Advanced Lectures on Machine Learning: Machine Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lectures (Lecture Notes in Computer Science, 2600) [2003 ed.] 3540005293, 9783540005292

Machine Learning has become a key enabling technology for many engineering applications and theoretical problems alike.

110 89 3MB

English Pages 276 [269] Year 2003

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Lecture Notes in Artificial Intelligence 2600
Advanced Lectures on Machine Learning
Copyright
Preface
Thanks and Acknowledgments
Table of Contents
1 A Few Notes on Statistical Learning Theory
2 A Short Introduction to Learning with Kernels
3 Bayesian Kernel Methods
4 An Introduction to Boosting and Leveraging
5 An Introduction to Reinforcement Learning Theory: Value Function Methods
6 Learning Comprehensible Theories from Structured Data
7 Algorithms for Association Rules
8 Online Learning of Linear Classifiers
Author Index
Recommend Papers

Advanced Lectures on Machine Learning: Machine Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lectures (Lecture Notes in Computer Science, 2600) [2003 ed.]
 3540005293, 9783540005292

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2600

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Shahar Mendelson Alexander J. Smola

Advanced Lectures on Machine Learning Machine Learning Summer School 2002 Canberra, Australia, February 11-22, 2002 Revised Lectures

13

Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Shahar Mendelson The Australian National University, RSISE Canberra, ACT 0200, Australia E-mail: [email protected] Alexander J. Smola The Australian National University Research School for Information Sciences and Engineering Canberra, ACT 0200, Australia E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

CR Subject Classification (1998): I.2.6, F.1, F.2 ISSN 0302-9743 ISBN 3-540-00529-3 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin Stefan Sossna e. K. Printed on acid-free paper SPIN: 10872572 06/3142 543210

Preface

Machine Learning has become a key enabling technology for many engineering applications and theoretical problems alike. To further discussions and to disseminate new results, a Summer School was held on February 11–22, 2002 at the Australian National University. The current book contains a collection of the main talks held during those two weeks in February, presented as tutorial chapters on topics such as Boosting, Data Mining, Kernel Methods, Logic, Reinforcement Learning, and Statistical Learning Theory. The papers provide an in-depth overview of these exciting new areas, contain a large set of references, and thereby provide the interested reader with further information to start or to pursue his own research in these directions. Complementary to the book, a recorded video of the presentations during the Summer School can be obtained at http://mlg.anu.edu.au/summer2002 It is our hope that graduate students, lecturers, and researchers alike will find this book useful in learning and teaching Machine Learning, thereby continuing the mission of the Summer School.

Canberra, November 2002

Shahar Mendelson Alexander Smola

Research School of Information Sciences and Engineering, The Australian National University

Thanks and Acknowledgments

We gratefully thank all the individuals and organizations responsible for the success of the workshop.

Local Arrangements, Co-located Conferences Help Special thanks go to John Lloyd for making this Summer School possible, to Di Kossatz and Michelle Moravec who handled the local arrangements, to Gunnar R¨atsch for his help with the manuscripts, to Dennis Andriolo, Joe Elso, Kim Holburn, and Teddy Mantoro for IT support, to Annie Lau, Petra Philips, Kristy Sim, Cheng Soon-Ong, and the students at the Computer Sciences Laboratory for their help throughout the course of the Summer School.

Sponsoring Institutions Research School of Information Sciences and Engineering, Australia The National Institute of Engineering and Information Science, Australia

Speakers Brian Anderson Peter Bartlett Peter Hall Markus Hegland Jyrki Kivinen John Lloyd Ron Meir

Shahar Mendelson Gunnar R¨ atsch Bernhard Sch¨ olkopf Arun Sharma Alex Smola Robert Williamson

Table of Contents

A Few Notes on Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . Shahar Mendelson

1

A Short Introduction to Learning with Kernels . . . . . . . . . . . . . . . . . . . . . . . Bernhard Sch¨ olkopf, Alexander J. Smola

41

Bayesian Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander J. Smola, Bernhard Sch¨ olkopf

65

An Introduction to Boosting and Leveraging . . . . . . . . . . . . . . . . . . . . . . . . . 118 Ron Meir, Gunnar R¨ atsch An Introduction to Reinforcement Learning Theory: Value Function Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Peter L. Bartlett Learning Comprehensible Theories from Structured Data . . . . . . . . . . . . . . 203 J.W. Lloyd Algorithms for Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Markus Hegland Online Learning of Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Jyrki Kivinen Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

X

Table of Contents

A Few Notes on Statistical Learning Theory Shahar Mendelson RSISE, The Australian National University, Canberra 0200, ACT, Australia

1

Introduction

In these notes our aim is to survey recent (and not so recent) results regarding the mathematical foundations of learning theory. The focus in this article is on the theoretical side and not on the applicative one; hence, we shall not present examples which may be interesting from the practical point of view but have little theoretical significance. This survey is far from being complete and it focuses on problems the author finds interesting (an opinion which is not necessarily shared by the majority of the learning community). Relevant books which present a more evenly balanced approach are, for example [1, 4, 34, 35] The starting point of our discussion is the formulation of the learning problem. Consider a class G, consisting of real valued functions defined on a space Ω, and assume that each g ∈ G maps Ω into [0, 1]. Let T be an unknown function, T : Ω → [0, 1] and set µ to be an unknown probability measure on Ω. The data one receives are a finite sample (Xi )ni=1 , where (Xi ) are independent random variables distributed according to µ, and the values of the unknown  n function on the sample T (Xi ) i=1 . The objective of the learner is to construct a function in G which is almost the closest function to T in the set, with respect to the L2 (µ) norm. In other words, given ε > 0, one seeks a function g0 ∈ G which satisfies that Eµ |g0 − T |2 ≤ inf Eµ |g − T |2 + ε, g∈G

(1)

where Eµ is the expectation with respect to the probability measure µ. Of course, this function has to be constructed according to the data at hand.  n  A mapping L is a learning rule if it maps every sn = (Xi )ni=1 , T (Xi ) i=1 to some Lsn ∈ G. The measure of the effectiveness of the learning rule is “how much data” it needs in order to produce an almost optimal function in the sense of (1). The one learning rule which seems to be the most natural (and it is the one we focus on throughout this article) is the loss minimization. For the sake of simplicity, we assume that the L2 (µ) minimal distance between T and members of G is attained at a point we denote by PG T , and define a new function class, which is based on G and T in the following manner; for every g ∈ G, let (g) = |g − T |2 − |PG T − T |2 and set L = {(g)|g ∈ G}. L is called the 2-loss class 

I would like to thank Jyrki Kivinen for his valuable comments, which improved this manuscript considerably.

S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 1–40, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

2

S. Mendelson

associated with G and T , and there are obvious generalizations of this notion when other norms are considered. For every sample sn = (x1 , ..., xn ) and ε > 0, let g ∗ ∈ G be any function for which n n 2 2 1  1  ∗ g (xi ) − T (xi ) ≤ inf g(xi ) − T (xi ) + ε. g∈G n n i=1 i=1

(2)

Thus, any g ∗ is an “almost minimizer” of the empirical distance between members of G and the target T . To simplify the presentation, let us introduce a notation we shall use throughout these notes. Given a set {x1 , ..., xn }, let nµn be the empirical measure supported on the set. In other words, µn = n−1 i=1 δxi where δxi is the point evaluation n functional on the set {xi }. The L2 (µn ) norm is defined as f 2L2 (µn ) = n−1 i=1 f 2 (xi ). Therefore, g ∗ is defined as a function which satisfies that g ∗ − T 2L2 (µn ) ≤ inf g − T 2L2 (µn ) + ε. g∈G

From the definition of the loss class it follows that Eµn (g ∗ ) ≤ ε. Indeed, the second term in every loss function is the same — |T − PG T |2 , hence the infimum is determined only by the first term |g − T |2 . Thus, Eµn (g ∗ ) ≤ inf Eµn f + ε ≤ ε, f ∈L

since inf f ∈L Eµn f ≤ 0, simply by looking at f = (PG T ). The question we wish to address is when such a function (g ∗ ) will also be an “almost minimizer” with respect to the original L2 norm. Since g − T L2 (µ) ≥ PG T − T L2 (µ) it follows that for every g ∈ G, Eµ (g) ≥ 0. Therefore, our question is when Eµ (g ∗ ) ≤ inf Eµ (g) + ε = ε? g∈G

(3)

Formally, we attempt to solve the following Question 1. Fix ε > 0, let sn be a sample and set g ∗ to be a function which satisfies (2). Does it mean that Eµ (g ∗ ) ≤ 2ε? Of course, it is too much to hope for that the answer is affirmative for any given sample, or even for any “long enough” sample, because one can encounter arbitrarily long samples that give misleading information on the behavior of T . The hope is that an affirmative answer will be true with a relatively high probability as the size of the sample increases. The tradeoff between the desired accuracy ε, the high probability required and the size of the sample is the main question we wish to address. Any attempt to approximate T with respect to any measure other than the measure according to which the sampling is made will not be successful. For example, if one has two probability measures which are supported on disjoint

A Few Notes on Statistical Learning Theory

3

sets, any data received by sampling according to one measure will be meaningless when computing distances with respect to the other. Another observation is that if the class G is “too large” it would be impossible to construct any worthwhile approximating function using empirical data. Indeed, assume that G consists of all the continuous functions on [0, 1] which are bounded by 1, and for the sake of simplicity, assume that T is a Boolean function and that µ is the Lebesgue measure on [0, 1]. By a standard density argument, there are functions in G which are arbitrarily close to T with respect 2 to the L2 (µ) distance,  hence inf g∈G Eµ |T − g| = 0. On the other hand, for any  sample (xi ), T (xi ) of T and every ε > 0 there is some g ∈ G which coincides with T on the sample, but Eµ |T − g|2 ≥ 1 − ε. The problem one encounters in this example occurs because the class in question is too large; even if one receives as data an arbitrarily large sample, there are still “too many” very different functions in the class which behave in a similar way to (or even coincide with) T on the sample, but are very far apart. In other words, if one wants an effective learning scheme, the structure of the class should not be too rich, in the sense that additional empirical data (i.e. a larger sample) decreases the number of class members which are “close” to the target on the data. Hence, all the functions which the learning algorithm may select become “closer” to the target as the size of the sample increases. The two main approaches we focus on are outcomes of this line of reasoning. Firstly, assume that one can ensure that when the sample size is large enough, then with high probability, empirical means of members of L are uniformly close to the actual means (that is, with high probability every f ∈ L satisfies that, |Eµ f − Eµn f | < ε). In particular, if Eµn (g ∗ ) < ε then Eµ (g ∗ ) < 2ε. This naturally leads us to the definition of Glivenko-Cantelli classes. Definition 1. We say that a F is a uniform Glivenko-Cantelli class if for every ε > 0, n    1   f (Xi ) ≥ ε = 0, lim sup P r sup Eµ f − n→∞ µ n i=1 f ∈F

where (Xi )∞ i=1 are independent random variables distributed according to µ. We use the term Glivenko-Cantelli classes and uniform Glivenko-Cantelli classes interchangeably. The fact that the supremum is taken with respect to all probability measures µ is important because one does not have apriori information on the probability measure according to which the data is sampled. This definition has a quantified version. For every 0 < ε, δ < 1, let SF (ε, δ) be the first integer n0 such that for every n ≥ n0 and any probability measure µ,  P r sup |Eµ f − Eµn f | ≥ ε ≤ δ, (4) f ∈F

where µn is the random empirical measure n−1

n

i=1 δXi .

4

S. Mendelson

SF is called the Glivenko-Cantelli sample complexity of the class F with accuracy ε and confidence δ. Of course, the ability to control the means of every function within the class is a very strong property, and is only a (loose!) sufficient condition which suffices to ensure that g ∗ is a “good approximation” of T . In fact, all that we are interested in is that this type of a condition holds for a function like (g ∗ ) (i.e., an almost minimizer of (g) with respect to an empirical norm). Therefore, one would like to estimate  (5) sup P r ∃f ∈ L, Eµn f < ε, Eµ f ≥ 2ε . µ

Let CL (ε, δ) be the first integer such that for every n ≥ CL (ε, δ) the term in (5) is smaller than δ. For such a value of n, there is a set of large probability on which any function which is an “almost minimizer” of the empirical loss will be an “almost minimizer” of the actual loss regardless of the underlying probability measure, implying that our learning algorithm will be successful. These notes are divided into two main parts. The first deals with GlivenkoCantelli classes, in which we present two different approaches to the analysis of these classes. The first is based on a loose concentration result, and yields suboptimal complexity bounds, but is well known among members of the Machine Learning community. The second is based on Talagrand’s concentration inequality for empirical processes, and using it we obtain sharp complexity bounds. All our bounds are expressed in terms of parameters which measure the “richness” or size of the given class. In particular, we focus on combinatorial parameters (e.g. the VC dimension and the shattering dimension), the uniform entropy and the random averages associated with the class, and investigate the connections between the three notions of “size”. We show that the random averages capture the “correct” notion of size, and lead to sharp complexity bounds. In the second part of this article, we focus on (5) and show that under mild structural conditions on the class G it is possible to improve the estimates obtained using a Glivenko-Cantelli argument. Notational conventions we shall use are that all absolute constants are denoted by c and C. Their values may change from line to line, or even within the same line. If X and Y are random variables, Ef (X, Y ) denotes the expectation with respect to both variables.   The expectation with respect to X is denoted by EX f (X, Y ) = E f (X, Y )|Y .

2

Glivenko-Cantelli Classes

In this section we study the properties of uniform Glivenko-Cantelli classes (uGC classes for brevity), which are classes that satisfy (3) or (4). We examine various characterization theorems for uGC classes. The results which are relevant to the problem of sample complexity estimates are presented in full. We assume that the reader has some knowledge of the basic definitions in probability theory and empirical processes theory. One can turn to [5] for a more detailed introduction, or to [33, 8] for a complete and rigorous analysis.

A Few Notes on Statistical Learning Theory

5

We start this section with a presentation of the classical approach, using which sample complexity estimates for uGC classes were established in the past [36, 2]. This approach has its own merit, though the estimates one obtains using this method are suboptimal.

2.1

The Classical Approach

Let F be a class of functions whose range is contained in [−1, 1]. We say that (Zi )i∈I is a random process indexed by F if for every f ∈ F and every i ∈ I, The process is called i.i.d. if the finite dimensional Zi (f ) is a random variable.  marginal distributions Zi (f1 ), ..., Zi (fk ) are independent random vectors1 . One example the reader should have in mind is the following random process: let µ be a probability measure on the domain Ω and let X1 , ..., Xn be independent random variables distributed according to µ. Set n µn to be the empirical measure supported on X1 , ..., Xn — which is n−1 i=1 δXi . Hence, µn is a random probability  measure given by the average of point masses at Xi . Let Zi (·) = δXi − µ (·), where the last equation should be interpreted as Zi (f ) = f (Xi ) − Eµ (f ) for every f ∈ F . Note that Z1 , ..., Zn is an i.i.d. process with 0 mean (since for every f ∈ F , EZi (f ) = 0). Moreover, sup |

f ∈F

n  i=1

Zi (f )| = sup | f ∈F

n    f (Xi ) − Eµ f |, i=1

which is exactly the random variable we are interested in. Our strategy is based on the following idea, which, for the sake of simplicity, is explained for the trivial class consisting of a single element. We wish to measure “how close” empirical means are to the actual mean. If empirical means are close to the actual one with high probability, then two random empirical means  should be “close” to each other. n Thus, if (Xi ) is an independent copy of (Xi ), then the probability that | i=1 f (Xi ) − f (Xi ) | ≥ x should be an indication of the probability of deviation of the empirical means from the actual one. By symmetry, for every i, Yi = f (Xi )−f (Xi ) is distributed as −Yi . Hence, for every selection of signs εi , n n      f (Xi ) − f (Xi ) ≥ x = P r  εi f (Xi ) − f (Xi )  ≥ x . Pr  i=1

(6)

i=1

Now, consider (εi )ni=1 as independent Rademacher (that is, symmetric {−1, 1}valued) random variables, and (6) still holds, where P r on the right hand side now denotes the product measure generated by (Xi ), (Xi ) and (εi ). By the triangle inequality, and since Xi and Xi are identically distributed, 1

Throughout these notes we are going to omit all the measurability issues one should address in a completely rigorous exposition.

6

S. Mendelson n    f (Xi ) − f (Xi ) ≥ x Pr  i=1 n n  x  x   ≤P r  + Pr  εi f (Xi ) ≥ εi f (Xi ) ≥ 2 2 i=1 i=1 n  x  =2P r  εi f (Xi ) ≥ 2 i=1

n Thus, P r{| i=1 εi f (Xi )| ≥ x/2} could be the right quantity to control the deviation we require. Since this is far from being rigorous, one has to make the above reasoning precise. There are two main issues that need to be resolved; firstly, can this kind of a result be true for a “rich” class of functions, consisting of more than a single function, and secondly, how can one control the probability of deviation even after this “symmetrization” argument? The symmetrization procedure. The following symmetrization argument, due to Gin´e and Zinn [9], is the first step in the “classical” approach. Theorem 1. Let (Zi )ni=1 be an i.i.d. stochastic process which has 0 mean, and for every 1 ≤ i ≤ n, set hi : F → R to be an arbitrary function. Then, for every x>0 n

    4n Zi (f )| > x 1 − 2 sup var Z1 (f ) P r sup | x f ∈F f ∈F i=1 n     x ≤2P r sup | , εi Zi (f ) − hi (f ) | > 4 f ∈F i=1

where (εi )ni=1 are independent Rademacher random variables. Before proving this theorem, let us consider its implications for “our” empirical process. Fix a probability measure µ according to which the sampling is made. Then, Zi (f ) = f (Xi )−Eµ f and put hi (f ) = −Eµ f . Also, set v 2 = supf ∈F var(f ), √ √   and note that if x ≥ 2 2 nv then 1 − 4n x2 supf ∈F var Z1 (f ) ≥ 1/2. Therefore, for such a value of x, n n       x P r sup | . f (Xi ) − Eµ f | > x ≤ 4P r sup | εi f (Xi )| > 4 f ∈F i=1 f ∈F i=1

(7)

Now, fix any ε > 0 and let x = nε. If n ≥ 8v 2 /ε2 then n n    1 nε P r sup | . f (Xi ) − Eµ f | > ε ≤ 4P r sup | εi f (Xi )| > 4 f ∈F n i=1 f ∈F i=1

(8)

In particular, if each function in F maps Ω into [−1, 1] then v 2 ≤ 1. Thus, (8) holds for any n ≥ 8/ε2 .

A Few Notes on Statistical Learning Theory

7

Proof (Theorem 1). Let Wi be an independent copy of Zi and fix x > 0. Denote by PZ (resp. PW ) the probability measure n associated with the pro(W )). Put β = inf P r{| cess (Zi ) (resp. f ∈F i=1 Zi (f )| < x/2} and let n i A = {supf ∈F | i=1 Zi (f )| > x}. For every  element in A there is a realization of n > x. Fix this realization the process Zi and some f ∈ F such that | i=1 Zi (f )| n and f and observe that by the triangle inequality, if | i=1 Wi (f )| < x/2 then  n   n   Z (f ) − W (f ) > x/2. Since (W ) is a copy of (Zi )ni=1 then i i i i=1 i=1 n n n      x x β ≤ PW | ≤ PW | Wi (f )| < Wi (f ) − Zi (f )| > 2 2 i=1 i=1 i=1 n n    x ≤ PW sup | . Wi (f ) − Zi (f )| > 2 f ∈F i=1 i=1

Since the two extreme sides of this inequality are independent of the specific selection of f , this inequality holds on the set A. Integrating with respect to Z on A it follows that n n       x βPZ sup | . Zi (f )| > x ≤ PZ PW sup | Zi (f ) − Wi (f ) | > 2 f ∈F i=1 f ∈F i=1

Zi = −(Zi − Wi ), implying Clearly, Zi − Wi has the same distribution as Wi − n n n ) ∈ {−1, 1} , that for every selection of signs (ε i i=1 i=1 (Zi − Wi ) has the same n distribution as i=1 εi (Zi − Wi ). Hence, n     x PZ PW sup | Zi (f ) − Wi (f ) | > 2 f ∈F i=1

=PZ PW

n    x sup | εi Zi (f ) − Wi (f ) | > 2 f ∈F i=1



n     x =Eε PZ PW sup | , εi Zi (f ) − Wi (f ) | > 2 f ∈F i=1

where Eε denotes the expectation with respect to the Rademacher random variables (εi )ni=1 . By the triangle inequality, for every selection of functions hi and every fixed realization (εi )ni=1 , n     x εi Zi (f ) − Wi (f ) | > PZ PW sup | 2 f ∈F i=1

n     x ≤2PZ sup | , εi Zi (f ) − hi (f ) | > 2 f ∈F i=1

and by Fubini’s Theorem

8

S. Mendelson n

    x  εi Zi (f ) − hi (f ) | > Eε PZ sup | (εi )ni=1 2 f ∈F i=1 n     x = P r sup | . εi Zi (f ) − hi (f ) | > 2 f ∈F i=1

Finally, to estimate β, note that by Chebyshev’s inequality n    x 4n  Pr | ≤ 2 var Z1 (f ) , Zi (f )| > 2 x i=1   for every f ∈ F , and thus, β ≥ 1 − (4n/x2 ) supf ∈F var Z1 (f ) . After establishing (8), the next step is to transform a very  rich n class to a trivial class, consisting of a single function, and then estimate P r  i=1 εi f (xi ) > x . We show that one can effectively replace the (possibly) infinite class F with a finite set which approximates the original class in some sense. The “richness” of the class F will be reflected by the cardinality of the finite approximating set. This approximation scheme is commonly used in many areas of mathematics, and the main notion behind it is called covering numbers. Covering numbers and complexity estimates. Let (Y, d) be a metric space and set F ⊂ Y . For every ε > 0, denote by N (ε, F, d) the minimum number of open balls (with respect to the metric d) needed to cover F . That is, the minimum cardinality of a set {y1 , ..., ym } ⊂ Y with the property that every f ∈ F has is some yi such that d(f, yi ) < ε. The set {y1 , ..., ym } is called an ε-cover of F . The logarithm of the covering number is called the entropy of the set. We investigate metrics endowed by samples; for every sample {x1 , ..., xn } let µn be the empirical measure supported on that sample. For 1 ≤ p < ∞  1/p n and a function f , put f Lp (µn ) = n−1 i=1 |f (xi )|p and set f ∞ =   max1≤i≤n |f (xi )|. Recall that N ε, F, Lp (µn ) is the covering number of F at scale ε with respect to the Lp (µn ) norm. n Two observations we require are the following. Firstly, if n−1 | i=1 f (xi )| > t and if f − gL1 (µn ) < t/2 then n n n 1  1  1 t | g(xi )| ≥ | f (xi )| − |f (xi ) − g(xi )| > . n i=1 n i=1 n i=1 2

Secondly, for every empirical measure µn and every 1 ≤ p ≤ ∞, f L1 (µn ) ≤ f Lp (µn ) ≤ f L∞ (µn ) . Hence,       N ε, F, L1 (µn ) ≤ N ε, F, Lp (µn ) ≤ N ε, F, L∞ (µn ) . In a similar fashion to the notion of covering numbers one can define the packing numbers of a class. Roughly speaking, a packing number is the maximal cardinality of a subset of F with the property that the distance between any two of its members is “large”.

A Few Notes on Statistical Learning Theory

9

Definition 2. Let (X, d) be a metric space. We say that K ⊂ X is ε-separated with respect to the metric d if for every k1 , k2 ∈ K, d(k1 , k2 ) ≥ ε. Given a set F ⊂ X, define its ε-packing number as the maximal cardinality of a subset of F which is ε-separated, and denote it by D(ε, F, d). It is easy to see that the covering numbers and the packing numbers are closely related. Indeed, assume that K ⊂ F is a maximal ε-separated subset. By the maximality, for every f ∈ F there is some k ∈ K for which d(x, k) < ε, which shows that N (ε, F, d) ≤ D(ε, F, d). On the other hand, let {y1 , ..., ym } be an ε/2 cover of F and assume that f1 , ..., fk is a maximal ε-separated subset of F . In every ball {y|d(y, yi ) < ε/2} there is at most a single element of the packing (by the triangle inequality, the diameter of this ball is smaller than ε). Since this is true for any cover of F then D(ε, F, d) ≤ N (ε/2, F, d). Our discussion will rely heavily on covering and packing numbers. We can now combine the symmetrization argument with the notion of covering numbers and obtain the required complexity estimates. Theorem 2. Let F be a class of functions which map Ω into [−1, 1] and set µ to be a probability measure on Ω. Let (Xi )∞ i=1 be independent random variables distributed according to µ. For every ε > 0 and any n ≥ 8/ε2 ,  ε  nε2 1 P r sup | f (Xi ) − Eµ f | > ε ≤ 4Eµ N , F, L1 (µn ) e− 128 , n 8 f ∈F i=1 where µn is the (random) empirical measure supported on {X1 , ..., Xn }. One additional preliminary result we need before proceeding with the proof will enable us to handle the “trivial” case of classes consisting of a single function. This case follows from Hoeffding’s inequality [11, 33] Theorem 3. Let (ai )ni=1 ∈ Rn and let (εi )ni=1 be independent Rademacher random variables (that is, symmetric {−1, 1}-valued). Then, n    2 1 2 εi ai  > x ≤ 2e− 2 x /a2 , Pr  i=1

where a2 =

n i=1

a2i

1/2

.

In our case, (ai )ni=1 is going to be the values of the function f on a fixed sample {x1 , ..., xn }.   n Proof (Theorem 2). Let A = supf ∈F | i=1 εi f (Xi )| > nε 4 , and denote by χA the characteristic function of A. By Fubini’s Theorem, n

  nε     |X1 , ..., Xn . εi f (Xi ) > P r(A) = Eµ Eε χA |X1 , ..., Xn = Eµ P r sup  4 f ∈F i=1

(9)

10

S. Mendelson

Fix a realization of X1 , ..., Xn and let µn be the empirical measure supported on that realization. Set G to be an ε/8 cover of F with respect to the L1 (µn ) norm. Since F consists of functions which are by 1, we can assume that bounded n the same holds for every g ∈ G. If supf ∈F | i=1 εi f (Xi )| > nε/4, there is some f ∈ F for which this inequality holds. G is an ε/8-cover of F with n respect to the L1 (µn ), hence, there is some g∈ G which satisfies that n−1 i=1 |f (Xi ) − n g(Xi )| < ε/8. Therefore, supg∈G | i=1 εi g(Xi )| > nε/8, implying that for that realization of (Xi ), n n     nε  nε  ≤ P r sup | . εi f (Xi )| > εi g(Xi )| > P r sup | 4 8 g∈G i=1 f ∈F i=1

Applying nthe union bound, Hoeffding’s inequality and the fact that for every g ∈ G, i=1 g(xi )2 ≤ n, n n     nε  nε  ≤ |G|P r | P r sup | εi g(Xi )| > εi g(Xi )| > 8 8 g∈G i=1 i=1  nε2 ε ≤ N , F, L1 (µn ) e− 128 . 8

Finally, our claim follows from (9) and (8). Unfortunately, it might be very difficult to compute the expectation of the covering numbers. Thus, one natural thing to do is to introduce uniform entropy numbers. Definition 3. For every class F , 1 ≤ p ≤ ∞ and ε > 0, let    Np ε, F, n) = sup N ε, F, Lp (µn ) , µn

and    Np ε, F ) = sup sup N ε, F, Lp (µn ) . n

µn

We call log Np (ε, F ) the uniform entropy numbers of F with respect to the Lp (µn ). The only hope for establishing non-trivial uniform entropy bounds is when the covering numbers do not depend on the cardinality of the set on which the empirical measure is supported. In some sense, this implies that classes for which one can obtain uniform entropy bounds must be “small”. As we will show in sections to come, one can establish such dimension-free bounds in terms of the combinatorial parameters which are used to “measure” the size of a class of functions. The following result seems to be a weaker version of the theorem, but in the sequel we prove that it is a necessary condition for the uniform GC property as well.

A Few Notes on Statistical Learning Theory

11

Theorem 4. Assume that F is a class of functions which are all bounded by 1. If there is some 1 ≤ p ≤ ∞ such that for every ε > 0 the entropy numbers satisfy lim

n→∞

log Np (ε, F, n) = 0, n

then F is a uniform Glivenko-Cantelli class. An easy observation is that it is possible to bound the Glivenko-Cantelli sample complexity using the uniform entropy numbers of the class. Theorem 5. Let F be a class of functions which map Ω into [−1, 1]. Then for every 0 < ε, δ < 1, n  1 sup P r sup | f (Xi ) − Eµ f | ≥ ε ≤ δ, µ f ∈F n i=1

provided that n ≥

C ε2



 log N1 (ε, F ) + log(2/δ) , where C is an absolute constant.

In particular, if the uniform entropy is of power type q (that is, log N1 (ε, F ) = O(ε−q )), then the uGC sample complexity is (up to logarithmic factors in δ −1 ) O(ε−(2+q) ). As an example, assume that F is the 2-loss class associated with G and T . In this case, the Lp entropy numbers of the loss class can be controlled by those of G. Lemma 1. Let G be a class of functions whose range is contained in [0, 1] and assume that the same holds for T . If L is the 2-loss class associated with G and T , then for every ε > 0, every 1 ≤ p ≤ ∞ and every probability measure µ,   ε  N ε, L, Lp (µ) ≤ N , G, Lp (µ) . 4 Proof. Since L is a shift of the class (G − T )2 , and since covering numbers of a shifted class are the same as those of the original one (a shift is an isometry with respect to the Lp norm), it is enough to estimate the covering numbers of the class (G − T )2 . Let {y1 , ..., ym } be an ε-cover of G in Lp (µ). If g − yi Lp (µ) < ε, then pointwise |g − T |2 − |yi − T |2 = |g − yi | · |g + yi − 2T | ≤ 4|g − yi |. Hence, |g − T |2 − |yi − T |2 Lp (µ) ≤ 4g − yi Lp (µ) < 4ε. Corollary 1. Using the notation of the previous theorem, for every 0 < ε, δ < 1, SL (ε, δ) ≤

   128  log N1 ε/4, G + log(8/δ) 2 ε

The natural question which comes to mind is how to estimate the uniform entropy numbers of a class. Historically, this was the reason for the introduction of several combinatorial parameters. We will show that by using them one can control the uniform entropy.

12

2.2

S. Mendelson

Combinatorial Parameters and Covering Numbers

The first combinatorial parameter was introduced by Vapnik and Chervonenkis [36] to control the empirical L∞ entropy of Boolean classes of functions. Definition 4. Let F be a class of {0, 1}-valued functions on a space Ω. We say that F shatters {x1 , ..., xn } ⊂ Ω, if for every I ⊂ {1, ..., n} there is a function fI ∈ F for which fI (xi ) = 1 if i ∈ I and fI (xi ) = 0 if i ∈ I. Let    V C(F, Ω) = sup |A| A ⊂ Ω, A is shattered by F . V C(F, Ω) is called the VC dimension of F , but when the underlying space is clear we denote it by V C(F ). The VC dimension has  a geometric interpretation. A set sn = {x1 , .., xn } is shattered if the set { f (x1 ), ..., f (xn ) |f ∈ F } is the combinatorial cube {0, 1}n . For every sample σ denote by Pσ F the coordinate projection of F ,    Pσ F = { f (xi ) x ∈σ  f ∈ F }. i

Hence, the VC dimension is the largest cardinality of σ ⊂ Ω such that Pσ F is the combinatorial cube of dimension |σ|. Next, we present bounds on the empirical L∞ and L2 uniform entropy estimate using the VC dimension. Uniform entropy and the VC dimension. We begin with the L∞ estimates mainly for historical reasons. The following lemma, known as the Sauer-Shelah Lemma, was proved independently at least three times, by Sauer [28], Shelah [29] and Vapnik and Chervonenkis [36]. Lemma 2. Let F be a class of Boolean functions and set d = V C(F ). Then, for every finite subset σ ⊂ Ω of cardinality n ≥ d, |Pσ F | ≤

en d d

.

  d  In particular, for every ε > 0, N ε, F, L∞ (σ) ≤ |Pσ F | ≤ en/d . Using the Sauer-Shelah Lemma, one can characterize the uniform GlivenkoCantelli property of a class of Boolean functions in terms of the VC dimension. Theorem 6. Let F be a class of Boolean functions. Then F is a uniform Glivenko-Cantelli class if and only if it has a finite VC dimension. Proof. Assume that VC(F ) = ∞ and fix an integer d ≥ 2. There is a set σ ⊂ Ω, |σ| = d such that Pσ F = {0, 1}d , and let µ be the uniform measure on σ (assigns a weight of 1/d to every point). For any A ⊂ σ of cardinality n ≤ d/2, let µA n be the empirical measure supported on A. Since there is some fA ∈ F which is

A Few Notes on Statistical Learning Theory

13

1 on A and vanishes on σ\A then |Eµ fA − EµA f | = |1 − n/d| ≥ 1/2. Hence,  n A  supf ∈F |EµA f − Eµ f | ≥ 1/2. Therefore, P r supf ∈F |Eµn f − Eµ f | ≥ 1/2 = 1 n for any n ≤ d/2, and since d can be made arbitrarily large, F is not a uniform GC class. To prove the converse, recall that for every 0 < ε < 1 and  every empirical measure µn supported on the sample sn , N ε, F, L∞ (sn ) ≤ |Psn F | ≤  d en/d . Since the empirical L1 entropy is bounded by the empirical L∞ one, log N1 (ε, F, n) ≤ d log(en/d). Thus, for every ε > 0, log N1 (ε, F, n) = o(n), implying that F is a uniform GC class. In a similar fashion one can characterize the uGC property for Boolean classes using the Lp entropy numbers. Corollary 2. Let F be a class of Boolean functions. Then, F is a uniform Glivenko-Cantelli class if and only if for every 1 ≤ p ≤ ∞ and every ε > 0, log Np (ε, F, n) = o(n). Proof. Fix any 1 ≤ p ≤ ∞. If for every ε > 0 log Np (ε, F, n) = o(n), then by Theorem 2, F is a uGC class. Conversely, if F is a uGC class then it has a finite VC dimension. Denote V C(F ) = d, let σ be a sample of cardinality n and set µn to be the empirical measure supported on σ. For every ε > 0 and 1 ≤ p < ∞     en   = o(n). log N ε, F, Lp (µn ) ≤ log N ε, F, L∞ (σ) ≤ log |Pσ F | ≤ d log d There is some hope that with respect to a “weaker” norm, one will be able to obtain uniform entropy estimates (which can not be derived from the L∞ bounds presented here), that would lead to improved complexity bounds. Although the uGC property is characterized by the entropy with respect to any Lp norm (and in that sense, the L∞ one is as good as any other Lp norm), from the quantitative point of view, it is much more desirable to obtain L1 or L2 entropy estimates, which will prove to be considerably smaller than the L∞ ones. Therefore, the next order of business is to estimate the uniform entropy of a VC class with respect to empirical Lp norms. This result is due to Dudley [7] and it is based on a combination of an extraction principle and the Sauer-Shelah Lemma. The probabilistic extraction argument simply states that if K ⊂ F is “well separated” in L1 (µn ) in the sense that every two points are different on a number of coordinated which is proportional to n, one can find a much smaller set of coordinates (which depends of the cardinality of K) on which every two points in K are different on at least one coordinate. Theorem 7. There is an absolute constant C such that the following holds. Let F be a class of Boolean functions and assume that V C(F ) = d. Then, for every 1 ≤ p < ∞, and every 0 < ε < 1,

2 d 1 pd Np (ε, F ) ≤ (Cp) log . ε ε

14

S. Mendelson

Proof. Since the functions in F are {0, 1}-valued, it is enough to prove the claim for p = 1. The general case follows since for any f, g ∈ F and any probability p measure µ, f − gLp (µ) = f − gL1 (µ) .  n Let µn = n−1 i=1 δxi and fix 0 < ε < 1. Set Kε to be any ε-separated subset of F with respect to the L1 (µn ) norm and denote its cardinality by D. If V = {fi − fj |fi = fj ∈ Kε } then every v ∈ V has at least nε coordinates which belong to {−1, 1}. Indeed, since Kε is ε-separated then for any v ∈ V ε ≤ vL1 (µn ) = fi − fj L1 (µn ) =

1 1 |fi (xl ) − fj (xl )| = |v(xl )|, n n n

n

l=1

l=1

and for every 1 ≤ l ≤ n, |v(xl )| = |fi (xl ) − fj (xl )| ∈ {0, 1}. In addition, it is easy to see that |V | ≤ D2 . Take (Xi )ti=1 to be independent {x1 , ..., xn }-valued random variables, such that for every 1 ≤ i ≤ t and 1 ≤ j ≤ n, P r(Xi = xj ) = 1/n. It follows that for any v ∈ V , t     P r ∀i, v(Xi ) = 0 = P r v(Xi ) = 0 ≤ (1 − ε)t . i=1

Hence,

Therefore,

 P r ∃v ∈ V, ∀i, v(Xi ) = 0 ≤ |V | (1 − ε)t ≤ D2 (1 − ε)t .  P r ∀v ∈ V, ∃i, 1 ≤ i ≤ t |v(Xi )| = 1 ≥ 1 − D2 (1 − ε)t ,

and if the latter is greater than 0, there is a set of σ ⊂ {1, ..., n} such that |σ| = t and     |Pσ Kε | =  f (xi ) i∈σ f ∈ Kε  = D. D Select t = 2 log which suffices to ensure the existence of such a set σ. Clearly, ε we can assume that t ≥ d, otherwise, our claim follows immediately. By the Sauer-Shelah Lemma,

D = |Pσ Kε | ≤ |Pσ F | ≤

 e|σ| d  2e log D d = . d dε

It is easy to see that if α ≥ 1 and α log−1 α ≤ β then α ≤ β log(eβ log β). Applying this to (10), log D ≤ d log as claimed.

2e2 ε

log(

2e ) , ε

(10)

A Few Notes on Statistical Learning Theory

15

This result was strengthened by Haussler in [10] in a very difficult proof, which removed the superfluous logarithmic factor. Theorem 8. There is an absolute constant C such that for every Boolean class F , any 1 ≤ p < ∞ and every ε > 0, Np (ε, F ) ≤ Cd(4e)d ε−pd , where VC(F ) = d. The significance of Theorem 7 and Theorem 8 is that they provide uniform Lp entropy estimates for VC classes, while the L∞ estimates are not dimension-free. These uniform entropy bounds play a very important role in our discussion. In particular, they can be used to obtain uGC complexity estimated for VC classes, using Theorem 2. Theorem 9. There is an absolute constant C for which the following holds. Let F be a class of Boolean functions which has a finite VC dimension d. Then, for every 0 < ε, δ < 1, n  1 sup P r sup | f (Xi ) − Eµ f | ≥ ε ≤ δ, µ f ∈F n i=1

provided that n ≥

C ε2



 d log(2/ε) + log(2/δ) .

Using the same reasoning and by Lemma 1 it is possible to prove analogous results when F is the 2-loss class associated with a VC class and an arbitrary target T which maps Ω into [0, 1]. Generalized combinatorial parameters. After obtaining covering results (and generalization bounds) in the Boolean case, we attempt to extend our analysis to classes of real-valued functions. We focus on classes which consist of uniformly bounded functions, though it is possible to obtain some results in a slightly more general scenario [33]. Hence, throughout this section F will denote a class of functions which map Ω into [−1, 1]. The path we take here is very similar to the one we used for VC classes. Firstly, one has to define a combinatorial parameter which measures the “size” of the class. Definition 5. For every ε > 0, a set σ = {x1 , ..., xn } ⊂ Ω is said to be εshattered by F if there is some function s : σ → R, such that for every I ⊂ {1, ..., n} there is some fI ∈ F for which fI (xi ) ≥ s(xi ) + ε if i ∈ I, and fI (xi ) ≤ s(xi ) − ε if i ∈ I. Let    fatε (F ) = sup |σ| σ ⊂ Ω, σ is ε−shattered by F . fI is called the shattering function of the set I and the set called a witness to the ε-shattering.



s(xi )|xi ∈ σ



is

The first bounds on the empirical L∞ covering numbers in terms of the fatshattering dimension was established in [2], where is was shown that F is a uGC class if and only if it has a finite fat-shattering dimension for every ε. The proof

16

S. Mendelson

that if F is a uGC it has a finite fat-shattering dimension for every ε follows from a similar argument to the one used in the VC case. For the converse one requires empirical L∞ entropy estimates combined with Theorem 4. Dimension-free Lp entropy results for 1 ≤ p < ∞ in terms of the fat-shattering dimension were first proved in [18]. Both these results were improved in [21] and then in [22]. The proofs of all the results mentioned here are very difficult, and go beyond the scope of these notes. The second part of the following claim is due to Vershynin  (still unpublished). Let us denote by B L∞ (Ω) the unit ball in L∞ (Ω). Theorem 10. There are absolute constants K and c and constants  Kp , cp which depend only on p for which the following holds: for every F ⊂ B L∞ (Ω) , every sample sn , every 1 ≤ p < ∞ and any 0 < ε < 1,   2 Kp fatcp ε (F ) N ε, F, Lp (µ) ≤ , ε and, for any 0 < δ < 1,

n   , log N ε, F, L∞ (sn ) ≤ K · fatcδε (F ) log1+δ δε The significance of these entropy estimates goes far beyond learning theory. They are essential in solving highly non-trivial problems in convex geometry and in empirical processes [22, 25, 31, 32]. Using the bounds on the uniform entropy numbers and Theorem 2, one can establish the following sample complexity estimates. Theorem   11. There is an absolute constant C such that for every class F ⊂ B L∞ (Ω) and every 0 < ε, δ < 1, n  1 sup P r sup | f (Xi ) − Eµ f | ≥ ε ≤ δ, µ f ∈F n i=1

provided that n≥

8

2 C

+ log . fat (F ) · log ε/8 ε2 ε δ

Unfortunately, the VC dimension and the fat-shattering dimension have become the central issue in machine learning literature. One must remember that the combinatorial parameters were introduced as a way to estimate the uniform entropy numbers. In fact, they seem to be the wrong parameters to measure the complexity of learning problems. Ironically, they have a considerable geometric significance as many results indicate. To sum-up the results we have presented so far, it is possible to obtain uGC sample complexity estimates via symmetrization, a covering argument and Hoeffding’s inequality. The combinatorial parameters are used only to estimate the covering numbers one needs. One point in which a slight improvement can be made, is by replacing Hoeffding’s inequality with inequalities of a similar

A Few Notes on Statistical Learning Theory

17

nature, (e.g. Bernstein’s inequality or Bennett’s inequality [33]) in which additional data on the moments of the random variables is used to obtain tighter deviation bounds. However, this does not resolve the main problem in this line of argumentation - that passing to an ε-cover and applying the union bound is horribly loose. To solve this problem one needs a stronger deviation inequality for a supremum over a family of functions and not just a single one. This “functional” inequality is the subject of the next section and we show it yields tighter complexity bounds. 2.3

Talagrand’s Inequality

Let us begin by recalling Bernstein’s inequality [17, 33]. Theorem 12. Let µ be a probability measure on Ω and let X1 , ..., Xn be independent random n variables distributed according to µ. Given n a function f : Ω → R, set Z = i=1 f (Xi ), let b = f ∞ and put v = E( i=1 f 2 (Xi )). Then,   x2 P r |Z − Eµ Z| ≥ x ≤ 2e− 2(v+bx/3) . This deviation result is tighter than Hoeffding’s inequality because one has additional data on the variance of the random variable Z, which leads to potentially sharper bounds. It has been a long standing open question  n whether a similar result can be obtained when replacing Z by supf ∈F | i=1 f (Xi ) − Eµ f |. This “functional” inequality was first established by Talagrand [32], and later was modified and partially improved by Ledoux [14], Massart [17], Rio [27] and Bousquet [3]. Theorem 13. [17] Let µ be a probability measure on Ω and let X1 , ..., Xn be independent random variables according to µ. Given a class of func  n distributed tions F , set Z = supf ∈F | i=1 f (Xi ) − Eµ f |, let b = supf ∈F f ∞ and put   n σ 2 = supf ∈F i=1 var f (Xi ) . Then, there is an absolute constant C ≥ 1 such that for every x > 0 there is a set of probability larger than 1 − e−x on which √ (11) Z ≤ 2EZ + C(σ x + bx). Observe that if F consists of functions which are bounded by 1 then b = 1 and nε2 √ σ ≤ n. If we select x = nε2 /4C 2 then with probability larger than 1 − e− 4C 2 , 1 1 3ε f (Xi ) − Eµ f | ≤ 2E sup | f (Xi ) − Eµ f | + . n i=1 4 f ∈F n i=1 n

sup |

f ∈F

n

This equation holds with probability larger than 1 − δ provided that n ≥ (4C 2 /ε2 ) log 1δ . It follows that the dominating term in the complexity estimate is the expectation of the random variable Z. Again, the notion of symmetrization will come to our rescue in the attempt to estimate EZ. Let us define the (global) Rademacher averages associated with a class of functions.

18

S. Mendelson

Definition 6. Let µ be a probability measure on Ω and set F to be a class of uniformly bounded functions. For every integer n, let n  1 Rn (F ) = Eµ Eε √ sup | εi f (Xi )|, n f ∈F i=1

where (Xi )ni=1 are independent random variables distributed according to µ and (εi )ni=1 are independent Rademacher random variables. √ The reason for the seemingly strange normalization (of 1/ n instead of 1/n) will become evident in the next section. Now, we can prove an “averaged” version of the symmetrization result: Theorem 14. Let µ be a probability measure and set F to be a class of functions on Ω. Denote 1 f (Xi ) − Eµ f |, n i=1 n

Z = sup | f ∈F

where (Xi )ni=1 are independent variables distributed according to µ. Then, n   Rn (F ) 1 Eµ Z ≤ 2 √ εi |. ≤ 4Eµ Z + 2 sup Eµ f  · Eε | n i=1 n f ∈F

Proof. Let Y1 , ..., Yn be an independent copy of X1 , ..., Xn . Then, n 1   EX sup  f (Xi ) − EY f  f ∈F n i=1 n n 1  1   =EX sup  f (Xi ) − EY f − EY f (Yi ) − EY f | = (1). n i=1 f ∈F n i=1

Conditioning (1) with respect to X1 , ..., Xn and then applying Jensen’s inequality with respect to EY and Fubini’s Theorem, it follows that (1) ≤

n n n       1 1 f (Xi ) − f (Yi ) = EX EY sup  εi f (Xi ) − f (Yi ) , EX EY sup  n n f ∈F f ∈F i=1 i=1 i=1

where the latter inequality holds for every (εi )ni=1 ∈ {−1, 1}n . Therefore, it also holds when taking the expectation with respect to the Rademacher random variables (εi )ni=1 . By the triangle inequality, n n    2Rn (F )   1 2 EX EY Eε sup  εi f (Xi ) − f (Yi )  ≤ EX Eε sup  εi f (Xi ) = √ . n n n f ∈F i=1 f ∈F i=1

A Few Notes on Statistical Learning Theory

19

To prove the upper bound, the starting point is the triangle inequality which yields n   1 EX Eε sup  εi f (Xi ) n f ∈F i=1



n n  1       1 EX Eε sup  εi f (Xi ) − Eµ f  +  sup Eµ f  · Eε  εi . n n i=1 f ∈F i=1 f ∈F

To estimate the first term, let (Zi ) be the stochastic process defined by Zi (f ) = f (Xi ) − Eµ f and let Wi be an independent copy of (Zi ). For every f ∈ F , EWi (f ) = 0, thus, n n      εi f (Xi ) − Eµ f  = EZ Eε sup  εi Zi (f ) EX Eε sup  f ∈F i=1

f ∈F i=1 n 

 = Eε EZ sup 

f ∈F i=1

  εi Zi (f ) − EW Wi (f ) .

For every realization of the Rademacher random variables (εi )ni=1 and by Jensen’s inequality conditioned with respect to the Zi , n n       EZ sup  εi Zi (f ) − EW Wi (f )  ≤ EZ EW sup  εi Zi (f ) − Wi (f ) , f ∈F i=1

f ∈F i=1

which is invariant for under any selection of signs εi . Therefore, n n       Eε EZ sup  Zi (f ) − Wi (f )  εi Zi (f ) − EW Wi (f )  ≤ EZ EW sup  f ∈F i=1

f ∈F i=1 n 

 ≤ 2EZ sup 

f ∈F i=1

 Zi (f ).

This result implies that the expectation of the √ deviation of the empirical means from the actual ones is controlled by Rn (F )/ n. Therefore, we can formulate the following   Corollary 3. Let µ be a probability measure on Ω, set F ⊂ B L∞ (Ω) and put   n σ 2 = supf ∈F i=1 var f (Xi ) , where (Xi ) are independent random variables distributed according to µ. Then, there is an absolute constant C ≥ 1 such that for every x > 0, there is a set of probability larger than 1 − e−x on which n 1   4Rn (F ) C √ sup  f (Xi ) − Eµ f  ≤ √ + (σ x + x). n n f ∈F n i=1

In particular, there is an absolute constant C such that if  1 C , n ≥ 2 max Rn2 (F ), log ε δ   n then P r supf ∈F | n1 i=1 f (Xi ) − Eµ f | ≥ ε ≤ δ.

(12)

20

S. Mendelson

After establishing that the random averages control the uGC sample complexity, the natural question is how to estimate them. In particular, it is interesting to estimate them using the covering numbers and the combinatorial parameters which we investigated in previous sections. 2.4

Random Averages, Combinatorial Parameters, and Covering Numbers

In this section we present several ways in which one can bound the Rademacher averages associated with a class F . First we present structural results, which enable one to compute the averages of complicated classes using those of simple ones. Next, we give an example of a case in which the averages can be computed directly. Finally, we show how estimates on the empirical entropy of a class can be used to bound the random averages. Structural results. The following theorem summarizes some of the properties of the Rademacher averages we shall use. The difficulty of the proofs of the different observations varies considerably. Some of the claims are straightforward while others are very deep facts. Most of the results are true when replacing the Rademacher random variables with independent standard Gaussian ones (with very similar proofs), but we shall not present the analogous result in the Gaussian case. Theorem 15. Let F and G be classes of real-valued functions on (Ω, µ). Then, for every integer n, 1. If F ⊂ G, Rn (F ) ≤ Rn (G). 2. Rn (F ) = Rn (conv F ) = Rn (absconvF ), where conv(F ) is the convex hull of F and absconv(F ) = conv(F ∪ −F ) is the symmetric convex hull of F . 3. For every c ∈ R, Rn (cF ) = |c|Rn (F ). 4. If φ : R → R is a Lipschitz function with a constant  Lφ and satisfies φ(0) = 0, then Rn (φ ◦ F ) ≤ 2Lφ Rn (F ), where φ ◦ F = {φ f (·) |f ∈ F }. 5. For every 1 ≤ p < ∞, there is a constant cp which depend only on p, such that for every {x1 , ..., xn } ⊂ Ω, n n   p  1   cp Eε sup  εi f (xi ) p ≤ Eε sup  εi f (xi ) f ∈F

f ∈F

i=1

i=1

n  p  1  ≤ Eε sup  εi f (xi ) p . f ∈F

i=1 1

6. For any function h ∈ L2 (µ), Rn (F + h) ≤ Rn (F ) + (Eµ h2 ) 2 , where F + h = {f + h|f ∈ F }. 7. For every 1 < p < ∞ there is an absolute constant cp for which n n n p  1   p  1     cp E sup  εi f (Xi ) p ≤ E sup  εi f (Xi ) ≤ E sup  εi f (Xi ) p , f ∈F

f ∈F

i=1

2

i=1

provided that supf ∈F Eµ f ≥ 1/n.

f ∈F

i=1

A Few Notes on Statistical Learning Theory

21

Proof. Parts 1 and 3 are immediate To see part 2, ob from the definitions.  serve that Rn (F ) ≤ Rn conv(F ) ≤ Rn absconv(F ) . To prove the reverse inequality, note that H = absconv(F ) is symmetric and convex. for evHence, n n ery sample x , ..., x and any realization of (ε ) , sup | ε h(x n i i=1 i )| = h∈H i=1 i   n 1 m m suph∈H i=1 εi h(xi ). Every h ∈ H is given by i=1 λj fj where j=1 |λj | = 1, and thus, n 

εi h(xi ) =

m 

λj

j=1

i=1

n 

n   εi fj (xi ) ≤ sup  εi f (xi ). f ∈F

i=1

i=1

Hence, the supremum with respect to F and to H coincide. Part 4 is called the contraction inequality, and is due to Ledoux and Talagrand [15, Corollary 3.17]. Part 5 is the Kahane-Khintchine inequality [24]. As for part 6, note that for every sample x1 , ..., xn , n n n        Eε sup  εi f (xi ) + h(xi )  ≤ Eε sup  εi f (xi ) + Eε  εi h(xi ) = (∗). f ∈F

f ∈F

i=1

i=1

i=1

By Khintchine’s inequality for the second term and the fact that (εi )ni=1 are independent, n n    1 (∗) ≤ Eε sup  εi f (xi ) + h2 (xi ) 2 .



f ∈F

i=1

i=1

Normalizing by 1/ n, taking the expectation with respect to µ and by Jensen’s inequality, 1

Rn (F + h) ≤ Rn (F ) + (Eµ h2 ) 2 . Finally, part 7 follows from a concentration argument which will be presented in appendix 3.2. Remark 1. A significant fact we do not use but feel can not go unmentioned is that the Gaussian averages and the Rademacher averages are closely connected. Indeed, one can show (see, e.g. [24]) that there are absolute constants c and C which satisfy that for every class F , every integer n and any realization {x1 , ..., xn } n n n            εi f (xi ) ≤ Eε sup gi f (xi ) ≤ CEε sup εi f (xi ) · log n, cEε sup f ∈F

i=1

f ∈F

i=1

f ∈F

i=1

where (gi )ni=1 are independent standard Gaussian random variables. When one tries to estimate the random averages, the first and most natural approach is to try and compute them directly. There are very few cases in which such an attempt would be successful, and the one we chose to present is the case of kernel classes.

22

S. Mendelson

Example: Kernel Classes. Assume that Ω is a compact set and let K : Ω × Ω → R be a positive definite, continuous function. Let µ be a probability measure  on Ω, and  consider the integral operator TK : L2 (µ) → L2 (µ) given by TK f (x) = K(x, y)f (y)dµ(y). By Mercer’s Theorem, TK has a diagonal representation, that  is, there ∞ exists a complete, orthonormal basis of L2 (µ), which is denoted by φn (x) n=1 , and a non-increasing sequence of eigenval∞ ues (λn )∞ i=1 which satisfy that for every sequence (an ) ∈ 2 , TK ( n=1 an φn ) =  ∞ n=1 an λn φn . Under certain mild assumptions on the measure µ, Mercer’s Theorem implies that for every x, y ∈ Ω, K(x, y) =

∞ 

λn φn (x)φn (y).

n=1

m Let FK be the class consisting of all the functions of the form i=1 ai K(xi , ·) m m m for mevery m ∈ N∪{∞}, every (xi )i=1 ∈ Ω and every sequence (ai )i=1 for which i,j=1 ai aj K(xi , xj ) ≤ 1. One can show that FK is the unit ball of a Hilbert space associated with the integral operator, called the reproducing kernel space,  and we denote it  √ Hilbert by H. In fact, the unit ball of H is simply TK B L2 (µ) , which is the image √ of the L2 (µ) unit ball by the operator which maps φi to  λi φi . Animportant property of the inner product in H is that for every f ∈ H, f, K(x, ·) H = f (x). An alternative way to define the reproducing kernel √ Hilbert ∞space is via the feature map. Let Φ : Ω → 2 be defined by Φ(x) = λi φi (x) i=1 . Then,      FK = f (·) = β, Φ(·) H β2 ≤ 1 .   Observe that for every x, y ∈ Ω, Φ(x), Φ(y) H = K(x, y). Let us compute the Rademacher averages of FK with respect to the probability measure µ. Theorem 16. Assume that the largest eigenvalue of TK satisfies that λ1 ≥ 1/n. Then, for every such integer n, ∞ ∞  1 1  c λi 2 ≤ Rn (FK ) ≤ C λi 2 , i=1

i=1

where (λi )∞ i=1 are the eigenvalues of the integral operator TK arranged in a non increasing order, C, c are absolute constants and FK is the unit ball in the reproducing kernel Hilbert space. Remark 2. As the proof we present reveals, the upper bound on Rn (FK ) is true even without the assumption on the largest eigenvalue of TK . Before proving the claim, we require the following lemma: Lemma 3. Let FK be the unit ball of the reproducing kernel Hilbert  spacenH associated with the kernel K. For every sample sn = {x1 , ..., xn } let θi (sn ) i=1

A Few Notes on Statistical Learning Theory

23

be the singular values of the operator T : Rn → H defined by T ei = K(xi , ·). Then, n n  2  1 εi f (xi ) = θi2 . Eε √ sup  n f ∈FK i=1 i=1

Proof. By the reproducing kernel property, n n   2   2 εi K(xi , ·), f H  εi f (xi ) = Eε sup  Eε sup  f ∈FK

f ∈FK

i=1

i=1

n   2 = Eε sup  ε i T ei , f H  . f ∈FK

i=1

Since FK is the unit ball in H then n 

Eε sup | f ∈FK

ε i T ei , f



|2 = Eε  H

i=1

n 

εi T ei 2H .

i=1

Thus, n n n n     2 2 εi f (xi ) = Eε  εi T ei H = T ei 2H = θi2 (sn ), Eε sup  f ∈FK

i=1

i=1

i=1

i=1

proving our claim. Proof (Theorem 16). Firstly, it is easy √ to see that √ there exists some f ∈ FK for which Eµ f 2 ≥ 1/n. Indeed, f = TK φ1 = λ1 φ1 ∈ H satisfies that Eµ f 2 = λ1 ≥ 1/n. Thus, using part 7 of Theorem 15, Rn (FK ) is equivalent  n 2 1/2  . Applying the previous lemma and usto n−1/2 E supf ∈FK  i=1 εi f (Xi ) ing its notation, n n  2   1  1 2 Eµ Eε sup  εi f (Xi ) X1 , ..., Xn = Eµ θ (sn ). n n i=1 i f ∈F i=1

n  By the definition of the operator T , θi2 (sn ) i=1 are the eigenvalues of T ∗ T , and   n it is easy to see that T ∗ T = K(xi , xj ) i,j=1 . Therefore, n  i=1

θi2 (sn )



= tr(T T ) =

n 

K(xi , xi ).

i=1

Hence, n n  2   1   1     Eµ Eε sup εi f (Xi ) X1 , ..., Xn = Eµ K(Xi , Xi ) . n n f ∈F i=1 i=1

24

S. Mendelson

To conclude the proof, one has to take the expectation with respect to µ and recall that Eµ K(Xi , Xi ) = Eµ

∞ 

λj φ2j (Xi ) =

j=1

∞ 

λj .

j=1

Corollary 4. Let (Ω, µ) be a probability space, setFK to be the unit ball  in the ∞ reproducing kernel Hilbert space and put tr(K) = i=1 λi . Let T ∈ B L∞ (Ω) and denote by L the loss class associated with FK and T . Then, there is an absolute constant C such that n  1    f (Xi ) − Eµ f  ≥ ε ≤ δ, P r sup  f ∈L n i=1

provided that n ≥

C ε2

max{1 + tr(K), log 1δ }.

Proof. The proof follows immediately from Corollary 3 and the estimates on the Rademacher averages of FK . Indeed, by Theorem 15,   Rn (L) = Rn (FK − T )2 − (PFK T − T )2 ≤ 4Rn (FK − T ) + PFK T − T 2∞  1  ≤ 4 Rn (FK ) + CT ∞ + 4 where C is an absolute constant. Entropy and averages. Unfortunately, in the vast majority of cases, it is next to impossible to compute the random averages directly. Thus, one has to resort to alternative routes to estimate the random averages, especially from above — since this is the direction one needs for sample complexity upper bounds. We show that it is possible to bound the Rademacher and Gaussian averages using the empirical L2 entropy of the class. This follows from results due to Dudley [6] and Sudakov [30]. Originally, the bounds were established from Gaussian processes, and later they were extended to the sub-gaussian setup [8, 33], which includes Rademacher processes. Theorem 17. There are absolute constants C and c for which the following holds. For any integer n, any sample {x1 , ..., xn } and every class F , n     1 1 εi f (xi ) c sup ε log 2 N ε, F, L2 (µn ) ≤ √ Eε sup  n f ∈F i=1 ε>0  ∞   1 ≤C log 2 N ε, F, L2 (µn ) dε, 0

where µn is the empirical measure supported on the sample. This result implies that if the class is relatively small, then its Rademacher averages are uniformly bounded.

A Few Notes on Statistical Learning Theory

25

Corollary 5. There is an absolute constant C such √ that for every Boolean class F with VC(F ) = d and every integer n, Rn (F ) ≤ C d. Proof. Since F is a Boolean class, all of its members are bounded by 1. Thus, for every ε ≥ 1 only a single ball of radius ε is needed to cover F . Using the uniform L2 entropy bound in Theorem 8 it follows that for every integer n and every empirical measure µn ,     log N ε, F, L2 (µn ) ≤ Cd log 1/ε , and the claim is evident from Theorem 17.   In a similar way one can show that if F ⊂ B L∞ (Ω) has a polynomial fatshattering dimension with exponent strictly less than 2, it has uniformly bounded Rademacher averages. This is true because one can obtain a uniform L2 -entropy bound for which the entropy integral converges. It is less obvious what can be done if the entropy integral diverges, in which case Theorem 17 does not apply. To handle this case, we present a stronger version of Dudley’s entropy bound, which will be formulated for Gaussian random variables. Lemma 4. [18] Let µn be an empirical measure supported on {x1 , ..., xn } ⊂ Ω, put F ⊂ B L∞ (Ω) and set (εk )∞ k=0 to be a monotone sequence decreasing to 0 such that ε0 = 1. Then, there is an absolute constant C such that for every integer N , n N      1 1 1 √ E sup  gi f (xi ) ≤ C εk−1 log 2 N εk , F, L2 (µn ) + 2εN n 2 , n f ∈F i=1 k=1

where (gi )ni=1 are standard Gaussian random variables. In particular, n N    1  2 1 1 1/2 √ E sup  + 2εN n 2 . gi f (xi ) ≤ C εk−1 fatεk /8 (F ) log 2 εk n f ∈F i=1

(13)

k=1

The latter part of Lemma 4 follows from its first part and Theorem 10. Before presenting the proof of Lemma 4, we require the following lemma, which is based on the classical inequality due to Slepian [26, 8]. m Lemma 5. Let (Zi )N i=1 be Gaussian random variables (i.e., Zi = j=1 ai,j gj where (gi )ni=1 are independent standard Gaussian random variables). Then, there 1 is some absolute constant C such that E supi Zi ≤ C supi,j Zi − Zj 2 log 2 N . Proof (Lemma 4). We may assume that F is symmetric and contains 0. The proof in the non-symmetric case follows the same path. measure supported on {x1 , ..., xn }. For every f ∈ F , Let µn be an empirical n let Zf = n−1/2 i=1 gi f (xi ), where (gi )ni=1 are independent standard Gaussian random variables on the probability space (Y, P ). Set ZF = {Zf |f ∈ F} and

26

S. Mendelson

define V : L2 (µn ) → L2 (Y, P ) by V (f ) = Zf . Since V is an isometry for which V (F ) = ZF then     N ε, F, L2 (µn ) = N ε, ZF , L2 (P ) . Let (εk )∞ k=0 be a monotone sequence decreasing to 0 such that ε0 = 1 and set Hk ⊂ ZF to be a 2εk cover of ZF . Thus, for every k ∈ N and every Zf ∈ ZF there is some Zfk ∈ Hk such that Zf − Zfk 2 ≤ 2εk , and we select Zf0 = 0. N Writing Zf = k=1 (Zfk − Zfk−1 ) + Zf − ZfN it follows that E sup Zf ≤ f ∈F

N  k=1

E sup (Zfk − Zfk−1 ) + E sup (Zf − ZfN ). f ∈F

f ∈F

By the definition of Zfk and Lemma 5, there is an absolute constant C for which   E sup (Zfk − Zfk−1 ) ≤E sup Zi − Zj |Zi ∈ Hk , Zj ∈ Hk−1 , Zi − Zj 2 ≤ 4εk−1 f ∈F

1

≤C sup Zi − Zj 2 log 2 |Hk | |Hk−1 | i,j

  1 ≤Cεk−1 log 2 N εk , F, L2 (µn ) . Since ZfN ∈ ZF , there is some f  ∈ F such that ZfN = Zf  . Hence,  n 1  f (xi ) − f  (xi ) 2 2 √ = Zf − Zf  2 ≤ 2εN , n i=1 which implies that for every f ∈ F and every y ∈ Y ,  n  n

 12   f (xi ) − f  (xi )    Zf (y) − ZfN (y) ≤   √ (y) ≤ 2ε gi2 (y) . g i N   n i=1 i=1 n

Therefore, E supf ∈F (Zf − ZfN ) ≤ εN E follows.

2 i=1 gi

 12

√ = 2εN n, and the claim

Using this result it is possible to estimate the Rademacher averages of classes with a polynomial fat-shattering dimension.   Theorem 18. Let F ⊂ B L∞ (Ω) and assume that there is some γ > 1 such that for any ε > 0, fatε (F ) ≤ γε−p . Then, there are absolute constants Cp , which depends only on p, such that   if 0 < p < 2 1 1 3/2 Rn (F ) ≤ Cp γ 2 log n if p = 2  1  12 − p1 p n log n if p > 2.

A Few Notes on Statistical Learning Theory

27

Proof. Let µn be an empirical measure on Ω. If p < 2 then by Theorem 10,  ∞   1 1 log 2 N ε, F, L2 (µn ) dε ≤ Cp γ 2 0

and the bound follows from the upper bound in Theorem 17. Assume that p ≥ 2 and, using the notation of Lemma 4, select εk = 2−k and N = p−1 log2 (n/ log2 (n)). By (13), 1

Rn (F ) ≤ Cp γ 2

N 

1− p 2

εk

1

log 2

k=1 1

≤ Cp γ 2

N √ 

1 2 + 2εN n 2 εk

p

1

1

k2k( 2 −1) + 2n 2 − p .

k=1

If p = 2, the sum is bounded by 1

3

1

Cp γ 2 N 2 ≤ Cp γ 2 log3/2 n, 1

1

1

whereas is p > 2 it is bounded by Cp γ 2 n 2 − p log p n. 1

These bounds on Rn are “worst case” bounds, since they hold for any empirical measure. In fact, the underlying measure µ plays no part in the bounds. Using a geometric interpretation of the fat-shattering dimension, it is possible to show that the “worst case” bounds we established are tight (up to the exact power of the logarithm), in the sense that if fatε (F ) = Ω(ε−p ) for p > 2, then for every integer n there will be a sample {x1 , ..., xn } for which n   1 1 1 √ Eε sup  εi f (xi ) ≥ cn 2 − p , n f ∈F i=1

where c is an absolute constant. Since this is not the main issue we wish to address in these notes, we refer the interested reader to [18]. The complexity bounds that one obtains using Corollary 3 and Theorem 18 are a significant improvement to the ones obtained via Theorem 11. Indeed, the sample complexity estimate obtained there was that if fatε (F ) = O(ε−p ) then SF (ε, δ) = O

 1 2  2 · (log + log ) . 2+p ε ε δ

Using Talagrand’s inequality, we obtain a sharper bound:   Theorem 19. Let F ⊂ B L∞ (Ω) and assume that fatε (F ) ≤ γε−p . Then, there is a constant Cp , which depends only on p, such that SF (ε, δ) ≤ Cp max

1 1 1 , 2 log p ε ε δ

if p = 2. If p = 2 there is an additional logarithmic factor in 1ε .

28

S. Mendelson

We were able to obtain this improved result is because we removed the major looseness—the union bound in the “classical” argument. But this is not the end of the story.... There is still one additional source of sub-optimality; as we said in the introduction, using the uGC property only yields upper bounds to the quantity we wish to explore — the learning sample complexity. In the next section, we use very similar methods to the ones used here and obtain even tighter bounds.

3

Learning Sample Complexity

After bounding the uGC sample complexity using Corollary 3 and establishing bounds on the Rademacher averages, we now turn to the alternative approach which will prove to yield tighter learning sample complexity bounds. Recall that the question we wish to answer is how to ensure that an “almost minimizer” of the empirical loss will be close to the minimum of the actual loss. Thus, our aim is to bound n   1 f (Xi ) ≤ ε/2, Eµ f ≥ ε . P r ∃f ∈ L, n i=1

(14)

To that end, we need to impose an important structural assumption on the class at hand. Assumption 1 Assume that there is an absolute constant B such that for every f ∈ F , Eµ f 2 ≤ BEµ f . Though this assumption seems restrictive, it turns out that it holds in all the cases we are interested in.   Lemma 6. Let F ⊂ B L∞ (Ω) satisfying assumption 1. Fix ε > 0 and define H= and set

 εf   f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε , Eµ f

  Fε = f ∈ F | Eµ f 2 ≤ ε ,

  Hε = h ∈ H| Eµ h2 ≤ Bε .

Then, n   1 P r ∃f ∈ F, f (Xi ) ≤ ε/2, Eµ f ≥ ε ≤ n i=1   ε ε + P r sup |Eµ h − Eµn h| ≥ P r sup |Eµ f − Eµn f | ≥ 2 2 f ∈Fε h∈Hε

In particular, for every 0 < δ < 1, 

ε δ

ε δ ε , SHε , . CL ( , δ) ≤ max SFε , 2 2 2 2 2

(15)

A Few Notes on Statistical Learning Theory

Proof. Denote by µn the random empirical measure n−1

n

i=1 δXi .

29

Then,

  P r ∃f ∈ F, Eµn f ≤ ε/2, Eµ f ≥ ε ≤   P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 < ε, Eµn f ≤ ε/2 +   P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε, Eµn f ≤ ε/2 = (1) + (2). If Eµ f ≥ ε then Eµ f ≥ 12 (Eµ f + ε) ≥ 12 Eµ f + Eµn f . Therefore, |Eµ f − Eµn f | ≥ 1 2 Eµ f ≥ ε/2, hence,  ε (1) + (2) ≤ P r ∃f ∈ F, Eµ f 2 < ε, |Eµ f − Eµn f | ≥ 2   1 2 + P r ∃f ∈ F, Eµ f ≥ ε, Eµ f ≥ ε, |Eµ f − Eµn f | ≥ Eµ f 2 = (3) + (4).   The first term is bounded by P r supf ∈Fε |Eµn f −Eµ f | ≥ ε/2 . As for the second, assume that |Eµn f − Eµ f | ≥ (Eµ f )/2 and that Eµ f ≥ ε. Then, h = εf /Eµ f satisfies that |Eµn h − Eµ h| ≥ ε/2 and since Eµ f 2 ≤ B(Eµ f ) then E µ h2 ≤ B

ε2 ≤ Bε. Eµ f

  Therefore, (4) ≤ P r ∃h ∈ Hε , |Eµn h − Eµ h| ≥ ε/2 . To simplify this estimate, we require the following definition: Definition 7. Let X be a normed space and let A ⊂ X. We say that A is starshaped with center x if for every a ∈ A the interval [a, x] = {tx + (1 − t)a|0 ≤ t ≤ 1} ⊂ A. Given A and x, denote by star(A, x) the union of all the intervals [a, x], where a ∈ A. It is easy to see that each element h ∈ H is given by αf f , where 0 ≤ αf ≤ 1. Thus, H ⊂ star(F, 0) and obviously F ⊂ star(F, 0). Therefore, n   1 f (Xi ) < ε/2, Eµ f ≥ ε ≤ P r ∃f ∈ F, n i=1  ε . 2P r ∃h ∈ star(F, 0), Eµ h2 ≤ Bε, |Eµ h − Eµn h| ≥ 2

(16)

This implies that the question of obtaining sample complexity estimates may be reduced to a GC deviation problem for a class which is the intersection of star(F, 0) with an L2 (µ) ball, centered at 0 with radius proportional to the square-root of the required deviation. Combining this with Corollary 3 yields the following fundamental result:

30

S. Mendelson

  Theorem 20. Let F ⊂ B L∞ (Ω) and assume that assumption 1 holds. Set H = star(F, 0) and for every ε > 0 let Hε = H ∩ {h : Eµ h2 ≤ ε}. Then, for every 0 < ε, δ < 1, n  1 f (Xi ) ≤ ε/2, Eµ f ≥ ε ≤ δ, P r ∃f ∈ F, n i=1

provided that n ≥ C max

 Rn2 (Hε ) B log 2δ  . , ε2 ε

The proof of this theorem follows immediately from (12) in Corollary 3. Theorem 20 shows that the important quantity which governs the learning sample complexity is the “localized” Rademacher average Rn (Hε ), assuming, of course, that assumption 1 holds. Before presenting bounds on the localized Rademacher averages of some classes, let us comment on assumption 1. Assumption 1 clearly holds for 2loss classes if the target function is a member of the original class G, since in that case, PG T = T , and every loss function is nonnegative and bounded by 4. The situation  when T ∈ G is much  more difficult. One can show that if G ⊂ B L∞ (Ω) is convex and T ∈ B L∞ (Ω) , then for every probability measure µ and every 2-loss function f , Eµ f 2 ≤ 16Eµ f [16, 19]. In fact, it is possible to obtain results of a similar flavor for q-loss classes, where the “usual” exponent 2 is replaced with some q ≥ 2 (see [19]). Even the convexity assumption can be relaxed in the following sense; if G ⊂ L2 (µ) is not convex, then there will be functions which have more than a single best approximation in G. The set of functions which do not have a unique best approximation in G is denoted by nup(G, µ) and it clearly depends on the probability measure µ, because a change of measure generates a different way of measuring distances. One can show ([23]) that given a measure µ and a target T ∈ nup(G, µ), the 2-loss class L satisfies that Eµ f 2 ≤ BEµ f for every f ∈ L. The constant B will depend on “how far” T is from nup(G, µ). Thus, the complexity bounds one obtains in this case are both target and measure dependent. For the sake of simplicity, in all the cases we shall be interested in we impose the assumption that either T ∈ G, or that G is convex. In both these cases, a selection of B = 16 suffices to ensure that assumption 1 holds.

3.1

Localized Random Averages

In an analogous way to what we did in Section 2.4, we present two paths one can take when computing the random averages. For the direct approach we present the example of kernel classes. The second approach, which may be used in the vast majority of examples is to apply uniform entropy estimates.

A Few Notes on Statistical Learning Theory

31

Localized averages of kernel classes. Here, we present a direct tight bound on the localized Rademacher averages of FK in terms of the eigenvalues of the integral operator TK . It is important to note that the underlying measure in the definition of Rn and of TK has to be the same, which emphasizes the difficulty from the learning theoretic viewpoint, since one does not have a priori knowledge on the underlying measure. Theorem 21. [20] There are absolute constants c and C for which the following holds. Let K be a kernel and set µ to be a probability measure on Ω. If (λi )∞ i=1 are the eigenvalues of the integral operator TK (with respect to µ) and if λ1 ≥ 1/n, then for every ε ≥ 1/n, ∞ n ∞   1  1  1 εi f (Xi ) ≤ C min{λi , ε} 2 , min{λi , ε} 2 ≤ √ Eµ Eε sup  c n f ∈Fε i=1 j=1 j=1

where Fε = {f ∈ FK , Eµ f 2 ≤ ε} Remark 3. The upper bound in Theorem 21 holds even without the assumptions on λ1 and ε, and this is the direction we require for sample complexity bounds. The assumption is imposed only to enable one to obtain matching upper and lower bounds.  n  Proof. Let Rε = supf ∈Fε  i=1 εi f (Xi ). Just as in the proof of Theorem 16, there is some f ∈ FK for which Eµ f 2 ≥ 1/n. Hence, there will be some 0 < t ≤ 1 for which f1 = tf ∈ Fε and Eµ f12 ≥ 1/n. Thus, supf ∈Fε Eµ f 2 ≥ 1/n and by Theorem 15, part 7, ERε is equivalent to (ERε2 )1/2 . We can assume that 2 is the reproducing kernel Hilbert space and recall that FK = {f (·) = β, Φ(·) | β2 ≤ 1}, where Φ is the kernel feature map. By Setting B(ε) = {f |Eµ f 2 ≤ ε} it follows that ∞ f ∈ FK is also in B(ε) if and only if its representing vector β satisfies that i=1 βi2 λi ≤ ε. Hence, in 2 , Fε = FK ∩ B(ε) = {β|

∞ 

βi2 ≤ 1,

i=1

Let E ⊂ 2 be defined as {β| note that

∞ i=1

∞ 

βi2 λi ≤ ε}.

i=1

µi βi2 ≤ 1}, where µi = (min{1, ε/λi })−1 and

E ⊂ FK ∩ B(ε) ⊂

√ 2E.

Therefore, √ one can replace Fε by E in the computation of Rn (Fε ), losing a factor of 2 at the most. Finally, let (ei )∞ i=1 be the standard basis in 2 . By the definition of E it follows that for every v ∈ 2 , sup

∞  √

β∈E i=1

  1 µi βi ei , v = v, v 2 .

32

S. Mendelson

Hence, it is evident that n ∞ ∞ n   λi  12        √ µi βi ei , εj φi (Xj ) ei |2 εj Φ(Xj ) |2 = E sup | E sup | β, µi β∈E β∈E i=1 j=1 i=1 j=1

= Eµ

∞  λi i=1

µi



∞ 

∞  λi  2 λi εj φi (Xj ) = Eµ φ2i (Xj ) = n , µ µ i j=1 i,j i=1 i

which proves our claim. As an example, consider the case where the eigenvalues of TK are λi ∼ 1/ip , for some p > 1. It is easy to see that in that case, Rn (Fε ) ≤ Cε1/2−1/2p . Therefore, if T ∈ FK , then according to Theorem 20 the learning sample complexity (when the sampling is done with respect to the measure µ!) is

 1 log(2/δ) C(ε, δ) = O max 1+1/p , . ε ε Using the Entropy. The previous section is somewhat misleading since the reader might develop the feeling that computing localized averages directly is a winning strategy. Unfortunately, even if the geometry of the original class is well behaved and enables direct computation, the problem becomes considerably harder in the localized case. In the latter, one has to take into account the intersection body of the original class and an L2 (µ) ball. Thus, in most cases one has no choice but to resort to indirect methods, like entropy based bounds. Theorem 17 may be used to compute the localized version of the Rademacher averages in the following manner; let Y be a random variablewhich measures n the empirical radius of the class, which is Y 1/2 = (supf ∈F n−1 i=1 f 2 (Xi ))1/2 . Given a sample (x1 , ..., xn ) and any ε ≥ Y 1/2 (x1 , ..., xn ), only a single ball is needed to cover the entire class. Hence,  Y 1/2 (x1 ,...,xn ) n     1 1 √ sup  εi f (xi ) ≤ C log 2 N ε, F, L2 (µn ) dε. n f ∈F i=1 0 Taking the expectation with respect to the sample it follows that there is an absolute constant C such that for every class F ,  √Y   1 log 2 N ε, F, L2 (µn ) dε. Rn (F ) ≤ CE −1

n

0

2

where Y = supf ∈F n i=1 f (Xi ). Of course, the information we have is not on the random variable Y , but rather on σF2 = supf ∈F Eµ f 2 . Fortunately, it is possible to connect the two, as the following result which is due to Talagrand [32], shows.   Lemma 7. Let F ⊂ B L∞ (Ω) and set σF2 = supf ∈F Eµ f 2 . Then, Eµ sup

n 

f ∈F i=1

√ f 2 (Xi ) ≤ nσF2 + 8 nRn (F )

A Few Notes on Statistical Learning Theory

33

Using this fact, it turns out that if one has data on the uniform entropy, one can estimate the localized Rademacher averages. As an example, consider the case when the entropy is logarithmic in 1/ε.   Lemma 8. Let F ⊂ B L∞ (Ω) and set σF2 = supf ∈F Eµ f 2 . Assume that there are γ > 1, d ≥ 1 and p ≥ 1 such that

γ log N2 (ε, F ) ≤ d logp . ε Then, there is a constant Cp,γ which depend only on p and γ for which  d p 1 √ 1 . , dσF log 2 Rn (F ) ≤ Cp,γ max √ logp σF σF n Before proving the lemma, we require the next result: Lemma 9. For every 0 ≤ p < ∞ and γ > 1, there is some constant cp,γ such that for every 0 < x < 1,  x γ cp,γ , logp dε ≤ 2x logp ε x 0 and x1/2 logp

cp,γ x

is increasing and concave in (0, 10).

The first part of the proof follows from the fact that both terms are equal at x = 0, but for an appropriate constant cp,γ , the derivative of the function on left-hand side is smaller than that of the function on the right-hand one. The second part is evident by differentiation. n Proof (Lemma 8). Set Y = n−1 supf ∈F i=1 f 2 (Xi ). By Theorem 17 there is an absolute constant C such that n   1 √ Eε sup  εi f (Xi ) ≤ C n f ∈F i=1



√ Y

0

√  ≤C d

  1 log 2 N ε, F, L2 (µn ) dε √ Y

p

log 2 0

γ dε. ε

By Lemma 9 there is a constant cp,γ such that for every 0 < x ≤ 1,  x p γ p cp,γ , log 2 dε ≤ 2x log 2 ε x 0 √ and v(x) = x logp/2 (cp,γ /x) is increasing and concave in (0, 10). Since Y ≤ 1, n √   p cp,γ 1 √ Eε sup  , εi f (Xi ) ≤ Cp dY log 2 Y n f ∈F i=1

34

S. Mendelson

√ and since σF2 + 8Rn / n ≤ 9, then by Jensen’s inequality, Lemma 7 and the fact that v is increasing in (0, 10),  Eµ Y

1 2

p

log 2

p cp,γ 1 cp,γ  ≤ (Eµ Y ) 2 log 2 Y Eµ Y  2 p cp,γ Rn  1 ≤ cp,γ σF + 8 √ 2 log 2 2 √n n σF + 8R n  2 p 1 8Rn  12 log 2 . ≤ cp,γ σF + √ σF n

Therefore, √  p 1 Rn (F )  12 log 2 , Rn (F ) ≤ Cp,γ d σF2 + √ σF n and our claim follows from a straightforward computation. In a similar manner one can show that if that there are γ and p < 2 such that log N2 (ε, F ) ≤

γ εp

then  1 2−p 1− p  Rn (F ) ≤ Cp,γ max n− 2 2+p , σF 2 ,

(17)

and if log N2 (ε, F ) ≤

2 γ log2 εp ε

then  1 2−p 2 2  1− p Rn (F ) ≤ Cp,γ max n− 2 2+p logβ , , σF 2 log σF σF

(18)

where β = 4/(2 + p).      Let F ⊂ B L∞ (Ω) and set Fε = f ∈ F Eµ f 2 ≤ ε . Since Fε ⊂ F then its entropy must be smaller than that of F . Therefore, all the estimates above hold for Fε when one replaces σF2 by ε. The next step is to connect the entropy of the original class G to that of F = star(L, 0). Let us recall that the uniform entropy for the loss class is controlled by that of G (see Lemma 1). Hence, all that remains is to see whether taking the star-shaped hull of L with 0 increases the entropy by much. Lemma 10. Let X be a normed space and let A ⊂ B(X) be totally bounded (i.e., has a compact closure). Then, for any x ≤ 1 and every ε > 0,     2 log N 2ε, star(A, x) ≤ log + log N ε, A . ε

A Few Notes on Statistical Learning Theory

35

Proof. Fix ε > 0 and let y1 , ..., yk be an ε-cover of A. Note that for any a ∈ A and any z ∈ [a, x] there is some z  ∈ [yi , x] such that z  − z < ε. Hence, an ε-cover of the union ∪ni=1 [yi , z] is a 2ε-cover for star(A, x). Since for every i, x − yi  ≤ 2, then each interval may be covered by 2ε−1 balls of radius ε and our claim follows. Corollary 6. Assume that G consists of functions which map Ω into [0, 1] and that the same holds for T . Then, for any ε, ρ > 0,   log N2 (ρ, Fε ) ≤ log N2 (ρ/8, G) + log 4/ρ ,    where Fε = f ∈ star(L, 0)Eµ f 2 ≤ ε . This result yields sample complexity estimates when one has estimates on the L2 entropy of the class (which can be obtained using the combinatorial parameters or other methods). The case we present here is when the class has a polynomial uniform entropy.   and assume Theorem 22. Let G ⊂ B L∞ (Ω) be a convex classof functions  that N2 (ε, G) ≤ γε−p for some 0 < p < ∞. Set T ∈ B L∞ (Ω) and put L to be the loss class associated with G and T . Then, n  1 f (Xi ) ≤ ε, Eµ f ≥ 2ε ≤ δ, P r ∃f ∈ L, n i=1

provided that n ≥ C(p, γ) max

 1  ε

1+ p 2

,

log(1/δ) ε

if 0 < p < 2,

and  1 

log(1/δ) if p > 2. ε ε    Proof. Let F = star(L, 0) and set Fε = f ∈ F Eµ f 2 ≤ ε . Applying Theorem 18 it follows that for every integer n, every ε > 0 and any p > 2, n ≥ C(p, γ) max

p

,

1

1

Rn (Fε ) ≤ Rn (F ) ≤ Cp n 2 − p . To estimate the localized averages for 0 < p < 2, one uses the previous corollary and (17). Both parts of the theorem are now immediate from Theorem 20. 3.2

The Iterative Scheme

The biggest downside in our analysis is the fact that the localized Rademacher averages are very hard to compute, and it almost impossible to estimate them using the empirical data one receives. If fact, all the results presented here were

36

S. Mendelson

based on some kind of an a-priori data on the learning problem we had to face; for example, we imposed assumptions on the growth rates of the uniform entropy of the class. It is highly desirable to obtain estimates which are data-dependent. This ball in the definition could be done if we had the ability to replace  the L2 (µ)   of n the localized averages by the empirical ball f ∈ F n−1 i=1 f 2 (Xi ) ≤ ε Koltchinskii and Panchenko [12] have introduced a computable iterative scheme which enabled them to replace the “actual” ball by an empirical one for a random sequence of radii rk = rk (X1 , ..., Xn ). In some cases, this method proved to be an effective way of bounding the localized averages. In fact, when one has some “global” data (e.g. uniform entropy bounds), the iterative scheme gives the same asymptotic bounds as the ones obtained using the entropic approach. To this day, there is no proof that the iterative scheme always converges to the “correct” value of the localized averages. Even more so, the question of when is it possible to replace the L2 (µ) ball by an empirical ball remains open.

References 1. M.Anthony, P.L. Bartlett: Neural Network Learning: Theoretical Foundations, Cambridge University Press, 1999. 2. N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale sensitive dimensions, uniform convergence and learnability, J. of ACM 44 (4), 615–631, 1997. 3. O. Bousquet: A Bennett concentration inequality and its application to suprema of empirical processes, preprint. 4. L. Devroye, L. Gy¨ orfi, G. Lugosi: A Probabilistic Theory of Pattern Recognition, Springer, 1996. 5. R.M. Dudley: Real Analysis and Probability, Chapman and Hall, 1993. 6. R.M. Dudley: The sizes of compact subsets of Hilbert space and continuity of Gaussian processes, J. of Functional Analysis 1, 290–330, 1967. 7. R.M. Dudley: Central limit theorems for empirical measures, Annals of Probability 6(6), 899–929, 1978. 8. R.M. Dudley: Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics 63, Cambridge University Press, 1999. 9. E. Gin´e, J. Zinn: Some limit theorems for empirical processes, Annals of Probability, 12(4), 929–989, 1984. 10. D. Haussler: Sphere packing numbers for subsets of Boolean n-cube with bounded Vapnik-Chervonenkis dimension, J. of Combinatorial Theory (A) 69, 217–232, 1995. 11. W. Hoeffding: Probability inequalities for sums of bounded random variables, J. of the American Statistical Association, 58, 13–30, 1963. 12. V. Koltchinskii, D. Panchenko: Rademacher processes and bounding the risk of function learning, High Dimensional Probability, II (Seattle, WA, 1999), 443–457, Progr. Probab., 47, Birkhauser. 13. R. Latala, K. Oleszkiewicz: On the best constant in the Khintchine-Kahane inequality, Studia Math. 109(1), 101–104, 1994. 14. M. Ledoux: The Concentration of Measure Phenomenon, Mathematical Surveys an Monographs, Vol 89, AMS, 2001.

A Few Notes on Statistical Learning Theory

37

15. M. Ledoux, M. Talagrand: Probability in Banach Spaces: Isoperimetry and Processes, Springer, 1991. 16. W.S.Lee, P.L. Bartlett, R.C. Williamson: The Importance of Convexity in Learning with Squared Loss, IEEE Transactions on Information Theory 44 (5), 1974–1980, 1998. 17. P. Massart: About the constants in Talagrand’s concentration inequality for empirical processes, Annals of Probability, 28(2), 863–884, 2000. 18. S. Mendelson: Rademacher averages and phase transitions in Glivenko-Cantelli class, IEEE Transactions on Information Theory, 48(1), 251–263, 2002. 19. S. Mendelson: Improving the sample complexity using global data, IEEE Transactions on Information Theory, 48(7), 1977–1991, 2002. 20. S. Mendelson: Geometric parameters of kernel machines, in Proceedings of the 15th annual conference on Computational Learning Theory COLT02, Jyrki Kivinen and Robert H. Sloan(Eds.), Lecture Notes in Computer Sciences 2375, Springer, 29–43, 2002. 21. S. Mendelson, R. Vershynin: Entropy, combinatorial dimensions and random averages, in Proceedings of the 15th annual conference on Computational Learning Theory COLT02, Jyrki Kivinen and Robert H. Sloan(Eds.), Lecture Notes in Computer Sciences 2375, Springer, 14–28, 2002. 22. S. Mendelson, R. Vershynin: Entropy and the combinatorial dimension, Inventiones Mathematicae, to appear. 23. S. Mendelson, R.C. Williamson: Agnostic learning nonconvex classes of functions, in Proceedings of the 15th annual conference on Computational Learning Theory COLT02, Jyrki Kivinen and Robert H. Sloan(Eds.), Lecture Notes in Computer Sciences 2375, Springer, 1–13, 2002. 24. V.D. Milman, G. Schechtman: Asymptotic Theory of Finite Dimensional Normed Spaces, Lecture Notes in Mathematics 1200, Springer 1986. 25. A. Pajor: Sous espaces n 1 des espaces de Banach, Hermann, Paris, 1985. 26. G. Pisier: The volume of convex bodies and Banach space geometry, Cambridge University Press, 1989. 27. E. Rio: Une inegalit´e de Bennett pour les maxima de processus empiriques, preprint. 28. N. Sauer: On the density of families of sets, J. Combinatorial Theory (A), 13, 145–147, 1972. 29. S. Shelah: A combinatorial problem: stability and orders for models and theories in infinitary languages, Pacific Journal of Mathematics, 41, 247–261, 1972. 30. V.N. Sudakov: Gaussian processes and measures of solid angles in Hilbert space, Soviet Mathematics. Doklady 12, 412–415, 1971. 31. M. Talagrand: Type, infratype and the Elton-Pajor theorem, Inventiones Mathematicae, 107, 41–59, 1992. 32. M. Talagrand: Sharper bounds for Gaussian and empirical processes, Annals of Probability, 22(1), 28–76, 1994. 33. A.W. Van der Vaart, J.A. Wellner: Weak Convergence and Empirical Processes, Springer-Verlag, 1996. 34. V. Vapnik: Statistical Learning Theory, Wiley 1998. 35. A. Vidyasagar: The Theory of Learning and Generalization Springer-Verlag, 1996. 36. V. Vapnik, A. Chervonenkis: Necessary and sufficient conditions for uniform convergence of means to mathematical expectations, Theory Prob. Applic. 26(3), 532– 553, 1971.

38

S. Mendelson

4 Appendix: Concentration of Measure and Rademacher Averages In this section we prove that all the Lp norms of the Rademacher averages of a class are equivalent, as long as the class is not contained in a “very small” ball. Theorem 23. For every 1 < p < ∞ there is a constant cp for which the following holds. Let F be a class of functions, set µ to be a probability measure on Ω and put σF2 = supf ∈F Eµ f 2 . If n satisfies that σF2 ≥ 1/n then n n n    p  1   p  1  εi f (Xi ) p ≤ E sup  εi f (Xi ) ≤ E sup  εi f (Xi ) p , cp E sup  f ∈F i=1

f ∈F i=1

f ∈F i=1

where (Xi )ni=1 are independent random variables distributed according to µ and the expectation is taken with respect to the product measure associated with the Rademacher variables and the variables Xi . n The proof of this theorem is based on the fact that supf ∈F | i=1 εi f (Xi )| is highly concentrated around its mean value, with an exponential tail. The first step in the proof is to show that if one can establish such an exponential tail for a class of functions, then all the Lp norms are equivalent on the class. In fact, we prove a little more: Lemma 11. Let G be a class of nonnegative functions which satisfies that there is some constant c0 such that for every g ∈ G and every integer m,   P r |g − Eg| ≥ mEg ≤ 2e−c0 m . Then, for every 0 < p < ∞ there are constants cp and Cp which depend only on p and c0 , such that for every g ∈ G, 1

1

cp (Eg p ) p ≤ Eg ≤ Cp (Eg p ) p . Proof. Fix some 0 < p < ∞ and g ∈ G, and set a = Eg. Clearly, Eg p = Eg p χ{g

❍ ❍ ❍

{x | + b = 0}

Fig. 2. A binary classification toy problem: separate balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half-way between the two classes. The problem being separable, there exists a weight vector w and a threshold b such that yi ·((w·xi )+b) > 0 (i = 1, . . . , m). Rescaling w and b such that the point(s) closest to the hyperplane satisfy |(w · xi ) + b| = 1, we obtain a canonical form (w, b) of the hyperplane, satisfying yi · ((w · xi ) + b) ≥ 1. Note that in this case, the margin, measured perpendicularly to the hyperplane, equals 2/w. This can be seen by considering two points x1 , x2 on opposite sides of the margin, i.e., (w · x1 ) + b = 1, (w · x2 ) + b = −1, and projecting them onto the hyperplane normal vector w/w (from [17]).

To construct this Optimal Hyperplane (cf. Figure 2), one solves the following optimization problem: minimize subject to

1 w2 2 yi · ((w · xi ) + b) ≥ 1, τ (w) =

(23) i = 1, . . . , m.

(24)

This constrained optimization problem is dealt with by introducing Lagrange multipliers αi ≥ 0 and a Lagrangian L(w, b, α) =

m  1 w2 − αi (yi · ((xi · w) + b) − 1) . 2 i=1

(25)

The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables αi (i.e., a saddle point has to be found). Let us try to get some intuition for this. If a constraint (24) is violated, then yi · ((w · xi ) + b) − 1 < 0, in which case L can be increased by increasing the corresponding αi . At the same time, w and b will have to change such that L decreases. To prevent −αi (yi · ((w · xi ) + b) − 1) from becoming arbitrarily large, the change in w and b will ensure that, provided the problem is

48

B. Sch¨ olkopf and A.J. Smola

separable, the constraint will eventually be satisfied. Similarly, one can understand that for all constraints which are not precisely met as equalities, i.e., for which yi · ((w · xi ) + b) − 1 > 0, the corresponding αi must be 0: this is the value of αi that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker complementarity conditions of optimization theory [6]. The condition that at the saddle point, the derivatives of L with respect to the primal variables must vanish, ∂ ∂ L(w, b, α) = 0, L(w, b, α) = 0, ∂b ∂w

(26)

leads to m 

αi yi = 0

(27)

i=1

and w=

m 

αi yi xi .

(28)

i=1

The solution vector thus has an expansion in terms of a subset of the training patterns, namely those patterns whose αi is non-zero, called Support Vectors. By the Karush-Kuhn-Tucker complementarity conditions αi · [yi ((xi · w) + b) − 1] = 0,

i = 1, . . . , m,

(29)

the Support Vectors lie on the margin (cf. Figure 2). All remaining examples of the training set are irrelevant: their constraint (24) does not play a role in the optimization, and they do not appear in the expansion (28). This nicely captures our intuition of the problem: as the hyperplane (cf. Figure 2) is completely determined by the patterns closest to it, the solution should not depend on the other examples. By substituting (27) and (28) into L, one eliminates the primal variables and arrives at the Wolfe dual of the optimization problem (e.g., [6]): find multipliers αi which maximize

W (α) =

m  i=1

subject to

αi −

m 1  αi αj yi yj (xi · xj ) 2 i,j=1

αi ≥ 0, i = 1, . . . , m, and

m 

αi yi = 0.

(30)

(31)

i=1

The hyperplane decision function can thus be written as

m   f (x) = sgn yi αi · (x · xi ) + b i=1

where b is computed using (29).

(32)

A Short Introduction to Learning with Kernels

49

The structure of the optimization problem closely resembles those that typically arise in Lagrange’s formulation of mechanics. Also there, often only a subset of the constraints become active. For instance, if we keep a ball in a box, then it will typically roll into one of the corners. The constraints corresponding to the walls which are not touched by the ball are irrelevant, the walls could just as well be removed. Seen in this light, it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes [9]: If we assume that each support vector xi exerts a perpendicular force of size αi and sign yi on a solid plane sheet lying along the hyperplane, then the solution satisfies the requirements of mechanical stability. The constraint (27) states that the forces on  the sheet sum to zero; and (28) implies that the torques also sum to zero, via i xi × yi αi · w/w = w × w/w = 0. There are theoretical arguments supporting the good generalization performance of the optimal hyperplane [30, 28, 4, 25, 35]. In addition, it is computationally attractive, since it can be constructed by solving a quadratic programming problem.

4

Support Vector Classifiers

We now have all the tools to describe support vector machines [29, 23]. Everything in the last section was formulated in a dot product space. We think of this space as the feature space H described in Section 1. To express the formulas in terms of the input patterns living in X, we thus need to employ (5), which expresses the dot product of bold face feature vectors x, x in terms of the kernel k evaluated on input patterns x, x , k(x, x ) = (x · x ).

(33)

This can be done since all feature vectors only occured in dot products. The weight vector (cf. (28)) then becomes an expansion in feature space,2 and will thus typically no longer correspond to the image of a single vector from input space. We thus obtain decision functions of the more general form (cf. (32))

f (x) = sgn = sgn

m 

i=1

m 

 yi αi · (Φ(x) · Φ(xi )) + b  yi αi · k(x, xi ) + b ,

(34)

i=1

and the following quadratic program (cf. (30)): 2

This constitutes a special case of the so-called representer theorem, which states that under fairly general conditions, the minimizers of objective functions which contain a penalizer in terms of a norm in feature space will have kernel expansions [32, 23].

50

B. Sch¨ olkopf and A.J. Smola

Fig. 3. Example of a Support Vector classifier found by using a radial basis function kernel k(x, x ) = exp(−x − x 2 ). Both coordinate axes range from -1 to +1. Circles and disks are two classes of training examples; the middle line is the decision surface; the outer lines precisely meet the constraint (24). Note that the Support Vectors found by the algorithm (marked by extra circles) are not centers of clusters, but examples which are critical for the given classification task. Grey values code the modulus of the  argument m i=1 yi αi · k(x, xi ) + b of the decision function (34) (from [17]).)

maximize

W (α) =

m  i=1

subject to

αi −

m 1  αi αj yi yj k(xi , xj ) 2 i,j=1

αi ≥ 0, i = 1, . . . , m, and

m 

αi yi = 0.

(35)

(36)

i=1

In practice, a separating hyperplane may not exist, e.g. if a high noise level causes a large overlap of the classes. To allow for the possibility of examples violating (24), one introduces slack variables [10, 29, 22] ξi ≥ 0,

i = 1, . . . , m

(37)

in order to relax the constraints to yi · ((w · xi ) + b) ≥ 1 − ξi ,

i = 1, . . . , m.

(38)

A classifier which generalizes well is then found by controlling both the classifier capacity (via w) and the sum of the slacks i ξi . The latter is done as it can be shown to provide an upper bound on the number of training errors which leads to a convex optimization problem.

A Short Introduction to Learning with Kernels

51

One possible realization of a soft margin classifier is minimizing the objective function  1 w2 + C ξi 2 i=1 m

τ (w, ξ) =

(39)

subject to the constraints (37) and (38), for some value of the constant C > 0 determining the trade-off. Here and below, we use boldface Greek letters as a shorthand for corresponding vectors ξ = (ξ1 , . . . , ξm ). Incorporating kernels, and rewriting it in terms of Lagrange multipliers, this again leads to the problem of maximizing (35), subject to the constraints 0 ≤ αi ≤ C, i = 1, . . . , m, and

m 

αi yi = 0.

(40)

i=1

The only difference from the separable case is the upper bound C on the Lagrange multipliers αi . This way, the influence of the individual patterns (which could be outliers) gets limited. As above, the solution takes the form (34). The threshold b can be computed by exploiting the fact that for all SVs xi with αi < C, the slack variable ξi is zero (this again follows from the Karush-Kuhn-Tucker complementarity conditions), and hence m 

yj αj · k(xi , xj ) + b = yi .

(41)

j=1

Another possible realization of a soft margin variant of the optimal hyperplane uses the ν-parameterization [22]. In it, the parameter C is replaced by a parameter ν ∈ [0, 1] which can be shown to lower and upper bound the number of examples that will be SVs and that will come to lie on the wrong side of the hyperplane, respectively. It uses a primal objective function with the error term  1 ξ − ρ, and separation constraints i i νm yi · ((w · xi ) + b) ≥ ρ − ξi ,

i = 1, . . . , m.

(42)

The margin parameter ρ is a variable of the optimization problem. The dual can be shown to consist of maximizing the quadratic part of   (35), subject to 0 ≤ αi ≤ 1/(νm), i αi yi = 0 and the additional constraint i αi = 1.

5

Support Vector Regression

The concept of the margin is specific to pattern recognition. To generalize the SV algorithm to regression estimation [29], an analog of the margin is constructed in the space of the target values y (note that in regression, we have y ∈ R) by using Vapnik’s ε-insensitive loss function (Figure 4) |y − f (x)|ε := max{0, |y − f (x)| − ε}.

(43)

52

B. Sch¨ olkopf and A.J. Smola

x

ξ

x x

x

x

x x

x

x

x

ξ

+ε 0 −ε

x

x

x

−ε

x



x

Fig. 4. In SV regression, a tube with radius ε is fitted to the data. The trade-off between model complexity and points lying outside of the tube (with positive slack variables ξ) is determined by minimizing (46) (from [17]).

To estimate a linear regression f (x) = (w · x) + b

(44)

with precision ε, one minimizes  1 w2 + C |yi − f (xi )|ε . 2 i=1 m

(45)

Written as a constrained optimization problem, this reads:  1 w2 + C (ξi + ξi∗ ) 2 i=1 m

minimize

τ (w, ξ, ξ ∗ ) =

subject to

((w · xi ) + b) − yi ≤ ε + ξi yi − ((w · xi ) + b) ≤ ε + ξi∗

(47)

ξi , ξi∗ ≥ 0

(49)

(46)

(48)

for all i = 1, . . . , m. Note that according to (47) and (48), any error smaller than ε does not require a nonzero ξi or ξi∗ , and hence does not enter the objective function (46). Generalization to kernel-based regression estimation is carried out in complete analogy to the case of pattern recognition. Introducing Lagrange multipliers, one thus arrives at the following optimization problem: for C > 0, ε ≥ 0 chosen a priori,

A Short Introduction to Learning with Kernels maximize

W (α, α∗ ) = −ε

m m   (αi∗ + αi ) + (αi∗ − αi )yi i=1

i=1

m 1  ∗ − (αi − αi )(αj∗ − αj )k(xi , xj ) 2 i,j=1

subject to

53

0 ≤ αi , αi∗ ≤ C, i = 1, . . . , m, and

(50) m  (αi − αi∗ ) = 0.

(51)

i=1

The regression estimate takes the form f (x) =

m  (αi∗ − αi )k(xi , x) + b,

(52)

i=1

where b is computed using the fact that (47) becomes an equality with ξi = 0 if 0 < αi < C, and (48) becomes an equality with ξi∗ = 0 if 0 < αi∗ < C. Several extensions of this algorithm are possible. From an abstract point of view, we just need some target function which depends on the vector (w, ξ) (cf. (46)). There are multiple degrees of freedom for constructing it, including some freedom how to penalize, or regularize, different parts of the vector, and some freedom how to use the kernel trick. For instance, more general loss functions can be used for ξ, leading to problems that can still be solved efficiently [27]. Moreover, norms other than the 2-norm . can be used to regularize the solution. Yet another example is that polynomial kernels can be incorporated which consist of multiple layers, such that the first layer only computes products within certain specified subsets of the entries of w [17]. Finally, the algorithm can be modified such that ε need not be specified a priori. Instead, one specifies an upper bound 0 ≤ ν ≤ 1 on the fraction of points allowed to lie outside the tube (asymptotically, the number of SVs) and the corresponding ε is computed automatically. This is achieved by using as primal objective function 1 w2 + C 2



νmε +

m 



|yi − f (xi )|ε

(53)

i=1

instead of (45), and treating ε ≥ 0 as a parameter that we minimize over [22]. We conclude this section by noting that the SV algorithm has not only been generalized to regression, but also, more recently, to one-class problems and novelty detection [20]. Moreover, the kernel method for computing dot products in feature spaces is not restricted to SV machines. Indeed, it has been pointed out that it can be used to develop nonlinear generalizations of any algorithm that can be cast in terms of dot products, such as principal component analysis [21], and a number of developments have followed this example.

6

Polynomial Kernels

We now take a closer look at the issue of the similarity measure, or kernel, k. In this section, we think of X as a subset of the vector space RN , (N ∈ N), endowed with the canonical dot product (3).

54

B. Sch¨ olkopf and A.J. Smola

σ( Σ ) υ1

υ2



Φ(x1)

Φ(x2)

output

...

υm

... Φ(x)

σ (Σ υi k (x,xi))

weights

dot product = k (x,xi)

Φ(xn)

mapped vectors Φ(xi), Φ(x)

...

support vectors x1 ... xn test vector x

Fig. 5. Architecture of SV machines. The input x and the Support Vectors xi are nonlinearly mapped (by Φ) into a feature space H, where dot products are computed. By the use of the kernel k, these two layers are in practice computed in one single step. The results are linearly combined by weights υi , found by solving a quadratic program (in pattern recognition, υi = yi αi ; in regression estimation, υi = αi∗ − αi ). The linear combination is fed into the function σ (in pattern recognition, σ(x) = sgn (x + b); in regression estimation, σ(x) = x + b) (from [17]).

6.1

Product Features

Suppose we are given patterns x ∈ RN where most information is contained in the d-th order products (monomials) of entries [x]j of x, [x]j1 · · · · · [x]jd ,

(54)

where j1 , . . . , jd ∈ {1, . . . , N }. In that case, we might prefer to extract these product features, and work in the feature space H of all products of d entries. In visual recognition problems, where images are often represented as vectors, this would amount to extracting features which are products of individual pixels. For instance, in R2 , we can collect all monomial feature extractors of degree 2 in the nonlinear map Φ : R2 → H = R3 ([x]1 , [x]2 ) →

([x]21 , [x]22 , [x]1 [x]2 ).

(55) (56)

A Short Introduction to Learning with Kernels

55

This approach works fine for small toy examples, but it fails for realistically sized problems: for N -dimensional input patterns, there exist NH =

(N + d − 1)! d!(N − 1)!

(57)

different monomials (54), comprising a feature space H of dimensionality NH . For instance, already 16 × 16 pixel input images and a monomial degree d = 5 yield a dimensionality of 1010 . In certain cases described below, there exists, however, a way of computing dot products in these high-dimensional feature spaces without explicitely mapping into them: by means of kernels nonlinear in the input space RN . Thus, if the subsequent processing can be carried out using dot products exclusively, we are able to deal with the high dimensionality. The following section describes how dot products in polynomial feature spaces can be computed efficiently. 6.2

Polynomial Feature Spaces Induced by Kernels

In order to compute dot products of the form (Φ(x) · Φ(x )), we employ kernel representations of the form k(x, x ) = (Φ(x) · Φ(x )),

(58)

which allow us to compute the value of the dot product in H without having to carry out the map Φ. This method was used by [8] to extend the Generalized Portrait hyperplane classifier of [30] to nonlinear Support Vector machines. [1] call H the linearization space, and used in the context of the potential function classification method to express the dot product between elements of H in terms of elements of the input space. What does k look like for the case of polynomial features? We start by giving an example [29] for N = d = 2. For the map C2 : ([x]1 , [x]2 ) → ([x]21 , [x]22 , [x]1 [x]2 , [x]2 [x]1 ),

(59)

dot products in H take the form (C2 (x) · C2 (x )) = [x]21 [x ]21 + [x]22 [x ]22 + 2[x]1 [x]2 [x ]1 [x ]2 = (x · x )2 ,

(60)

i.e., the desired kernel k is simply the square of the dot product in input space. The same works for arbitrary N, d ∈ N [8]: as a straightforward generalization of a result proved in the context of polynomial approximation (Lemma 2.1, [16]), we have: Proposition 1. Define Cd to map x ∈ RN to the vector Cd (x) whose entries are all possible d-th degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is k(x, x ) = (Cd (x) · Cd (x )) = (x · x )d .

(61)

56

B. Sch¨ olkopf and A.J. Smola

Proof. We directly compute (Cd (x) · Cd (x )) =

N 

[x]j1 · · · · · [x]jd · [x ]j1 · · · · · [x ]jd

(62)

j1 ,...,jd =1

d  N  =  [x]j · [x ]j  = (x · x )d .

(63)

j=1

Instead of ordered products, we can use unordered ones to obtain a map Φd which yields the same value of the dot product. To this end, we have to compensate for the multiple occurrence of certain monomials in Cd by scaling the respective entries of Φd with the square roots of their numbers of occurrence. Then, by this definition of Φd , and (61), (Φd (x) · Φd (x )) = (Cd (x) · Cd (x )) = (x · x )d .

(64)

ones are different, For instance, if n of the ji in (54) are equal, and the remaining then the coefficient in the corresponding component of Φd is (d − n + 1)! (for the general case, cf. [24]. For Φ2 , this simply means that [29] √ (65) Φ2 (x) = ([x]21 , [x]22 , 2 [x]1 [x]2 ). If x represents an image with the entries being pixel values, we can use the kernel (x · x )d to work in the space spanned by products of any d pixels — provided that we are able to do our work solely in terms of dot products, without any explicit usage of a mapped pattern Φd (x). Using kernels of the form (61), we take into account higher-order statistics without the combinatorial explosion (cf. (57)) of time and memory complexity which goes along already with moderately high N and d. To conclude this section, note that it is possible to modify (61) such that it maps into the space of all monomials up to degree d, defining [29] k(x, x ) = ((x · x ) + 1)d .

7

(66)

Representing Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the kernel. Given some kernel function, can we construct a feature space such that the kernel computes the dot product in that feature space? This question has been brought to the attention of the machine learning community by [1], [8], and [29]. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the functional analytic theory of kernels is [5]; indeed, a large part of the material in the present section is based on that work. There is one more aspect in which this section differs from the previous one: the latter dealt with vectorial data. The results in the current section, in contrast,

A Short Introduction to Learning with Kernels

57

hold for data drawn from domains which need no additional structure other than them being nonempty sets X. This generalizes kernel learning algorithms to a large number of situations where a vectorial representation is not readily available [17, 12, 33]. We start with some basic definitions and results. Definition 1 (Gram matrix). Given a kernel k and patterns x1 , . . . , xm ∈ X, the m × m matrix K := (k(xi , xj ))ij

(67)

is called the Gram matrix (or kernel matrix) of k with respect to x1 , . . . , xm . Definition 2 (Positive definite matrix). An m × m matrix Kij satisfying  ci c¯j Kij ≥ 0 (68) i,j

for all ci ∈ C is called positive definite. Definition 3 ((Positive definite) kernel). Let X be a nonempty set. A function k : X × X → C which for all m ∈ N, xi ∈ X gives rise to a positive definite Gram matrix is called a positive definite kernel. Often, we shall refer to it simply as a kernel. The term kernel stems from the first use of this type of function in the study of integral operators. A function k which gives rise to an operator T via k(x, x )f (x ) dx (69) (T f )(x) = X

is called the kernel of T . One might argue that the term positive definite kernel is slightly misleading. In matrix theory, the term definite is usually used to denote the case where equality in (68) only occurs if c1 = · · · = cm = 0. Simply using the term positive kernel, on the other hand, could be confused with a kernel whose values are positive. In the literature, a number of different terms are used for positive definite kernels, such as reproducing kernel, Mercer kernel, or support vector kernel. The definitions for (positive definite) kernels and positive definite matrices differ only in the fact that in the former case, we are free to choose the points on which the kernel is evaluated. Positive definiteness implies positivity on the diagonal, k(x1 , x1 ) ≥ 0 for all x1 ∈ X,

(70)

(use m = 1 in (68)), and symmetry, i.e., k(xi , xj ) = k(xj , xi ).

(71)

58

B. Sch¨ olkopf and A.J. Smola

Note that in the complex-valued case, our definition of symmetry includes complex conjugation, depicted by the bar. The definition of symmetry of matrices is ¯ ji . analogous, i.e.. Kij = K Obviously, real-valued kernels, which are what we will mainly be concerned with, are contained in the above definition as a special case, since we did not require that the kernel take values in C\R. However, it is not sufficient to require that (68) hold for real coefficients ci . If we want to get away with real coefficients only, we additionally have to require that the kernel be symmetric, k(xi , xj ) = k(xj , xi ).

(72)

Kernels can be regarded as generalized dot products. Indeed, any dot product can be shown to be a kernel; however, linearity does not carry over from dot products to general kernels. Another property of dot products, the CauchySchwarz inequality, does have a natural generalization to kernels: Proposition 2. If k is a positive definite kernel, and x1 , x2 ∈ X, then |k(x1 , x2 )|2 ≤ k(x1 , x1 ) · k(x2 , x2 ).

(73)

Proof. For sake of brevity, we give a non-elementary proof using some basic facts of linear algebra. The 2 × 2 Gram matrix with entries Kij = k(xi , xj ) is positive definite. Hence both its eigenvalues are nonnegative, and so is their product, K’s determinant, i.e., ¯ 12 = K11 K22 − |K12 |2 . 0 ≤ K11 K22 − K12 K21 = K11 K22 − K12 K

(74)

Substituting k(xi , xj ) for Kij , we get the desired inequality. We are now in a position to construct the feature space associated with a kernel k. We define a map from X into the space of functions mapping X into C, denoted as CX , via Φ : X → CX x → k(., x).

(75)

Here, Φ(x) = k(., x) denotes the function that assigns the value k(x , x) to x ∈ X. We have thus turned each pattern into a function on the domain X. In a sense, a pattern is now represented by its similarity to all other points in the input domain X. This seems a very rich representation, but it will turn out that the kernel allows the computation of the dot product in that representation. We shall now construct a dot product space containing the images of the input patterns under Φ. To this end, we first need to endow it with the linear structure of a vector space. This is done by forming linear combinations of the form m  f (.) = αi k(., xi ). (76) i=1

Here, m ∈ N, αi ∈ C and xi ∈ X are arbitrary.

A Short Introduction to Learning with Kernels

59

Next, we define a dot product between f and another function 

g(.) =

m 

βj k(., xj )

(77)

j=1

(m ∈ N, βj ∈ C and xj ∈ X) as 

f, g :=

m  m 

α ¯ i βj k(xi , xj ).

(78)

i=1 j=1

To see that this is well-defined, although it explicitly contains the expansion coefficients (which need not be unique), note that 

m 

f, g =

βj f (xj ),

(79)

j=1

using k(xj , xi ) = k(xi , xj ). The latter, however, does not depend on the particular expansion of f . Similarly, for g, note that

f, g =

m 

α ¯ i g(xi ).

(80)

i=1

The last two equations also show that ·, · is bilinear. It is symmetric, as f, g =

g, f . Moreover, it is positive definite, since positive definiteness of k implies that for any function f , written as (76), we have

f, f =

m 

αi αj k(xi , xj ) ≥ 0.

(81)

i,j=1

The latter implies that ·, · is actually itself a positive definite kernel, defined on our space of functions. To see this, note that given functions f1 , . . . , fn , and coefficients γ1 , . . . , γn ∈ R, we have   n n n    (82) γi γj fi , fj = γi fi , γj fj ≥ 0. i,j=1

i=1

j=1

Here, the left hand equality follows from the bilinearity of ·, · , and the right hand inequality from (81). For the last step in proving that it even is a dot product, we will use the following interesting property of Φ, which follows directly from the definition: for all functions (76), we have

k(., x), f = f (x) — k is the representer of evaluation. In particular,

(83)

60

B. Sch¨ olkopf and A.J. Smola

k(., x), k(., x ) = k(x, x ).

(84)

By virtue of these properties, positive definite kernels k are also called reproducing kernels [3, 5, 32, 17]. By (83) and Proposition 2, we have |f (x)|2 = | k(., x), f |2 ≤ k(x, x) · f, f .

(85)

Therefore, f, f = 0 directly implies f = 0, which is the last property that was left to prove in order to establish that ., . is a dot product. One can complete the space of functions (76) in the norm corresponding to the dot product, i.e., add the limit points of sequences that are convergent in that norm, and thus gets a Hilbert space H, usually called a reproducing kernel Hilbert space.3 The case of real-valued kernels is included in the above; in that case, H can be chosen as a real Hilbert space.

8

Examples of Kernels

Besides (61), [8] and [29] suggest the usage of Gaussian radial basis function kernels [1]

x − x 2  k(x, x ) = exp − (86) 2 σ2 and sigmoid kernels k(x, x ) = tanh(κ(x · x ) + Θ).

(87)

While one can show that (87) is not a kernel [26], (86) has become one of the most useful kernels in situations where no further knowledge about the problem at hand is given. Note that all the kernels discussed so far have the convenient property of unitary invariance, i.e., k(x, x ) = k(U x, U x ) if U  = U −1 (if we consider complex numbers, then U ∗ instead of U  has to be used). The radial basis function kernel additionally is translation invariant. Moreover, as it satisfies k(x, x) = 1 for all x ∈ X, each mapped example has unit length, Φ(x) = 1. In addition, as k(x, x ) > 0 for all x, x ∈ X, all points lie inside the same orthant in feature space. To see this, recall that for unit length vectors, the dot product (3) equals the cosine of the enclosed angle. Hence cos(∠(Φ(x), Φ(x ))) = (Φ(x) · Φ(x )) = k(x, x ) > 0,

(88)

which amounts to saying that the enclosed angle between any two mapped examples is smaller than π/2. The examples given so far apply to the case of vectorial data. Let us at least give one example where X is not a vector space. 3

A Hilbert space H is defined as a complete dot product space. Completeness means that all sequences in H which are convergent in the norm corresponding to the dot product will actually have their limits in H, too.

A Short Introduction to Learning with Kernels

61

Example 1 (Similarity of probabilistic events). If A is a σ-algebra, and P a probability measure on A, then k(A, B) = P (A ∩ B) − P (A)P (B)

(89)

is a positive definite kernel. Further examples include kernels for string matching, as proposed by [33] and [12]. There is an analog of the kernel trick for distances rather than dot products, i.e., dissimilarities rather than similarities. This leads to the class of conditionally positive definite kernels, which contain the standard SV kernels as special cases. Interestingly, it turns out that SVMs and kernel PCA can be applied also with this larger class of kernels, due to their being translation invariant in feature space [23].

9

Applications

Having described the basics of SV machines, we now summarize some empirical findings. By the use of kernels, the optimal margin classifier was turned into a classifier which became a serious competitor of high-performance classifiers. Surprisingly, it was noticed that when different kernel functions are used in SV machines, they empirically lead to very similar classification accuracies and SV sets [18]. In this sense, the SV set seems to characterize (or compress) the given task in a manner which up to a certain degree is independent of the type of kernel (i.e., the type of classifier) used. Initial work at AT&T Bell Labs focused on OCR (optical character recognition), a problem where the two main issues are classification accuracy and classification speed. Consequently, some effort went into the improvement of SV machines on these issues, leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs, and the Reduced Set method for speeding up classification. This way, SV machines became competitive with (or, in some cases, superior to) the best available classifiers on both OCR and object recognition tasks [7, 9, 11]. Another initial weakness of SV machines, less apparent in OCR applications which are characterized by low noise levels, was that the size of the quadratic programming problem scaled with the number of Support Vectors. This was due to the fact that in (35), the quadratic part contained at least all SVs — the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that some of the patterns that were initially not identified as SVs turn out to become SVs at a later stage (note that without chunking, the size of the matrix would be m × m, where m is the number of all training examples). What happens if we have a high-noise problem? In this case, many of the slack variables ξi will become nonzero, and all the corresponding examples will become SVs. For this case, a

62

B. Sch¨ olkopf and A.J. Smola

decomposition algorithm was proposed [14], which is based on the observation that not only can we leave out the non-SV examples (i.e., the xi with αi = 0) from the current chunk, but also some of the SVs, especially those that hit the upper boundary (i.e., αi = C). In fact, one can use chunks which do not even contain all SVs, and maximize over the corresponding sub-problems. SMO [15] explores an extreme case, where the sub-problems are chosen so small that one can solve them analytically. Several public domain SV packages and optimizers are listed on the web page http://www.kernel-machines.org. For more details on the optimization problem, see [23].

10

Conclusion

One of the most appealing features of kernel algorithms is the solid foundation provided by both statistical learning theory and functional analysis. Kernel methods let us interpret (and design) learning algorithms geometrically in feature spaces nonlinearly related to the input space, and combine statistics and geometry in a promising way. Kernels provide an elegant framework for studying three fundamental issues of machine learning: – Similarity measures — the kernel can be viewed as a (nonlinear) similarity measure, and should ideally incorporate prior knowledge about the problem at hand – Data representation — as described above, kernels induce representations of the data in a linear space – Function class — due to the representer theorem, the kernel implicitly also determines the function class which is used for learning.

References ´ M. Braverman, and L. I. Rozono´er. Theoretical foundations 1. M. A. Aizerman, E. of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. 2. Noga Alon, Shai Ben-David, Nicolo Cesa-Bianchi, and David Haussler. Scalesensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997. 3. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. 4. P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages 43– 54, Cambridge, MA, 1999. MIT Press. 5. C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, New York, 1984. 6. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.

A Short Introduction to Learning with Kernels

63

7. V. Blanz, B. Sch¨ olkopf, H. B¨ ulthoff, C. Burges, V. Vapnik, and T. Vetter. Comparison of view-based object recognition algorithms using realistic 3D models. In C. von der Malsburg, W. von Seelen, J. C. Vorbr¨ uggen, and B. Sendhoff, editors, Artificial Neural Networks ICANN’96, pages 251–256, Berlin, 1996. Springer Lecture Notes in Computer Science, Vol. 1112. 8. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the Annual Conference on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press. 9. C. J. C. Burges and B. Sch¨ olkopf. Improving the accuracy and speed of support vector learning machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge, MA, 1997. MIT Press. 10. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995. 11. D. DeCoste and B. Sch¨ olkopf. Training invariant support vector machines. Machine Learning, 2002. Accepted for publication. Also: Technical Report JPL-MLTR-00-1, Jet Propulsion Laboratory, Pasadena, CA, 2000. 12. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL-99-10, Computer Science Department, UC Santa Cruz, 1999. 13. J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society, London, A 209:415–446, 1909. 14. E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII—Proceedings of the 1997 IEEE Workshop, pages 276–285, New York, 1997. IEEE. 15. J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press. 16. T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201– 209, 1975. 17. B. Sch¨ olkopf. Support Vector Learning. R. Oldenbourg Verlag, M¨ unchen, 1997. Doktorarbeit, TU Berlin. Download: http://www.kernel-machines.org. 18. B. Sch¨ olkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference on Knowledge Discovery & Data Mining, Menlo Park, 1995. AAAI Press. 19. B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods— Support Vector Learning. MIT Press, Cambridge, MA, 1999. 20. B. Sch¨ olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 2001. 21. B. Sch¨ olkopf, A. Smola, and K.-R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998. 22. B. Sch¨ olkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1207–1245, 2000. 23. B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

64

B. Sch¨ olkopf and A.J. Smola

24. A. Smola, B. Sch¨ olkopf, and K.-R. M¨ uller. The connection between regularization operators and support vector kernels. Neural Networks, 11:637–649, 1998. 25. A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans. Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 2000. ´ ari, and R. C. Williamson. Regularization with dot-product 26. A. J. Smola, Z. L. Ov´ kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 308–314. MIT Press, 2001. 27. A. J. Smola and B. Sch¨ olkopf. On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica, 22:211–231, 1998. 28. V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer, New York, 1982). 29. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. 30. V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979). 31. V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24:774–780, 1963. 32. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. 33. C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50, Cambridge, MA, 2000. MIT Press. 34. J. Weston, O. Chapelle, A. Elisseeff, B. Sch¨ olkopf, and V. Vapnik. Kernel dependency estimation. Technical Report 98, Max Planck Institute for Biological Cybernetics, 2002. 35. R. C. Williamson, A. J. Smola, and B. Sch¨ olkopf. Generalization bounds for regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transaction on Information Theory, 2001.

Bayesian Kernel Methods Alexander J. Smola1 and Bernhard Sch¨ olkopf2 1 2

RSISE, The Australian National University, Canberra 0200, ACT, Australia Max Planck Institut f¨ ur Biologische Kybernetik, 72076 T¨ ubingen, Germany

Abstract. Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.

1

Introduction

The Bayesian approach allows for an intuitive incorporation of prior knowledge into the process of estimation. Moreover, it is possible, within the Bayesian framework, to obtain estimates of the confidence and reliability of the estimation process itself. These estimates are typically relatively easy to compute. Surprisingly enough, as we shall see, the Bayesian approach leads to algorithms much akin to those developed within the framework of risk minimization. This allows us to provide new insight into kernel algorithms such as SV classification and regression. In addition, these similarities help us design Bayesian counterparts for risk minimization algorithms (such as Laplacian Processes (Section 5)), or vice versa (Section 6). In other words, we can tap into the knowledge from both worlds and combine it to create better algorithms. We begin in Section 2 with an overview of the basic assumptions underlying Bayesian estimation. We explain the notion of prior distributions, which encode our prior belief concerning the likelihood of obtaining a certain estimate, and the concept of the posterior probability, which quantifies how plausible functions appear after we observe some data. Furthermore we show how inference is performed, and how certain numerical problems that arise can be alleviated by various types of Maximum-a-Posteriori (MAP) estimation. Once the basic tools are introduced, we analyze the specific properties of Bayesian estimators for three different types of prior probabilities: Gaussian Processes (Section 3 describes the theory and Section 4 the implementation), which rely on the assumption that adjacent coefficients are correlated, Laplacian Processes (Section 5), which assume that estimates can be expanded into a sparse linear combination of kernel functions, and therefore favor such hypotheses, and Relevance Vector Machines (Section 6), which assume that the contribution of each kernel function is governed by a normal distribution with its own variance. 

The present article is based on [62].

S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 65–117, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

66

A.J. Smola and B. Sch¨ olkopf

Readers interested in a quick overview of the principles underlying Bayesian statistics will find the introduction sufficient. We recommend that the reader focus first on Sections 2 and 3. The subsequent sections are ordered in increasing technical difficulty, and decreasing bearing on the core issues of Bayesian estimation with kernels.

2

Bayesics

The central characteristic of Bayesian estimation is that we assume certain prior knowledge or beliefs about the data generating process, and the dependencies we might encounter. It is a natural extension of the maximum likelihood framework (loosely speaking the prior knowledge behaves exactly in the same way as additional data that we might have observed prior to the actual estimation problem). Unless stated otherwise, we observe an m-sample X := {x1 , . . . , xm } and Y := {y1 , . . . , ym }, based on which we will carry out inference, typically on a set of observations X  := {x1 , . . . , xm }. For notational convenience we sometimes use Z := {(x1 , y1 ), . . . , (xm , ym )} instead of X, Y . We begin with an overview over the fundamental ideas (see also [3, 40, 46, 63, 56, 62] for more details).1 2.1

Maximum Likelihood and Bayes Rule

Assume that we are given a set of observations Y which are drawn from a probability distribution pθ (Y ). It then follows that for a given value of θ we can assess how likely it is to observe Y . Conversely, given the observations Y , we can try to find a value of θ for which Y was a particularly likely observation. For instance, if we know that Y is composed of several observations of a scalar random variable drawn from a normal distribution with mean µ and variance σ, that is θ = (µ, σ), we could try to infer µ and σ from Y or vice versa, assess how likely it is that Y came from such a normal distribution. The process of determining θ from data by maximizing pθ (Y ) is referred to as maximum likelihood estimation. We obtain θML (Y ) := argmax pθ (Y ).

(1)

θ

Note that we deliberately write pθ (Y ) instead of the conditional probability distribution p(Y |θ), since the latter implies that also θ is a random variable with a certain probability distribution p(θ), an additional assumption we may not want to make. Note that pθ (Y ), viewed as a function of θ, is often referred to as the likelihood of Y , given θ. 1

For the sake of simplicity we will gloss over details such as σ-algebras. With some abuse of notation, p(x) will correspond to either a density or a probability measure, depending on the context.

Bayesian Kernel Methods

67

If, however, such a p(θ) exists, we can use the definition of conditional distributions p(Y, θ) = p(Y |θ)p(θ) = p(θ|Y )p(Y )

(2)

and obtain Bayes’ rule p(θ|Y ) =

p(Y |θ)p(θ) ∝ p(Y |θ)p(θ). p(Y )

(3)

This means that as soon as we assume that θ itself is a random variable with p(θ), we can infer how likely it is to observe a certain value of θ, given Y . In this way, θ is transformed from the parameter of a density to a random variable, which allows us to make statements about the likely value of θ in the presence of data. Typically p(θ|Y ) is referred to as the posterior distribution, whereas p(θ) is known as the prior, and p(Y |θ) is called the evidence: with a given prior, based on our prior knowledge about θ and using the evidence of the observations Y we can come to a posterior assessment p(θ|Y ) on the probability of having observed a certain value of θ. Finally, the mode of p(θ|Y ) is used in maximum a posteriori (MAP) estimation to find a good estimate for θ via θMAP (Y ) := argmax p(Y |θ)p(θ).

(4)

θ

Note that the use of θMAP (Y ) is typically preferred to θML (Y ), since we can avoid “unreasonable” values of θ by making suitable prior assumptions on θ, such as its range. 2.2

Examples of Maximum Likelihood Estimation

In the following, we will be introducing various models of pθ (y), which will become useful in regression and classification problems at a later stage in this chapter. We begin with the (simplifying) assumption that Y is obtained iid (identically independently distributed) from pθ (y), that is pθ (Y ) =

m 

pθ (yi ).

(5)

i=1

and that furthermore pθ (y) = p(y − θ), where p is a distribution with zero mean. In other words, we begin by studying the “estimation of a location parameter problem”. Depending on the properties of p(y−θ) we will obtain various solutions for pML (Y ). For the practical purpose of finding θML (Y ) under the assumption (5) it is advantageous to rewrite (1) as θML (Y ) = argmin θ

m  i=1

− log p(yi − θ).

(6)

68

A.J. Smola and B. Sch¨ olkopf

This reduces the problem of maximizing a joint function of all yi to minimizing the average of terms, each of which is dependent only on one of the instances yi . The term − log p(y − θ), considered as a function of θ is often referred to as the negative log-likelihood. Normal Distribution: we assume that we have a normal distribution with fixed variance σ > 0 and zero mean, that is   1 1 p(ξ) = √ (7) exp − 2 ξ 2 2σ 2πσ and consequently − log p(y − θ) = 2σ1 2 (y − θ)2 + c, where c is a constant independent of y and θ. This means that θMP (Y ) satisfies m  1 θMP (Y ) = argmin (y − θ)2 . 2 i 2σ θ i=1

(8)

Taking derivatives, simple algebra shows that in this case 1  yi . m i=1 m

θMP (Y ) =

(9)

In other words, θMP (Y ) for the normal distribution leads to the mean of random variables to estimate the location parameter. Laplacian Distribution: again this distribution has zero mean, yet much longer tails than the normal distribution. It satisfies   1 1 p(ξ) = exp − |ξ| (10) 2λ λ and consequently − log p(y − θ) = λ1 |y − θ| + c, where c is a constant independent of y and θ. This means that θMP (Y ) satisfies m  1 θMP (Y ) = argmin |yi − θ|. 2λ θ i=1

(11)

Taking derivatives, we can see that for the minimizer θMP as many terms must satisfy yi > θ as there are terms with yi < θ. Consequently, the median of Y is the solution for θMP . Clearly the median is a much more robust way of estimating the expected value of a distribution:2 if we corrupt Y with some additional data not taken from the true distribution we will obtain a good estimate nonetheless, regardless of the (possibly large) numerical value of the corrupted additional observations. This is the case since we excluded the extreme values in Y on both ends. 2

We assume that all distributions we are dealing with are symmetric and have zero mean, hence mean and median coincide.

Bayesian Kernel Methods

69

Robust Estimation: Using only the median of Y for estimation of θ appears to be too drastic, especially if we have reason to believe that not all the data is corrupted. Instead, Huber [29] formalized the notion of using a trimmed mean, where one only discards a certain fraction of extreme values and takes the mean of the rest. For this purpose the following density was introduced:  ξ2 if |ξ| ≤ σ 2σ ) (12) p(ξ) ∝ exp(− exp( σ2 − |ξ|) otherwise One can show that θML (Y ) is given by the mean of the fraction of observations that lie within an interval of ±σ around θML (Y ) and where equal amounts of observations yi exceed θML (Y ) by more than σ from above and below. ε-insensitive Density: For computational convenience Vapnik [75] introduced another variant of density model, based on the ε-insensitive loss function. It is essentially a Laplacian distribution, where in a neighborhood of size ε around its mean all data is equally probable. We can formalize this as follows: p(ξ) =

1 exp(−|ξ|ε ) where |ξ|ε := max(0, |ξ| − ε). 2(1 + ε)

(13)

Estimators of θML (Y ) have similar properties to the ones in the previous paragraph, with the difference that one takes the mean of the two extreme values in the ε-neighborhood of the expectation rather than the mean over the whole set. The advantage of this somewhat peculiar estimator is that optimization problems arising from it have a lower number of active constraints. This was exploited in Support Vector regression [75, 76]. Besides estimating the mean of a real-valued random variable y we may also have to deal with discrete valued ones. For simplicity we consider only binary y, that is y ∈ {±1}. In this case it is most useful to estimate the probability p(y = 1) (and p(y = −1) = 1 − p(y = 1)).3 While we could do this directly, it pays to consider a few indirect strategies for finding such an estimate. The reason is that at a later stage we will use this indirect parameterization to estimate p(y = 1) as a function of some additional parameter x (in which case we will call y the label and x the pattern), a process which is greatly simplified by an indirect parameterization. Typically we will study probabilities parameterized by pθ (y) = p(θy), where p may take on various functional forms. Often in classification, when θ is location dependent, the values θ(x)y are referred to as the margin at location x. Logistic Transfer Function: this is given by pθ (y) := 3

exp(θy) . 1 + exp(θy)

(14)

As in the real-valued case, we develop the reasoning for p(y) rather than p(y|x) and will introduce the conditioning part later in Section 3.

70

A.J. Smola and B. Sch¨ olkopf Normal Distribution

Laplace Distribution

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0 −3

−2

−1

0 y

1

2

3

0 −3

−2

Huber’s Robust 3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5 −2

−1

0 y

1

0 y

1

2

3

Vapnik’s Insensitive

3

0 −3

−1

2

3

0 −3

−log p(y) p(y)

−2

−1

0 y

1

2

3

Fig. 1. Densities and corresponding negative log density. Upper left: Gaussian, upper right: Laplacian, lower left: Huber’s robust, lower right: ε-insensitive. θ

−θ

e e In other words, p(y = 1) = 1+e θ and p(y = −1) = 1+e−θ . Note that logistic regression with (14) is equivalent to obtaining θ via

θ = ln

p(y = 1) . p(y = −1)

(15)

Quite often the logarithm of the ratio between the two class probabilities is also referred to as the log odds ratio. Probit: we might also assume that y is given by the sign of θ, but corrupted by Gaussian noise (see for instance [49, 50, 63]); thus, y = sgn (θ + ξ) where ξ ∼ N(0, 1). In this case, we have  sgn (θ + ξ) + 1 p(ξ)dξ (16) pθ (y) = 2    ∞ 1 1 =√ (17) exp − ξ 2 dξ = Φ (θ) . 2 2π −θ

Bayesian Kernel Methods Logistic Regression

Cumulative Normal Distribution

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0 −3

−2

−1

0 y

1

71

2

3

0 −3

p(y) −log p(y)

−2

−1

0 y

1

2

3

Fig. 2. Conditional probabilities and corresponding negative log-probability. Left: logistic regression; Right: Probit.

Here Φ is the distribution function of the normal distribution. Note the correspondence between the asymptotic behaviors of the logistic and the probit model on the one hand, and the linear and quadratic soft margin loss functions (1 − θy)+ and (1 − θy)2+ . This also explains why the linear soft margin loss is a good proxy to logistic regression and the quadratic soft margin to the probit model. Furthermore, it indicates, that should one use the linear soft margin as a cheap proxy for optimization purposes, the logistic is a more adequate model to fit the densities subsequently [51], whereas, for the quadratic soft margin the probit model is to be preferred. 2.3

Inference

Besides the problem of finding suitable estimates of θ for pθ (Y ), which can be used to understand the way observations Y were generated from some underlying distribution, we may also simply want to infer new values Y  , given some observations Y . For this purpose we need to compute p(Y  |Y ). The factorization of conditional probabilities leads to p(Y  |Y ) =

p(Y  , Y ) ∝ p(Y  , Y ). p(Y )

(18)

This means that as soon as we can compute the joint probability distribution function of Y, Y  we are able to predict Y  , given Y . The normalization term p(Y ) is not always needed — for instance, the mode of p(Y  |Y ) is independent of the scale. Example 1 (Inference with Normal Distributions). Assume that (Y, Y  ) isjointly    ΣY Y ΣY Y  µY normal with covariance matrix Σ = and mean µ = , ΣY  Y ΣY  Y  µY 

72

A.J. Smola and B. Sch¨ olkopf

then p(Y  |Y ) is drawn from a normal distribution with covariance Σcond = ΣY  Y  − ΣY  Y ΣY−1Y ΣY Y  and mean µcond = µY  + ΣY  Y ΣY−1Y (Y − µY ). This can be seen as follows: since p(Y, Y  ) is jointly normal, the conditional distribution p(Y  |Y ) is also a normal distribution, which can be obtained from p(Y, Y  ) by collecting all the terms which depend on Y  . We know that p(Y, Y  ) is given by    −1  

 1 1 Y − µY  Σ Y − µ Σ Y Y Y Y Y − m+m − 2 (det Σ) 2 exp − (2π) ΣY  Y ΣY  Y  Y  − µY  2 Y  − µY  Writing out the inverse of Σ (see e.g., [39]) and collecting terms yields the above result. In the next section we will use the above example to perform Gaussian Process prediction. For the moment just note that once we know p(Y, Y  ) it is very easy to estimate Y  , given Y , since p(Y  |Y ) ∝ p(Y, Y  ). Quite often, unfortunately, we will not have p(Y, Y  ) at our disposition immediately. Instead, we may only have p(Y, Y  |θ) together with p(θ). For instance, in all settings described in Section 2.2 we assumed to know the distribution of the observations only up to their expected value. This means that in order to obtain p(Y, Y  ) we need to integrate out the latent variable θ. This is achieved as follows:     p(Y, Y ) = p(Y, Y , θ)dθ = p(Y, Y  |θ)p(θ)dθ. (19) Eq. (19) may or may not be computable in closed form. Hence there exist various strategies to deal with the problem of obtaining p(Y, Y  ). We list some of them below. Exact Solution: If we can solve for p(Y, Y  ) explicitly we can proceed as before with our estimation procedure after solving the integral. An important special case is where (Y −θ, Y  −θ ) ∼ N(µ, Σ) and furthermore (θ, θ ) ∼ N(0, Λ),   2 where θ, Y ∈ Rm , θ , Y  ∈ Rm , and Λ ∈ R(m+m ) . Such a situation may occur where Y (and Y  ) are composed of two random variables, each of them normally distributed with their own mean and covariance. Typically θ assumes the role of the additive noise and Λ is a multiple of the unit matrix. This, however need not be the case: we could deal with colored noise as well. By construction Y , too, is normally distributed, where we simply add up the means and variances of the two constituents to obtain (Y, Y  ) ∼ N(µ, Σ +Λ). Hence, without much effort, we computed integral (19) by remarking that for normal distributions mean and variance add up. Note that inference in this situation proceeds identically to the discussion of Example 1). We will later in this chapter use this setting under the name ’Gaussian Process Regression with Normal Noise’.

Bayesian Kernel Methods

73

Sampling Methods: We can approximate the integration in (19) by randomly drawing (Y  , θ) from p(Y, Y  , θ) and thereby performing inference about the distribution of Y  . Various methods to carry out such samplings exist. Typically one uses Markov Chain Monte Carlo (MCMC) methods. See [47] for details and further references. The advantage of such methods is that given sufficient computational resources, we are able to obtain a very good estimate on the distribution of Y  . Furthermore the quality of the estimate keeps on increasing as we wait for more samples. The obvious downside is that we will not obtain a closed form analytic expression for a density. Furthermore sampling methods can be computationally quite expensive, especially if we wish to predict for a large amount of data. Variational Methods: The reason why we had to resort to approximating (19) was that the integrand did not lend itself to a closed form solution of the integral. However, it may be that for a modified version of p(Y, Y  |θ) we might be able to perform the integral. One option is to use a normal distribution N(µ, Σ), where the mean µ coincides with the mode of p(Y, Y  |θ)p(θ) with respect to θ, and to use the second derivative of − ln p(Y, Y  |θ)p(θ) at θ = µ for the variance σ. This is often referred to as the Laplace Approximation. We set (see for instance [40]) θ|(Y, Y  ) ∼ N(E[θ|(Y, Y  )], Σ −1 ) where Σ = −∂θ2 [ln p(θ|(Y, Y  ))]|µ . (20) The advantage of such a procedure is that the integrals remain tractable. This is also one of the reasons why normal distributions enjoy a high degree of popularity in Bayesian methods. Besides, the normal distribution is the least informative distribution (largest entropy) among all distributions with bounded variance [7]. As Figure 3 indicates, a single Gaussian may not always be sufficient to capture the important properties of p(Y, Y  |θ)p(θ). A more elaborate parametric model qφ (θ) of p(θ|Y, Y  ), such as a mixture of Gaussian densities, can then be used to improve the approximation of (19). A common strategy is to resort to variational methods. The details are rather technical and go beyond the scope of this section. The interested reader is referred to [36] for an overview, and to [4] for an application to the Relevance Vector Machine of Section 6. The following theorem describes the basic idea. Theorem 1 (Variational Approximation of Densities). Denote by θ, Y random variables with corresponding densities p(θ, Y ), p(θ|Y ), and p(θ). Then for any density q(θ), the following bound holds;  ln p(Y ) =

ln θ

p(θ, Y ) q(θ)dθ − q(θ)

 ln θ

p(θ|Y ) q(θ)df ≥ q(θ)

 ln θ

p(θ, Y ) q(θ)dθ. q(θ) (21)

74

A.J. Smola and B. Sch¨ olkopf 0.5

1.5 Mean

0.4

Mode

Mean

1 p(y)

p(y)

0.3 0.2 0.5 0.1 0

0

2

4

6

8

10

0

0

2

y

4

6

8

10

y

Fig. 3. Left: The mode and mean of the distribution coincide, hence the MAP approximation is satisfied. Right: For multi-modal distributions, the MAP approximation can be arbitrarily bad.

Proof. We begin with the first equality of (21). Since p(θ, Y ) = p(θ|Y )p(Y ), we may decompose p(θ, Y ) p(θ|Y ) = ln p(Y ) + ln . q(θ) q(θ)

(22)

) Additionally, − θ ln p(θ|Y q(θ) q(θ)dθ = KL(p(θ|Y )q(θ)) is the Kullback-Leibler divergence between p(θ|Y ) and q(θ) [7]. The latter is a nonnegative quantity which proves the second part of (21). The true posterior distribution is usually p(θ|Y ), and q(θ) an approximation of it. The practical advantage of (21) is that  p(θ, Y ) q(θ)dθ L := ln q(θ) θ can often be computed more easily, at least for simple enough q(θ). Furthermore, by maximizing L via a suitable choice of q, we maximize a lower bound on ln p(θ). Expectation Maximization: Another method for approximating p(Y, Y  ) was suggested in [10], namely that one maximizes the integrand in (19) jointly over the unknown variable Y  and the latent variable θ. While this is clearly not equivalent to solving the integral, there are many cases where one may hope that the maximum of the integrand will not differ too much from the location of the mean of p(y). Furthermore we gain an interpretation of a possibly plausible value of θ in conjunction with Y . This could be the noise level of the estimation process, certain parameters of the function class such as the degree of smoothness, etc. Several options are at our disposal when it comes to maximizing p(Y, Y  |θ)p(θ) with respect to θ and Y  .

Bayesian Kernel Methods

75

Firstly, we could use a function maximization procedure directly on the joint distribution. Instead, [10] propose the so-called Expectation Maximization algorithm, which leads to a local maximum in a very intuitive fashion. Ignoring proof details we proceed as follows. Define Q(θ) := EY  [log p(Y  |Y, θ)] + log p(θ)

(23)

ˆ for a previously chosen value where the expectation is taken over p(Y  |Y, θ) ˆ θ. Computing Q is commonly referred as the Expectation-step. Next we find the maximum of Q and replace θˆ by it, that is (24) θˆ ← argmax Q(θ). θ

This is commonly known as the Maximization-step. There exist many special cases where these calculations can be carried out in closed form, most notably distributions derived from the exponential family. [10] show that iteration of this procedure will converge to a local maximum of p(Y, Y  , θ). However, it is not clear, how good the local maximum will be. After all, it depends on ˆ the initial guess of θ. An alternative is to modify the definition of Q(θ) such that θ is used both for the conditional expectation and as an argument of the expectation itself (in other words, we merge θˆ and θ into one variable). In this way we are still guaranteed to find a joint local maximum of Y  and θ through maximizing a function of θ only, yet the reduced number of parameters can be beneficial for a general purposed function optimizer. We will come back to this observation in Section 3. 2.4

Likelihood, Priors, and Hyperpriors Recall p(Y, Y  ) = p(Y, Y |θ)p(θ). This means that we weigh the values p(Y, Y  |θ), which tell us how well (Y, Y  ) fits our model parameterized by θ, by p(θ) according to our prior knowledge of the possible value of θ. For instance, if p(Y, Y  |θ) describes a jointly normal distribution of iid random variables with mean µ and variance σ, that is θ := (µ, σ), p(θ) models our prior knowledge about the occurrence of a particular (µ, σ)-pair. For instance, we might know that the variance never exceeds a certain value and that the mean is always positive. Quite often, however, we may not even be sure about the specific form of p(θ) either, in which case we assume a distribution over possible priors, that is  p(θ|ω)p(ω). (25) p(θ) = ω

Here ω is a so-called hyperprior describing the uncertainty in θ. In summary, we have the following dependency model of (Y, Y  ): p(ω)



p(θ|ω) p(θ|ω)p(ω)



p(Y,Y  |θ) p(Y,Y  |θ)p(θ|ω)p(ω)

/ (Y, Y  ) (26)

76

A.J. Smola and B. Sch¨ olkopf

Hence, in order to obtain p(Y, Y  ) we need to solve the integral   p(Y, Y  , ω, θ)dθdω = p(Y, Y  |ω)p(ω|θ)p(θ)dθdω ω,θ

(27)

ω,θ

and again, as in the previous section, we may resort to various methods for approximating (27). By far the most popular method is to maximize the integrand and to obtain, what is commonly referred to as a MAP2 approximation: ωMAP (Y, θ) := argmax p(Y, θ|ω)p(θ|ω)p(ω).

(28)

ω

If possible, one integrates out the θ by solving the inner of the two integrals in (27) to obtain  ωMAP (Y ) := argmax p(Y |ω)p(ω) = argmax p(Y, Y  |θ)p(θ|ω)dθ. (29) ω

ω

θ

In other words, θ assumes the role of the new prior, where ω is integrated out into p(Y, Y  |θ). Such an approach will be used in Section 6 to obtain numerically attractive optimization methods for estimation.

3

Gaussian Processes

Gaussian Processes are based on the “prior” assumption that adjacent observations should convey information about each other. In particular, it is assumed that the observed variables are normal, and that the coupling between them takes place by means of the covariance matrix of a normal distribution. It turns out that this is a convenient way of extending Bayesian modeling of linear estimators to nonlinear situations (cf. [82, 80, 63]). Furthermore, it represents the counterpart of the “kernel trick” in methods minimizing the regularized risk. We present the basic ideas, and relegate details on efficient implementation of the optimization procedure required for inference to Section 4. So far we only assumed that we have observations Y and a density p(Y ), possibly p(Y |θ) that was given to us independently of any other data. However, if we wish to perform regression or classification, the observations Y will typically depend on some patterns X. In other words, data comes in (xi , yi ) pairs and given some novel patterns xi we will wish to estimate yi at those locations. Hence, we will introduce a conditioning on X and X  in our reasoning. Note that this does not affect any of the derivations above. 3.1

Correlated Observations

If we are observing yi at locations xi , it is only natural to assume that these values are correlated, depending on their location xi . Indeed, if this were not the case, we would not be able to perform inference, since by definition, independent random variables yi do not depend on other observations yj .

Bayesian Kernel Methods

77

In fact, we make a rather strong assumption regarding the distribution of the yi , namely that they form a normal distribution with mean µ and covariance matrix K.4 We could of course assume any arbitrary distribution; most other settings, however, result in inference problems that are rather expensive to compute. Furthermore, as Theorem 5 will show, there exists a large class of assumptions on the distribution of yi that have a normal distribution as their limit. We begin with two observations,  y1 and  y2 , for which we assume zero mean 1 3/4 µ = (0, 0) and covariance K = . Figure 4 shows the corresponding 3/4 3/4 density of the random variables y1 and y2 . Now assume that we observe y1 . This gives us further information about y2 , which allows us to state the conditional density5 p(y2 |y1 ) =

p(y1 , y2 ) . p(y1 )

(30)

Once the conditional density is known, the mean of y2 need no longer be 0, and 3 the variance of y2 is decreased. In the example above, the latter becomes 16 3 instead of 4 — we have performed inference from the observation y1 to obtain possible values of y2 . In a similar fashion, we may infer the distribution of yi based on more than two variables, provided we know the corresponding mean µ and covariance matrix K. This means that K determines how closely the prediction relates to the previous observations yi . In the following section, we formalize the concepts presented here and show how such matrices K can be generated efficiently. 3.2

Definitions and Basic Notions

Assume we are given a distribution over observations yi at locations x1 , . . . , xm . Rather than directly specifying that the observations yi are generated from an underlying functional dependency, we simply assume that they are generated by a Gaussian Process. Loosely speaking, Gaussian processes allow us to extend the notion of a set of random variables to random functions. More formally, we have the following definition: Definition 1 (Gaussian Process). Denote by y(x) a stochastic process parameterized by x ∈ X (X is an arbitrary index set). Then y(x) is a Gaussian process if for any m ∈ N and {x1 , . . . , xm } ⊂ X, the random variables (y(x1 ), . . . , y(xm )) are normally distributed. 4

5

Note that we now use K to denote the covariance matrix. This is done for consistency with the literature on Reproducing Kernel Hilbert Spaces and Support Vector Machines, where K denotes the kernel matrix. We will see that K plays the same role in Gaussian Processes as the kernel matrix plays in the other settings. A convenient trick to obtain p(y2 |y1 ) for normal distributions is to consider p(y1 , y2 ) as a function only of y2 , while keeping y1 fixed at its observed value. The linear and quadratic terms then completely determine the normal distribution in y2 .

78

A.J. Smola and B. Sch¨ olkopf p(y , y ) 1

2

1.5 1 0.5

2

p(y , y )

1

0

y

1

2

0.5

−0.5 0 2

−1 2 0

−1.5

0

y

2

−2

−2

−2 −2

y

1

1

0.8

0.8

p(y2|y1 = 1)

p(y |y = −2)

1

0.4 0.2 0 −4

0 y1

−2

0 y

1

0.6

2 1

0.6

−1

0.4 0.2

−2

0 y

2

4

0 −4

2

2

4

2

Fig. 4. Normal distribution with two variables. Top left: normal density p(y1 , y2 ) with zero mean and covariance K; Top right: contour plot of p(y1 , y2 ); Bottom left: Conditional density of y2 when y1 = 1; Bottom left: Conditional density of y2 when y1 = −2. Note that in the last two plots, y2 is normally distributed, but with nonzero mean.

We denote by k(x, x ) the function generating the covariance matrix K := cov{y(x1 ), . . . , y(xm )},

(31)

and by µ the mean of the distribution. We also write Kij = k(xi , xj ). This leads to (y(x1 ), . . . , y(xm )) ∼ N(µ, K) where µ ∈ Rm .

(32)

Remark 1 (Gaussian Processes and Positive Definite Matrices). The function k(x, x ) is well defined, symmetric, and the matrix K is positive definite, that is, none of its eigenvalues is negative. Proof. We first show that k(x, x ) is well defined. By definition, [cov{y(x1 ), . . . , y(xm )}]ij = cov{y(xi ), y(xj )}.

(33)

Bayesian Kernel Methods

79

Consequently, Kij is only a function of two arguments (xi and xj ), which shows that k(x, x ) is well defined. It follows directly from the definition of the covariance that k is symmetric. Finally, to show that K is positive definite, we have to prove for any α ∈ Rm that the inequality α Kα ≥ 0 holds. This follows from

m  αi y(xi ) = α [cov{y(xi ), y(xj )}] α = α Kα. (34) 0 ≤ Var i=1

Thus K is positive definite and the function k is an admissible kernel. Note that even if k happens to be a smooth function (this turns out to be a reasonable assumption), the actual realizations y(x), as drawn from the Gaussian process, need not be smooth at all. In fact, they may be even pointwise discontinuous. Let us have a closer look at the prior distribution resulting from these assumptions. The standard setting is µ = 0, which implies that we have no prior knowledge about the particular value of the estimate, but assume that small values are preferred. Then, for a given set of (y(x1 ), . . . , y(xm )) =: Y , the prior density function p(Y |X) is given by   1  −1 −m − 12 2 p(Y |X) = (2π) (det K) exp − Y K Y . (35) 2 In most cases, we try to avoid inverting K. By a simple substitution, Y := Kα,

(36)

we have α ∼ N(0, K −1 ), and consequently

  m 1 1 p(α|X) = (2π)− 2 (det K)− 2 exp − α Kα . 2

(37)

Taking logs, we see that this term is identical to penalty term arising from the regularized risk framework (cf. the chapter on Support Vectors and [62, 77, 75, 26]). This result thus connects Gaussian process priors and estimators using the Reproducing Kernel Hilbert Space framework: Kernels favoring smooth functions translate immediately into covariance kernels with similar properties in a Bayesian context. 3.3

Simple Hypotheses

Let us analyze in more detail which functions are considered simple by a Gaussian process prior. As we know, hypotheses of low complexity correspond to vectors Y for which Y  K −1 Y is small. This is in particular the case for the (normalized) eigenvectors vi of K with large eigenvalues λi , since Kvi = λi vi yields vi K −1 vi = λ−1 i .

(38)

80

A.J. Smola and B. Sch¨ olkopf

0.1

0.1

0.1

0

0

0

−0.1

−0.1

−0.1

−5

0

5

−5

0

5

−5

0.1

0.1

0.1

0

0

0

−0.1

−0.1

−0.1

−5

0

5

−5

0

5

−5

0

5

0

5

10

20

60 0.1

0.1

0

0

−0.1

−0.1

40 20

−5

0

5

−5

0

5

0 0

Fig. 5. Hypotheses corresponding to the first eigenvectors of a Gaussian kernel of width 1 over a uniform distribution on the interval [−5, 5]. From top to bottom and from left to right: The functions corresponding to the first eight eigenvectors of K. Lower right: the first 20 eigenvalues of K. Note that most of the information about K is contained in the first 10 eigenvalues. The plots were obtained by computing K for an equidistant grid of 200 points on [−5, 5]. We then computed the eigenvectors e of K, and plotted them as the corresponding function values (this is possible since for α = e we have Kα = λα).

In other words, the estimator is biased towards solutions with small λ−1 i . This means that the spectrum and eigensystem of K represent a practical means of actually viewing the effect a certain prior has on the degree of smoothness of the estimates. Let us consider a practical example: for a Gaussian covariance kernel   x − x 2  , (39) k(x, x ) = exp − 2ω 2 where ω = 1, and under the assumption of a uniform distribution on [−5, 5], we obtain the functions depicted in Figure 5 as simple base hypotheses for our estimator. Note the similarity to a Fourier decomposition: this means that the kernel has a strong preference for slowly oscillating functions. The Gaussian Process setting also allows for a simple connection to parametric models of uncertainty (see also [80]). For instance, assume that the observations y are derived from the patterns x via the functional dependency

Bayesian Kernel Methods

y(x) =

n 

βi fi (x), where β ∼ N(0, Σ).

81

(40)

i=1

Clearly the random variables y are jointly normal, hence stem from an underlying Gaussian Process. We can calculate the covariance function k as follows. Let f (x) := (f1 (x), . . . , fn (x)), then k(x, x ) = Cov(y(x), y(x )) = f (x) Σf (x ).

(41)

In other words, starting from a parametric model, where we would want to estimate the coefficients β we arrived at a Gaussian Process with covariance function f (x) Σf (x ). One special case is of interest: set fi (x) = (x)i , that is, fi (x) encodes the i-th coordinate of x, and Σ = 1. Here k(x, x ) = x, x . This is the simplest Gaussian Process kernel possible. Other kernels include k(x, x ) = exp (−ωx − x ) (Laplacian kernel) k(x, x ) = ( x, x + c)p with c > 0 (Polynomial kernel)

(42) (43)

and the Gaussian RBF kernel of (39). For further details on the choice of kernels see [62, 25, 78, 31, 52, 77] and the references therein. 3.4

Regression

Let us put the previous discussion to practical use. For the sake of simplicity, we begin with regression (we study the classification setting in the next section). The natural assumption is that the observations Y are generated by the Gaussian Process with covariance matrix K and mean µ. Following the reasoning of Example 1 we can infer novel yi , given xi via Y  ∼ N(Σ  , µ ) where Kcond = KY  Y  − KY  Y KY−1Y ΣY Y  µcond = µY  + KY  Y KY−1Y (Y − µY )

(44) (45)

This means that the variance is reduced from KY  Y  to a degree controlled by both the correlation between Y and Y  (via KY  Y ), and the inherent degree of variability in Y (via KY−1Y ). The more certain we can be about Y , the more this certainty carries over to Y  . Likewise, the default estimate for the mean of Y  , namely µY  is corrected by KY  Y KY−1Y (Y − µY ), that is, the deviation of Y from its default estimate µY , weighted by the correlation between the random variables Y and Y  . Furthermore note that for the purpose of inferring the mean we only need to store α := KY−1Y (Y − µY ) and hence (Y − µY ) = KY Y α. This is similar to (36), where we used α to establish a connection between the regularized risk functional and prior probabilities. Let us get back to the case where k(x, x ) = x, x , that is, the linear covariance kernel. Here we can rewrite (44) as 



Kcond = X  X  − (X  X  )(XX  )−1 (XX  ) = X  (1 − PX )X 



(46)

82

A.J. Smola and B. Sch¨ olkopf

where PX is the projection on the space spanned by X. This means that the larger X grows, the more reliably we will be able to infer Y  for a given X  . In the case where X spans the entire space (if x ∈ Rn it may suffice if |X| = n), PX is the identity and consequently Kcond = 0. In other words, we can perform inference with certainty. The avid reader will notice that prediction with certainty can pose a problem if our model is not quite exact or if the data itself is fraught with measurement errors or noise. For instance, if we were to observe some Y which do not lie in the n-dimensional subspace spanned by X, we would have to conclude that such Y must never occur. In other words, we have a modeling problem. One way to address this issue is to replace the (apparently) unsuitable kernel k(x, x ) =

x, x by one which guarantees that K has always full rank. This, however, is rather cumbersome, since the question whether K is nondegenerate depends on both the data and on the covariance function k(x, x ). A more elegant way to ensure that our statistical model is better suited to the task is to introduce an extended dependency model using latent variables as follows: p(X)

/X

p(θ|X) p(θ|X)p(X)



p(Y |θ) p(Y |θ)p(θ|X)p(X)

/Y

This means that the random variables X and Y are conditionally independent, given θ. If we wish to infer Y  from X, X  , Y  this means that we need to integrate out the latent variable θ, in the same fashion as we did in (26) and as discussed in Section 2.3. In regression, a popular setting for p(Y |θ) is to assume that we have additive noise, i.e., Y = θ + ξ, where the ξi are drawn from some distribution (Gaussian, Laplacian, Huber’s robust, etc.), such as the ones given in Figure 1. And again, as discussed in Section 2.3, some of the integrals arising from the elimination of θ may be easily solvable, or we may need to resort to further approximations. Additive Normal Noise: We begin with the simple case, where ξi ∼ N(0, σ 2 ). In this situation Y ∼ N(µ, σ 2 1 + K), since Y is the sum of two independent normal random variables. Consequently we can re-use (44) and (45), the only difference being that now instead of k(x, x ) we use k(x, x ) + σ 2 δx,x as the covariance function. Figure 6 gives an example of regression with additive normal noise. Specializing (44) and (45) to the estimation of y(x) at only one new location and assuming µ = 0 we obtain (using k(x) := (k(x1 , x), . . . , k(xm , x))) Kcond = σ 2 + k(x, x) − k(x) (KY Y + σ 2 1)−1 k(x)

(47)

µcond = k(x) (KY Y +σ 2 1)−1 Y = k(x) α where α(KY Y +σ 2 1)−1 Y.

(48)

Note that the mean is a linear combination of kernel functions k(xi , x). Moreover, if we were to estimate y(x) for an element of x, say y(xi ), we would obtain

k(xi ) (KY Y + σ 2 1)−1 Y = KY Y (KY Y + σ 2 1)−1 Y i . (49)

Bayesian Kernel Methods

83

1.5

1

0.5

0

−0.5

−1

−1.5 0

0.5

1

Fig. 6. Gaussian Process regression with additive normal noise. The stars denote the observations, dotted lines correspond to the σ-confidence intervals with the prediction in between, and the solid line crossing the boundaries is the sine function, which, with additive normal noise, was used to generate the observations. For convenience we also plot the width of the confidence interval.

If σ 2 were 0 we would obtain yi , however, with σ 2 > 0 we end up shrinking yi towards 0, similarly to the shrinkage estimator of James and Stein [33]. Other Additive Noise: While we may in general not be able to integrate out θ from p(Y |θ), the noise models of regression typically allow one to perform several simplifications when it comes to estimation. For convenience we will in the following split up the latent variable into θ and θ  corresponding to Y and Y  . Moreover, we will assume that the additive noise has zero mean.6 Recall from Section 2.3 that an approximation to marginalization is to maximize the joint density p(Y  , θ, θ  , Y, X, X  ) with respect to (Y  , θ, θ  ). Using the fact that the random variables yi are drawn iid and that Y, Y  are conditionally independent of X, X  given θ, θ  we arrive at 6

In the nonzero mean case we simply add the offset to µ, which effectively reduces the additive noise to zero mean.

84

A.J. Smola and B. Sch¨ olkopf

p(Y  , θ, θ  , Y, X, X  ) = p(Y  |θ  )p(Y |θ)p(θ  |θ, X, X  )p(θ|X)p(X, X  ). (50) Note that Y  appears only in p(Y  |θ). If p(yi |θi ) has its mode for yi = θi , we know that Y  = θ maximizes (50) and we only need to care about the remainder p(Y  = θ  |θ  )p(Y |θ)p(θ  |θ, X, X  )p(θ|X)

(51)

and maximize it with respect to θ and θ  . A further simplification occurs if p(Y  = θ  |θ  ) is constant, which is typically the case (e.g., if we have the same additive noise for all observations). Then the maximization with respect to θ  can be carried out by maximizing p(θ  |θ, X, X  ). Note that the latter is Gaussian in θ  , so the maximum is obtained for the mean µcond of the corresponding normal distribution, as given by (44) and (45). Moreover, p(θ  = µcond |θ, X, X  ) is constant, if we consider µcond as a function of θ. This means that we now reduced the problem to the one of maximizing p(Y |θ)p(θ|X).

(52)

The latter depends on the training data X, Y alone, which makes the whole approach computationally very attractive. In summary, the following steps were needed in reducing (50) to (52): 1. The conditional distribution p(yi |θi ) is peaked for yi = θi . 2. p(yi = θi |θi ) is constant, when considered as a function of θi . 3. We predict θ  = µcond for known θ. 4. The maximizer in θ is found by maximizing the posterior probability p(θ|X) ∝ p(Y |θ)p(θ|X). We shall see in the next section that even if some of the above assumptions do not lead to optimality, we may still be able to obtain reasonably good estimates. In situations where neither of the two special cases discussed above applies we will need to resort to a reasoning which is very similar to the one used in Gaussian Process classification. Since such situations are rather rare we will not discuss them in further detail. 3.5

Classification

The main difference to regression is that in classification the observations yi are part of a discrete set Y, which we will either denote by Y = {±1} in the case of binary classification and Y = {1, . . . , N } for multiclass problems. It is clear that in such a situation we will need a conditional probability p(yi |θi ) if we are to transform the problem into one involving Gaussian Processes. This will lead to various models, using the logistic transfer function (14), the Probit (16), and their multiclass extensions.

Bayesian Kernel Methods

85

Binary Case: Here θi plays the role of a parameter responsible for the calibration of the conditional probabilities P(yi = 1|X) and P(yi = −1|X) and we have    m p(yi |θi ) p(θ|X)dθ. (53) p(Y |X) = i=1

Y being a discrete random variable, this gives us the probability of observing Y , given X. The latter makes classification slightly easier than regression: provided we are able to solve (53), we immediately know the confidence of the random variables Y arising from it. Thus, P(yi = 1|X) not only tells us whether the estimator classifies xi as +1 or −1, but also the probability of obtaining these labels. Therefore, calculations regarding the variation of Y are not quite as important as they were for regression. Multiclass Classification: While one could in theory design some p(yi |θi ) which allows for multiple discrete values of yi and assigns corresponding probabilities, it is far from clear how such a goal can be best achieved. Instead, one typically uses vector valued θi ∈ RN such that each coordinate [θi ]j is related to the probability of class j occurring. To derive our model we begin with the observation that for logistic regression p(yi = 1|θi ) = p(yi = −1|θi ) =

exp( 12 θi ) + exp(− 12 θi )

exp( 12 θi )

exp(− 12 θi ) . exp( 12 θi ) + exp(− 12 θi )

(54) (55)

In other words, the probability of class 1 or −1 occurring is proportional to exp( 12 [θi ]1 ) and exp( 12 [θi ]−1 ) subject to a normalization constraint, where we defined the coordinate “[θi ]−1 := −θi . From there the extension to more than two classes is straightforward: assume that the probability of class j is proportional to exp( 12 [θi ]j ) and normalize such that all p(yi = j|θi ) sum up to 1, namely   exp 12 [θi ]j (56) p(yi = j|θi ) = N . 1 l=1 exp 2 [θi ]l The default assumption is then that all θi are drawn from a Gaussian Process. Without any further knowledge about the relation between the various classes j, one typically assumes that all coordinates are drawn independently. An extension to uncertain labels yi is straightforward. Assume that we do not know the specific class j but merely some probability pij assessing whether pattern xi belongs to class j or not. In this case we need to integrate over all j to obtain   m N    pij p(j|θi ) (57) p(Yuncertain |θ) = i=1

j=1

86

A.J. Smola and B. Sch¨ olkopf

as a replacement for p(yi = j|θi ) in the conditional probability p(Y |θ). This is closely related to uncertain multiclass settings from maximum margin theory. See [55] for further details. To solve the inference problem arising from (19) or (50) in the classification case we will need to resort to approximations. Again, as before, we cannot solve the integral explicitly, so our main goal is to (approximately) maximize the joint density. For convenience, we only study the binary case. Recall that Y  appears in the joint density only via p(Y  |θ). Knowing that this is maximized with respect to Y  if we set yi = sgn (θi ) we could turn the overall problem into one of jointly maximizing over the continuous random variables θ and θ  . Unfortunately the resulting optimization problem is rather ill behaved. Instead we resort to the EM algorithm or a related method, as described in Section 2.3. Here Y  are the unknown labels and θ, θ  the latent variables to be obtained jointly with Y  . Taking the expectation over Y  as given by p(Y  |θ  ) we can write Q(θ) (see (23)) as follows

(58) Q(θ, θ  ) = EY  log p(Y  |θ  ) + log p(Y |θ) + log p(θ, θ  |X, X  ) 

m 

=

p(yi |θi ) log p(yi |θ¯i ) + log p(Y |θ) + log p(θ, θ  |X, X  ) (59)

i=1,yi ∈Y

where θ¯i denotes the value of θi obtained from a previous estimate. One then proceeds as follows: at every step one maximizes Q(θ, θ  ) and subsequently updates θ¯i ← θi until convergence. [10] show that the algorithm converges to a local maximum of the joint density p(Y  , θ, θ  |Y, X, X  ). Unfortunately, in practice, the local maximum achieved by this process is often not very good. However, a small modification of (58) leads to an expression which (empirically) tends to lead to better estimates, yet will also lead to a local maximum of the joint density. The idea is to remove θ¯i from Q and replace it by θi directly. In this case we have 

m 



˜ θ)= Q(θ,

p(yi |θi ) log p(yi |θi ) + log p(Y |θ) + log p(θ, θ  |X, X  ) (60)

i=1,yi ∈Y 

=−

m  i=1

H(p(yi |θi )) + log p(Y |θ) + log p(θ, θ  |X, X  ),

(61)

 where H(p) = j pj log pj is the entropy of p. This function can then be maximized jointly over θ and θ  by standard nonlinear optimization methods. Note ˜ and the fixed point of Q have to coincide. Since we know that the maxima of Q ˜ too, that the latter are the (local) maxima of the joint density we know that Q, will yield maxima of the joint density. ˜ θ ) From a Minimum Description Length point of view, minimizing −Q(θ, can be viewed as minimizing the number of bits (we use base e for convenience)

Bayesian Kernel Methods

87

needed to encode (Y  , θ, θ  ), given Y, X, X  : Firstly encoding Y  , given θ  requires H(p(Y |θ  ) bits. Secondly, for a given prior probability p(θ, θ  |X, X  ), the two remaining terms in (60) describe the number of bits needed to encode these random variables with respect to a code using log p(Y |θ)+log p(θ, θ  |X, X  ) bits. The latter is the Shannon optimal code length for random variables distributed according to the assumed prior distribution. Note that quite commonly one ignores the terms dependent on X  , θ  when it comes to finding an estimate for θ, hence one effectively resorts to an optimization setting as described in Section 3.4, however without the justification that could be given in the additive-noise regression case. This tends to yield acceptable results, however, it is important to bear in mind that better estimates can be obtained if a joint maximization over all variables is carried out. Figure 7 gives an example on classification with and without taking Y  into account.

Fig. 7. Gaussian Process classification without (left) and with (right) knowledge of the test data. Circles and stars correspond to the respective classes, crosses are unlabeled observations. Note the errors introduced by ignoring the test data on the left figure.

3.6

Adjusting Hyperparameters for Gaussian Processes

More often than not, we will not know the exact amount of additive noise, the specific form of the covariance kernel, or other parameters beforehand. To address this problem, the hyperparameter formalism of Section 2.4 is needed. However, unlike in the previous section, even an EM approach may be too costly, since expectations over the set of hyperparameters ω, or over the set of further latent variables θ, θ  are too costly to be carried out. Consequently, one of the few practical method available is that of coordinate descent, that is: a) optimize over (θ, θ  , Y  ) via the EM algorithm for a fixed ω, b) maximize with respect to ω for a fixed (θ, θ  , Y  ), and repeat a) and b) until convergence occurs. To avoid technicalities, we only discuss the special and somewhat simpler case of regression with additive Gaussian noise, since here the latent variables

88

A.J. Smola and B. Sch¨ olkopf

θ, θ  can be integrated out. We refer the reader to [84, 15, 11, 54] and the references therein for integration methods based on Markov Chain Monte Carlo approximations (see also [63] for a more recent overview). More specifically, assume that both K and σ 2 (the additive normal noise) are functions of ω, so that   1 1 exp − Y  (K + σ 2 )−1 Y (62) p(Y |ω, X) =  2 (2π)m det(K + σ 2 1) and p(Y |X) = p(Y |ω, X)p(ω)dω. In other words, (62) tells us how likely it is that we observe y, if we know ω. To maximize the integrand p(Y |ω, X)p(ω) with respect to ω we require information about the gradient of (62) with respect to ω. An explicit expression is given below Since the logarithm is monotonic, we can equivalently minimize the negative log posterior, ln p(Y |ω)p(ω). With the shorthand Q := K + σ 2 1, we obtain ∂ω [− ln p(Y |ω)p(ω)]

1 1 = ∂ω (ln det Q) − ∂ω Y  Q−1 Y − ∂ω ln p(ω) 2 2  1 1  = − tr Q−1 ∂ω Q + Y  Q−1 (∂ω Q) Q−1 Y − ∂ω ln p(ω). 2 2

(63) (64)

Here (64) follows from (63) via standard matrix algebra [39]. Likewise, we could compute the Hessian of ln p(Y |ω)p(ω) with respect to ω and use a second order optimization method.7 If we assume a flat8 hyperprior (p(ω) = const.), optimization over ω simply becomes gradient descent in − ln p(Y |ω); in other words, the term depending on p(ω) vanishes. Computing (64) is still very expensive numerically since it involves the inversion of Q, which is an m × m matrix. One option to parameterize K +σ 2 1 is to assume that the covariance kernel k itself has been drawn from a Gaussian Process (in this case we need to restrict k to to the cone of Mercer kernels). Such a setting can be optimized by a so-called superkernel expansion. See [70] for further details. Finally, there exist numerous techniques, such as sparse greedy approximation methods, to alleviate this problem. We present a selection of these techniques in the following section. Additional detail on the topic of hyperparameter optimization can be found in Section 6, where hyperparameters play a crucial role in determining the sparsity of an estimate.

4

Implementation of Gaussian Processes

In this section, we discuss various methods to perform inference in the case of Gaussian process classification or regression. We begin with a general purpose 7 8

This is rather technical, and the reader is encouraged to consult the literature for further detail [41, 54, 46, 18]. Note that this is clearly an improper hyperprior, which may lead to overfitting.

Bayesian Kernel Methods

89

technique, the Laplace approximation, which is essentially an application of Newton’s method to the problem of minimizing the negative log-posterior density. Since it is a second order method, it is applicable as long as the log-densities have second order derivatives. Readers interested only in the basic ideas of Gaussian process estimation may skip the present section. For classification with the logistic transfer function we present a variational method (Section 4.2), due to Jaakkola and Jordan [32], and Gibbs and MacKay [18, 16], a linear system of equations for optimization purposes. Finally, the special case of regression in the presence of normal noise admits very efficient optimization algorithms based on the approximate minimization of quadratic forms (Section 4.3). We subsequently discuss the scaling behavior and approximation bounds for these algorithms. For convenience, we only study the problem of maximizing p(θ|Y, X) ∝ p(Y |θ)p(θ|X)

(65)

ignoring the considerations about the test data X  , which we put forward in Section 3.5. An extension to these methods is in some cases straightforward (for the Expectation Maximization setting), and in other cases essentially impossible (for the direct MDL approach). So we skip both of them. Maximizing p(θ|Y, X) is equivalent to minimizing − log p(Y |θ) − log p(θ|X), since L(θ) := − log(θ|Y, X) = − log p(θ|Y, X) = − log p(Y |θ) − log p(θ|X) + c, (66) for some constant c ∈ R. L(θ) is commonly referred to as the negative log posterior and the remainder of this section is devoted to efficient methods of minimizing it. 4.1

Laplace Approximation

Note that for Gaussian Process regression with additive normal noise (66) becomes L(θ) =

m  1 1 (yi − θi )2 + θK −1 θ  . 2 2σ 2 i=1

(67)

Minimization of (67) can be achieved by solving a quadratic optimization probˆ at lem with its minimum θ ˆ = (σ 2 1 + K)−1 KY = K(σ 2 1 + K)−1 Y θ

(68)

which is identical to the estimate obtained in (48). This means that for normal distributions seeking the mode of the density and performing a quadratic approximation at the mode is exact. In general, however, the negative log posterior (66), is not quadratic, hence the minimum cannot be found analytically and typically we will not be able to

90

A.J. Smola and B. Sch¨ olkopf

study the variation of the estimate explicitly either. A possible solution is to make successive quadratic approximations of the negative log posterior, and minimize the latter iteratively. This strategy is referred to as the Laplace approximation9 [71, 84, 63]; the Newton-Raphson method, in numerical analysis (see [72, 53]); or the Fisher scoring method, in statistics. A necessary condition for the minimum of a differentiable function g is that its first derivative be 0. For convex functions, this requirement is also sufficient. We approximate g  linearly by g  (x + ∆x) ∼ g  (x) + ∆xg  (x), and hence ∆x = −

g  (x) . g  (x)

(69)

Substituting ln p(f |X, Y ) into (69) and using the definitions c := (−∂θ1 ln p(y1 |θ1 ), . . . , −∂θm ln p(ym |θm )) ,   C := diag −∂θ21 ln p(y1 |θ1 ), . . . , −∂θ2m ln p(ym |θm ) ,

(70) (71)

we obtain the following update rule for α (where θ = Kα), αnew = (KC + 1)−1 (KCαold − c).

(72)

While (72) is usually an efficient way of finding a maximizer of the log posterior, it is far from clear that this update rule is always convergent (to prove the latter, we would need to show that the initial guess of α lies within the radius of attraction [53, 13, 19, 38]. Nonetheless, this approximation turns out to work in practice, and the implementation of the update rule is relatively simple. The major stumbling block if we want to apply (72) to large problems is that the update rule requires the inversion of an m × m matrix. This is costly, and effectively precludes efficient exact solutions for problems of size significantly larger than 103 , due to memory and computational requirements. If we are able to provide a low rank approximation of K by ˜ = U  Ksub U where U ∈ Rn×m and Ksub ∈ Rn×n K

(73)

with n m, however, we may compute (72) much more efficiently. For instance, it follows immediately from the Sherman-Woodbury-Morrison formula [22], (V + RHR )−1 = V −1 − V −1 R(H −1 + R V −1 R)−1 R V −1 ,

(74)

˜ that we obtain the following update rule for K,    −1 −1 + U CU  U C (U  Ksub U Cαold − c). αnew = 1 − U  Ksub

(75)

9

Strictly speaking, the Laplace approximation refers only to the fact that we approximate the mode of the posterior by a Gaussian distribution. We already use the Gaussian approximation in the second order method, however, in order to maximize the posterior. Hence, for all practical purposes, the two approximations just represent two different points of view on the same subject.

Bayesian Kernel Methods

91

In particular, the number of operations required to solve (72) is O(mn2 + n3 ) rather than O(m3 ). Numerically more stable, yet efficient and easily implementable methods than the Sherman-Woodbury-Morrison method exist, however their discussion would be somewhat technical. See [69, 12, 21] for further details and references. There are several ways to obtain a good approximation of (73). One way is to project k(xi , x) on a random subset of dimensions, and express the missing terms as a linear combination of the resulting sub-matrix (this is the Nystr¨ om method proposed by Seeger and Williams [83]). We might also construct a randomized sparse greedy algorithm to select the dimensions (see [68] for details), or resort to a positive diagonal pivoting strategy [12]. An approximation of K by its leading principal components, as often done in machine learning, is usually undesirable, since the computation of the eigensystem would still be costly, and the time required for prediction would still rise with the number of observations (since we cannot expect the leading eigenvectors of K to contain a significant number of zero coefficients). 4.2

Variational Methods

In the case of logistic regression, Jaakkola and Jordan [32] compute upper and lower bounds on the logistic (1 + e−t )−1 , by exploiting the log-concavity of eq:gp:logistic-model: A convex function can be bounded from below by its tangent at any point, and from above by a quadratic with sufficiently large curvature (provided the maximum curvature of the original function is bounded). These bounds are   (t − ν) 1 2 2 p(y = 1|t) ≥ − λ(ν)(t exp − ν ) , (76) 1 + e−ν 2 p(y = 1|t) ≤ exp (µt − H(µ)) , (77) −ν −1

) −1/2 where µ, ν ∈ [0, 1] and λ(ν) = (1+e 2ν . Furthermore, H(µ) is the binary entropy function, H(µ) = −µ ln µ − (1 − µ) ln(1 − µ). Likewise, bounds for p(y = −1|t) follow from p(y = −1|t) = 1 − p(y = 1|t). Equations (77) and (76) can be calculated quite easily, since they are linear or quadratic functions in t. This means that for fixed parameters µ and ν, we can optimize an upper and a lower bound on the log posterior using the same techniques as in Gaussian process regression (Section 3). Approximations (77) and (76) are only tight, however, if ν, µ are chosen suitably. Therefore we have to adapt these parameters at every iteration (or after each exact solution), for instance by gradient descent (by minimizing the upper bound and maximizing the lower bound correspondingly). See [16, 17] for details. Again, factorizations for rank-degenerate matrices as in the previous section can be used for efficient implementation.

92

A.J. Smola and B. Sch¨ olkopf 2 Logistic Lower Bound Upper Bound

1.8

1.6

1.4

p(y=1|t(x))

1.2

1

0.8

0.6

0.4

0.2

0 −5

−4

−3

−2

−1

0 t(x)

1

2

3

4

5

Fig. 8. Variational Approximation for µ = ν = 0.5. Note that the quality of the approximation varies widely, depending on the value of f (x).

4.3

Approximate Solutions for Gaussian Process Regression

The approximations of Section 4.1 indicate that one of the more efficient ways of implementing Gaussian process estimation on large amounts of data is to find a low rank approximation10 of the matrix K. Such an approximation is very much needed in practice, since (47) and (48) show that exact solutions of Gaussian Processes can be hard to come by. Even if α is computed beforehand (see Table 1 for the scaling behavior), prediction of the mean at a new location still requires O(m) operations. In particular, memory requirements are O(m2 ) to store K, and CPU time for matrix inversions, as are typically required for second order methods, scales with O(m3 ). Let us limit ourselves to an approximation of the MAP solution. One of the criteria to impose is that the posterior probability at the approximate solution be close to the maximum of the posterior probability. Note that this requirement is different from the requirement of closeness in the approximation itself, as represented for instance by the expansion coefficients (the latter requirement 10

Tresp [74] devised an efficient way of estimating f (x) if the test set is known at the time of training. He proceeds by projecting the estimators on the subspace spanned by the functions k(˜ xi , ·), where x ˜i are the training data. Likewise, Csat´ o and Opper [9] design an iterative algorithm that performs gradient descent on partial posterior distributions and simultaneously projects the estimates onto a subspace.

Bayesian Kernel Methods

93

was used by Gibbs and Mackay [18]). Proximity in the coefficients, however, is not what we want, since it does not take into account the importance of the individual variables. For instance, it is not invariant under transformations of scale in the parameters. For the remainder of the current section, we consider only additive normal noise. Here, the log posterior takes a quadratic form, given by (67). The following theorem, which uses an idea from [18], gives a bound on the approximation quality of minima of quadratic forms and is thus applicable to (67). For convenience we rewrite (67) in terms of θ = Kα. Theorem 2 (Approximation Bounds for Quadratic Forms [67]). Denote by K ∈ Rm×m a symmetric positive definite matrix, y, α ∈ Rm , and define the two quadratic forms 1 L(α) := −y Kα + α (σ 2 K + K  K)α, 2 1  2 ∗  L (α) := −y α + α (σ 1 + K)α. 2

(78) (79)

Suppose L and L∗ have minima Lmin and L∗min . Then for all α, α∗ ∈ Rm we have 1 L(α) ≥ Lmin ≥ − y2 − σ 2 L∗ (α∗ ), 2  1  ∗ ∗ −2 − y2 − L(α) , L (α ) ≥ L∗min ≥ σ 2

(80) (81)

with equalities throughout when L(α) = Lmin and L∗ (α∗ ) = L∗min . Hence, by minimizing L∗ in addition to L, we can bound L’s closeness to the optimum and vice versa. Proof. The minimum of L(α) is obtained for αopt = (K + σ 2 1)−1 y (which also minimizes L∗ ),11 hence 1 1 Lmin = − y K(K + σ 2 1)−1 y and L∗min = − y (K + σ 2 1)−1 y. 2 2

(82)

This allows us to combine Lmin and L∗min to Lmin + σ 2 L∗min = − 12 y2 . Since by definition L(α) ≥ Lmin for all α (and likewise L∗ (α∗ ) ≥ L∗min for all α∗ ), we may solve Lmin + σ 2 L∗min for either L or L∗ to obtain lower bounds for each of the two quantities. This proves (80) and (81). Equation (80) is useful for computing an approximation to the MAP solution (the objective function is identical to L(α), ignoring constant terms independent of α), whereas (81) can be used to obtain error bars on the estimate. To see this, 11

If K does not have full rank, L(α) still attains its minimum value for αopt . There will then be additional α that minimize L(α), however.

94

A.J. Smola and B. Sch¨ olkopf

note that in calculating the variance (47), the expensive quantity to compute is −k (K + σ 2 1)−1 k. This can be found as   −k (K + σ 2 1)−1 k = 2 minm −k α + 12 α σ 2 1 + K α , (83) α∈R

however. A close look reveals that the expression inside the parentheses is L∗ (α) with y = k (see (79)). Consequently, an approximate minimizer of (83) gives an upper bound on the error bars, and lower bounds can be obtained from (81). In practice, we use the relative discrepancy between the upper and lower bounds, Gap(α, α∗ ) :=

2(L(α) + σ 2 L∗ (α∗ ) + 12 y2 ) , −L(α) + σ 2 L∗ (α∗ ) + 12 y2

(84)

to determine how much further the approximation has to proceed. 4.4

Solutions on Subspaces

The central idea of the algorithm below is that improvements in speed can be achieved by a reduction in the number of free variables. Denote by P ∈ Rm×n with m ≥ n and m, n ∈ N an extension matrix (in other words, P  is a projection), with P  P = 1. We make the ansatz αP := P β where β ∈ Rn ,

(85)

and find solutions β such that L(αP ) (or L∗ (αP )) is minimized. The solution is    −1   P K y. β opt = P  σ 2 K + K  K P

(86)

If P is of rank m, then KαP is also the minimizer of (67). In all other cases, however, it is an approximation. For a given P ∈ Rm×n , let us analyze the computational cost involved in computing (86). We need O(nm) operations to evaluate P  Ky, O(n2 m) operations for (KP ) (KP ), and O(n3 ) operations for the inversion of an n × n matrix. This brings the total cost to O(n2 m). Predictions require k α, which entails O(n) operations. Likewise, we may use P to minimize L∗ (P β ∗ ), which is needed to upper-bound the log posterior. The latter costs no more than O(n3 ). To compute the posterior variance, we have to approximately minimize (83), which can done for α = P β at cost O(n3 ) . If we compute (P KP  )−1 beforehand, the cost becomes O(n2 ), and likewise for upper bounds. In addition to this, we have to minimize −k KP β + 12 β  P  (σ 2 K + K  K)P β, which again costs O(n2 m) (once the inverse matrices have been computed, however, we may also use them to compute error bars at different locations, thus limiting the cost to O(n2 )). Accurate lower bounds on the error bars are not especially crucial, since a bad estimate leads at worst to overly conservative confidence intervals, and has no further negative effect. Finally, note that we need only compute and store KP — that is, the m × n sub-matrix of K — and not K itself. Table 1 summarizes the scaling behavior of several optimization algorithms.

Bayesian Kernel Methods

95

Table 1. Computational Cost of Various Optimization Methods. Note that n  m, and that different values of n are used in Conjugate Gradient, Sparse Decomposition, and Sparse Greedy Approximation methods: nCG ≤ nSD ≤ nSGA , since the search spaces are progressively more restricted. Near-optimal results are obtained when κ = 60. Exact Solution Memory O(m2 ) Initialization O(m3 ) (= Training) Prediction: Mean O(m) Error Bars O(m2 )

Conjugate Gradient [18] O(m2 ) O(nm2 )

Sparse Decomposition O(nm) O(n2 m)

O(m) O(nm2 )

O(n) O(n) O(n2 m) or O(n2 ) O(κn2 m) or O(n2 )

Sparse Greedy Approximation O(nm) O(κn2 m)

This leads us to the question of how to choose P for optimum efficiency. Possibilities include using the principal components of K [83], performing conjugate gradient descent to minimize L [18], performing symmetric diagonal pivoting [12], or applying a sparse greedy approximation to K directly [68]. Yet these methods have the disadvantage that they either do not take the specific form of y into account [83, 68, 12], or lead to expansions that cost O(m) for prediction, and require computation and storage of the full matrix [83, 18]. By contrast to these methods, we use a data adaptive version of a sparse greedy approximation algorithm. We may then only consider matrices P that are a collection of unit vectors ei (here (ei )j = δij ), since these only select a number of rows of K equal to the rank of P (see also [62] for a template of a sparse greedy algorithm). – First, for n = 1, we choose P = ei such that L(P β) is minimal. In this case we could permit ourselves to consider all possible indices i ∈ {1, . . . m} and find the best one by trying all of them. – Next, assume that we found a good solution P β, where P contains n columns. In order to improve this solution, we expand the projection operator P into the matrix Pnew := [Pold , ei ] ∈ Rm×(n+1) and seek the best ei such that Pnew minimizes minβ L(Pnew β). Note that this method is very similar to Matching Pursuit [42] and to iterative reduced set Support Vector algorithms [60], with the difference that the target to be approximated (the full solution α) is only given implicitly via L(α). Recently Zhang [85] proved lower bounds on the rate of sparse approximation schemes. In particular, he shows that most subspace projection algorithms enjoy at least an O(n−1 ) rate of convergence. 4.5

Implementation Issues

Performing a full search over all possible n + 1 of m indices is excessively costly. Even a full search over all m − n remaining indices to select the next basis function can be prohibitively expensive. A simple result concerning the tails of rank

96

A.J. Smola and B. Sch¨ olkopf

statistics (see [62]) comes to our aid — it states that with high probability, a small subset of size κ = 59, chosen at random, guarantees near optimal performance. Hence, if we are satisfied with finding a relatively good index rather than the best index, we may resort to selecting only a random subset.

Algorithm 1.1 Sparse Greedy Quadratic Minimization. Require: Training data X = {x1 , . . . , xm }, Targets y, Noise σ 2 , Precision , corresponding quadratic forms L and L ∗ . Initialize index sets I, I ∗ = {1, . . . , m}; S, S ∗ = ∅. repeat Choose M ⊆ I, M ∗⊆ I ∗ .    Find argmin i∈M Q [P, ei ]β opt , arg mini∗ ∈M ∗ L ∗ [P ∗ , ei∗ ]β ∗opt . ∗ ∗ ∗ Move i from I to S, i from I to S . Set P := [P, ei ], P ∗ := [P ∗ , ei∗ ]. until L(P β opt ) + σ 2 L ∗ (P β ∗opt ) + 12 y2 ≤ 2 (|L(P β opt )| + |σ 2 L ∗ (P β ∗opt ) + 12 y2 | Output: Set of indices S, β opt , (P  KP )−1 , and (P  (K  K + σ 2 K)P )−1 .

It is now crucial to obtain the values of L(P β opt ) cheaply (with P = [Pold , ei ]), assuming that we found Pold previously. From (86) we can see that we need only do a rank-1 update on the inverse. We now show that this can be obtained in O(mn) operations, provided the inverse of the smaller subsystem is known. Expressing the relevant terms using Pold and ki , we obtain  P  K  y = [Pold , ei ] K  y = (Pold K  y, k (87) i y),          2  2 Pold K K + σ K Pold Pold K + σ 1 ki   P  K  K + σ2 K P = .(88) 2 2 k k i (K + σ 1)Pold i ki + σ Kii

Thus computation of the terms costs only O(nm) once we know Pold . Furthermore, we can write the inverse of a strictly positive definite matrix as 

A B B C

−1



 A−1 + (A−1 B) γ(A−1 B) −γ(A−1 B) = , −(γ(A−1 B)) γ

(89)

  where γ := (C − B  A−1 B)−1 . Hence, inversion of P  K  K + σ 2 K P costs only O(n2 ). Thus, to find the matrix P of size m×n takes O(κn2 m) time. For the error bars, (P  KP )−1 is generally a good starting value for the minimization of (83), so the typical cost for (83) is O(τ mn) for some τ < n, rather than O(mn2 ). If additional numerical stability is required, we might want to replace (89) by a rank-1 update rule for Cholesky decompositions of the corresponding positive definite matrix. Furthermore, we may want to add the kernel function chosen by positive diagonal pivoting [12] to the selected subset, in order to ensure that the n × n sub-matrix remains invertible. See numerical mathematics textbooks, such as [28], for more detail on update rules.

Bayesian Kernel Methods

4.6

97

Hardness and Approximation Results

It is worthwhile to study the theoretical guarantees on the performance of the algorithm (as described in Algorithm 1.1). It turns out that our technique closely resembles a Sparse Linear Approximation problem studied by Natarajan [44]: Given A ∈ Rm×n , b ∈ Rm , and  > 0, find x ∈ Rn with minimal number of nonzero entries such that Ax − b2 ≤ . If we define  1 A = σ 2 K + K  K 2 and b := A−1 Ky,

(90)

we may write L(α) =

1 b − Aα2 + c, 2

(91)

where c is a constant independent of α. Thus the problem of sparse approximate minimization of L(α) is a special case of Natarajan’s problem (where the matrix A is square, and strictly positive definite). In addition, the algorithm considered in [44] involves sequentially choosing columns of A to maximally decrease Ax − b. This is equivalent to the algorithm described above and we may apply the following result to our sparse greedy Gaussian process algorithm. Theorem 3 (Natarajan, 1995 [44]). A sparse greedy algorithm to approximately solve the problem minimize y − Ax

(92)

needs at most n ≤ 18n∗ ()A+ 22 ln

y 2

(93)

non-zero components, where n∗ () is the minimum number of nonzero components in vectors α for which y − Ax ≤ , and A+ is the matrix obtained from A by normalizing its columns to unit length. Corollary 1 (Approximation Rate for Gaussian Processes). Algorithm 1.1 satisfies L(α) ≤ L(αopt ) + 2 when α has n≤

18n∗ () A−1 Ky ln λ2 2

(94)

non-zero components, where n∗ () is the minimum number of nonzero components in vectors α for which L(α) ≤ L(αopt ) + 2 , A = (σ 2 K + K  K)1/2 , and λ is the smallest magnitude of the singular values of A, the matrix obtained by normalizing the columns of A. Moreover, we can also show NP-hardness of sparse approximation of Gaussian process regression. The following theorem holds:

98

A.J. Smola and B. Sch¨ olkopf

Theorem 4 (NP-Hardness of Approximate GP Regression). There exist kernels K and labels y such that the problem of finding the minimal set of indices to minimize a corresponding quadratic function L(α) with precision ε is NPhard. Proof. We use the hardness result [44, Theorem 1] for Natarajan’s quadratic approximation problem in terms of A and b. More specifically, we have to proceed in the opposite direction to (90) and (91) and show that for every A and b, there exist K and y for an equivalent optimization problem. Since Ax − b2 = x (A A)x − 2(b A)x + b2 , the value of A enters only via A A, which means that we have to find K in (78) such that A A = K  K + σ 2 K.

(95)

We can check that it is possible to find a suitable positive definite K for any A, by using identical eigensystems for A A and K, and subsequently solving the equations ai = λ2i + σ 2 λi for the respective eigenvalues ai and λi of A A and K. Furthermore, we have to satisfy y K = bA.

(96)

To see this, recall that bA is a linear combination of the nonzero eigenvectors of A A; and since K has the same rank and image as A A, the vector bA can also be represented by y K. Thus for every A, b there exists an equivalent L, which proves NP-hardness by reduction. This shows that the sparse greedy algorithm is an efficient approximate solution of an NP-hard problem. 4.7

Experimental Evidence

We conclude this section with a brief experimental demonstration of the efficiency of sparse greedy approximation methods, using the Abalone dataset. Specifically, we used Gaussian covariance kernels, and we split the data into 4000 training and 177 test examples to assess training speed (to assess generalization performance, a 3000 training and 1177 test set split was used). For the optimal parameters (2σ 2 = 0.1 and 2ω 2 = 10, chosen after [68]), the average test error of the sparse greedy approximation trained until Gap(α, α∗ ) < 0.025 is indistinguishable from the corresponding error obtained by an exact solution of the full system. The same applies for the log posterior. See Table 2 for details. Consequently for all practical purposes, full inversion of the covariance matrix and the sparse greedy approximation have comparable generalization performance. A more important quantity in practice is the number basis functions needed to minimize the log posterior to a sufficiently high precision. Table 3 shows this number for a precision of Gap(α, α∗ ) < 0.025, and its variation as a function of the kernel width σ; the latter dependency is observed since the number of

Bayesian Kernel Methods

99

−1

Gap

10

−2

10

0

20

40

60

80 100 120 Number of Iterations

140

160

180

200

Fig. 9. Speed of Convergence. We plot the size of the gap between upper and lower bound of the log posterior (Gap(α, α∗ )), for the first 4000 samples from the Abalone dataset (σ 2 = 0.1 and 2ω 2 = 10). From top to bottom: Subsets of size 1, 2, 5, 10, 20, 50, 100, 200. The results were averaged over 10 runs. The relative variance of the gap size is less than 10%. We can see that subsets of size 50 and above ensure rapid convergence.

kernels determines time and memory needed for prediction and training. In all cases, less than 10% of the kernel functions suffice to find a good minimizer of the log posterior; less than 2% are sufficient to compute the error bars. This is a significant improvement over a direct minimization approach. A similar result can be obtained on larger datasets. To illustrate, we generated a synthetic data set of size 10.000 in R20 by adding normal noise with variance σ 2 = 0.1 to a function consisting of 200 randomly chosen Gaussians of width 2ω 2 = 40 and with normally distributed expansion coefficients and centers. To avoid trivial sparse expansions, we deliberately used an inadequate Gaussian process prior (but correct noise level) consisting of Gaussians with width 2σ 2 = 10. After 500 iterations (thus, after using 5% of all basis functions), the size of Gap(α, α∗ ) was less than 0.023. This demonstrates the feasibility of the sparse greedy approach on larger datasets.

5

Laplacian Processes

So far the dependency of the latent variables  θ on X could not be factorized easily in terms of the xi . That is, p(θ|X) = i p(θi |xi ). This is due to the fact

100

A.J. Smola and B. Sch¨ olkopf

Table 2. Performance of sparse greedy approximation vs. explicit solution of the full learning problem. In these experiments, the Abalone dataset was split into 3000 training and 1177 test samples. To obtain more reliable estimates, the algorithm was run over 10 random splits of the whole dataset. Generalization Error Log Posterior Optimal Solution 1.782 ± 0.33 −1.571 · 105 (1 ± 0.005) Sparse Greedy Approximation 1.785 ± 0.32 −1.572 · 105 (1 ± 0.005) Table 3. Number of basis functions needed to minimize the log posterior on the Abalone dataset (4000 training samples), for various kernel widths ω. Also given is the number of basis functions required to approximate k (K + σ 2 1)−1 k, which is needed to compute the error bars. Results were averaged over the 177 test samples. Kernel width 1 2 5 10 20 50 Kernels for log-posterior 373 287 255 257 251 270 Kernels for error bars 79±61 49±43 26±27 17±16 12±9 8±5

that we want to infer from knowing (x, y) the likely value of (x , y  ) — a model which did not couple the various yi would fail in this regard. However, by a simple trick, we can achieve the factorization, yet still retain the inference properties of the estimator: we introduce yet another layer of dependence: rather than modelling p(X)

/X

p(θ|X) p(θ|X)p(X)

p(Y |θ)



p(Y |θ)p(θ|X)p(X)

/Y

we assume p(X)

/X m

p(θ|X) p(θ|X)p(X)



t=Kθ p(θ|X)p(X)

m

/t

p(Y |t) p(Y |t(θ))p(θ|X)p(X)

/Y

where p(θ|X) = i=1 p(θi |xi ) and p(Y |t(θ)) = i=1 p(yi |ti (θ)). That is, we moved the mixing between the various yi into the design matrix K. Note that there is no requirement that K be positive semidefinite. In fact, any arbitrary matrix would do. However, for practical purposes we will typically choose some function k with Kij = k(xi , xj ). Before we go into the technical details, let us give some motivation as to why the complexity of an estimate can depend on the locations where data occurs, since we are effectively updating our prior assumptions about t after observing the data placement. Note that we do not modify our prior assumptions based on the targets yi , but rather as a result of the distribution of patterns xi : Different input distribution densities might for instance correspond to different assumptions regarding the smoothness of the function class to be estimated. For example, it might be be advisable to favor smooth functions in areas where data are scarce, and allow more complicated functions where observations abound. We might not care about smoothness at all in regions where there is little or no chance of patterns occurring: In the problem of handwritten digit recognition,

Bayesian Kernel Methods

101

we do not (and should not) care about the behavior of the estimator on inputs x looking like faces. The specific benefit of this strategy is that it provides us with a correspondence between linear programming regularization [43, 2, 65, 6] and Bayesian priors over function spaces, by analogy to regularization in Reproducing Kernel Hilbert Spaces and Gaussian Processes.12 5.1

Examples of Factorizing Priors

Let us now study some of the priors factorizing in coefficient space. By construction we have m

m   1 γ(θi ) , where ti = θi k(xi , x). (97) p(t|X) = exp − Z i=1 i=1 Here γ(θi ) is chosen such that exp(−γ(θ)) is integrable, Z is the corresponding normalization term, and xi ∈ X. Examples of priors that depend on the locations xi include γ(θ) = 1 − ep|θ| with p > 0 (feature selection prior), γ(θ) = θ2 (weight decay prior), γ(θ) = |θ| (Laplacian prior).

(98) (99) (100)

The prior given by (98) was introduced in [5, 14] and is log-concave. While the latter characteristic is unfavorable in general, since the corresponding optimization problem exhibits many local minima, the negative log-posterior becomes strictly concave if we choose Laplacian noise (equivalent to the L1 loss in regression). By a basic result from convex analysis [57] this means that the optimum occurs at one of the extreme points, which makes optimization more feasible. Eq. (99) describes the popular weight decay prior used in Bayesian Neural Networks [40, 45, 46]. It assumes that the coefficients are independently normally distributed. We relax the assumption of a common normal distribution in Section 6 and introduce individual (hyper)parameters si . The resulting prior, p(θ|X, s) = (2π)

−m 2

m 12

m  1 2 si exp − si θi , 2 i=1 i=1

(101)

leads to the construction of the Relevance Vector Machine [73] and very sparse function expansions. Finally, the assumption underlying the Laplacian prior (100) is that only very few basis functions will be nonzero. The specific form of the prior is why we will call such estimators Laplacian Processes. This prior has two significant advantages over (98): It leads to convex optimization problems, and the integral 12

We thank Carl Magnus Rasmussen for discussions and suggestions.

102

A.J. Smola and B. Sch¨ olkopf



p(θ)dθ is finite and thus allows normalization (this is not the case for (98), which is why we call the latter an improper prior). The Laplacian prior corresponds to the regularization functional employed in sparse coding approaches, such as wavelet dictionaries [6], coding of natural images [48], independent component analysis [37], and linear programming regression [66, 65]. In the following, we focus on (100). It is straightforward to see that the MAP estimate can be obtained by minimizing the negative log posterior, which is given (up to constant terms) by −

m 

ln p(yi |f (xi ), xi ) +

i=1

m 

|θi |.

(102)

i=1

Depending on ln p(yi |ti (θ)), we may formulate the minimization of (102) as a linear or quadratic program. 5.2

Samples from the Prior

In order to illustrate our reasoning, and to show that such priors correspond to useful distributions over t, we generate samples from the prior distribution. As in Gaussian processes, smooth kernels k correspond to smooth priors. This is not surprising: As we show in the next section (Theorem 5), there exists a corresponding Gaussian process for every kernel k and every distribution p(x). The obvious advantage, however, is that we need not worry about Mercer’s condition for k but can take any arbitrary function k(x, x ) to generate a Laplacian process. We draw samples from the following three kernels, k(x, x ) = 

k(x, x ) = 

e− e

x−x 2 2σ 2

x−x  − σ



Gaussian RBF kernel,

(103)

Laplacian RBF kernel,

(104)

k(x, x ) = tanh(θ x, x + ϑ) Neural Networks kernel.

(105)

While (103) and (104) are also valid kernels for Gaussian Process estimation, (105) does not satisfy Mercer’s condition13 . Figure 10 gives sample realizations from the corresponding process. The use of (105) is impossible for GP priors, unless we diagonalize the matrix K explicitly and render it positive definite by replacing λi with |λi |. This is a very costly procedure (see also [61, 24]) as it involves computing the eigensystem of K. 5.3

Estimation

Since one of the aims of using a Laplacian prior on the coefficients θi is to achieve sparsity of the expansion, it does not appear sensible to use a Bayesian averaging 13

The covariance matrix K has to be positive definite at all times. An analogous application of the theory of conditionally positive definite kernels would be possible as well. There one simply assumes a Gaussian Process prior on a linear subspace of the yi .

Bayesian Kernel Methods

103

Fig. 10. Left Column: Grayscale plots of the realizations of several Laplacian Processes. The black dots represent data points. Right Column: 3D plots of the same samples of the process. We used 400 data points sampled at random from [0, 1]2 using a uniform distribution. Top to bottom: Gaussian kernel (103) (σ 2 = 0.1), Laplacian kernel (104) (σ = 0.1), and Neural Networks kernel (105) (θ = 10, ϑ = 1). Note that the Laplacian kernel is significantly less smooth than the Gaussian kernel, as with a Gaussian Process with Laplacian kernels. Moreover, observe that the Neural Networks kernel corresponds to a non-stationary process; that is, its covariance properties are not translation invariant.

104

A.J. Smola and B. Sch¨ olkopf

scheme to compute the mean of the posterior distribution, since such a scheme leads to mostly nonzero coefficients. Instead we seek to obtain the mode of the distribution (the MAP estimate) only. As already pointed out in the previous section, finding the mode need not give an exact solution, since mode and mean do not coincide for Laplacian regularization (recall Figure 3). Nonetheless, the MAP estimate is computationally attractive, since if both − ln p(ξi ) and γ(θi ) are convex, the optimization problem has a unique minimum. By assumption we can write the joint density in Y, t, θ as follows: p(Y, t, θ|X) = p(Y |t(θ))p(θ|X)  m  p(yi |ti (θ))p(θi ) where t(θ) = Kθ = ∝

i=1 m 

(106) (107)

 p(yi |ti (θ)) exp (−γ(θi ))

(108)

i=1

Here we exploited the fact that the prior factorizes and that the data is generated iid. Maximization of (106) is equivalent to minimizing the negative log posterior which leads to minimize

m  i=1

− ln p(yi |ti ) +

subject to t = Kθ

m  i=1

γ(θi ),

(109)

For − log p(ξi ) = |ξi | + c and γ(θi ) = |θi |, this leads to a linear program and the solution can be readily used as a MAP estimate for Laplacian processes (a similar reasoning holds for soft margin loss functions). Likewise for Gaussian noise, we obtain a quadratic program with a simple objective function but a dense set of constraints, by analogy to Basis Pursuit [6]. 5.4

Confidence Intervals for Gaussian Noise

One of the key advantages of Bayesian modeling is that we can obtain explicit confidence intervals for the predictions, provided the assumptions made regarding the priors and distribution are satisfied. Even for Gaussian noise, however, no explicit meaningful expansion using the MAP estimate αMAP is possible, since γ(θi ) = |θi | is non-differentiable at 0 (otherwise we could make a quadratic approximation at θi = 0). Nonetheless, a slight modification permits computationally efficient approximation of such error bounds. The modification consists of dropping all variables θi for which θMAP,i = 0 from the expansion (this renders the distribution flatter and thereby overestimates the error), and replacing all remaining variables by linear approximations (we replace |θi | by sgn (θMAP,i ) θi ). In other words, we assume that variables with zero coefficients do not influence the expansion, and that the signs of the remaining variables do not change.

Bayesian Kernel Methods

105

This is a sensible approximation since for large sample sizes, which Laplacian processes are designed to address, the posterior is strongly peaked around its mode. Thus the contribution of − log p(θ) = θ1 + c around θ MAP can be considered to be approximately linear. Denote by θ M the vector of nonzero variables, obtained by deleting all entries where θMAP,i = 0, by s the vector with elements ±1 such that θ MAP 1 = s θ M , and by KM the matrix generated from K by removing the columns corresponding to θMAP,i = 0. Then the posterior (now written in terms of θ for convenience) can be approximated as

m   1  1 2 (yi − KM θ M ) exp −s θ M . (110) p(θ M |X, Y ) ≈ exp − 2 Z 2σ i=1 Collecting linear and quadratic terms, we see that    θ M ∼ N(θ MAP , (KM KM )−1 ), where θ MAP = (KM KM )−1 (KM y + σ 2 s). (111)

The equation for θ MAP follows from the conditions on the optimal solution of the quadratic programming problem (109), or directly from maximizing (106) (after s is fixed). Hence predictions at a new point x are approximately normally distributed, with      −1 2  , (112) K θ , σ + k K k y(x) = N k M M M M M M where kM := (k(x1 , x), . . . , k(xM , x)) and only xi with nonzero θMAP,i are considered (thus M ≤ m). The additional σ 2 stems from the fact that we have additive Gaussian noise of variance σ 2 in addition to the Laplacian process. Equation (112) is still expensive to compute, but it is much cheaper to invert  ΣM ΣM than a dense square matrix Σ (since θ MAP may be very sparse). In addition, greedy approximation methods (as described for instance in Section 4.4) or column generation techniques [1] could be used to render the computation of (112) numerically more efficient. 5.5

An Equivalent Gaussian Process

We conclude this section with a proof that in the large sample size limit, there exists a Gaussian Process for each kernel expansion with a prior on the coefficients θi . For the purpose of the proof, we have to slightly modify the normalization condition on f : That is, we assume 1

yi (θ) = m− 2

m 

θi k(xi , x),

(113)

i=1

where θi ∼ Z1 exp(−γ(θ)). For large sample size, i.e., m → ∞, the following theorem holds.

106

A.J. Smola and B. Sch¨ olkopf

Theorem 5 (Convergence to Gaussian Process). Denote by θi independent random variables (we do not require identical distributions on θi ) with unit variance and zero mean. Furthermore, assume that there exists a distribution p(x) on X with according to which a sample {x1 , . . . , xm } is drawn, and that k(x, x ) is bounded on X × X. Then the random variable y(x) given by (113) converges for m → ∞ to a Gaussian process with zero mean and covariance function   ˜ k(x, x ) = k(x, x ¯)k(x , x ¯)p(¯ x)d¯ x. (114) X

This means that instead of a Laplacian process prior, we could use any other factorizing prior on the expansion coefficients θi and in the limit still obtain an equivalent stochastic process. Proof. To prove  the first part, we need only check is that y(x) and any linear combination j y(xj ) (for arbitrary xj ∈ X) converge to a normal distribution. By application of a theorem of Cram´er [8], this is sufficient to prove that y(x) is distributed according to a Gaussian Process. The random variable y(x) is a sum of m independent random variables with bounded variance (since k(x, x ) is bounded on X × X). Therefore in the limit m → ∞, by virtue of the Central Limit Theorem (e.g., [8]), we have y(x) ∼ N(0, σ 2 (x)) for some σ 2 (x) ∈ R. For arbitrary xj ∈ X, linear combinations of y(xj ) also have Gaussian distributions since n 

1

βi yi = m− 2

j=1

m  i=1

θi

n 

βj k(xi , xj ),

(115)

j=1

which allows n since the nthe application of the Central Limit Theorem to the sum inner sum j=1 βj k(xi , xj ) is bounded for any xi . It also implies j=1 βj yj ∼ N(0, σ 2 ) for m → ∞ and some σ 2 ∈ R, which proves that y(x) is distributed according to a Gaussian Process. To show (114), first note that y(x) has zero mean. Thus the covariance function for finite m can be found as expectation with respect to θ,   m m  1 1      θi θj k(xi , x)k(xj , x ) = k(xi , x)k(xj , x ), Eθ [y(x)y(x )] = Eθ m i,j=1 m i=1 (116) since the θi are independent and have zero mean. This expression, however, converges to the Riemann integral over X with the density p(x) as m → ∞. Thus   k(x, x ¯)k(x , x ¯)p(¯ x)d¯ x, (117) E[y(x)y(x )] −→ m→∞

which completes the proof.

X

Bayesian Kernel Methods

6

107

Relevance Vector Machines and Deconvolution

One of the problems with the probabilistic model introduced in (97) is that it may lead to somewhat intractable optimization problems, in particular if the negative log posterior γ(θ) is not convex. However, such functions γ may correspond to useful statistical assumptions on the distribution of coefficients in function expansions. 6.1

Turning Priors into Hyperpriors

Recently, Tipping [73] proposed a method to circumvent the numerical problems inherent in using a certain set of γ(θ) via deconvolution. Before presenting the specific choices [73] makes for the sake of sparse function expansions, we present the general principle, since it can be extended to priors on coefficients and likelihood terms alike, using ideas from [20]. The method works as follows: Assume we have a prior    si 1 2 exp − si θi , p(θi |si ) = (118) 2π 2 that is si plays the role of a hyperprior and θi ∼ N(0, s−1 i ). Here, given si , we can use well studied methods for inference with a Gaussian prior to perform the required estimation steps. Quite often we will be able to perform exact inference, given the hyperparameters si . Key is the suitable choice of p(si ), since clearly      si 1 exp − si θi2 p(si )dsi . (119) p(θi ) = p(θi |si )p(si )dsi = 2π 2 This means that given some p(θi ) we may be able to find some p(si ) such that (119) holds. A hyperprior p(si ) with a large weight at si = 0 is desirable since this leads to a of the distribution of θi around 0. A parameter transformation in the integral β = 2s12 yields i    1 1 (120) p(θi ) = exp(−βθi )(8π)− 2 β −1 p (2β)− 2 ) dβ.   1 1 That is, p(θi ) is the Laplace Transform of (8π)− 2 β −1 p (2β)− 2 ) and correspondingly p(si ) is given by the inverse Laplace Transform of p(θi ) and suitable variable changes. This fact allows us to match up priors p(θi ) and corresponding hyperpriors p(si ) in a fairly automatic fashion. In particular, we assume a normal distribution over θi with adjustable variance. The latter is then determined with a hyperparameter that has its most likely value at 0; This prior is expressed analytically as Normal Hyperprior: We begin with a normal distribution on p(si ), that is   1 1 2 p(si ) = . (121) exp − s 2πω 2 2ω 2 i

108

A.J. Smola and B. Sch¨ olkopf

Performing a Laplace transform leads to 1 BesselK0 p(θi ) = 2πθi



8 , θi

(122)

where BesselK is the modified Bessel function of the second kind [79]. See Figure 11 for more properties of this function.

4

10

Density p(x)

x 10

− log p(x) 15 10

8

5 6 0 4 −5 2 0

−10 0

2

4

6

8

10

−15

0

2

4

6

8

10

Fig. 11. p(θi ) for a normal hyperprior. Left: p(θi ); Right: − log p(θi ). Note the sharp peak at θi = 0 and the almost linear increase afterwards.

Γ -Hyperprior: Tipping [73] used the Gamma hyperprior to design the Relevance Vector Machine. Here p(si ) = Γ (si |a, b) :=

ba exp(−si b) sa−1 i for si > 0. Γ (a)

(123)

For non-informative (flat in logspace) priors, one typically chooses a = b = 10−4 . This leads to a polynomial prior which is given by −(a+1/2)     θ2 θ2 = b+ i . p(θi ) ∝ exp − (a + 1/2) ln b + i 2 2

(124)

Note that (123) is heavily peaked for si → 0. For regression, a similar assumption is made concerning the amount of additive Gaussian noise σ 2 ; thus p(¯ σ 2 ) = 1/¯ σ 2 or p(¯ σ 2 ) = Γ (¯ σ 2 |c, d) where typically c = d = 10−4 . Note that the priors are imposed on the inverse of σ and σ ¯ . Figure 12 depicts the scaling behavior for non-informative priors. Further choices are possible and can be obtained by consulting tables of Legendre transformations, such as in [23].

Bayesian Kernel Methods Density p(x)

109

− log p(x)

35

2

30

1

25

0

20 −1 15 −2

10

−3

5 0

0

2

4

6

8

10

−4

0

2

4

6

8

10

Fig. 12. p(θi ) for a Gamma hyperprior (a = b = 10−4 ). Left: p(θi ); Right: − log p(θi ). Note that here the distribution is much less peaked around 0 and observe the sublinear increase in the negative log density.

6.2

Further Expansions

A similar approach can be used to transform arbitrary p(θi ) into Laplacian distributions and a suitable hyperprior. Again, the connection is made via the Laplace transform, however this time by using p(si ) directly rather than by reparameterizing with the inverse argument, as done in (120):  1 p(θi ) = exp(−βθi ) p(β)dβ. (125) 2β This means that the Laplace transform of s1i p(si ) yields the effective prior and vice versa. While a Laplacian distribution may not be quite as desirable as a normal distribution (it will typically lead to the solution of a linear program rather than a simple matrix inversion), it is likely to be nonetheless favorable to a direct attempt at computing the mode of the distribution. Finally, it is worth noting that we can employ the same methods that we used to obtain a normal distribution in θi can be used to transform p(yi |ti ) into normal distributions combined with a hyperprior:    σ 1 √ i exp − σi (yi − ti )2 p(σi )dσi . (126) p(yi |ti ) = 2 2π The ensuing considerations are completely analogous to the ones in the previous chapter. We will come back to (126) when dealing with general factorizing estimation problems. 6.3

Regression with Hyperparameters

As before, we begin with the simple case of regression with additive Gaussian noise, that is Y |t ∼ N(0, σ 2 ). We know that θ ∼ N(0, S −1 ) where

110

A.J. Smola and B. Sch¨ olkopf

S := diag(s1 , . . . , sm ). Therefore t = Kθ satisfies t ∼ N(0, K  S −1 K) and finally Y ∼ N(0, K  S −1 K + σ 2 1). Consequently we obtain   m ˜ − 12 exp − 1 Y  Σ ˜ −1 Y , (127) p(Y |s, σ) = (2π)− 2 |Σ| 2 ˜ = K  S −1 K + σ 2 1. If we wish to determine p(Y  |Y, s, σ), we need an where Σ estimate of si . Assuming that all si = 0, the use of the matrix inversion formula plus the application of the Sherman-Morrison-Woodbury formula yields Y  |Y, s, σ ∼ N(µ , Σ  ).

(128)

µ = (K  ) S −1 K(σ 2 1 + K  S −1 K)−1 Y and Σ  = σ 2 1 + (K  ) (S + σ −2 KK  )−1 K  .

(130)

where (129)

This follows directly from Example 1. For K  = K, i.e., estimation on the training set, the above equations reduce to µ = K  S −1 K(σ 2 1 + K  S −1 K)−1 Y = σ −2 K(σ −2 K  K + S)−1 K  Y.

(131)

Since elimination of s, σ 2 by integration is impossible, we resort, as many times before, to the approximation of maximizing the joint density p(Y, s, σ) = p(Y |s, σ)

m 

p(si )p(σ).

(132)

i=1

Assuming a gamma prior not only on si but also on σ, the negative logarithm of (132) can be written as ! !  −1 " 1 ln !σ1 + KS −1 K  ! + Y  σ1 + KS −1 K  Y − ln p(Y, s, σ) = 2 m  − (a ln si − bsi ) − c ln σ 2 + dσ 2 + const. (133) i=1

Of course, if we set a = b = c = d = 0 (flat prior) the terms in (133) vanish and we are left with maximizing the parameters over an improper prior. Note the similarity to logarithmic barrier methods in constrained optimization, for which constrained minimization problems are transformed into unconstrained problems by adding logarithms of the constraints to the initial objective function. In other words, the Gamma distribution can be viewed as a positivity constraint on the hyperparameters si and σ 2 . Differentiating (133) and setting the corresponding terms to 0 leads to the update rules [73] si =

1 − si Ξii , µ2i

(134)

Bayesian Kernel Methods

111

where µ is given by (131) and Ξ := (σ −2 KK  +S)−1 . The quantity 1−si Ξii is a measure of the degree to which the corresponding parameter θi is “determined” by the data [40]. Likewise we obtain σ2 =

y − Kµ2 . m  si Ξii

(135)

i=1

It turns out that many of the parameters si tend to infinity during the optimization process. This means that the corresponding distribution of θi is strongly peaked around 0, and we may drop these variables from the optimization process. This speeds up the process as the minimization progresses. It seems wasteful to first consider the full set of possible functions k(xi , x), and only then weed out the functions not needed for prediction. We could instead use a greedy method for building up predictors, similar to the greedy strategy employed in Gaussian Processes (Section 4.4). This is the approach in [73], which proposes the following algorithm. After initializing the predictor with a single basis function (the bias, for example), we test whether each new basis function yields an improvement. The latter is achieved by guessing a large initial value si , and performing one update step. If (134) leads to an increase of si , we reject the corresponding basis function, otherwise we retain it in the optimization process. 6.4

Classification

For classification, we follow a scheme similar to that in Section 3.5. In order to keep matters simple, we only consider the binary classification case. Specifically, we carry out logistic regression by using (14) as a model for the distribution of labels yi ∈ {±1}. As in regression, we use a kernel expansion, this time for the latent variables t = Kθ. The negative log density in Y, θ conditioned on s, X is given by − ln p(Y, θ|s, X) =

m  i=1

− ln p(yi |ti (θ)) −

m 

ln p(θi |si ) + const. where t = Kθ.

i=1

(136) Unlike in regression, however, we cannot minimize (136) explicitly and have to resort to approximate methods, such as the Laplace approximation (see Section 4.1). Computing the first and second derivatives of (136) and using the definitions (70) and (71) yields ∂θ [− ln p(θ|y, s)] = Kc + Sθ,

(137)

∂θ2 [− ln p(θ|y, s)] = K  CK + S.

(138)

This allows us to obtain a MAP estimate of p(θ|y, s) by iterative application of (69), and we obtain an update rule for θ in a manner analogous to (72); θ new = θ old − (K  CK + S)−1 (Kc + Sθ old ) = (K  CK + S)−1 K(CKθ old − c). (139)

112

A.J. Smola and B. Sch¨ olkopf

If the iteration scheme converges, it will converge to the minimum of the negative log posterior. We next have to provide an iterative method for updating the hyperparameters s (note that we do not need σ 2 ). Since we cannot integrate out θ explicitly (we had to resort to an iterative method even to obtain the mode of the distribution), it is best to use the Gaussian approximation obtained from (138). This gives an approximation of the value of the posterior distribution p(s|y) and allows us to apply the update rules developed for regression in classification. Setting µ = θ MAP and Ξ = (K  CK + S)−1 , we can use (134) to optimize si . See [73] for further detail and motivation.

7

Summary

In this paper, we presented an overview over some of the more common techniques of Bayesian estimation, namely Gaussian Processes and the Relevance Vector Machine, and a novel method: Laplacian Processes. Due to the wealth of existing concepts and algorithms developed in Bayesian statistics, it is impossible to give a comprehensive treatment in a single paper. Instead, such a goal would merit writing a book in its own right. We refer the reader to [46, 26, 63, 40, 77, 81] and references therein for further detail. Topics We Left Out: We did not discuss Markov-Chain Monte Carlo methods and their application to Bayesian Estimation [45, 54] as an alternate way of performing Bayesian inference. These work by sampling from the posterior distribution rather than computing an approximation of the mode. On the model side, the maximum entropy discrimination paradigm [64, 7, 30] is a worthy concept in its own right, powerful enough to spawn a whole family of new inference algorithms both with [30] and without [34] kernels. The main idea is to seek the least informative estimate for prediction purposes. In addition, rather than requiring that a specific function satisfy certain constraints, we require only that the distribution satisfy the constraints on average. Methods such as the Bayes-Point machine [27] and Kernel Billiard [59, 58] can also be used for estimation purposes. The idea behind these methods is to “play billiard” in version space and average over the existing trajectories. The version space is the set of all separating hyperplanes for which the empirical risk vanishes or is bounded by some previously chosen constant. Proponents of this strategy claim rapid convergence due to the good mixing properties of the dynamical system. Finally, we left the field of graphical models (see for instance [71, 32, 36, 35] and the references therein) completely untouched. These algorithms model the dependency structure between different random variables in a rather explicit fashion and use efficient approximate inference techniques to solve the optimization problems. It is not clear yet, how to combine graphical models with kernels. Key Issues: Topics covered in this paper include deterministic and approximate methods for Bayesian inference, with an emphasis on the Maximum a Posteriori

Bayesian Kernel Methods

113

(MAP) estimate and the treatment of hyperparameters. As a side-effect, one can observe that the minimization of regularized risk is closely related to approximate Bayesian estimation. One of the first consequences of this link is the connection between Gaussian Processes and Support Vector Machines. While the former are defined in terms of correlations between random variables, the latter are derived from smoothness assumptions regarding the estimate and feature space considerations. This connection also allows one to exchange uniform convergence statements and Bayesian error bounds between both types of reasoning. As a side effect, this connection also gives rise to a new class of prior, namely those corresponding to 1 regularization and linear programming machines. Since the coefficients θi then follow a Laplacian distribution, we name the corresponding stochastic process a Laplacian Process. This new point of view allows the derivation of error bars for the estimates in a way that is not easily possible in a statistical learning theory framework. It turns out that this leads to a data dependent prior on the function space. Finally, the Relevance Vector Machine introduces individual hyperparameters for the distributions of the coefficients θi . This makes certain optimization problems tractable (matrix inversion) that otherwise would have remained infeasible (MAP estimate with the Student-t distribution as a prior). We expect that the technique of representing complex distributions by a normal distribution cum hyperprior is also a promising approach for other estimation problems. Taking a more abstract view, we expect a convergence between different estimation algorithms and inference principles derived from risk minimization, Bayesian estimation, and Minimum Description Length concepts. Laplacian Processes and the Relevance Vector Machine are two examples of such convergence. We hope that more such methods will follow in the next few years.

References 1. K. P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, San Francisco, 2000. Morgan Kaufmann Publishers. 2. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992. 3. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. 4. C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In Proceedings of 16th Conference on Uncertainty in Artificial Intelligence UAI’2000, pages 46–53, 2000. 5. P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Proceedings of the International Conference on Machine Learning, pages 82–90, San Francisco, California, 1998. Morgan Kaufmann Publishers. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/9803.ps.Z. 6. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Siam Journal of Scientific Computing, 20(1):33–61, 1999.

114

A.J. Smola and B. Sch¨ olkopf

7. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, New York, 1991. 8. H. Cram´er. Mathematical Methods of Statistics. Princeton University Press, 1946. 9. L. Csat´ o and M. Opper. Sparse representation for Gaussian process models. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 444–450. MIT Press, 2001. 10. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1–22, 1977. 11. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1995. 12. S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representation. Technical report, IBM Watson Research Center, New York, 2000. 13. R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, New York, 1989. 14. G. Fung and O. L. Mangasarian. Data selection for support vector machine classifiers. In Proceedings of KDD’2000, 2000. also: Data Mining Institute Technical Report 00-02, University of Wisconsin, Madison. 15. A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995. 16. M. Gibbs and D. J. C. Mackay. Variational Gaussian process classifiers. Technical report, Cavendish Laboratory, Cambridge, UK, 1998. 17. M. N. Gibbs. Bayesian Gaussian Methods for Regression and Classification. PhD thesis, University of Cambridge, 1997. 18. Mark Gibbs and David J. C. Mackay. Efficient implementation of Gaussian processes. Technical report, Cavendish Laboratory, Cambridge, UK, 1997. available at http://wol.ra.phy.cam.ac.uk/mng10/GP/. 19. P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Academic Press, 1981. 20. F. Girosi. Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1991. 21. D. Goldfarb and K. Scheinberg. A product-form cholesky factorization method for handling dense columns in interior point methods for linear programming. Technical report, IBM Watson Research Center, Yorktown Heights, 2001. 22. G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. 23. I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic Press, New York, 1981. 24. T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classification on pairwise proximity data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 438–444, Cambridge, MA, 1999. MIT Press. 25. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL-99-10, Computer Science Department, UC Santa Cruz, 1999. 26. R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002. 27. Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines: Estimating the Bayes point in kernel space. In Proceedings of IJCAI Workshop Support Vector Machines, pages 23–27, 1999. 28. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.

Bayesian Kernel Methods

115

29. P. J. Huber. Robust statistics: a review. Annals of Statistics, 43:1041, 1972. 30. T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. Technical Report AITR-1668, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1999. 31. T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 487–493, Cambridge, MA, 1999. MIT Press. 32. T. S. Jaakkola and M. I. Jordan. Computing upper and lower bounds on likelihoods in untractable networks. In Proceedings of the 12th Conference on Uncertainty in AI. Morgan Kaufmann Publishers, 1996. 33. W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 361–380, Berkeley, 1960. University of California Press. 34. T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainty In Artificial Intelligence, 2000. 35. M. I. Jordan and C. M. Bishop. An Introduction to Probabilistic Graphical Models. MIT Press, 2002. 36. M. I. Jordan, Z. Gharamani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Learning in Graphical Models, volume M.I. Jordan, pages 105–162. Kluwer Academic, 1998. 37. M. S. Lewicki and T. J. Sejnowski. Learning nonlinear overcomplete representations for efficient coding. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 556–562, Cambridge, MA, 1998. MIT Press. 38. D. G. Luenberger. Introduction to Linear and Nonlinear Programming. AddisonWesley, Reading, MA, 1973. 39. H. L¨ utkepohl. Handbook of Matrices. John Wiley and Sons, Chichester, 1996. 40. D. J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, Computation and Neural Systems, California Institute of Technology, Pasadena, CA, 1991. 41. D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation, 4(5):720–736, 1992. 42. S. Mallat and Z. Zhang. Matching Pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41:3397–3415, 1993. 43. O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965. 44. B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of Computing, 25(2):227–234, 1995. 45. R. Neal. Priors for infinite networks. Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 1994. 46. R. Neal. Bayesian Learning in Neural Networks. Springer, 1996. 47. Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Dept. of Computer Science, University of Toronto, 1993. CRGTR-93-1. 48. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996. 49. M. Opper and O. Winther. Mean field methods for classification with Gaussian processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 309–315, Cambridge, MA, 1999. MIT Press.

116

A.J. Smola and B. Sch¨ olkopf

50. M. Opper and O. Winther. Gaussian processes and SVM: mean field and leaveone-out. In A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 311–326, Cambridge, MA, 2000. MIT Press. 51. J. Platt. Probabilities for SV machines. In A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–73, Cambridge, MA, 2000. MIT Press. 52. T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201– 209, 1975. 53. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press, Cambridge, 1992. ISBN 0-521-43108-5. 54. C. Rasmussen. Evaluation of Gaussian Processes and Other Methods for NonLinear Regression. PhD thesis, Department of Computer Science, University of Toronto, 1996. ftp://ftp.cs.toronto.edu/pub/carl/thesis.ps.gz. 55. G. R¨ atsch, S. Mika, and A.J. Smola. Adapting codes und embeddings for polychotomies. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear. 56. B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. 57. R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970. 58. P. Ruj´ an and M. Marchand. Computing the Bayes kernel classifier. In A. J. Smola, P. L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 329–347, Cambridge, MA, 2000. MIT Press. 59. P´ al Ruj´ an. Playing billiards in version space. Neural Computation, 9:99–122, 1997. 60. B. Sch¨ olkopf, S. Mika, C. Burges, P. Knirsch, K.-R. M¨ uller, G. R¨ atsch, and A. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000–1017, 1999. 61. B. Sch¨ olkopf, A. Smola, and K.-R. M¨ uller. Kernel principal component analysis. In B. Sch¨ olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages 327–352. MIT Press, Cambridge, MA, 1999. 62. B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 63. M. Seeger. Bayesian methods for support vector machines and Gaussian processes. Master’s thesis, University of Edinburgh, Division of Informatics, 1999. 64. J. Skilling. Maximum Entropy and Bayesian Methods. Cambridge University Press, 1988. 65. A. Smola, B. Sch¨ olkopf, and G. R¨ atsch. Linear programs for automatic accuracy control in regression. In Ninth International Conference on Artificial Neural Networks, Conference Publications No. 470, pages 575–580, London, 1999. IEE. 66. A. J. Smola. Learning with Kernels. PhD thesis, Technische Universit¨ at Berlin, 1998. GMD Research Series No. 25. 67. A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. 68. A. J. Smola and B. Sch¨ olkopf. Sparse greedy matrix approximation for machine learning. In P. Langley, editor, Proceedings of the International Conference on Machine Learning, pages 911–918, San Francisco, 2000. Morgan Kaufmann Publishers.

Bayesian Kernel Methods

117

69. A.J. Smola and S.V.N. Vishwanathan. Cholesky factorization for rank-k modifications of diagonal matrices. SIAM Journal of Matrix Analysis, 2002. submitted. 70. C. Soon-Ong, A.J. Smola, and R.C. Williamson. Superkernels. In Neural Information Processing Systems, volume 15. MIT Press, 2002. to appear. 71. D. J. Spiegelhalter and S. L. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579–605, 1990. 72. J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, second edition, 1993. 73. M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001. 74. V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. 75. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. 76. V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281–287, Cambridge, MA, 1997. MIT Press. 77. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. 78. C. Watkins. Dynamic alignment kernels. CSD-TR-98- 11, Royal Holloway, University of London, Egham, Surrey, UK, 1999. 79. G. N. Watson. A Treatise on the Theory of Bessel Functions. Cambridge University Press, Cambridge, UK, 2 edition, 1958. 80. C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer Academic, 1998. 81. C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In Micheal Jordan, editor, Learning and Inference in Graphical Models, pages 599–621. MIT Press, 1999. 82. C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. 83. Christoper K. I. Williams and Matthias Seeger. Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA, 2001. MIT Press. 84. Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 20(12):1342–1351, 1998. 85. T. Zhang. Some sparse approximation bounds for regression problems. In Proc. 18th International Conf. on Machine Learning, pages 624–631. Morgan Kaufmann, San Francisco, CA, 2001.

An Introduction to Boosting and Leveraging Ron Meir1 and Gunnar R¨ atsch2 1

Department of Electrical Engineering, Technion, Haifa 32000, Israel [email protected], http://www-ee.technion.ac.il/˜rmeir 2 Research School of Information Sciences & Engineering The Australian National University, Canberra, ACT 0200, Australia [email protected] http://mlg.anu.edu.au/˜raetsch

Abstract. We provide an introduction to theoretical and practical aspects of Boosting and Ensemble learning, providing a useful reference for researchers in the field of Boosting as well as for those seeking to enter this fascinating area of research. We begin with a short background concerning the necessary learning theoretical foundations of weak learners and their linear combinations. We then point out the useful connection between Boosting and the Theory of Optimization, which facilitates the understanding of Boosting and later on enables us to move on to new Boosting algorithms, applicable to a broad spectrum of problems. In order to increase the relevance of the paper to practitioners, we have added remarks, pseudo code, “tricks of the trade”, and algorithmic considerations where appropriate. Finally, we illustrate the usefulness of Boosting algorithms by giving an overview of some existing applications. The main ideas are illustrated on the problem of binary classification, although several extensions are discussed.

1

A Brief History of Boosting

The underlying idea of Boosting is to combine simple “rules” to form an ensemble such that the performance of the single ensemble member is improved, i.e. “boosted”. Let h1 , h2 , . . . , hT be a set of hypotheses, and consider the composite ensemble hypothesis f (x) =

T 

αt ht (x).

(1)

t=1

Here αt denotes the coefficient with which the ensemble member ht is combined; both αt and the learner or hypothesis ht are to be learned within the Boosting procedure. The idea of Boosting has its roots in PAC learning (cf. [188]). Kearns and Valiant [101] proved the astonishing fact that learners, each performing only slightly better than random, can be combined to form an arbitrarily good ensemble hypothesis (when enough data is available). Schapire [163] was the first S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 118–183, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

An Introduction to Boosting and Leveraging

119

to provide a provably polynomial time Boosting algorithm, while [55] were the first to apply the Boosting idea to a real-world OCR task, relying on neural networks as base learners. The AdaBoost (Adaptive Boosting) algorithm by [67, 68, 70] (cf. Algorithm 2.1) is generally considered as a first step towards more practical Boosting algorithms. Very similar to AdaBoost is the Arcing algorithm, for which convergence to a linear programming solution can be shown (cf. [27]). Although the Boosting scheme seemed intuitive in the context of algorithmic design, a step forward in transparency was taken by explaining Boosting in terms of a stage-wise gradient descent procedure in an exponential cost function (cf. [27, 63, 74, 127, 153]). A further interesting step towards practical applicability is worth mentioning: large parts of the early Boosting literature persistently contained the misconception that Boosting would not overfit even when running for a large number of iterations. Simulations by [81, 153] on data sets with higher noise content could clearly show overfitting effects, which can only be avoided by regularizing Boosting so as to limit the complexity of the function class (cf. Section 6). When trying to develop means for achieving robust Boosting it is important to elucidate the relations between Optimization Theory and Boosting procedures (e.g. [71, 27, 48, 156, 149]). Developing this interesting relationship opened the field to new types of Boosting algorithms. Among other options it now became possible to rigorously define Boosting algorithms for regression (cf. [58, 149]), multi-class problems [3, 155], unsupervised learning (cf. [32, 150]) and to establish convergence proofs for Boosting algorithms by using results from the Theory of Optimization. Further extensions to Boosting algorithms can be found in [32, 150, 74, 168, 169, 183, 3, 57, 129, 53]. Recently, Boosting strategies have been quite successfully used in various real-world applications. For instance [176] and earlier [55] and [112] used boosted ensembles of neural networks for OCR. [139] proposed a non-intrusive monitoring system for assessing whether certain household appliances consume power or not. In [49] Boosting was used for tumor classification with gene expression data. For further applications and more details we refer to the Applications Section 8.3 and to http://www.boosting.org/applications. Although we attempt a full and balanced treatment of most available literature, naturally the presentation leans in parts towards the authors’ own work. In fact, the Boosting literature, as will become clear from the bibliography, is so extensive, that a full treatment would require a book length treatise. The present review differs from other reviews, such as the ones of [72, 165, 166], mainly in the choice of the presented material: we place more emphasis on robust algorithms for Boosting, on connections to other margin based approaches (such as support vector machines), and on connections to the Theory of Optimization. Finally, we discuss applications and extensions of Boosting. The content of this review is organized as follows. After presenting some basic ideas on Boosting and ensemble methods in the next section, a brief overview of the underlying theory is given in Section 3. The notion of the margin and its

120

R. Meir and G. R¨ atsch

connection to successful Boosting is analyzed in Section 4, and the important link between Boosting and Optimization Theory is discussed in Section 5. Subsequently, in Section 6, we present several approaches to making Boosting more robust, and proceed in Section 7 by presenting several extensions of the basic Boosting framework. Section 8.3 then presents a short summary of applications, and Section 9 summarizes the review and presents some open problems.

2

An Introduction to Boosting and Ensemble Methods

This section includes a brief definition of the general learning setting addressed in this paper, followed by a discussion of different aspects of ensemble learning. 2.1

Learning from Data and the PAC Property

We focus in this review (except in Section 7) on the problem of binary classification [50]. The task of binary classification is to find a rule (hypothesis), which, based on a set of observations, assigns an object to one of two classes. We represent objects as belonging to an input space X, and denote the possible outputs of a hypothesis by Y. One possible formalization of this task is to estimate a function f : X → Y, using input-output training data pairs generated independently at random from an unknown probability distribution P (x, y), (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × {−1, +1} such that f will correctly predict unseen examples (x, y). In the case where Y = {−1, +1} we have a so-called hard classifier and the label assigned to an input x is given by f (x). Often one takes Y = R, termed a soft classifier, in which case the label assignment is performed according to sign(f (x)). The true performance of a classifier f is assessed by  L(f ) = λ(f (x), y)dP (x, y), (2) where λ denotes a suitably chosen loss function. The risk L(f ) is often termed the generalization error as it measures the expected loss with respect to examples which were not observed in the training set. For binary classification one often uses the so-called 0/1-loss λ(f (x), y) = I(yf (x) ≤ 0), where I(E) = 1 if the event E occurs and zero otherwise. Other loss functions are often introduced depending on the specific context. Unfortunately the risk cannot be minimized directly, since the underlying probability distribution P (x, y) is unknown. Therefore, we have to try to estimate a function that is close to the optimal one based on the available information, i.e. the training sample and properties of the function class F from which the solution f is chosen. To this end, we need what is called an induction principle. A particularly simple one consists of approximating the minimum of the risk (2) by the minimum of the empirical risk

An Introduction to Boosting and Leveraging N 1  ˆ λ(f (xn ), yn ). L(f ) = N n=1

121

(3)

ˆ ) → L(f ) as N → From the law of large numbers (e.g. [61]) one expects that L(f ˆ ) ∞. However, in order to guarantee that the function obtained by minimizing L(f also attains asymptotically the minimum of L(f ) a stronger condition is required. Intuitively, the complexity of the class of functions F needs to be controlled, since otherwise a function f with arbitrarily low error on the sample may be found, which, however, leads to very poor generalization. A sufficient condition ˆ ) converge uniformly for preventing this phenomenon is the requirement that L(f (over F) to L(f ); see [50, 191, 4] for further details. While it is possible to provide conditions for the learning machine which ensure that asymptotically (as N → ∞) the empirical risk minimizer will perform optimally, for small sample sizes large deviations are possible and overfitting might occur. Then a small generalization error cannot be obtained by simply minimizing the training error (3). As mentioned above, one way to avoid the overfitting dilemma – called regularization – is to restrict the size of the function class F that one chooses the function f from [190, 4]. The intuition, which will be formalized in Section 3, is that a “simple” function that explains most of the data is preferable to a complex one which fits the data very well (Occam’s razor, e.g. [21]). For Boosting algorithms which generate a complex composite hypothesis as in (1), one may be tempted to think that the complexity of the resulting function class would increase dramatically when using an ensemble of many learners. However this is provably not the case under certain conditions (see e.g. Theorem 3), as the ensemble complexity saturates (cf. Section 3). This insight may lead us to think that since the complexity of Boosting is limited we might not encounter the effect of overfitting at all. However when using Boosting procedures on noisy realworld data, it turns out that regularization (e.g. [103, 186, 143, 43]) is mandatory if overfitting is to be avoided (cf. Section 6). This is in line with the general experience in statistical learning procedures when using complex non-linear models, for instance for neural networks, where a regularization term is used to appropriately limit the complexity of the models (e.g. [2, 159, 143, 133, 135, 187, 191, 36]). Before proceeding to discuss ensemble methods we briefly review the strong and weak PAC models for learning binary concepts [188]. Let S be a sample consisting of N data points {(xn , yn )}N n=1 , where xn are generated independently at random from some distribution P (x) and yn = f (xn ), and f belongs to some known class F of binary functions. A strong PAC learning algorithm has the property that for every distribution P , every f ∈ F and every 0 ≤ , δ ≤ 1/2 with probability larger than 1−δ, the algorithm outputs a hypothesis h such that Pr[h(x) = f (x)] ≤ . The running time of the algorithm should be polynomial in 1/ε, 1/δ, n, d, where d is the dimension (appropriately defined) of the input space. A weak PAC learning algorithm is defined analogously, except that it is only required to satisfy the conditions for particular ε and δ, rather than all pairs. Various extensions and generalization of the basic PAC concept can be found in [87, 102, 4].

122

R. Meir and G. R¨ atsch

2.2

Ensemble Learning, Boosting, and Leveraging

Consider a combination of hypotheses in the form of (1). Clearly there are many approaches for selecting both the coefficients αt and the base hypotheses ht . In the so-called Bagging approach [25], the hypotheses {ht }Tt=1 are chosen based on a set of T bootstrap samples, and the coefficients αt are set to αt = 1/T (see e.g. [142] for more refined choices). Although this algorithm seems somewhat simplistic, it has the nice property that it tends to reduce the variance of the overall estimate f (x) as discussed for regression [27, 79] and classification [26, 75, 51]. Thus, Bagging is quite often found to improve the performance of complex (unstable) classifiers, such as neural networks or decision trees [25, 51]. For Boosting the combination of the hypotheses is chosen in a more sophisticated manner. The evolution of the AdaBoost algorithm can be understood from Figure 1. The intuitive idea is that examples that are misclassified get higher weights in the next iteration, for instance the examples near the decision boundary are usually harder to classify and therefore get high weights after a few iterations. This idea of iterative reweighting of the training sample is essential to Boosting. (t) (t) More formally, a non-negative weighting d(t) = (d1 , . . . , dN ) is assigned to the data at step t, and a weak learner ht is constructed based on d(t) . This weighting is updated at each iteration t according to the weighted error incurred by the weak learner in the last iteration (see Algorithm 2.1). At each step t, the weak learner is required to produce a small weighted empirical error defined by 1st Iteration

2nd Iteration

3rd Iteration

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0.5

1

0

0

5th Iteration

0.5

1

0

10th Iteration 1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0.5

1

0

0

0.5

0.5

1

100th Iteration

1

0

0

1

0

0

0.5

1

Fig. 1. Illustration of AdaBoost on a 2D toy data set: The color indicates the label and the diameter is proportional to the weight of the examples in the first, second, third, 5th, 10th and 100th iteration. The dashed lines show the decision boundaries of the single classifiers (up to the 5th iteration). The solid line shows the decision line of the combined classifier. In the last two plots the decision line of Bagging is plotted for a comparison. (Figure taken from [153].)

An Introduction to Boosting and Leveraging

123

Algorithm 2.1 The AdaBoost algorithm [70]. 1. Input: S = {(x1 , y1 ), . . . , (xN , yN )}, Number of Iterations T (1) 2. Initialize: dn = 1/N for all n = 1, . . . , N 3. Do for t = 1, . . . , T , a) Train classifier with respect to the weighted sample set {S, d(t) } and obtain hypothesis ht : x → {−1, +1}, i.e. ht = L(S, d(t) ) b) Calculate the weighted training error εt of ht : εt =

N 

d(t) n I(yn = ht (xn )) ,

n=1

c) Set: αt =

1 1 − εt log 2 εt

d) Update weights: d(t+1) = d(t) n n exp {−αt yn ht (xn )} /Zt ,  (t+1) = 1. where Zt is a normalization constant, such that N n=1 dn 1 4. Break if εt = 0 or εt ≥ 2 and set T = t − 1. T  αt 5. Output: fT (x) = ht (x) T r=1 αr t=1

εt (ht , d(t) ) =

N 

d(t) n I(yn = ht (xn )).

(4)

n=1

After selecting the hypothesis ht , its weight αt is computed such that it minimizes a certain loss function (cf. step (3c)). In AdaBoost one minimizes GAB (α) =

N 

exp {−yn (αht (xn ) + ft−1 (xn ))} ,

(5)

n=1

where ft−1 is the combined hypothesis of the previous iteration given by ft−1 (xn ) =

t−1 

αr hr (xn ).

(6)

r=1

Another very effective approach, proposed in [74], is the LogitBoost algorithm, where the cost function analogous to (5) is given by G

LR

(α) =

N  n=1

log {1 + exp (−yn (αht (xn ) + ft−1 (xn )))} .

(7)

124

R. Meir and G. R¨ atsch

An important feature of (7) as compared to (5) is that the former increases only linearly for negative values of yn (αht (xn ) + ft−1 (xn )), while the latter increases exponentially. It turns out that this difference is important in noisy situations, where AdaBoost tends to focus too much on outliers and noisy data. In Section 6 we will discuss other techniques to approach this problem. Some further details concerning LogitBoost will be given in Section 5.2. For AdaBoost it has been shown [168] that αt in (5) can be computed analytically leading to the expression in step (3c) of Algorithm 2.1. Based on the new combined hypothesis, the weighting d of the sample is updated as in step (3d) (1) of Algorithm 2.1. The initial weighting d(1) is chosen uniformly: dn = 1/N . The so-called Arcing algorithms proposed in [27] are similar to AdaBoost, but use a different loss function.1 Each loss function leads to different weighting schemes of the examples and hypothesis coefficients. One well-known example is Arc-X4 where one uses a fourth order polynomial to compute the weight of the examples. Arcing is the predecessor of the more general leveraging algorithms discussed in Section 5.2. Moreover, for one variant, called Arc-GV, it has been shown [27] to find the linear combination that solves a linear optimization problem (LP) (cf. Section 4). The AdaBoost algorithm presented above is based on using binary (hard) weak learners. In [168] and much subsequent work, real-valued (soft) weak learners were used. One may also find the hypothesis weight αt and the hypothesis ht in parallel, such that (5) is minimized. Many other variants of Boosting algorithms have emerged over the past few years, many of which operate very similarly to AdaBoost, even though they often do not possess the PAC-boosting property. The PAC-boosting property refers to schemes which are able to guarantee that weak learning algorithms are indeed transformed into strong learning algorithms in the sense described in Section 2.1. Duffy and Helmbold [59] reserve the term ‘boosting’ for algorithms for which the PAC-boosting property can be proved to hold, while using ‘leveraging’ in all other cases. Since we are not overly concerned with PAC learning per se in this review, we use the terms ‘boosting’ and ‘leveraging’ interchangeably. In principle, any leveraging approach iteratively selects one hypothesis ht ∈ H at a time and then updates their weights; this can be implemented in different ways. Ideally, given {h1 , . . . , ht }, one solves the optimization problem for all hypothesis coefficients {α1 , . . . , αt }, as proposed by [81, 104, 48]. In contrast to this, the greedy approach is used by the original AdaBoost/Arc-GV algorithm as discussed above: only the weight of the last hypothesis is selected, while minimizing some appropriate cost function [27, 74, 127, 153]. Later we will show in Sections 4 and 5 that this relates to barrier optimization techniques [156] and coordinate descent methods [197, 151]. Additionally, we will briefly discuss relations to information geometry [24, 34, 110, 104, 39, 147, 111] and column generation techniques (cf. Section 6.2). An important issue in the context of the algorithms discussed in this review pertains to the construction of the weak learner (e.g. step (3a) in Algorithm 2.1). 1

Any convex loss function is allowed that goes to infinity when the margin goes to minus infinity and to zero when the margin goes to infinity.

An Introduction to Boosting and Leveraging

125

At step t, the weak learner is constructed based on the weighting d(t) . There are basically two approaches to taking the weighting into consideration. In the first approach, we assume that the learning algorithm can operate with reweighted examples. For example, if the weak learner is based on minimizing a cost function (see Section 5), one can construct a revised cost function which assigns a weight to each of the examples, so that heavily weighted examples are more influential. For example, for the least squares error one may attempt to minimize N (t) 2 n=1 dn (yn − h(xn )) . However, not all weak learners are easily adaptable to the inclusion of weights. An approach proposed in [68] is based on resampling the data with replacement based on the distribution d(t) . The latter approach is more general as it is applicable to any weak learner; however, the former approach has been more widely used in practice. Friedman [73] has also considered sampling based approaches within the general framework described in Section 5. He found that in certain situations (small samples and powerful weak learners) it is advantageous to sample a subset of the data rather than the full data set itself, where different subsets are sampled at each iteration of the algorithm. Overall, however, it is not yet clear whether there is a distinct advantage to using either one of the two approaches.

3 3.1

Learning Theoretical Foundations of Boosting The Existence of Weak Learners

We have seen that AdaBoost operates by forming a re-weighting d of the data at each step, and constructing a base learner based on d. In order to gauge the performance of a base learner, we first define a baseline learner. Definition 1. Let d = (d1 , . . . , dN ) be a probability weighting of the data points S. Let  S+ be the subset of the positively labeled points, and similarly for S− . Set D+ = n:yn =+1 dn and similarly for D− . The baseline classifier fBL is defined as fBL (x) = sign(D+ − D− ) for all x. In other words, the baseline classifier predicts +1 if D+ ≥ D− and −1 otherwise. It is immediately obvious that for any weighting d, the error of the baseline classifier is at most 1/2. The notion of weak learner has been introduced in the context of PAC learning at the end of Section 2.1. However, this definition is too limited for most applications. In the context of this review, we define a weak learner as follows. A learner is a weak learner for sample S if, given any weighting d on S, it is able to achieve a weighted classification error (see (4)) which is strictly smaller than 1/2. A key ingredient of Boosting algorithms is a weak learner which is required to exist in order for the overall algorithm to perform well. In the context of binary classification, we demand that the weighted empirical error of each weak learner is strictly smaller than 12 − 12 γ, where γ is an edge parameter quantifying the deviation of the performance of the weak learner from the baseline classifier introduced in Definition 1. Consider a weak learner that outputs a binary classifier h based on a data set, S = {(xn , yn )}N n=1 each pair (xn , yn ) of which is weighted by a non-negative weight dn . We then demand that

126

R. Meir and G. R¨ atsch N 

ε(h, d) =

dn I[yn = h(xn )] ≤

n=1

1 1 − γ, 2 2

(γ > 0).

(8)

For simple weak learners, it may not be possible to find a strictly positive value of γ for which (8) holds without making any assumptions about the data. For example, consider the two-dimensional xor problem with N = 4, x1 = (−1, −1), x2 = (+1, +1), x3 = (−1, +1), x4 = (+1, +1), and corresponding labels {−1, −1, +1, +1}. If the weak learner h is restricted to be an axis-parallel half-space, it is clear that no such h can achieve an error smaller than 1/2 for a uniform weighting over the examples. We consider two situations, where it is possible to establish the strict positivity of γ, and to set a lower bound on its value. Consider a mapping f from the binary cube X = {−1, +1}d to Y = {−1, 1}, and assume the true labels yn are given by yn = f (xn ). We wish to approximate f by combinations of binary hypotheses ht belonging to H. Intuitively we expect that a large edge γ can be achieved if we can find a weak hypothesis h which correlates well with f (cf. Section 4.1). Let H be a class of binary hypotheses (Boolean functions), and let D be a distribution over X. The correlation between f and H, with respect to D, is given by CH,D (f ) = suph∈H ED {f (x)h(x)}. The distribution-free correlation between f and H is given by CH (f ) = inf D CH,D (f ). [64] shows that if T >2 log(2)dCH (f )−2  T then f can be represented exactly as f (x) = sign t=1 ht (x) . In other words, if H is highly correlated with the target function f , then f can be exactly represented as a convex combinations of a small number of functions from H. Hence, after a sufficiently large number of Boosting iterations, the empirical error can be expected to approach zero. Interestingly, this result is related to the Min-Max theorem presented in Section 4.1.

−− −

− −− −

− − − − −− −

− −



− − −

− +

− + +



+

+ + B +

− + +



− −

− −

− −





η −



+

+

− −



+ + +

− − −

− −

− −

− −

− −



Fig. 2. A single convex set containing the positively labeled examples separated from the negative examples by a gap of η.





The results of [64] address only the case of Boolean functions and a known target function f . An important question that arises relates to the establishment of geometric conditions for which the existence of a weak learner can be guaranteed. Consider the case where the input patterns x belong to Rd , and let H, the class of weak learners, consisting of linear classifiers of the form sign(w x + b). It is not hard to show [122], that for any distribution d over a training set of

An Introduction to Boosting and Leveraging

127

distinct points, ε(h, d) ≤ 12 − c/N , where c is an absolute constant. In other words, for any d one can find a linear classifier that achieves an edge of at least c/N . However, as will become clear in Section 3.4 such an advantage does not suffice to guarantee good generalization. This is not surprising, since the claim holds even for arbitrarily labeled points, for which no generalization can be expected. In order to obtain a larger edge, some assumptions need to be made about the data. Intuitively, we expect that situations where the positively and negatively labeled points are well separated are conducive to achieving a large edge by linear classifiers. Consider, for example, the case where the positively labeled points are enclosed within a compact convex region in K ⊂ Rd , while the remaining points are located outside of this region, such that the distance between the oppositely labeled points is at least η, for some η > 0 (see Figure 2). It is not hard to show that in this case [122] γ ≥ γ0 > 0, namely the edge is strictly larger than zero, independently of the sample size (γ0 is related to the number of faces of the smallest convex polytope that covers only the positively labeled points, see also discussion in Section 4.2). In general situations, the data may be strongly overlapping. In order to deal with such cases, much more advanced tools need to be wielded based on the theory of Geometric Discrepancy [128]. The technical details of this development are beyond the scope of this paper. The interested reader is referred to [121, 122] for further details. In general, there has not been much work on establishing geometric conditions for the existence of weak learners for other types of classifiers. 3.2

Convergence of the Training Error to Zero

As we have seen in the previous section, it is possible under appropriate conditions to guarantee that the weighted empirical error of a weak learner is smaller than 12 − 12 γ, γ > 0. We now show that this condition suffices to ensure that the empirical error of the composite hypothesis converges to zero as the number of iterations increases. In fact, anticipating the generalization bounds in Section 3.4, we present a somewhat stronger result. We establish the claim for the AdaBoost algorithm; similar claims can be proven for other Boosting algorithms (cf. Section 4.1 and [59]). Keeping in mind that we may use a real-valued function f for classification, we often want to take advantage of the actual value of f , even though classification is performed using sign(f ). The actual value of f contains information about the confidence with which sign(f ) is predicted (e.g. [4]). For binary classification (y ∈ {−1, +1}) and f ∈ R we define the margin of f at example (xn , yn ) as ρn (f ) = yn f (xn ). Consider the following function defined for  1 ϕθ (z) = 1 − z/θ  0

0 ≤ θ ≤ 12 , if z ≤ 0, if 0 < z ≤ θ, otherwise .

(9)

128

R. Meir and G. R¨ atsch

Let f be a real-valued function taking values in [−1, +1]. The empirical margin error is defined as N 1  θ ˆ ϕθ (yn f (xn )). L (f ) = N n=1

(10)

It is obvious from the definition that the classification error, namely the fraction ˆ ) = L ˆ 0 (f ). In addition, of misclassified examples, is given by θ = 0, i.e. L(f θ ˆ (f ) is monotonically increasing in θ. We note that one often uses the so-called L 0/1-margin error defined by N  ˜ θ (f ) = 1 L I(yn f (xn ) ≤ θ). N n=1

ˆ θ (f ) ≤ L ˜ θ (f ). Since we Noting that ϕθ (yf (x)) ≤ I(yf (x) ≤ θ), it follows that L ˆ θ (f ) as part of an upper bound on the generalization error in Section 3.4, use L ˜ θ (f ). the bound we obtain using it is tighter than would be obtained using L Theorem 1 ([167]). Consider AdaBoost as described in Algorithm 2.1. Assume that at each round t, the weighted empirical error satisfies ε(ht , d(t) ) ≤ 12 − 12 γt . Then the empirical margin error of the composite hypothesis fT obeys T

ˆ θ (fT ) ≤ L

(1 − γt )

1−θ 2

(1 + γt )

1+θ 2

.

(11)

t=1

Proof. We present a proof from [167] for the case where ht ∈ {−1, +1}. We begin by showing that for every {αt } T

T  θ ˜ L (fT ) ≤ exp θ (12) αt Zt . t=1

t=1

By definition Zt =

N  n=1

=

−yn αt ht (xn ) d(t) n e



−αt d(t) + n e

n:yn =ht (xn ) −αt

= (1 − εt )e



αt d(t) n e

n:yn =ht (xn )

+ εt e . αt

From the definition of fT it follows that  yfT (x) ≤ θ



exp −y

T  t=1

which we rewrite as

αt ht (x) + θ

T  t=1

 αt

≥ 1,

An Introduction to Boosting and Leveraging



T 

I[Y fT (X) ≤ θ] ≤ exp −y

αt ht (x) + θ

t=1

T 

129

αt

.

(13)

t=1

Note that (T )

dn exp (−αT yn hT (xn )) Z    T T exp − t=1 αt yn ht (xn ) = T N t=1 Zt

+1) d(T = n

(by induction).

(14)

Using (13) and (14) we find that N  ˜ θ (f ) = 1 I[yn fT (xn ) ≤ θ] L N n=1

 T N T   1  ≤ exp −yn αt ht (xn ) + θ αt N n=1 t=1 t=1

T N

T    1 = exp θ αt exp −yn αt ht (xn ) N t=1 n=1 t=1

T T N   +1) = exp θ αt Zt d(T n t=1

= exp θ

T 

t=1



αt

t=1

T

n=1

 Zt



=1



.

t=1

Next, set αt = (1/2) log((1 − εt )/εt ) as in Algorithm 2.1, which easily implies that  Zt = 2 εt (1 − εt ). Substituting this result into (12) we find that ˜ θ (f ) ≤ L

T 

4ε1−θ (1 − εt )1+θ t

t=1

which yields the desired result upon recalling that εt = 1/2 − γt /2, and noting ˆ θ (f ) ≤ L ˜ θ (f ). that L In the special case that θ = 0, i.e. one considers the training error, we find that 

ˆ T ) ≤ e− Tt=1 γt2 /2 , L(f T from which we infer that the condition t=1 γt2 → ∞ suffices to guarantee that √ ˆ T ) → 0. For example, the condition γt ≥ c/ t suffices for this.2 Clearly L(f 2

When one uses binary valued hypotheses, then γt ≥ c/t is already sufficient to achieve a margin of at least zero on all training examples (cf. [146], Lemma 2.2, 2.3).

130

R. Meir and G. R¨ atsch

this holds if γt ≥ γ0 > 0 for some positive constant γ0 . In fact, in this case ˆ θ (fT ) → 0 for any θ ≤ γ0 /2. L In general, however, one may not be interested in the convergence of the empirical error to zero, due to overfitting (see discussion in Section 6). Sufficient conditions for convergence of the error to a nonzero value can be found in [121]. 3.3

Generalization Error Bounds

In this section we consider binary classifiers only, namely f : X → {−1, +1}. A learning algorithm can be viewed as a procedure for mapping any data set S = {(xn , yn )}N n=1 to some hypothesis h belonging to a hypothesis class H consisting of functions from X to {−1, +1}. In principle, one is interested in quantifying the performance of the hypothesis fˆ on future data. Since the data S consists of randomly drawn pairs (xn , yn ), both the data-set and the generated hypothesis fˆ are random variables. Let λ(y, f (x)) be a loss function, measuring the loss incurred by using the hypothesis f to classify input x, the true label of which is y. As in Section 2.1 the expected loss incurred by a hypothesis f is given by L(h) = E{λ(y, f (x))}, where the expectation is taken with respect to the unknown probability distribution generating the pairs (xn , yn ). In the sequel we will consider the case of binary classification and 0/1-loss λ(y, f (x)) = I[y = f (x)], where I[E] = 1, if the event E occurs and zero otherwise. In this case it is easy to see that L(f ) = Pr[y = h(x)], namely the probability of misclassification. A classic result by Vapnik and Chervonenkis relates the empirical classification error of a binary hypothesis f , to the probability of error Pr[y = f (x)]. For binary functions f we use the notation P (y = f (x)) for the probability of error and Pˆ (y = f (x) for the empirical classification error. Before presenting the result, we need to define the VC-dimension of a class of binary hypotheses F. Consider a set of N points X = (x1 , . . . , xN ). Each binary function f defines a dichotomy (f (x1 ), . . . , f (xN )) ∈ {−1, +1}N on these points. Allowing f to run over all elements of F, we generate a subset of the binary N -dimensional cube {−1, +1}N , denoted by FX , i.e. FX = {(f (x1 ), . . . , f (xN )) : f ∈ F}. The VC-dimension of F, VCdim(F), is defined as the maximal cardinality of the set of points X = (x1 , . . . , xN ) for which |FX | = 2N . Good estimates of the VC dimension are available for many classes of hypotheses. For example, in the ddimensional space Rd one finds for hyperplanes a VC dimension of d + 1, while for rectangles the VC dimension is 2d. Many more results and bounds on the VC dimension of various classes can be found in [50] and [4]. We present an improved version of the classic VC bound, taken from [11]. Theorem 2 ([192]). Let F be a class of {−1, +1}-valued functions defined over a set X. Let P be a probability distribution on X × {−1, +1}, and suppose that N -samples S = {(x1 , y1 ), . . . , (xN , yN )} are generated independently at random according to P . Then, there is an absolute constant c, such that for any integer N , with probability at least 1 − δ over samples of length N , every f ∈ F satisfies

An Introduction to Boosting and Leveraging

 P (y = f (x)) ≤ Pˆ (y = f (x)) + c

VCdim(F) + log(1/δ) . N

131

(15)

We comment that the original bound in [192] contained an extra factor of log N multiplying VCdim(F) in (15). The finite-sample bound presented in Theorem 2 has proved very useful in theoretical studies. It should also be stressed that the bound is distribution free, namely it holds independently of the underlying probability measure P . Moreover, it can be shown ([50], Theorem 14.5) to be optimal in rate in a precise minimax sense. However, in practical applications it is often far too conservative to yield useful results. 3.4

Margin Based Generalization Bounds

In spite of the claimed optimality of the above bound, there is good reason to investigate bounds which improve upon it. While such an endeavor may seem futile in light of the purported optimality of (15), we observe that the bound is optimal only in a distribution-free setting, where no restrictions are placed on the probability distribution P . In fact, one may want to take advantage of certain regularity properties of the data in order to improve the bound. However, in order to retain the appealing distribution-free feature of the bound (15), we do not want to impose any a-priori conditions on P . Rather, the idea is to construct a so-called luckiness function based on the data [179], which yields improved bounds in situations where the structure of the data happens to be ‘simple’. In order to make more effective use of a classifier, we now allow the class of classifiers F to be real-valued. In view of the discussion in Section 3.1, a ˆ θ (f ) defined good candidate for a luckiness function is the margin-based loss L in (10). The probability of error is given by L(f ) = P (y = sign(f (x))). For real-valued classifiers F, define a new notion of complexity, related to the VC dimension but somewhat more general. Let {σ1 , . . . , σN } be a sequence of {−1, +1}-valued random variables generated independently by setting each σn to {−1, +1} with equal probability. Additionally, let {x1 , . . . , xN } be generated independently at random according to some law P . The Rademacher complexity [189] of the class F is given by   N 1     σn f (xn ) , RN (F) = E sup   f ∈F  N n=1

where the expectation is taken with respect to both {σn } and {xn }. The Rademacher complexity has proven to be essential in the derivation of effective generalization bounds (e.g. [189, 11, 108, 10]). The basic intuition behind the definition of RN (F) is its interpretation as a measure of correlation between the class F and a random set of labels. For very rich function classes F we expect a large value of RN (F) while small function classes can only achieve small correlations with random labels. In the special case  that the class F consists of binary functions, one can show that RN (F) = O( VCdim(F)/N ). For real-valued functions, one needs to extend the

132

R. Meir and G. R¨ atsch

notion of VC dimension to the so-called pseudo-dimension (e.g. [4]), in which  case one can show that RN (F) = O( Pdim(F)/N ). It is not hard to show that VCdim(sign(F)) ≤ Pdim(F). The following theorem (cf. [108]) provides a bound on the probability of misclassification using a margin-based loss function. Theorem 3 ([108]). Let F be a class of real-valued functions from X to [−1, +1], and let θ ∈ [0, 1]. Let P be a probability distribution on X × {−1, +1}, and suppose that N -samples S = {(x1 , y1 ), . . . , (xN , yN )} are generated independently at random according to P . Then, for any integer N , with probability at least 1 − δ over samples of length N , every f ∈ F satisfies ˆ θ (f ) + 4RN (F) + L(f ) ≤ L θ



log(2/δ) . 2N

(16)

In order to understand the significance of Theorem 3, consider a situation ˆ θ (f ) for where one can find a function f which achieves a low margin error L θ ˆ ˆ ) for a large value of the margin parameter θ. In other words, L (f ) ≈ L(f large values of θ. In this case the second term on the r.h.s. in (16) can be made  to be smaller than the corresponding term in (15) (recall that RN (F) = O( Pdim(F)/N ) for classes with finite pseudo-dimension), using the standard VC bounds. Note that Theorem 3 has implications concerning the size of the margin θ. In particular, assuming F is a class with finite VC  dimension, we have 2 that the second term on the r.h.s. of (16) is of the order O( VCdim(F)/N √ θ ). In order that this term decrease to zero it is mandatory that θ 1/ N . In other words, in order to lead to useful generalization bounds the margin must be sufficiently large with respect to the sample size N . However, the main significance of (16) arises from its application to the specific class of functions arising in Boosting algorithms. Recall that in Boosting, T one generates a hypothesis f which can be written as fT (x) = t=1 αt ht (x), namely a linear combination with non-negative coefficients of base hypotheses h1 , . . . , ht , each belonging to the class H. Consider the class of functions  coT (F) =

f : f (x) =

T  t=1

αt ht (x) : αt ≥ 0,

T 

 αt = 1,

ht ∈ H ,

t=1

corresponding to the 1 -normalized function output by the Boosting algorithm. The convex hull is given by letting T → ∞ in coT . The key feature in the application of Theorem 3 to Boosting is the observation that for any class of real-valued functions H, RN (coT (H)) = RN (H) for any T (e.g. [11]), namely the Rademacher complexity of coT (H) is not larger than that of the base class H itself. On the other hand, since ht (x) are in general non-linear T functions, the linear combination t=1 αt ht (x) represents a potentially highly ˆ θ (fT ). Combining complex function which can lead to very low margin error L these observations, we obtain the following result, which improves upon the first bounds of this nature presented in [167].

An Introduction to Boosting and Leveraging

133

Corollary 1. Let the conditions of Theorem 3 hold, and set F = coT (H). Then, for any integer N , with probability at least 1 − δ over samples of length N , every f ∈ coT (H) satisfies 4RN (H) + L(f ) ≤ L (f ) + θ ˆθ



log(2/δ) . 2N

(17)

Had we used a bound of the form of (15) for f ∈ coT (H), we would have obtained a bound depending on Pdim(coT (H)), which often grows linearly in T , leading to inferior results. The important observation here is that the complexity penalty in (17) is independent of T , the number of Boosting iterations. As a final comment we add that a considerable amount of recent work has been devoted to the derivation of so-called data-dependent bounds (e.g. [5, 9, 92]), where the second term on the r.h.s. of (16) is made to depend explicitly on the data. Data-dependent bounds depending explicitly on the weights αt of the weak learners are given in [130]. In addition, bounds which take into account the full margin distribution are presented in [180, 45, 181]. Such results are particularly useful for the purpose of model selection, but are beyond the scope of this review.

3.5

Consistency

The bounds presented in Theorems 2 and 3, depend explicitly on the data, and are therefore potentially useful for the purpose of model selection. However, an interesting question regarding the statistical consistency of these procedures arises. Consider a generic binary classification problem, characterized by the class conditional distribution function P (y|x), where τ (x) = P (y = 1|x) is assumed to belong to some target class of functions T. It is well known [50] that in this case the optimal classifier is given by the Bayes classifier fB (x) = sign(τ (x)− 12 ), leading to the minimal error LB = L(fB ). We say that an algorithm is strongly consistent with respect to T if, based on a sample of size N , it generates an empirical classifier fˆN for which L(fˆN ) → LB almost surely for N → ∞, for every τ ∈ T. While consistency may seem to be mainly of theoretical significance, it is reassuring to have the guarantee that a given procedure ultimately performs optimally. However, it turns out that in many cases inconsistent procedures perform better for finite amounts of data, than consistent ones. A classic example of this is the so-called James-Stein estimator ([96, 160], Section 2.4.5). In order to establish consistency one needs to assume (or prove in specific cases) that as T → ∞ the class of functions coT (H) is dense in T. The consistency of Boosting algorithms has recently been established in [116, 123], following related previous work [97]. The work of [123] also includes rates of convergence for specific weak learners and target classes T. We point out that the full proof of consistency must tackle at least three issues. First, it must be shown that the specific algorithm used converges as a function of the number of iterations - this is essentially an issue of optimization. Furthermore, one must show that the function to which the algorithm converges itself converges to the optimal

134

R. Meir and G. R¨ atsch

estimator as the sample size increases - this is a statistical issue. Finally, the approximation theoretic issue of whether the class of weak learners is sufficiently powerful to represent the underlying decision boundary must also be addressed.

4

Boosting and Large Margins

In this section we discuss AdaBoost in the context of large margin algorithms. In particular, we try to shed light on the question of whether, and under what conditions, boosting yields large margins. In [67] it was shown that AdaBoost quickly finds a combined hypothesis that is consistent with the training data. [167] and [27] indicated that AdaBoost computes hypotheses with large margins, if one continues iterating after reaching zero classification error. It is clear that the margin should be as large as possible on most training examples in order to minimize the complexity term in (17). If one assumes that the base learner always achieves a weighted training error t ≤ 1/2 − γ/2 with γ > 0, then AdaBoost generates a hypothesis with margin larger than γ/2 [167, 27]. However, from the Min-Max theorem of linear programming [193] (see Theorem 4 below) one finds that the achievable margin ρ∗ is at least γ [69, 27, 72]. We start with a brief review of some standard definitions and results for the margin of an example and of a hyperplane. Then we analyze the asymptotic properties of a slightly more general version, called AdaBoost , which is equivalent to AdaBoost for  = 0, while assuming that the problem is separable. We show that there will be a subset of examples – the support patterns [153] – asymptotically having the same smallest margin. All weights d are asymptotically concentrated on these examples. Furthermore, we find that AdaBoost is able to achieve larger margins, if  is chosen appropriately. We briefly discuss two algorithms for adaptively choosing  to maximize the margin. 4.1

Weak Learning, Edges, and Margins

The assumption made concerning the base learning algorithm in the PACBoosting setting (cf. Section 3.1) is that it returns a hypothesis h from a fixed set H that is slightly better than the baseline classifier introduced in Definition 1. This means that for any distribution, the error rate ε is consistently smaller than 1 1 2 − 2 γ for some fixed γ > 0. Recall that the error rate ε of a base hypothesis is defined as the weighted fraction of points that are misclassified (cf. (8)). The weighting d = (d1 , . . . , dN ) N of the examples is such that dn ≥ 0 and n=1 dn = 1. A more convenient quantity to measure the quality of the hypothesis h is the edge [27], which is also applicable and useful for real-valued hypotheses: γ(h, d) =

N 

dn yn h(xn ).

(18)

n=1

The edge is an affine transformation of ε(h, d) in the case when h(x) ∈ {−1, +1}: ε(h, d) = 12 − 12 γ(h, d) [68, 27, 168]. Recall from Section 3.2 that the margin of a function f for a given example (xn , yn ) is defined by ρn (f ) = yn f (xn ) (cf. (9)).

An Introduction to Boosting and Leveraging

135

˜j | j = Assume for simplicity that H is a finite hypothesis class H = {h 1, . . . , J}, and suppose we combine all possible hypotheses from H. Then the following well-known theorem establishes the connection between margins and edges (first noted in connection with Boosting in [68, 27]). Its proof follows directly from the duality of the two optimization problems. Theorem 4 (Min-Max-Theorem, [193]).  γ ∗ = min max d

˜ h∈H

N 

 ˜ n) dn yn h(x

= max min yn

n=1

w

1≤n≤N

 J  

 

˜ j (xn ) wj h



j=1

= ∗ , (19)

where d ∈ PN , w ∈ PJ and Pk is the k-dimensional probability simplex. Thus, the minimal edge γ ∗ that can be achieved over all possible weightings d of the training set is equal to the maximal margin ∗ of a combined hypothesis from H. We refer to the left hand side of (19) as the edge minimization problem. This problem can be rewritten as a linear program (LP): min

γ,0≤d∈RN

s.t.

γ N 

dn = 1 and

n=1

N 

˜ n) ≤ γ dn yn h(x

˜ ∈ H. ∀h

(20)

n=1

˜ d) ≥ γ ∗ = γ(h, For any non-optimal weightings d and w we always have maxh∈H ˜ ∗ ≥ minn=1,... ,N yn fw (xn ), where fw (x) =

J 

˜ j (x). wj h

(21)

j=1

If the base learning algorithm is guaranteed to return a hypothesis with edge at least γ for any weighting, there exists a combined hypothesis with margin at least γ. If γ = γ ∗ , i.e. the lower bound γ is as large as possible, then there exists a combined hypothesis with margin exactly γ = ∗ (only using hypotheses that are actually returned by the base learner). From this discussion we can derive a sufficient condition on the base learning algorithm to reach the maximal margin: If it returns hypotheses whose edges are at least γ ∗ , there exists a linear combination of these hypotheses that has margin γ ∗ = ∗ . This explains the termination condition in Algorithm 2.1 (step (4)). Remark 1. Theorem 4 was stated for finite hypothesis sets H. However, the same result holds also for countable hypothesis classes. For uncountable classes, one can establish the same results under some regularity conditions on H, in particular that the real-valued hypotheses h are uniformly bounded (cf. [93, 149, 157, 147]).

136

R. Meir and G. R¨ atsch

To avoid confusion, note that the hypotheses indexed as elements of the ˜ 1, . . . , h ˜ J , whereas the hypotheses hypothesis set H are marked by a tilde, i.e. h returned by the base learner are denoted by h1 , . . . , hT . The output of AdaBoost and similar algorithms is a sequence of pairs (αt , ht ) and a combined hypothesis t ft (x) = r=1 αr hr (x). But how do the α’s relate to the w’s used in Theorem 4? ˜j At every step of AdaBoost, one can compute the weight for each hypothesis h 3 in the following way: wjt

=

t 

˜ j ), αr I(hr = h

j = 1, . . . , J.

(22)

r=1

t J ˜ j (x). Also note, if the α’s are It is easy to verify that r=1 αr hr (x) = j=1 wjt h positive (as in Algorithm 2.1), then w 1 = α 1 holds. 4.2

Geometric Interpretation of p-Norm Margins

Margins have been used frequently in the context of Support Vector Machines (SVMs) [22, 41, 190] and Boosting. These so-called large margin algorithms focus on generating hyperplanes/functions with large margins on most training examples. Let us therefore study some properties of the maximum margin hyperplane and discuss some consequences of using different norms for measuring the margin (see also Section 6.3). Suppose we are given N examples in some space F: S = {(xn , yn )}N n=1 , where (xn , yn ) ⊆ F × {−1, 1}. Note that we use x as elements of the feature space F instead of x as elements of the input space X (details below). We are interested in the separation of the training set using a hyperplane A through the origin4 A = {x | x, w = 0} in F determined by some vector w, which is assumed to be normalized with respect to some norm. The p -norm margin of an example (xn , yn ) with respect to the hyperplane A is defined as ρpn (w) :=

yn xn , w ,

w p

where the superscript p ∈ [1, ∞] specifies the norm with respect to which w is normalized (the default is p = 1). A positive margin corresponds to a correct classification. The margin of the hyperplane A is defined as the minimum margin over all N examples, ρp (w) = minn ρpn (w). To maximize the margin of the hyperplane one has to solve the following convex optimization problem [119, 22]: max ρp (w) = max min w

3 4

w

1≤n≤N

yn xn , w .

w p

(23)

For simplicity, we have omitted the normalization implicit in Theorem 4. This can easily be generalized to general hyperplanes by introducing a bias term.

An Introduction to Boosting and Leveraging

137

The form of (23) implies that without loss of generality we may take w p = 1, leading to the following convex optimization problem: max ρ ρ,w

s.t.

yn xn , w ≥ ρ,

w p = 1.

n = 1, 2, . . . , N

(24)

Observe that in the case where p = 1 and wj ≥ 0, we obtain a Linear Program (LP) (cf. (19)). Moreover, from this formulation it is clear that only a few of the constraints in (24) will typically be active. These constraints correspond to the most difficult examples, called the support patterns in boosting and support vectors in SVMs.5 The following theorem gives a geometric interpretation to (23): Using the p norm to normalize w corresponds to measuring the distance to the hyperplane with the dual q -norm, where 1/p + 1/q = 1. Theorem 5 ([120]). Let x ∈ F be any point which is not on the plane A := {˜ x | ˜ x, w = 0}. Then for p ∈ [1, ∞]: |x, w| = x − A q ,

w p

(25)

˜ q denotes the distance of x to the plane A where x − A q = minx˜ ∈A x − x measured with the dual norm q . Thus, the p -margin of xn is the signed q -distance of the point to the hyperplane. If the point is on the correct side of the hyperplane, the margin is positive (see Figure 3 for an illustration of Theorem 5).

−0.4 −0.5 −0.6 −0.7 −0.8 −0.9

−0.5

0

0.5

1

1.5

Fig. 3. The maximum margin solution for different norms on a toy example: ∞ -norm (solid) and 2 norm (dashed). The margin areas are indicated by the dashdotted lines. The examples with label +1 and −1 are shown as ‘x’ and ‘◦’, respectively. To maximize the ∞ -norm and 2 -norm margin, we solved (23) with p = 1 and p = 2, respectively. The ∞ -norm often leads to fewer supporting examples and sparse weight-vectors.

Let us discuss two special cases: p = 1 and p = 2: 5

In Section 4.4 we will discuss the relation of the Lagrange multipliers of these constraints and the weights d on the examples.

138

R. Meir and G. R¨ atsch

Boosting (p = 1). In the case of Boosting, the space F is spanned by the base hypotheses. One considers the set of base hypotheses that could be generated by the base learner – the base hypothesis set – and constructs a mapping Φ from the input space X to the “feature space” F: 

 ˜ 1 (x) h ˜  Φ(x) = h2 (x) : X → F. .. .

(26)

In this case the margin of an example is (cf. (22)) ρ1n (w)

  ˜ j (xn ) yn j wj h yn t αt ht (xn ) yn xn , w  = = = , J

w 1 t αt j=1 wj

as used before in Algorithm 2.1, where we used the 1 -norm for normalization and without loss of generality assumed that the w’s and α’s are non-negative (assuming H is closed under negation). By Theorem 5, one therefore maximizes the ∞ -norm distance of the mapped examples to the hyperplane in the feature space F. Under mild assumptions, one can show that the maximum margin hyperplane is aligned with most coordinate axes in the feature space, i.e. many wj ’s are zero (cf. Figure 3 and Section 6.3). Support Vector Machines (p = 2). Here, the feature space F is implicitly defined by a Mercer kernel k(x, y) [131], which computes the inner product of two examples in the feature space F [22, 175]. One can show that for every such kernel there exists a mapping Φ : X → F, such that k(x, y) = Φ(x), Φ(y) for all x, y ∈ X. Additionally, one uses the 2 -norm for normalization and, hence, the Euclidean distance to measure distances between the mapped examples and the hyperplane in the feature space F: ρ2n (w)

N yn i=1 βi k(xn , xi ) yn Φ(xn ), w = =  , N

w 2 β β k(x , x ) i,j=1

i j

i

j

where we used the Representer Theorem [103, 172] that shows that the maximum margin solution w can be written as a sum of the mapped examples, i.e. w = N n=1 βn Φ(xn ). 4.3

AdaBoost and Large Margins

We have seen in Section 3.4 that a large value of the margin is conducive to good generalization, in the sense that if a large margin can be achieved with respect to the data, then an upper bound on the generalization error is small (see also discussion in Section 6). This observation motivates searching for algorithms which maximize the margin.

An Introduction to Boosting and Leveraging

139

Convergence properties of AdaBoost. We analyze a slightly generalized version of AdaBoost. One introduces a new parameter, , in step (3c) of Algorithm 2.1 and chooses the weight of the new hypothesis differently: αt =

1 + γt 1+ 1 1 log . − log 2 1 − γt 2 1−

This algorithm was first introduced in [27] as unnormalized Arcing with exponential function and in [153] as AdaBoost-type algorithm. Moreover, it is similar to an algorithm proposed in [71] (see also [177]). Here we will call it AdaBoost . Note that the original AdaBoost algorithm corresponds to the choice  = 0. Let us for the moment assume that we chose  at each iteration differently, i.e. we consider sequences {t }Tt=1 , which might either be specified before running the algorithm or computed based on results obtained during the running of the algorithm. In the following we address the issue of how well this algorithm, denoted by AdaBoost{t } , is able to increase the margin, and bound the fraction of examples with margin smaller than some value θ. We start with a result generalizing Theorem 1 (cf. Section 3.2) to the case  = 0 and a slightly different loss: Proposition 1 ([153]). Let γt be the edge of ht at the t-th step of AdaBoost . Assume −1 ≤ t ≤ γt for t = 1, . . . , T . Then for all θ ∈ [−1, 1] # N T " 1 − γt 1  θ ˆ L (fT ) ≤ I(yn fT (xn ) ≤ θ) ≤ N n=1 1 − t t=1

1−θ 2

"

1 + γt 1 + t

# 1+θ 2 (27)

The algorithm makes progress, if each of the products on the right hand side of (27) is smaller than one. Suppose we would like to reach a margin θ on all training examples, where we obviously need to assume θ ≤ ∗ (here ∗ is defined in (19)). The question arises as to which sequence of {t }Tt=1 one should use to find such a combined hypothesis in as few iterations as possible (according to (27)). One can show that the right hand side of (27) is minimized for t = θ and, hence, one should always use this choice, independent of how the base learner performs. Using Proposition 1, we can determine an upper bound on the number of iterations needed by AdaBoost for achieving a margin of  on all examples, given that the maximum margin is ∗ (cf. [167, 157]): Corollary 2. Assume the base learner always achieves an edge γt ≥ ∗ . If 0 ≤  ≤ ∗ − ν, ν > 0, then AdaBoost will converge to a solution with margin of at 2 ) least  on all examples in at most  2 log(Nν)(1−  + 1 steps. 2 Maximal Margins. Using the methodology reviewed so far, we can also analyze to what value the maximum margin of the original AdaBoost algorithm converges asymptotically. First, we state a lower bound on the margin that is achieved by AdaBoost . We find that the size of the margin is not as large as it could be theoretically based on Theorem 4. We briefly discuss below two algorithms that are able to maximize the margin.

140

R. Meir and G. R¨ atsch

As long as each factor on the r.h.s. of (27) is smaller than 1, the bound decreases. If it is bounded away from 1, then it converges exponentially fast to zero. The following corollary considers the asymptotic case and provides a lower bound on the margin when running AdaBoost forever. Corollary 3 ([153]). Assume AdaBoost generates hypotheses h1 , h2 , . . . with edges γ1 , γ2 , . . . . Let γ = inf t=1,2,... γt and assume γ > . Then the smallest margin ρ of the combined hypothesis is asymptotically (t → ∞) bounded from below by ρ≥

γ+ log(1 − 2 ) − log(1 − γ 2 )    ≥  . 1+γ 1+ 2 log 1−γ − log 1−

(28)

From (28) one can understand the interaction between  and γ: if the difference between γ and  is small, then the middle term of (28) is small. Thus, if  is large (assuming  ≤ γ), then ρ must be large, i.e. choosing a larger  results in a larger margin on the training examples. Remark 2. Under the conditions of Corollary 3, one can compute a lower bound on the hypothesis coefficients in each iteration. Hence the sum of the hypothesis coefficients will increase to infinity at least linearly. It can be shown that this suffices to guarantee that the combined hypothesis has a large margin, i.e. larger than  (cf. [147], Section 2.3). However, in Section 4.1 we have shown that the maximal achievable margin is at least γ. Thus if  is chosen to be too small, then we guarantee only a suboptimal asymptotic margin. In the original formulation of AdaBoost we have  = 0 and we guarantee only that AdaBoost0 achieves a margin of at least γ/2.6 This gap in the theory led to the so-called Arc-GV algorithm [27] and the Marginal AdaBoost algorithm [157]. Arc-GV. The idea of of Arc-GV [27] is to set  to the margin that is currently achieved by the combined hypothesis, i.e. depending on the previous performance of the base learner:7 " # t = max t−1 , min yn ft−1 (xn ) . (29) n=1,... ,N

There is a very simple proof using Corollary 3 that Arc-GV asymptotically maximizes the margin [147, 27]. The idea is to show that t converges to ρ∗ – the maximum margin. The proof is by contradiction: Since {t } is monotonically increasing on a bounded interval, it must converge. Suppose {t } converges to a value ∗ smaller than ρ∗ , then one can apply Corollary 3 to show that the margin ∗ ∗ . This leads to a of the combined hypothesis is asymptotically larger than ρ + 2 contradiction, since t is chosen to be the margin of the previous iteration. This shows that Arc-GV asymptotically maximizes the margin. 6 7

Experimental results in [157] confirm this analysis and illustrate that the bound given in Corollary 3 is tight. The definition of Arc-GV is slightly modified. The original algorithm did not use the max.

An Introduction to Boosting and Leveraging

141

Marginal AdaBoost. Unfortunately, it is not known how fast Arc-GV converges to the maximum margin solution. This problem is solved by Marginal AdaBoost [157]. Here one also adapts , but in a different manner. One runs AdaBoost repeatedly, and determines  by a line search procedure: If AdaBoost is able to achieve a margin of  then it is increased, otherwise decreased. For this algorithm one can show fast convergence to the maximum margin solution [157]. 4.4

Relation to Barrier Optimization

We would like to point out how the discussed algorithms can be seen in the context of barrier optimization. This illustrates how, from an optimization point of view, Boosting algorithms are related to linear programming, and provide further insight as to why they generate combined hypotheses with large margins. The idea of barrier techniques is to solve a sequence of unconstrained optimization problems in order to solve a constrained optimization problem (e.g. [77, 136, 40, 156, 147]). We show that the exponential function acts as a barrier function for the constraints in the maximum margin LP (obtained from (24) by setting p = 1 and restricting wj ≥ 0): max ρ ρ,w

s.t.

yn xn , w≥ ρ n = 1, 2, . . . , N wj = 1. wj ≥ 0,

(30)

Following the standard methodology and using the exponential barrier function 8 β exp(−z/β) [40], we find the barrier objective for problem (30): N 

1 Fβ (w, ρ) = −ρ + β exp β n=1



yn fw (xn ) ρ−  j wj

 ,

(31)

which is minimized with respect to ρ and w for some fixed β. We denote the optimum by wβ and ρβ . One can show that any limit point of (wβ , ρβ ) for β → 0 is a solution of (30) [136, 19, 40]. This also holds when one only has a sequence of approximate minimizers and the approximation becomes better for decreasing β [40]. Additionally, the quantities d˜n = exp(ρβ j wj − yn fwβ (xn ))/Z (where  Z is chosen such that n d˜n = 1) converge for β → 0 to the optimal Lagrange ∗ multipliers dn of the constraints in (30) (under mild assumptions, see [40], Theorem 2, for details). Hence, they are a solution of the edge minimization problem (20). To see the connection between Boosting and the barrier approach, we chose  −1 β = w . If the sum of the hypothesis coefficients increases to infinity j j (cf. Remark 2 in Section 4.3), then β converges to zero. $Plugging in this choice %  N  of β into (31) one obtains: −ρ + ( j wj )−1 n=1 exp ρ j wj − yn fw (xn ) . Without going into further details [147, 150], one can show that this function 8

Other barrier functions are β log(z) (log-barrier) and βz log(z) (entropic barrier).

142

R. Meir and G. R¨ atsch

is minimized by the AdaBoost algorithm. Thus, one essentially obtains the exponential loss function as used in AdaBoost for ρ = . Seen in this context, AdaBoost is an algorithm that finds a feasible solution, i.e. one that has margin at least . Also, it becomes clear why it is important that the sum of the hypothesis coefficients goes to infinity, an observation already made in the previous analysis (cf. Remark 2): otherwise β would not converge to 0 and the constraints may not be enforced. Arc-GV and Marginal AdaBoost are extensions, where the parameter ρ is optimized as well. Hence, AdaBoost, AdaBoost , Arc-GV and Marginal AdaBoost can be understood as particular implementations of a barrier optimization approach, asymptotically solving a linear optimization problem.

5

Leveraging as Stagewise Greedy Optimization

In Section 4 we focused mainly on AdaBoost. We now extend our view to more general ensemble learning methods which we refer to as leveraging algorithms [56]. We will relate these methods to numerical optimization techniques. These techniques served as powerful tools to prove the convergence of leveraging algorithms (cf. Section 5.4). We demonstrate the convergence of ensemble learning methods such as AdaBoost [70] and Logistic Regression (LR) [74, 39], although the techniques are much more general. These algorithms have in common that they iteratively call a base learning algorithm L (a.k.a. weak learner ) on a weighted training sample. The base learner is expected to return at each iteration t a hypothesis ht from the hypothesis set H that has small weighted training error (see (4)) or large edge (see (18)). These hypotheses are then linearly combined to form the final or composite hypothesis fT as in (1). The hypothesis coefficient αt is determined at iteration t, such that a certain objective is minimized or approximately minimized, and it is fixed for later iterations. For AdaBoost and the Logistic Regression algorithm [74] it has been shown [39] that they generate a composite hypothesis minimizing a loss function G – in the limit as the number of iterations goes to infinity. The loss depends only on the output of the combined hypothesis fT on the training sample. However, the assumed conditions (discussed later in detail) in [39] on the performance of the base learner are rather strict and can usually not be satisfied in practice. Although parts of the analysis in [39] hold for any strictly convex cost function of Legendre-type (cf. [162], p. 258), one needs to demonstrate the existence of a so-called auxiliary function (cf. [46, 39, 114, 106]) for each cost function other than the exponential or the logistic loss. This has been done for the general case in [147] under very mild assumptions on the base learning algorithm and the loss function. We present a family of algorithms that are able to generate a combined hypothesis f converging to the minimum of some loss function G[f ] (if it exists). Special cases are AdaBoost [70], Logistic Regression and LS-Boost [76]. While assuming rather mild conditions on the base learning algorithm and the loss function G, linear convergence rates (e.g. [115]) of the type G[ft+1 ] − G[f ∗ ] ≤

An Introduction to Boosting and Leveraging

143

η(G[ft ] − G[f ∗ ]) for some fixed η ∈ [0, 1) has been shown. This means that the deviation from the minimum loss converges exponentially to zero (in the number of iterations). Similar convergence rates have been proven for AdaBoost in the special case of separable data (cf. Section 4.3 and [70]). In the general case these rates can only be shown when the hypothesis set is finite. However, in practice one often uses infinite sets (e.g. hypotheses parameterized by some real valued parameters). Recently, Zhang [199] has shown order one convergence for such algorithms, i.e. G[ft+1 ] − G[f ∗ ] ≤ c/t for some fixed c > 0 depending on the loss function. 5.1

Preliminaries

In this section and the next one we show that AdaBoost, Logistic Regression and many other leveraging algorithms generate a combined hypothesis f minimizing a particular loss G on the training sample. The composite hypothesis fw is a linear combination of the base hypotheses:     ˜|h ˜ ∈ H, w˜ ∈ R , wh˜ h fw ∈ lin(H) := h   ˜ h∈H

where the w’s are the combination coefficients. Often one would like to find a combined function with small classification error, hence one would like to minimize the 0/1-loss: G0/1 (f, S) :=

N 

I(yn = sign(fw (xn ))).

n=1

However, since this loss is non-convex and not even differentiable, the problem of finding the best linear combination is a very hard problem. In fact, the problem is provably intractable [98]. One idea is to use another loss function which bounds the 0/1-loss from above. For instance, AdaBoost employs the exponential loss GAB (fw , S) :=

N 

exp(−yn fw (xn )),

n=1

and the LogitBoost algorithm [74] uses the Logistic loss: GLR (fw , S) :=

N 

log2 (1 + exp(−yn fw (xn ))) .

n=1

As can be seen in Figure 4, both loss functions bound the classification error G0/1 (f, S) from above. Other loss functions have been proposed in the literature, e.g. [74, 127, 154, 196]. It can be shown [116, 196, 28] that in the infinite sample limit, where the sample average converges to the expectation, minimizing either GAB (fw , S) or

144

R. Meir and G. R¨ atsch 7 0/1−loss Squared loss Logistic loss Hinge loss

6

5

loss

Fig. 4. The exponential (dashed) and logistic (solid) loss functions,plotted against ρ = yf (x). Both bound the 0/1-loss from above.

4

3

2

1

0 −1.5

−1

−0.5

0

yf (x) 0.5

1

1.5

2

GLR (fw , S) leads to the same classifier as would be obtained by minimizing the probability of error. More precisely, let GAB (f ) = E{exp(−yf (x))}, and set G0/1 (f ) = E{I[y = sign(f (x))]} = Pr[y = sign(f (x))]. Denote by f ∗ the function minimizing GAB (f ). Then it can be shown that f ∗ minimizes Pr[y = sign(f (x))], namely achieves the Bayes risk. A similar argument applies to other loss functions. This observation forms the basis for the consistency proofs in [116, 123] (cf. Section 3.5). Note, that the exponential and logistic losses are both convex functions. Hence, the problem of finding the global optimum of the loss with respect to the combination coefficients can be solved efficiently. Other loss functions have been used to approach multi-class problems [70, 3, 164], ranking problems [95], unsupervised learning [150] and regression [76, 58, 149]. See Section 7 for more details on some of these approaches. 5.2

A Generic Algorithm

Most Boosting algorithms have in common that they iteratively run a learning algorithm to greedily select the next hypothesis h ∈ H. In addition, they use in some way a weighting on the examples to influence the choice of the base hypothesis. Once the hypothesis is chosen, the algorithm determines the combination weight for the newly obtained hypothesis. Different algorithms use different schemes to weight the examples and to determine the new hypothesis weight. In this section we discuss a generic method (cf. Algorithm 5.2) and the connection between the loss function minimized and the particular weighting schemes. Many of the proposed algorithms can be derived from this scheme. Let us assume the loss function G(f, S) has the following additive form9 G(f, S) :=

N 

g(f (xn ), yn ),

(32)

n=1

and we would like to solve the optimization problem 9

More general loss functions are possible and have been used, however, for the sake of simplicity, we chose this additive form.

An Introduction to Boosting and Leveraging

145

Algorithm 5.2 – A Leveraging algorithm for the loss function G. 1. Input: S = (x1 , y1 ), . . . , (xN , yN ) , No. of Iterations T Loss function G : RN → R 2. Initialize: f0 ≡ 0, d1n = g (f0 (xn ), yn ) for all n = 1 . . . N 3. Do for t = 1, . . . , T , a) Train classifier on {S, dt } and obtain hypothesis H ∈ ht : X → Y b) Set αt = argminα∈R G[ft + αht ] (t+1) = g (ft+1 (xn ), yn ), n = 1, . . . , N c) Update ft+1 = ft + αt ht and dn 4. Output: fT

 min G(f, S) = min w

f ∈lin(H)

N 

 g(fw (xn ), yn ) .

(33)

n=1

In step (2) of Algorithm 5.2 the example weights are initialized using the derivative g of the loss function g with respect to the first argument at (f0 (xn ), yn ) (i.e. at (0, yn )). At each step t of Algorithm 5.2 one can compute the weight wjt assigned to ˜ j , such that (cf. (22)) each base hypothesis h fwt := ft =

 ˜ h∈H

˜ wh˜t h

=

t 

αt ht .

r=1

In iteration t of Algorithm 5.2 one chooses a hypothesis ht , which corresponds to a weight (coordinate) wht . This weight is then updated to minimize the loss (t+1) (t) (cf. step (3b)): wht = wht + αt . Since one always selects a single variable for optimization at each time step, such algorithms are called coordinate descent methods. Observe that if one uses AdaBoost’s exponential loss in Algorithm 5.2, then (t+1) = g (ft (xn ), yn ) = yn exp(−yn ft (xn )). Hence, in distinction from the dn original AdaBoost algorithm (cf. Algorithm 2.1), the d’s are multiplied by the labels. This is the reason why we will later use a different definition of the edge (without the label). Consider what would happen, if one always chooses the same hypothesis (or coordinate). In this case, one could not hope to prove convergence to the minimum. Hence, one has to assume the base learning algorithm performs well in selecting the base hypotheses from H. Let us first assume the base learner always find the hypothesis with maximal edge (or minimal classification error in the hard binary case): N   t˜ ht = argmax d h(xn ) . (34) ˜ h∈H

n

n=1

This is a rather restrictive assumption, being one among many that can be made (cf. [168, 199, 152]). We will later significantly relax this condition (cf. (40)).

146

R. Meir and G. R¨ atsch

Let us discuss why the choice in (34) is useful. For this we compute the ˜ gradient of the loss function with respect to the weight wh˜ of each hypothesis h in the hypothesis set: N N  ∂ G(fwt , S)   ˜ n) = ˜ n ). = g (ft (xn ), yn )h(x dtn h(x ∂wh˜ n=1 n=1

(35)

Hence, ht is the coordinate that has the largest component in the direction of the gradient vector of G evaluated at wt .10 If one optimizes this coordinate one would expect to achieve a large reduction in the value of the loss function. This method of selecting the coordinate with largest absolute gradient component is called Gauss-Southwell-method in the field of numerical optimization (e.g. [115]). It is known to converge under mild assumptions [117] (see details in [152] and Section 5.4). Performing the explicit calculation in (35) for the loss functions GAB (see (5)) yields the AdaBoost approach described in detail in Algorithm 2.1. In this case (t) g(f (x), y) = exp(−yf (x)) implying that dn ∝ yn exp(−yn ft−1 (xn )), which, up to a normalization constant and the factor yn is the form given in Algorithm 2.1. The LogitBoost algorithm [74] mentioned in Section 2.2 can be derived in a similar fashion using the loss function GLR (see (7)). In this case we find (t) that dn ∝ yn /(1 + exp(yn ft−1 (xn ))). Observe that in this case, the weights assigned to misclassified examples with very negative margins is much smaller than the exponential weights assigned by AdaBoost. This may help to explain why LogitBoost tends to be less sensitive to noise and outliers than AdaBoost [74]. Several extensions of the AdaBoost and LogitBoost algorithms based on Bregman distances were proposed in [39] following earlier work of [104] and [110] (cf. Section 5.3). 5.3

The Dual Formulation

AdaBoost was originally derived from results on online-learning algorithms [70], where one receives at each iteration one example, then predicts its label and incurs a loss. The important question in this setting relates to the speed at which the algorithm is able to learn to produce predictions with small loss. In [35, 106, 105] the total loss was bounded in terms of the loss of the best predictor. To derive these results, Bregman divergences [24] and generalized projections were extensively used. In the case of boosting, one takes a dual view [70]: Here the set of examples is fixed, whereas at each iteration the base learning algorithm generates a new hypothesis (the “online example”). Using online-learning techniques, the convergence in the online learning domain – the dual domain – has been analyzed and shown to lead to convergence results in the primal domain (cf. Section 4.3). In the primal domain the hypothesis coefficients w are optimized, while the weighting d are optimized in the dual domain. 10

If one assumes the base hypothesis set to be closed under negation (h ∈ H ⇒ −h ∈ H), then this is equivalent to choosing the hypothesis with largest absolute component in the direction of the gradient.

An Introduction to Boosting and Leveraging

147

In [104, 110] AdaBoost was interpreted as entropy projection in the dual domain. The key observation is that the weighting d(t+1) on the examples in the t-th iteration is computed as a generalized projection of d(t) onto a hyperplane ˜ (defined by linear constraints), where one uses a generalized distance ∆(d, d) measure. For AdaBoost the unnormalized relative entropy is used, where ˜ = ∆(d, d)

N 

dn log(dn /d˜n ) − dn + d˜n .

(36)

n=1

The update of the distribution in steps (3c) and (3d) of Algorithm 2.1 is the solution to the following optimization problem (“generalized projection”): ∆(d, d(t) ) d(t+1) = argmind∈RN + N ˜ with n=1 dn yn ht (xn ) = 0,

(37)

Hence, the new weighting d(t+1) is chosen such that the edge of the previous hypothesis becomes zero (as e.g. observed in [70]) and the relative entropy between the new d(t+1) and the old d(t) distribution is as small as possible [104]. After the projection the new d lies on the hyperplane defined by the corresponding constraint (cf. Figure 5).

Fig. 5. Illustration of generalized projections: one projects a point d(t) onto a plane ˜ by finding the point d(t+1) on the plane H that has the smallest generalized distance from d(t) . If G(d) = 12 d 22 , then the generalized distance is equal to the squared Euclidean distance, hence, the projected point (t+1) dE is the closest (in the common sense) point on the hyperplane. For another Bregman function G, one finds another projected (t+1) point dB , since closeness is measured differently.

The work of [104, 110] and later of [39, 47, 111, 196] lead to a more general understanding of Boosting methods in the context of Bregman distance optimization and Information Theory. Given an arbitrary strictly convex function G : RN + → R (of Legendre-type), called a Bregman function, one can define a Bregman divergence (“generalized distance”): ∆G (x, y) := G(x) − G(y) − ∇G(y), x − y.

(38)

AdaBoost, Logistic regression and many other algorithms can be understood in this framework as algorithms that iteratively project onto different hyper-

148

R. Meir and G. R¨ atsch

planes.11 A precursor of such algorithms is the Bregman algorithm [24, 34] and it is known that the sequence {d(t) } converges to a point on the intersection of all hyperplanes which minimizes the divergence to the first point d(0) – in AdaBoost the uniform distribution. Hence they solve the following optimization problem:12 ∆G (d, d(0) ) mind∈RN +  N ˜ ˜ with n=1 dn yn h(xn ) = 0 ∀h ∈ H.

(39)

An interesting questions is how generalized projections relate to coordinate descent discussed in Section 5.2. It has been shown [147] (see also [117, 107]) that a generalized projection in the dual domain onto a hyperplane (defined by h; cf. (37)) corresponds to the optimization of the variable corresponding to h in the primal domain. In particular, if one uses a Bregman function G(d), then this J ˜ j (X)+ corresponds to a coordinate-wise descent in the loss function G( j=1 wj h −1 (0) N ˜ ˜ ˜ (∇G) (d )), where hj (X) = (hj (x1 ), . . . , hj (xN )) and G : R → R is the convex conjugate of G.13 In the case of AdaBoost one uses G(d) = n dn log dn , leading to the relative entropy in (36) and (∇G)−1 (d(0) ) is zero, if d(0) is uniform. 5.4

Convergence Results

There has recently been a lot of work on establishing the convergence of several algorithms, which are all very similar to Algorithm 5.2. Since we cannot discuss all approaches in detail, we only provide an overview of different results which have been obtained in the past few years (cf. Table 1). In the following few paragraphs we briefly discuss several aspects which are important for proving convergence of such algorithms. Conditions on the Loss Function. In order to prove convergence to a global minimum one usually has to assume that the loss function is convex (then every local minimum is a global one). If one allows arbitrary loss functions, one can only show convergence to a local minimum (i.e. gradient converges to zero, e.g. [127]). Moreover, to prove convergence one usually assumes the function is reasonably smooth, where some measure of smoothness often appears in convergence rates. For some results it is important that the second derivatives are strictly positive, i.e. the loss function is strictly convex (“function is not flat”) – otherwise, the algorithm might get stuck. 11

12

13

In [48] a linear programming approach (“column generation”) for simultaneously projection onto a growing set of hyperplanes has been considered. Some details are described in Section 6.2. These algorithms also work, when the equality constraints in (37) are replaced by inhomogeneous inequality constraints: Then one only projects to the hyperplane, when the constraint is not already satisfied. In the case of Bregman functions, the convex conjugate is given by G(o) = (∇G)−1 (o), o − G((∇G)−1 (o)).

An Introduction to Boosting and Leveraging

149

Relaxing Conditions on the Base learner. In Section 5.2 we made the assumption that the base learner always returns the hypothesis maximizing the edge (cf. (34)). In practice, however, it might be difficult to find the hypothesis that exactly minimizes the training error (i.e. maximizes the edge). Hence, some relaxations are necessary. One of the simplest approaches is to assume that the base learner always returns a hypothesis with some edge larger that say γ (“γrelaxed”, e.g. [70, 57]). This condition was used to show that AdaBoost’s training error converges exponentially to zero. However, since the edge of a hypothesis is equal to the gradient with respect to its weight, this would mean that the gradient will not converge to zero. In the case of AdaBoost this actually happens and one converges to the minimum of the loss, but the length of the solution vector does not converge (cf. Section 3.2). More realistically one might assume that the base learner returns a hypothesis that approximately maximizes the edge – compared to the best possible hypothesis. [197] considered an additive term converging to zero (“α-relaxed”) and [152] used a multiplicative term (“βrelaxed”). We summarize some senses of approximate edge-maximization (e.g. [70, 197, 151, 147])14 α-relaxed:

N 

 dtn ht (xn )

≥ argmax h∈H

n=1

γ-relaxed:

N 

τ -relaxed:

n=1

β-relaxed:

N  n=1

dtn h(xn )

− α,

n=1

dtn ht (xn ) ≥ γ,

n=1 N 



N 



dtn ht (xn )

≥τ

argmax h∈H

dtn ht (xn ) ≥ β argmax h∈H



N 

n=1 N 

 dtn h(xn )

,



dtn h(xn ) ,

(40)

n=1

for some fixed constants α > 0, β > 0, γ > 0 and some strictly monotone and continuous function τ : R → R with τ (0) = 0.

Size of the Hypothesis Set. For most convergence results the size of the hypothesis set does not matter. However, in order to be able to attain the minimum, one has to assume that the set is bounded and closed (cf. [149]). In [39, 48, 151] finite hypothesis sets have been assumed in order to show convergence and rates. In the case of classification, where the base hypotheses are discrete valued, this holds true. However, in practice, one often wants to use real valued, parameterized hypotheses. Then the hypothesis set is uncountable. [197, 199] presents rather general convergence results for an algorithm related to that in Algorithm 5.2, which do not depend on the size of the hypothesis sets and is applicable to uncountable hypothesis sets. 14

Note that the label yn may be absorbed into the d’s.

150

R. Meir and G. R¨ atsch

Regularization Terms on the Hypothesis Coefficients. In Section 6 we will discuss several regularization approaches for leveraging algorithms. One important ingredient is a penalty term on the hypothesis coefficients, i.e. one seeks to optimize N   min g(fw (xn ), yn ) + CP(w) (41) w∈RJ

n=1

where C is the regularization constant and P is some regularization operator (e.g. some p -norm). We have seen that AdaBoost implicitly penalizes the 1 norm (cf. Section 4.3); this is also true for the algorithm described by [183, 154, 147, 197, 199]. The use of the 2 -norm has been discussed in [56, 57, 183]. Only a small fraction of the analyses found in the literature allow the introduction of a regularization term. Note, when using regularized loss functions, the optimality conditions for the base learner change slightly (there will be one additional term depending on the regularization constant, cf. [147, 199]). Some regularization functions will be discussed in Section 6.3. Convergence Rates. The first convergence result was obtained [70] for AdaBoost and the γ-relaxed condition on the base learner. It showed that the objective function is reduced at every step by a factor. Hence, it decreases exponentially. Here, the convergence speed does not directly depend on the size of the hypothesis class and the number of examples. Later this was generalized to essentially exponentially decreasing functions (like in Logistic Regression, but still α-relaxed condition, cf. [59]). Also, for finite hypothesis sets and strictly convex loss one can show the exponential decrease of the loss towards its minimum [151] (see also [117]) – the constants, however, depend heavily on the size of the training set and the number of base hypotheses. The best known rate for convex loss functions and arbitrary hypothesis sets is given in [197]. Here the rate is O(1/t) and the constants depend only on the smoothness of the loss function. In this approach it is mandatory to regularize with the 1 -norm. Additionally, the proposed algorithm does not exactly fit into the scheme discussed above (see [197] for details). Convexity with respect to parameters. In practical implementations of Boosting algorithms, the weak learners are parameterized by a finite set of parameters θ = (θ1 , . . . , θs ), and the optimization needs to be performed with respect to θ rather than with respect to the weak hypothesis h directly. Even if the loss function G is convex with respect to h, it is not necessarily convex with respect to θ. Although this problem may cause some numerical problems, we do not dwell upon it in this review.

6

Robustness, Regularization, and Soft-Margins

It has been shown that Boosting rarely overfits in the low noise regime (e.g. [54, 70, 167]); however, it clearly does so for higher noise levels (e.g. [145, 27, 81,

An Introduction to Boosting and Leveraging

151

Table 1. Summary of Results on the Convergence of Greedy Approximation Methods Ref. Cost Function Base Learner [70] exponential γ-relaxed loss [127] Lipschitz dif- strict ferentiable [127] convex, Lips- strict chitz differentiable [27] convex, posi- strict tive, differentiable [56] essentially γ-relaxed exp. decreasing [39] strictly con- strict vex, differentiable   [48] linear strict

Hypothesis Convergence Rate Set uncountable global minimum, O(e−t ) if loss zero uncountable gradient zero no

Regularization no

uncountable global minimum no

· 1 possible

countable

· 1 mandatory 1

uncountable global minimum, O(t− 2 ), no if loss zero then O(e−t ) finite global optimum no no (dual problem) finite

[58] squared loss, γ-relaxed uncountable ε-loss [151] strictly con- β-relaxed finite vex, differentiable   [149] linear β-relaxed uncountable, compact [147] strictly convex τ -relaxed uncountable, compact [197] convex strict uncountable   

global minimum no

no

global minimum no global minimum global minimum, O(e−t ) unique dual sol. global minimum no global minimum no global infimum

O(1/t)

· 1 mandatory · 2 mandatory · 1 possible · 1 possible · 1 mandatory · 1 mandatory

This also includes piecewise linear, convex functions like the soft-margin or the ε-insensitive loss. Extended to τ -relaxed in [147]. There are a few more technical assumptions. These functions are usually refereed to as functions of Legendre-type [162, 13].

153, 127, 12, 51]). In this section, we summarize techniques that yield state-ofthe-art results and extend the applicability of boosting to the noisy case. The margin distribution is central to the understanding of Boosting. We have shown in Section 4.3 that AdaBoost asymptotically achieves a hard margin separation, i.e. the algorithm concentrates its resources on a few hard-to-learn examples. Since a hard margin is clearly a sub-optimal strategy in the case of noisy data, some kind of regularization must be introduced in the algorithm to

152

R. Meir and G. R¨ atsch

alleviate the distortions that single difficult examples (e.g. outliers) can cause the decision boundary. We will discuss several techniques to approach this problem. We start with algorithms that implement the intuitive idea of limiting the influence of a single example. First we present AdaBoostReg [153] which trades off the influence with the margin, then discuss BrownBoost [65] which gives up on examples for which one cannot achieve large enough margins within a given number of iterations. We then discuss SmoothBoost [178] which prevents overfitting by disallowing overly skewed distributions, and finally briefly discuss some other approaches of the same flavor. In Section 6.2 we discuss a second group of approaches which are motivated by the margin-bounds reviewed in Section 3.4. The important insight is that it is not the minimal margin which is to be maximized, but that the whole margin distribution is important: one should accept small amounts of marginerrors on the training set, if this leads to considerable increments of the margin. First we describe the DOOM approach [125] that uses a non-convex, monotone upper bound to the training error motivated from the margin-bounds. Then we discuss a linear program (LP) implementing a soft-margin [17, 153] and outline algorithms to iteratively solve the linear programs [154, 48, 146]. The latter techniques are based on modifying the AdaBoost margin loss function to achieve better noise robustness. However, DOOM as well as the LP approaches employ ∞ -margins, which correspond to a 1 -penalized cost function. In Section 6.3 we will discuss some issues related to other choices of the penalization term.

6.1

Reducing the Influence of Examples

The weights of examples which are hard to classify are increased at every step of AdaBoost. This leads to AdaBoost’s tendency to exponentially decrease the training error and to generate combined hypotheses with large minimal margin. To discuss the suboptimal performance of such a hard margin classifier in the presence of outliers and mislabeled examples in a more abstract way, we analyze Figure 6. Let us first consider the noise free case [left]. Here, we can estimate a separating hyperplane correctly. In Figure 6 [middle] we have one outlier, which corrupts the estimation. Hard margin algorithms will concentrate on this outlier and impair the good estimate obtained in the absence of the outlier. Finally, let us consider more complex decision boundaries (cf. Figure 6 [right]). Here the overfitting problem is even more pronounced, if one can generate more and more complex functions by combining many hypotheses. Then all training examples (even mislabeled ones or outliers) can be classified correctly, which can result in poor generalization. From these cartoons, it is apparent that AdaBoost and any other algorithm with large hard margins are noise sensitive. Therefore, we need to relax the hard margin and allow for a possibility of “mistrusting” the data. We present several algorithms which address this issue.

An Introduction to Boosting and Leveraging

153

Fig. 6. The problem of finding a maximum margin “hyperplane” on reliable data [left], data with outlier [middle] and with a mislabeled example [right]. The solid line shows the resulting decision boundary, whereas the dashed line marks the margin area. In the middle and on the left the original decision boundary is plotted with dots. The hard margin implies noise sensitivity, because only one example can spoil the whole estimation of the decision boundary. (Figure taken from [153].)

AdaBoostReg . Examples that are mislabeled and usually more difficult to classify should not be forced to have a positive margin. If one knew beforehand which examples are “unreliable”, one would simply remove them from the training set or, alternatively, not require that they have a large margin. Assume one has defined a non-negative mistrust parameter ζn , which expresses the “mistrust” in an example (xn , yn ). Then one may relax the hard margin constraints (cf. (23) and (24)) leading to: max ρ w,ρ

s.t. ρn (w) ≥ ρ − Cζn ,

n = 1, . . . , N

(42)

where ρn (w) = yn xn , w/ w 1 , C is an a priori chosen constant.15 Now one could maximize the margin using these modified constraints (cf. Section 4). Two obvious questions arise from this discussion: first, how can one determine the ζ’s and second, how can one incorporate the modified constraints into a boosting-like algorithm. In the AdaBoostReg algorithm [153] one tries to solve both problems simultaneously. One uses the example weights computed at each iteration to determine which examples are highly influential and hard to classify. Assuming that the hard examples are “noisy examples”, the algorithm chooses (t) the mistrust parameter at iteration t, ζn , as the amount by which the example (xn , yn ) influenced the decision in previous iterations: ζn(t) = 15

t

(r) r=1 αr dn , t r=1 αr

Note that we use w to mark the parameters of the hyperplane in feature space and α to denote the coefficients generated by the algorithm (see Section 4.1).

154

R. Meir and G. R¨ atsch (t)

where the αt ’s and dn ’s are the weights for the hypotheses and examples, respectively. Moreover, one defines a new quantity, the soft margin ρ˜n (w) := ρn (w) + Cζn of an example (xn , yn ), which is used as a replacement of the margin in the AdaBoost algorithm. So, the idea in AdaBoostReg is to find a solution with large soft margin on all examples, i.e. maximize ρ w.r.t. w and ρ such that ρ˜n (w) ≥ ρ, n = 1, . . . , N . In this case one expects to observe less overfitting. One of the problems of AdaBoostReg is that it lacks a convergence proof. Additionally, it is not clear what the underlying optimization problem is. The modification is done at an algorithmic level, which makes it difficult to relate to an optimization problem. Nevertheless, it was one of the first boosting-like algorithm that achieved state-of-the art generalization results on noisy data (cf. [146]). In experimental evaluations, it was found that this algorithm is among the best performing ones (cf. [153]). BrownBoost. The BrownBoost algorithm [65] is based on the Boosting by Majority (BBM) algorithm [64]. An important difference between BBM and AdaBoost is that BBM uses a pre-assigned number of boosting iterations. As the algorithm approaches its predetermined termination, it becomes less and less likely that examples which have large negative margins will eventually become correctly labeled. Then, the algorithm “gives up” on those examples and concentrates its effort on those examples whose margin is, say, a small negative number [65]. Hence, the influence of the most difficult examples is reduced in the final iterations of the algorithm. This is similar in spirit to the AdaBoostReg algorithm, where examples which are difficult to classify are assigned reduced weights in later iterations. To use BBM one needs to pre-specify two parameters: (i) an upper bound on the error of the weak learner ε ≤ 12 − 12 γ (cf. Section 3.1) and (ii) a “target error”  > 0. In [65] it was shown that one can get rid of the γ-parameter as in AdaBoost. The target error  is interpreted as the training error one wants to achieve. For noisy problems, one should aim for non-zero training error, while for noise free and separable data one can set  to zero. It has been shown that letting  in the BrownBoost algorithm approach zero, one recovers the original AdaBoost algorithm. To derive BrownBoost, the following “thought experiment” [65] – inspired by ideas from the theory of Brownian motion – is made: Assume the base learner returns a hypothesis h with edge larger than γ. Fix some 0 < δ < γ and define a new hypothesis h which is equal to h with probability δ/γ and random otherwise. It is easy to check that the expected edge of h is δ. Since the edge is small, the change of the combined hypothesis and the example weights will be small as well. Hence, the same hypothesis may have an edge greater than δ for several iterations (depending on δ) and one does not need to call the base learner again. The idea in BrownBoost is to let δ approach zero and analyze the above “dynamics” with differential equations. The resulting BrownBoost algorithm is very similar to AdaBoost but has some additional terms in the example weighting, which depend on the remaining number of iterations. In addition, in each iteration one needs to solve a differential equation with boundary conditions to compute how long a given hypothesis “survives” the above described cycle, which

An Introduction to Boosting and Leveraging

155

determines its weight. More details on this algorithm and theoretical results on its performance can be found in [65]. Whereas the idea of the algorithm seems to be very promising, we are not aware of any empirical results illustrating the efficiency of the approach. SmoothBoost. The skewness of the distributions generated during the Boosting process suggests that one approach to reducing the effect of overfitting is to impose limits on this skewness. A promising algorithm along these lines was recently suggested in [178] following earlier work [53]. The SmoothBoost algorithm was designed to work effectively with malicious noise, and is provably effective in this scenario, under appropriate conditions. The algorithm is similar (t) to AdaBoost in maintaining a set of weights dn at each boosting iteration, except that there is a cutoff of the weight assigned to examples with very negative margins. The version of SmoothBoost described in [178] requires two parameters as input: (i) κ, which measures the desired error rate of the final classifier, (ii) and γ which measures the guaranteed edge of the hypothesis returned by the weak learner. Given these two parameters SmoothBoost can be shown to converge within a finite number of steps to a composite hypothesis f such that |{n : yn f (xn ) ≤ θ}| < κN , where θ ≤ κ. Moreover, it can be shown that the (t) weights generated during the process obey dn ≤ 1/κN , namely the weights cannot become too large (compare with the ν-LP in Section 6.2). Although the convergence time of SmoothBoost is larger than AdaBoost, this is more than compensated for by the robustness property arising from the smoothness of the distribution. We observe that SmoothBoost operates with binary or real-valued hypotheses. While SmoothBoost seems like a very promising algorithm for dealing with noisy situations, it should be kept in mind that it is not fully adaptive, in that both κ and γ need to be supplied in advance (cf. recent work by [30]). We are not aware of any applications of SmoothBoost to real data. Other approaches aimed at directly reducing the effect of difficult examples can be found in [109, 6, 183]. 6.2

Optimization of the Margins

Let us return to the analysis of AdaBoost based on margin distributions as discussed in Section 3.4. Consider a base-class of binary hypotheses H, characterized by VC dimension VCdim(H). From (16) and the fact that RN (H) =  O( VCdim(H)/N ) we conclude that with probability at least 1 − δ over the random draw of a training set S of size N the generalization error of a function f ∈ co(H) with margins ρ1 , . . . , ρN can be bounded by

  N VCdim(H) log(1/δ) 1  L(f ) ≤ (43) I(ρn ≤ θ) + O + N n=1 N θ2 2N where θ ∈ (0, 1] (cf. Corollary 1). We recall again that, perhaps surprisingly, the bound in (43) is independent of the number of hypotheses ht that contribute

156

R. Meir and G. R¨ atsch

 to defining f = t αt ht ∈ co(H). It was stated that a reason for the success of AdaBoost, compared to other ensemble learning methods (e.g. Bagging [25]), is that it generates combined hypotheses with large margins on the training examples. It asymptotically finds a linear combination fw of base hypotheses satisfying ρn (w) =

yn fw (xn )  ≥ρ j wj

n = 1, . . . , N,

(44)

for some large margin ρ (cf. Section 4). Then the first term of (43) can be made zero for θ = ρ and the second term becomes small, if ρ is large. In [27, 81, 157] algorithms have been proposed that generate combined hypotheses with even larger margins than AdaBoost. It was shown that as the margin increases, the generalization performance can become better on data sets with almost no noise (see also [167, 138, 157]). However, on problems with a large amount of noise, it has been found that the generalization ability often degrades for hypotheses with larger margins (see also [145, 27, 153]). In Section 4 we have discussed connections of boosting to margin maximization. The algorithms under consideration approximately solve a linear programming problem, but tend to perform sub-optimally on noisy data. From the margin bound (43), this is indeed not surprising. The minimum on the right hand side of (43) is not necessarily achieved with the maximum (hard) margin (largest θ). In any event, one should keep in mind that (43) is only an upper bound, and that more sophisticated bounds (e.g. [180, 45, 181]), based on looking at the full margin distribution as opposed to the single hard margin θ, may lead to different conclusions. In this section we discuss algorithms, where the number of margin errors can be controlled, and hence one is able to control the contribution of both terms in (43) separately. We first discuss the DOOM approach [125] and then present an extended linear program – the ν-LP problem – which allows for margin errors. Additionally, we briefly discuss approaches to solve the resulting linear program [48, 154, 147]. DOOM. (Direct Optimization Of Margins) The basic idea of [125, 127, 124] is to replace AdaBoost’s exponential loss function with another loss function with 1 -normalized hypothesis coefficients. In order to do this one defines a class of B-admissible margin cost functions, parameterized by some integer M . These loss functions are motivated from the margin bounds of the kind discussed in Section 3.4, which have a free parameter θ. A large value of M ∼ θ12 corresponds to a “high resolution” and a high effective complexity of the convex combination (small margin θ), whereas smaller values of M correspond to larger θ and therefore smaller effective complexity. The B-admissibility condition ensures that the loss functions appropriately take care of the trade-off between complexity and empirical error. Following some motivation (which we omit here), [125, 127, 124] derived an optimal family of margin loss functions (according to the margin bound). Unfortunately, this loss function is non-monotone and non-convex (cf. Figure 7, [left]), leading to a very difficult optimization problem (NP hard).

An Introduction to Boosting and Leveraging

157

The idea proposed in [125] is to replace this loss function with a piece-wise linear loss function, that is monotone (but non-convex, cf. Figure 7, [right]):  : −1 ≤ z ≤ 0,  1.1 − 0.1z : 0 ≤ z ≤ θ, Cθ (z) = 1.1 − z/θ  0.1(1 − z)/(1 − θ) : θ ≤ z ≤ 1. The optimization of this margin loss function to a global optimum is still very difficult, but good heuristics for optimization have been proposed [125, 127, 124]. Theoretically, however, one can only prove convergence to a local minimum (cf. Section 5.4).

Fig. 7. [left] The “optimal” margin loss functions CM (z), for M = 20, 50 and 200, compared to the 0/1-loss. [right] Piecewise linear upper bound on the functions CM (z) and the 0/1-loss (Figure taken from [125].)

Despite the above mentioned problems the DOOM approach seems very promising. Experimental results have shown that, in noisy situations, one can considerably improve the performance compared to AdaBoost. However, there is the additional regularization parameter θ (or M ), which needs to be chosen appropriately (e.g. via cross validation).

Linear Programming Approaches. We now discuss ideas aimed at directly controlling the number of margin errors. Doing so, one is able to directly control the contribution of both terms on the r.h.s. of (43) separately. We first discuss an extended linear program, the ν-LP problem, analyze its solution and review a few algorithms for solving this optimization problem. The ν-LP Problem. Let us consider the case where we have given a set H = {hj : x → [−1, 1], j = 1, . . . , J} of J hypotheses. To find the coefficients w for the combined hypothesis fw (x) in (21), we extend the linear problem (30) and solve the following linear optimization problem (see also [17, 154]), which we call the ν-LP problem:

158

R. Meir and G. R¨ atsch



 N 1  ρ− max ξn ρ,w,ξ νN n=1 s.t. yn fw (xn ) ≥ ρ − ξn ξn , wj ≥ 0 J j=1 wj = 1,

n = 1, . . . , N j = 1, . . . , J,

(45) n = 1, . . . , N

where ν ∈ (1/N, 1] is a parameter of the algorithm. Here, one does not force all margins to be large and obtains a soft margin hyperplane.16 The dual optimization problem of (45) (cf. (46)) is the same as the edge minimization problem 1 . Since dn is the Lagrange multi(20) with one additional constraint: dn ≤ νN plier associated with the constraint for the n-th example, its size characterizes how much the example influences the solution of (45). In this sense, the ν-LP also implements the intuition discussed in Section 6.1. In addition, there seems to be a direct connection to the SmoothBoost algorithm. The following proposition shows that ν has an immediate interpretation: Proposition 2 (ν-Property, e.g. [174, 78, 150]). The solution to the optimization problem (45) possesses the following properties: 1. ν upper-bounds the fraction of margin errors. 2. 1 − ν is greater than the fraction of examples with a margin larger than ρ. Since the slack variables ξn only enter the cost function linearly, their gradient is constant (once ξn > 0) and the absolute size is not important. Loosely speaking, this is due to the fact that for the optimum of the primal objective function, only derivatives w.r.t. the primal variables matter, and the derivative of a linear function is constant. This can be made more explicit: Any example outside the margin area, i.e. satisfying yn fw (xn ) > ρ, can be shifted arbitrarily, as long as it does not enter the margin area. Only if the example is exactly on the edge of the margin area, i.e. yn fw (xn ) = ρ, then (almost) no local movement is possible without changing the solution. If the example (xn , yn ) is in the margin area, i.e. ξn > 0, then it has been shown that almost any local movements are allowed (cf. [147] for details). A Column Generation Approach. Recently, an algorithm for solving (45) has been proposed [48] (see also in the early work of [81]). It uses the Column Generation (CG) method known since the 1950’s in the field of numerical optimization (e.g. [136], Section 7.4). The basic idea of column generation is to iteratively construct the optimal ensemble for a restricted subset of the hypothesis set, which is iteratively extended. To solve (45), one considers its dual optimization problem: min d,γ

s.t.

16

γ N

yn dn hj (xn ) ≤ γ N d ≥ 0, n=1 dn = 1 1 dn ≤ νN n=1

j = 1, . . . , J n = 1, . . . , N.

See Figure 4 for the soft margin loss, here one uses a scaled version.

(46)

An Introduction to Boosting and Leveraging

159

At each iteration t, (46) is solved for a small subset of hypotheses Ht ⊆ H. Then the base learner is assumed to find a hypothesis ht that violates the first constraint in (46) (cf. τ -relaxation in Section 5.4). If there exists such an hypothesis, then it is added to the restricted problem and the process is continued. This corresponds to generating a column in the primal LP (45). If all hypotheses satisfy their constraint, then the current dual variables and the ensemble (primal variables) are optimal, as all constraints of the full master problem are fulfilled. The resulting algorithm is a special case of the set of algorithms known as exchange methods (cf. [93], Section 7 and references therein). These methods are known to converge (cf. [48, 149, 93] for finite and infinite hypothesis sets). To the best of our knowledge, no convergence rates are known, but if the base learner is able to provide “good” hypotheses/columns, then it is expected to converge much faster than simply solving the complete linear program. In practice, it has been found that the column generation algorithm halts at an optimal solution in a relatively small number of iterations. Experimental results in [48, 149] show the effectiveness of the approach. Compared to AdaBoost, one can achieve considerable improvements of the generalization performance [48]. A Barrier Algorithm. Another idea is to derive a barrier algorithm for (45) along the lines of Section 4.4. Problem (45) is reformulated: one removes the constant 1 νN in the objective of (45), fixes ρ = 1 and adds the sum of the hypothesis 1 ). coefficients multiplied with another regularization constant C (instead of νN One can in fact show that the modified problem has the same solution as (45): for any given ν one can find a C such that both problems have the same solution (up to scaling) [48]. The corresponding barrier objective Fβ for this problem can be derived as in Section 4.4. It has two sets of parameters, the combination coefficients w and the slack variables ξ ∈ RN . By setting ∇ξ Fβ = 0, we can find the minimizing ξ for given β and w, which one plugs in and obtains Fβ (w) = C

J  j=1

wj + β

#' & " 1 − yn fw (xn ) + βN. log 1 + exp β n=1 N 

(47)

If one sets β = 1 and C = 0 (i.e. if we do not regularize), then one obtains a formulation which is very close to the logistic regression approach as in [74, 39]. The current approach can indeed be understood as a leveraging algorithm as in Section 5.2 with regularization. Furthermore, if we let β approach zero, then the N scaled logistic loss in (47) converges to the soft-margin loss n=1 max(0, 1 − yn fw (xn )). Using the techniques discussed in Section 5, one can easily derive algorithms that optimize (47) for a fixed β up to a certain precision. Here one needs to take care of the 1 -norm regularization, which can be directly incorporated into a coordinate descent algorithm (e.g. [147]). The idea is to reduce β, when a certain precision is reached and then the optimization is continued with the modified loss function. If one chooses the accuracy and the rate of decrease of β appropriately, one obtains a practical algorithm, for which one can show asymptotic convergence (for finite hypothesis sets see [147, 149]).

160

R. Meir and G. R¨ atsch

An earlier realization of the same idea was proposed in [154], and termed ν-Arc. The idea was to reformulate the linear program with soft margin into a non-linear program with maximum hard margin, by appropriately redefining the margin (similar to AdaBoostReg in Section 6.1). Then, as a heuristic, Arc-GV (cf. Section 4.3) was employed to maximize this newly defined margin.17 The barrier approach and ν-Arc led to considerable improvements compared to AdaBoost on a large range of benchmark problems [154, 147]. Since they try to solve a similar optimization problem as the column generation algorithm, one would expect very minor differences. In practice, the column generation algorithm often converges faster, however, it has a problem in combination with some base learners [48]: In the CG approach, most of the example weights will be zero after a few iterations (since the margin in these examples is larger than ρ and the margin constraints are not active). Then the base learner may have problems to generate good hypotheses. Since the barrier approach is much “smoother”, this effect is reduced considerably. 6.3

Regularization Terms and Sparseness

We will now discuss two main choices of regularizers that have a drastic influence on the properties of the solution of regularized problems. Let us consider optimization problems as in Section 5 with a regularization term in the objective:18   N J   min g(fw (xn ), yn ) + C p(wj ) w (48) n=1 j=1 J s.t. 0 ≤ w ∈ R , where C is a regularization constant, g is a loss function and p : R+ → R+ is a differentiable and monotonically increasing function with p(0) = 0. For simplicity of the presentation we assume the hypothesis set is finite, but the statements below also hold for countable hypothesis sets; in the case of uncountable sets some further assumptions are required (cf. [93], Theorem 4.2, and [149, 148, 198]). We say a set of hypothesis coefficients (see (21)) is sparse, if it contains O(N ) non-zero elements. The set is not sparse if it contains e.g. O(J) non-zero elements, since the feature space is assumed to be much higher dimensional than N (e.g. infinite dimensional). We are interested in formulations of the form (48) for which the optimization problem is tractable, and which, at the same time, lead to sparse solutions. The motivation for the former aspect is clear, while the motivation for sparsity is that it often leads to superior generalization results (e.g. [91, 90]) and also to smaller, and therefore computationally more efficient, ensemble hypotheses. Moreover, in the case of infinite hypothesis spaces, sparsity leads to a precise representation in terms of a finite number of terms. 17 18

A Matlab implementation can be downloaded at http://mlg.anu.edu.au/˜raetsch/software. Here we force the w’s to be non-negative, which can be done without loss of generality, if the hypothesis set is closed under negation.

An Introduction to Boosting and Leveraging

161

Let us first consider the 1 -norm as regularizer, i.e. p(wj ) = wj . Assume further that w∗ is the optimal solution to (48) for some C ≥ 0. Since the regularizer is linear, there will be a (J − 1)-dimensional subspace of w’s having the same 1 -norm as w∗ . By further restricting fw (xn ) = fw∗ (xn ), n = 1, . . . , N , one obtains N additional constraints. Hence, one can choose a w from a J − N − 1 dimensional space that leads to the same objective value as w∗ . Therefore, there exists a solution that has at most N + 1 non-zero entries and, hence, the solution is sparse. This observations holds for arbitrary loss functions and any concave regularizer such as 1 -norm [147]. Note that N + 1 is an upper bound on the number of non-zero elements – in practice much sparser solutions are observed [37, 48, 23, 8]. In neural networks, SVMs, matching pursuit (e.g. [118]) and many other algorithms, one uses the 2 -norm for regularization. In this case the optimal solution w∗ can be expressed as a linear combination of the mapped examples in feature space (cf. (26)), w∗ =

N 

βn Φ(xn ).

n=1

Hence, if the vectors Φ(xn ) are not sparse and the β’s are not all zero, the solution w∗ is also unlikely to be sparse (since it is a linear combination of non-sparse vectors). Consider, for instance, the case where the J vectors (hj (x1 ), . . . , hj (xN )) (j = 1, . . . , J), interpreted as points in an N dimensional space are in general position (any subset of N points span the full N dimensional space. For instance, this occurs with probability 1 for points drawn independently at random from a probability density supported over the whole  space).  If J N , then one can N show that there is a fraction of at most O J−N coefficients that are zero. Hence, for large J almost all coefficients are non-zero and the solution is not sparse. This holds not only for the 2 -norm regularization, but for any other strictly convex regularizer [147]. This observation can be intuitively explained as follows: There is a J − N dimensional subspace leading to the same output fw (xn ) on the training examples. Assume the solution is sparse and one has a small number of large weights. If the regularizer is strictly convex, then one can reduce the regularization term by distributing large weights over many other weights which were previously zero (while keeping the loss term constant). We can conclude that the type of regularizer determines whether there exists an optimal hypothesis coefficient vector that is sparse or not. Minimizing a convex loss function with a strictly concave regularizer will generally lead to nonconvex optimization problems with potentially many local minima. Fortunately, the 1 -norm is concave and convex. By employing the 1 -norm regularization, one can obtain sparse solutions while retaining the computational tractability of the optimization problem. This observation seems to lend strong support to the 1 -norm regularization.

162

7

R. Meir and G. R¨ atsch

Extensions

The bulk of the work discussed in this review deals with the case of binary classification and supervised learning. However, the general Boosting framework is much more widely applicable. In this section we present a brief survey of several extensions and generalizations, although many others exist, e.g. [76, 158, 62, 18, 31, 3, 155, 15, 44, 52, 82, 164, 66, 38, 137, 147, 16, 80, 185, 100, 14]. 7.1

Single Class

A classical unsupervised learning task is density estimation. Assuming that the unlabeled observations x1 , . . . , xN were generated independently at random according to some unknown distribution P (x), the task is to estimate its related density function. However, there are several difficulties to this task. First, a density function need not always exist — there are distributions that do not possess a density. Second, estimating densities exactly is known to be a hard task. In many applications it is enough to estimate the support of a data distribution instead of the full density. In the single-class approach one avoids solving the harder density estimation problem and concentrate on the simpler task, i.e. estimating quantiles of the multivariate distribution. So far there are two independent algorithms to solve the problem in a boosting-like manner. They mainly follow the ideas proposed in [184, 173] for kernel feature spaces, but have the benefit of better interpretability of the generated combined hypotheses (see discussion in [150]). The algorithms proposed in [32] and [150] differ in the way they solve very similar optimization problems – the first uses column generation techniques whereas the latter uses barrier optimization techniques (cf. Section 6.2). For brevity we will focus on the underlying idea and not the algorithmic details. As in the SVM case [173], one computes a hyperplane in the feature space (here spanned by the hypothesis set H, cf. Section 4.2) such that a pre-specified fraction of the training example will lie beyond that hyperplane, while at the same time demand that the hyperplane has maximal ∞ -distance (margin) to the origin – for an illustration see Figure 8. This is realized by solving the following linear optimization problem: min

w∈RJ ,ξ∈RN ,ρ∈R

s.t.

N  −νρ + N1 ξn n=1 J j=1 wj hj (xn ) ≥ ρ − ξn ,

w 1 = 1, ξn ≥ 0,

n = 1, . . . , N n = 1, . . . , N

(49)

which is essentially the same as in the two-class soft-margin approach (cf. Section 6.2), but with all labels equal to +1. Hence, the derivation appropriate optimization algorithms is very similar. 7.2

Multi-class

There has been much recent interest in extending Boosting to multi-class problems, where the number of classes exceeds two. Let S = {(x1 , y 1 ), . . . , (xN , y N )}

An Introduction to Boosting and Leveraging

163

F

νN outliers

Fig. 8. Illustration of single-class idea. A hyperplane in the feature space F is constructed that maximizes the distance to the origin while allowing for ν outliers. (Figure taken from [134].)

ρ origin

be a data set, where, for a K-class problem, y n is a K-dimensional vector of the form (0, 0, . . . , 0, 1, 0, . . . , 0) with a 1 in the k-th position if, and only if, xn belongs to the k-th class. The log-likelihood function can then be constructed as [20] (see also Page 84)

G=

K N  

{yn,k log p(k|xn ) + (1 − yn,k ) log(1 − p(k|xn ))} ,

(50)

n=1 k=1

where yn,k is the k’th component of the vector y n , and p(k|xn ) is the model probability that xn belongs to the k’th class. Using the standard softmax representation, eFk (x) , p(k|x) = K Fk (x) k =1 e and substituting in (50) we obtain a function similar in form to others considered so far. This function can then be optimized in a stagewise fashion as described in Section 5.2, yielding a solution to the multi-class problem. This approach was essentially the one used in [74]. Several other approaches to multi-category classification were suggested in [168] and applied to text classification in [169]. Recall that in Boosting, the weights of examples which are poorly classified by previous weak learners become amplified. The basic idea in the methods suggested in [168] was to maintain a set of weights for both examples and labels. As boosting progresses, training examples and their corresponding labels that are hard to predict correctly get increasingly higher weights. This idea led to the development of two multi-category algorithms, titled AdaBoost.MH and AdaBoost.MR. Details of these algorithms can be found in [168]. In addition to these approaches, there has recently been extensive activity on recoding multi-class problems as a sequence of binary problems, using the

164

R. Meir and G. R¨ atsch

idea of error-correcting codes [52]. A detailed description of the approach can be found in [3] in a general context as well as in the Boosting setup. Two special cases of this general approach are the so-called one-against-all and all-pairs approaches, which were extensively used prior to the development of the error correcting code approach. In the former case, given a k-class classification problem, one constructs k (usually soft) binary classifiers each of which learns to distinguish one class from the rest. The multi-category classifier then ( )uses the highest ranking soft classifier. In the all-pairs approach, all possible k2 pairs of classification problems are learned and used to form the multi-category classifier using some form of majority voting. Using soft classifiers helps in removing possible redundancies. Connections of multi-class boosting algorithms with column generations techniques are discussed in [155]. Finally, we comment that an interesting problem relates to situations where the data is not only multi-class, but is also multi-labeled in the sense that each input x may belong to several classes [168]. This is particularly important in the context of text classification, where a single document may be relevant to several topics, e.g. sports, politics and violence. 7.3

Regression

The learning problem for regression is based on a data set S = {(xn , yn )}N n=1 , where in contrast to the classification case, the variable y can take on real values, rather than being restricted to a finite set. Probably the first approach to addressing regression in the context of Boosting appeared in [70], where the problem was addressed by translating the regression task into a classification problem (see also [18]). Most of the Boosting algorithms for binary classification described in this review, are based on a greedy stage-wise minimization of a smooth cost function. It is therefore not surprising that we can directly use such a smooth cost function in the context of regression where we are trying to estimate a smooth function. Probably the first work to discuss regression in the context of stage-wise minimization appeared in [76] (LS-Boost and other algorithms), and later further extended by others (e.g. [158, 182, 195]). The work in [156] further addressed regression in connection with barrier optimization (essentially with a two-sided soft-margin loss – the -insensitive loss). This approach is very similar in spirit to one of the approaches proposed in [57] (see below). Later it was extended in [149, 147] to arbitrary strictly convex loss functions and discussed in connection with infinite hypothesis sets and semi-infinite programming. The resulting algorithms were shown to be very useful in real world problems. We briefly describe a regression framework recently introduced in [57], where several Boosting like algorithms were introduced. All these algorithms operate using the following scheme, where different cost functions are used for each algorithm. (i) At each iteration the algorithm modifies the sample to produce a new sample S˜ = {(x1 , y˜1 ), . . . , (xN , y˜N )} where the y’s have been modified, based on the residual error at each stage. (ii) A distribution d over the modified samples S˜ is constructed and a base regression algorithm is called. (iii) The base learner produces a function f ∈ F with some edge on S˜ under d (for a definition of

An Introduction to Boosting and Leveraging

165

an edge in the regression context see [57]). (iv) A new composite regressor of the form F + αf is formed, where α is selected by solving a one-dimensional optimization problem. Theoretical performance guarantees for the regression algorithms introduced in [57] were analyzed in a similar fashion to that presented in Section 3 for binary classification. Under the assumption that a strictly positive edge was obtained at each step, it was shown that the training error converges exponentially fast to zero (this is analogous to Theorem 1 for classification). Significantly, the optimality of the base learners was not required for this claim, similarly to related results described in Section 5 for classification. Finally, generalization error bounds were derived in [57], analogous to Theorem 3. These bounds depend on the edge achieved by the weak learner.

7.4

Localized Boosting

All of the Boosting approaches discussed in this review construct a composite  ensemble hypothesis of the form t αt ht (x). Observe that the combination parameters αt do not depend on the input x, and thus contribute equally at each point in space. An alternative approach to the construction of composite classifiers would be to allow the coefficients αt themselves to depend on x, leading to a hypothesis of the form t αt (x)ht (x). The interpretation of this form of functional representation is rather appealing. Assume that each hypothesis ht (x) represents and expert, while αt (x) assigns a non-negative weightwhich is attached to the prediction of the t-th expert, where we assume that t αt (x) = 1. Given an input x, each of the experts makes a prediction ht (x), where the prediction is weighted by a ‘confidence’ parameter αt (x). Note that if αt (x) are indicator functions, then for each input x, there is a single hypothesis ht which is considered. For linear hypotheses we can think of this approach as the representation of a general non-linear function by piece-wise linear functions. This type of representation forms the basis for the so-called mixture of experts models (e.g. [99]). An important observation concerns the complexity of the functions αt (x). Clearly, by allowing these functions to be arbitrarily complex (e.g. δfunctions), we can easily fit any finite data set. This stresses the importance of regularization in any approach attempting to construct such a mixture of expert representation. A Boosting like approach to the construction of mixture of experts representations was proposed in [129] and termed Localized Boosting. The basic observations in this work were the relationship to classic mixture models in statistics, where the EM algorithm has proven very effective, and the applicability of the general greedy stagewise gradient descent approaches described in Section 5.2. The algorithm developed in [129], termed LocBoost, can be thought of a stagewise EM algorithm, where similarly to Boosting algorithms a single hypothesis is added on to the ensemble at each stage, and the functions αt (·) are also estimated at each step. Regularization was achieved by restricting αt to simple parametric forms. In addition to the algorithm described in detail in [129], gen-

166

R. Meir and G. R¨ atsch

eralization bounds similar to Theorem 3 were established. The algorithm has been applied to many real-world data-sets leading to performance competitive with other state-of-the-art algorithms.19 7.5

Other Extensions

We briefly mention other extensions of Boosting algorithms. In [140] a method was introduced for learning multiple models that use different (possibly overlapping) sets of features. In this way, more robust learning algorithms can be constructed. An algorithm combining Bagging and Boosting, aimed at improving performance in the case of noisy data was introduced in [109]. The procedure simply generates a set of Bootstrap samples as in Bagging, generates a boosted classifier for each sample, and combines the results uniformly. An online version of a Boosting algorithm was presented in [141], which was shown to be comparable in accuracy to Boosting, while much faster in terms of running time. Many more extensions are listed at the beginning of the present section.

8

Evaluation and Applications

8.1

On the Choice of Weak Learners for Boosting

A crucial ingredient for the successful application of Boosting algorithms is the construction of a good weak learner. As pointed out in Section 3, a weak learner which is too weak cannot guarantee adequate performance of the composite hypothesis. On the other hand, an overly complex weak learner may lead to overfitting, which becomes even more severe as the algorithm proceeds. Thus, as stressed in Section 6, regularization, in addition to an astute choice of weak learner, plays a key role in successful applications. Empirically we often observed that a base learner that already performs quite well but are slightly too simple for the data at hand, are best suited for use with boosting. When one uses bagging, then using base learners that are slightly too complex perform best. Additionally, real-valued (soft) hypotheses often lead to considerable better results. We briefly mention some weak learners which have been used successfully in applications. Decision trees and stumps. Decision trees have been widely used for many years in the statistical literature (e.g. [29, 144, 85]) as powerful, effective and easily interpretable classification algorithms that are able to automatically select relevant features. Hence, it is not surprising that some of the most successful initial applications of Boosting algorithms were performed using decision trees as weak learners. Decision trees are based on the recursive partition of the input 19

A Matlab implementation of the algorithm, as part of an extensive Pattern Recognition toolbox, can be downloaded from http://tiger.technion.ac.il/˜eladyt/classification/index.htm.

An Introduction to Boosting and Leveraging

167

space into a set of nested regions, usually of rectangular shape. The so-called decision stump is simply a one-level decision tree, namely a classifier formed by splitting the input space in an axis-parallel fashion once and then halting. A systematic approach to the effective utilization of trees in the context of Boosting is given in [168], while several other approaches based on logistic regression with decision trees are described in [76]. This work showed that Boosting significantly enhances the performance of decision trees and stumps. Neural networks. Neural networks (e.g. [88, 20]) were extensively used during the 1990’s in many applications. The feed-forward neural network, by far the most widely used in practice, is essentially a highly non-linear function representation formed by repeatedly combining a fixed non-linear transfer function. It has been shown in [113] that any real-valued continuous function over Rd can be arbitrarily well approximated by a feed-forward neural network with a single hidden layer, as long as the transfer function is not a polynomial. Given the potential power of neural networks in representing arbitrary continuous functions, it would seem that they could easily lead to overfitting and not work effectively in the context of Boosting. However, careful numerical experiments conducted by [176] demonstrated that AdaBoost can significantly improve the performance of neural network classifiers in real world applications such as OCR. Kernel Functions and Linear combinations. One of the problems of using neural networks as weak learners is that the optimization problems often become unwieldy. An interesting alternative is forming a weak learner by linearly combining a set of fixed functions, as in γ

h (x) :=

K 

βk Gk (x),

β ∈ RK .

(51)

k=1

The functions Gk could, for example, be kernel functions, i.e. Gk (x) = k(x, xk ), centered around the training examples. The set {hβ } is an infinite hypothesis set and is unbounded, if β is unbounded. Hence, maximizing the edge as discussed in Section 5.4, would lead to diverging β’s. Keeping in mind the overfitting issues discussed in Section 6, we attempt to restrict the complexity of the class of functions by limiting the norm of β. We discuss two approaches. 1 -norm Penalized Combinations. Here we set H := {hβ | β 1 ≤ 1, β ∈ RN }. Then finding the edge-maximizing β has a closed form solution. Let N   ∗ dn gk (xn ) . k = argmax k=1,... ,K

n=1

Then hβ with β = (0, . .. , 0, βk∗ , 0, . . . , 0) maximizes the edge, where βk∗ = N ∗ sign n=1 dn k(xk , xn ) . This means that we will be adding in exactly one basis function gk∗ (·) per iteration. Hence, this approach reduces to the case of

168

R. Meir and G. R¨ atsch

using the single functions. A variant of this approach has been used in [183]. A related classification function approach was used for solving a drug discovery task in [149]. In [149] so-called active kernel functions have been used, where the set of functions were kernel-like functions hn,p (x) = kp (x, xn ), and kp is a SVM kernel function parameterized by p (e.g. the variance in a RBF kernel). In this case one has to find the example xn and the parameter p that maximize the edge. This may lead to a hard non-convex optimization problem, for which quite efficient heuristics have been proposed in [149]. 2 -norm Penalized combinations. Another way is to penalize the 2 -norm of the combination coefficients. In this case H := {hγ | γ 2 ≤ 1, γ ∈ RN }, and one attempts to solve N   γ 2 2 (dn − h (xn )) + C γ 2 , γ = argmin γ

n=1

N The problem has a particularly simple solution: γk = C1 n=1 dn gk (xn ). A similar approach in the form of RBF networks has been used in [153] in connection with regularized versions of boosting. 8.2

Evaluation on Benchmark Data Sets

Boosting and Leveraging have been applied to many benchmark problems. The AdaBoost algorithm and many of its early variants were tested on standard data sets from the UCI repository, and often found to compare favorably with other state of the art algorithms (see, for example, [68, 168, 51]. However, it was clear from [146, 168, 51] that AdaBoost tends to overfit if the data is noisy and no regularization is enforced. More recent experiments, using the regularized forms of Boosting described in Section 6 lead to superior performance on noisy real-world data. In fact, these algorithms often do significantly better than the original versions of Boosting, comparing very favorably with the best classifiers available (e.g. SVMs). We refer the reader to [153] for results of these benchmark studies. These results place regularized boosting techniques into the standard toolbox of data analysis techniques. 8.3

Applications

The list of applications of Boosting and Leveraging methods is growing rapidly. We discuss three applications in detail and then list a range of other applications. Non-intrusive Power Monitoring System. One of the regularized approaches described in Section 6, ν-Arc, was applied to the problem of power appliances monitoring [139]. The most difficult problem for power companies is the handling of short-term peak loads for which additional power plants need to be built to provide security against a peak load instigated power failure. A

An Introduction to Boosting and Leveraging

169

prerequisite for controlling the electric energy demand, however, is the ability to correctly and non-intrusively detect and classify the operating status of electric appliances of individual households. The goal in [139] was to develop such a non-intrusive measuring system for assessing the status of electric appliances. This is a hard problem, in particular for appliances with inverter systems,20 whereas non-intrusive measuring systems have already been developed for conventional on/off (non-inverter) operating electric equipments (cf. [83, 33]). The study in [139] presents a first evaluation of machine learning techniques to classify the operating status of electric appliances (with and without inverter) for the purpose of constructing a non-intrusive monitoring system. In this study, RBF networks, K-nearest neighbor classifiers (KNNs) (e.g. [42]), SVMs and ν-Arc (cf. Section 6.2) were compared. The data set available for this task is rather small (36 examples), since the collection and labeling of data is manual and therefore expensive. As a result of this, one has to be very careful with finding good model parameters. All model parameters were found using the computationally expensive but generally reliable leave-one-out method. The results reported in [139] demonstrate that the ν-Arc algorithm with RBF networks as base learner performs better on average than all other algorithms studied, followed closely by the SVM classifier with a RBF kernel. These results suggest that the goal of a control system to balance load peaks might be a feasible prospect in the not too distant future. Tumor Classification with Gene Expression Data. Micro-array experiments generate large datasets with expression values for thousands of genes but not more than a few dozen of examples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment (in typical cases the sample size is in the range of 20-100 and the number of features varies between 2,000 and 20,000; clearly here the potential for overfitting is huge). The goal is is to predict the unknown class label of a new individual on the basis of its gene expression profile. Since this task is of great potential value, there have been many attempts to develop effective classification procedures to solve it. Early work applied the AdaBoost algorithm to this data; however, the results seemed to be rather disappointing. The recent work in [49] applied the LogitBoost algorithm [74], using decision trees as base learners, together with several modifications, and achieved state of the art performance on this difficult task. It turned out that in order to obtain high quality results, it was necessary to preprocess the data by scoring each individual feature (gene) according to its discriminatory power using a non-parametric approach (details can be found in [49]). Moreover, it was found that the simple one-against-all approach to multi-category classification led to much better results than the direct multi-class approach presented in [74] based on the log-likelihood function (cf. Section 7.2). Interestingly, the authors found that the quality of the 20

An inverter system controls for example the rotation speed of a motor (as in airconditioners) by changing the frequency of the electric current.

170

R. Meir and G. R¨ atsch

results degraded very little as a function of the number of Boosting iterations (up to 100 steps). This is somewhat surprising given the small amount of data and the danger of overfitting. The success of the present approach compared to results achieved by AdaBoost, tends to corroborate the assertions made in [74] concerning the effectiveness of the LogitBoost approach for noisy problems.

Text Classification. The problem of text classification is playing an increasingly important role due to the vast amount of data available over the web and within internal company repositories. The problem here is particularly challenging since the text data is often multi-labeled, namely each text may naturally fall into several categories simultaneously (e.g. Sports, Politics and Violence). In addition, the difficulty of finding an appropriate representation for text is still open. The work in [169] presented one of the first approaches to using Boosting for text classification. In particular, the approaches to multi-class multi-label classification developed in [168], were used for the present task. The weak learner used was a simple decision stump (single level decision tree), based on terms consisting of single words and word pairs. The text categorization experiments reported in [169] were applied to several of the standard text classification benchmarks (Reuters, AP titles and UseNet groups) and demonstrated that the approach yielded, in general, better results than other methods to which it was compared. Additionally, it was observed that Boosting algorithms which used real-valued (soft) weak learners performed better than algorithms using only binary weak learners. A reported drawback of the approach was the very large time required for training.

Other applications. We briefly mention several applications of Boosting and Leveraging methods in other problems. The group at AT&T has been involved in many applications of Boosting approaches beyond the text classification task discussed above. For example, the problems of text filtering [170] and routing [95] were addressed as well as that of ranking and combining references [66]. More recently the problem of combining prior knowledge and boosting for call classification in spoken language dialogue was studied in [161] and applications to the problem of modeling auction price uncertainty was introduced in [171]. Applications of boosting methods to natural language processing has been reported in [1, 60, 84, 194], and approaches to Melanoma Diagnosis are presented in [132]. Some further applications to Pose Invariant Face Recognition [94], Lung Cancer Cell Identification [200] and Volatility Estimation for Financial Time Series [7] have also been developed. A detailed list of currently known applications of Boosting and Leveraging methods will be posted on the web at the Boosting homepage http://www.boosting.org/applications.

An Introduction to Boosting and Leveraging

9

171

Conclusions

We have presented a general overview of ensemble methods in general and the Boosting and Leveraging family of algorithms in particular. While Boosting was introduced within a rather theoretical framework in order to transform a poorly performing learning algorithm into a powerful one, this idea has turned out to have manifold extensions and applications, as discussed in this survey. As we have shown, Boosting turns out to belong to a large family of models which are greedily constructed by adding on a single base learner to a pool of previously constructed learners using adaptively determined weights. Interestingly, Boosting was shown to be derivable as a stagewise greedy gradient descent algorithm attempting to minimize a suitable cost function. In this sense Boosting is strongly related to other algorithms that were known within the Statistics literature for many years, in particular additive models [86] and matching pursuit [118]. However, the recent work on Boosting has brought to the fore many issues which were not studied previously. (i) The important concept of the margin, and its impact on learning and generalization has been emphasized. (ii) The derivation of sophisticated finite sample data-dependent bounds has been possible. (iii) Understanding the relationship between the strength of the weak learner, and the quality of the composite classifier (in terms of training and generalization errors). (iv) The establishment of consistency (v) The development of computationally efficient procedures. With the emphasis of much of the Boosting work on the notion of the margin, it became clear that Boosting is strongly related to another very successful current algorithm, namely the Support Vector Machine [190]. As was pointed out in [167, 134, 150], both Boosting and the SVM can be viewed as attempting to maximize the margin, except that the norm used by each procedure is different. Moreover, the optimization procedures used in both cases are very different. While Boosting constructs a complex composite hypothesis, which can in principle represent highly irregular functions, the generalization bounds for Boosting turn out to lead to tight bounds in cases where large margins can be guaranteed. Although initial work seemed to indicate that Boosting does not overfit, it was soon realized that overfitting does indeed occur under noisy conditions. Following this observation, regularized Boosting algorithms were developed which are able to achieve the appropriate balance between approximation and estimation required to achieve excellent performance even under noisy conditions. Regularization is also essential in order to establish consistency under general conditions. We conclude with several open questions. 1. While it has been possible to derive Boosting-like algorithms based on many types of cost functions, there does not seem to be at this point a systematic approach to the selection of a particular one. Numerical experiments and some theoretical results indicate that the choice of cost function may have a significant effect of the performance of Boosting algorithms (e.g. [74, 126, 154]). 2. The selection of the best type of weak learner for a particular task is also not entirely clear. Some weak learners are unable even in principle to represent

172

R. Meir and G. R¨ atsch

complex decision boundaries, while overly complex weak learners quickly lead to overfitting. This problem appears strongly related to the notoriously difficult problem of feature selection and representation in pattern recognition, and the selection of the kernel in support vector machines. Note, however, that the single weak learner in Boosting can include multi-scaling information whereas in SVMs one has to fix a kernel inducing the kernel Hilbert space. An interesting question relates to the possibility of using very different types of weak learners at different stages of the Boosting algorithm, each of which may emphasize different aspects of the data. 3. An issue related to the previous one is the question of the existence of weak learners with provable performance guarantees. In Section 3.1 we discussed sufficient conditions for the case of linear classifiers. The extension of these results to general weak learners is an interesting and difficult open question. 4. A great deal of recent work has been devoted to the derivation of flexible data-dependent generalization bounds, which depend explicitly on the algorithm used. While these bounds are usually much tighter than classic bounds based on the VC dimension, there is still ample room for progress here, the final objective being to develop bounds which can be used for model selection in actual experiments on real data. Additionally, it would be interesting to develop bounds or efficient methods to compute the leave-one-out error as done for SVMs in [36]. 5. In the optimization section we discussed many different approaches addressing the convergence of boosting-like algorithms. The result presented in [197] is the most general so far, since it includes many special cases which have been analyzed by others. However, the convergence rates in [197] do not seem to be optimal and additional effort needs to be devoted to finding tight bounds on the performance. In addition, there is the question of whether it is possible to establish super-linear convergence for some variant of leveraging, which ultimately would be lead to much more efficient leveraging algorithms. Finally, since many algorithms use parameterized weak learners, it is often the case that the cost function minimized by the weak learners is not convex with respect to the parameters (see Section 5.4). It would be interesting to see whether this problem could be circumvented (e.g. by designing appropriate cost functions as in [89]). Acknowledgements. We thank Klaus-R. M¨ uller for discussions and his contribution to writing this manuscript. Additionally, we thank Shie Mannor, Sebastian Mika, Takashi Onoda, Bernhard Sch¨ olkopf, Alex Smola and Tong Zhang for valuable discussions. R.M. acknowledges partial support from the Ollendorff Center at the Electrical Engineering department at the Technion and from the fund for promotion of research at the Technion. G.R. gratefully acknowledge partial support from DFG (JA 379/9-1, MU 987/1-1), NSF and EU (NeuroColt II). Furthermore, Gunnar R¨ atsch would like to thank UC Santa Cruz, CRIEPI in Tokyo and Fraunhofer FIRST in Berlin for their warm hospitality.

An Introduction to Boosting and Leveraging

173

References 1. S. Abney, R.E. Schapire, and Y. Singer. Boosting applied to tagging and pp attachment. In Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. 2. H. Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Control, 19(6):716–723, 1974. 3. E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000. 4. M. Anthony and P.L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 5. A. Antos, B. K´egl, T. Linder, and G. Lugosi. Data-dependent margin-based generalization bounds for classification. JMLR, 3:73–98, 2002. 6. J.A. Aslam. Improving algorithms for boosting. In Proc. COLT, San Francisco, 2000. Morgan Kaufmann. 7. F. Audrino and P. B¨ uhlmann. Volatility estimation with functional gradient descent for very high-dimensional financial time series. Journal of Computational Finance., 2002. To appear. See http://stat.ethz.ch/˜buhlmann/bibliog.html. 8. J.P. Barnes. Capacity control in boosting using a p-convex hull. Master’s thesis, Australian National University, 1999. supervised by R.C. Williamson. 9. P. Bartlett, P. Boucheron, and G. Lugosi. Model selction and error estimation. Machine Learning, 48:85–2002, 2002. 10. P.L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher averages. In Procedings COLT’02, volume 2375 of LNAI, pages 44–58, Sydney, 2002. Springer. 11. P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002. to appear 10/02. 12. E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithm: Bagging, boosting and variants. Machine Learning, 36:105–142, 1999. 13. H.H. Bauschke and J.M. Borwein. Legendre functions and the method of random Bregman projections. Journal of Convex Analysis, 4:27–67, 1997. 14. S. Ben-David, P. Long, and Y. Mansour. Agnostic boosting. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 507–516, 2001. 15. K. P. Bennett and O. L. Mangasarian. Multicategory separation via linear programming. Optimization Methods and Software, 3:27–39, 1993. 16. K.P. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods. In Proc. ICML, 2002. 17. K.P. Bennett and O.L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992. 18. A. Bertoni, P. Campadelli, and M. Parodi. A boosting algorithm for regression. In W.Gerstner, A.Germond, M.Hasler, and J.-D. Nicoud, editors, Proceedings ICANN’97, Int. Conf. on Artificial Neural Networks, volume V of LNCS, pages 343–348, Berlin, 1997. Springer. 19. D.P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. 20. C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.

174

R. Meir and G. R¨ atsch

21. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam’s razor. Information Processing Letters, 24:377–380, 1987. 22. B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, 1992. 23. P.S. Bradley and O.L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proc. 15th International Conf. on Machine Learning, pages 82–90. Morgan Kaufmann, San Francisco, CA, 1998. 24. L.M. Bregman. The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Math. and Math. Physics, 7:200–127, 1967. 25. L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996. 26. L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460, Statistics Department, University of California, July 1997. 27. L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1518, 1999. Also Technical Report 504, Statistics Department, University of California Berkeley. 28. L. Breiman. Some infinity theory for predictor ensembles. Technical Report 577, Berkeley, August 2000. 29. L. Breiman, J. Friedman, J. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. 30. N. Bshouty and D. Gavinsky. On boosting with polynomially bounded distributions. JMLR, pages 107–111, 2002. Accepted. 31. P. Buhlmann and B. Yu. Boosting with the l2 loss: Regression and classification. J. Amer. Statist. Assoc., 2002. revised, also Technical Report 605, Stat Dept, UC Berkeley August, 2001. 32. C. Campbell and K.P. Bennett. A linear programming approach to novelty detection. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 395–401. MIT Press, 2001. 33. J. Carmichael. Non-intrusive appliance load monitoring system. Epri journal, Electric Power Research Institute, 1990. 34. Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms and Application. Numerical Mathematics and Scientific Computation. Oxford University Press, 1997. 35. N. Cesa-Bianchi, A. Krogh, and M. Warmuth. Bounds on approximate steepest descent for likelihood maximization in exponential families. IEEE Transaction on Information Theory, 40(4):1215–1220, July 1994. 36. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131–159, 2002. 37. S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Technical Report 479, Department of Statistics, Stanford University, 1995. 38. W.W. Cohen, R.E. Schapire, and Y. Singer. Learning to order things. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. 39. M. Collins, R.E. Schapire, and Y. Singer. Logistic Regression, AdaBoost and Bregman distances. Machine Learning, 48(1-3):253–285, 2002. Special Issue on New Methods for Model Selection and Model Combination. 40. R. Cominetti and J.-P. Dussault. A stable exponential penalty algorithm with superlinear convergence. J.O.T.A., 83(2), Nov 1994. 41. C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning, 20:273– 297, 1995.

An Introduction to Boosting and Leveraging

175

42. T.M. Cover and P.E. Hart. Nearest neighbor pattern classifications. IEEE transaction on information theory, 13(1):21–27, 1967. 43. D.D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimates. The Annals of Statistics, 18(4):1676–1695, 1990. 44. K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. In N. Cesa-Bianchi and S. Goldberg, editors, Proc. Colt, pages 35–46, San Francisco, 2000. Morgan Kaufmann. 45. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000. 46. S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, April 1997. 47. S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, Carnegie Mellon University, 2001. 48. A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Journal of Machine Learning Research, 46:225–254, 2002. 49. M. Dettling and P. B¨ uhlmann. How to use boosting for tumor classification with gene expression data. Preprint. See http://stat.ethz.ch/˜dettling/boosting, 2002. 50. L. Devroye, L. Gy¨ orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of Mathematics. Springer, New York, 1996. 51. T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 1999. 52. T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Aritifical Intelligence Research, 2:263–286, 1995. 53. C. Domingo and O. Watanabe. A modification of AdaBoost. In Proc. COLT, San Francisco, 2000. Morgan Kaufmann. 54. H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, and V. Vapnik. Boosting and other ensemble methods. Neural Computation, 6, 1994. 55. H. Drucker, R.E. Schapire, and P.Y. Simard. Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7:705–719, 1993. 56. N. Duffy and D.P. Helmbold. A geometric approach to leveraging weak learners. In P. Fischer and H. U. Simon, editors, Computational Learning Theory: 4th European Conference (EuroCOLT ’99), pages 18–33, March 1999. Long version to appear in TCS. 57. N. Duffy and D.P. Helmbold. Boosting methods for regression. Technical report, Department of Computer Science, University of Santa Cruz, 2000. 58. N. Duffy and D.P. Helmbold. Leveraging for regression. In Proc. COLT, pages 208–219, San Francisco, 2000. Morgan Kaufmann. 59. N. Duffy and D.P. Helmbold. Potential boosters? In S.A. Solla, T.K. Leen, and K.-R. M¨ uller, editors, Advances in Neural Information Processing Systems, volume 12, pages 258–264. MIT Press, 2000. 60. G. Escudero, L. M` arquez, and G. Rigau. Boosting applied to word sense disambiguation. In LNAI 1810: Proceedings of the 12th European Conference on Machine Learning, ECML, pages 129–141, Barcelona, Spain, 2000. 61. W. Feller. An Introduction to Probability Theory and its Applications. Wiley, Chichester, third edition, 1968.

176

R. Meir and G. R¨ atsch

62. D.H. Fisher, Jr., editor. Improving regressors using boosting techniques, 1997. 63. M. Frean and T. Downs. A simple cost function for boosting. Technical report, Dep. of Computer Science and Electrical Engineering, University of Queensland, 1998. 64. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256–285, September 1995. 65. Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293–318, 2001. 66. Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. In Proc. ICML, 1998. 67. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT: European Conference on Computational Learning Theory. LNCS, 1994. 68. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. 69. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In Proc. COLT, pages 325–332, New York, NY, 1996. ACM Press. 70. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. 71. Y. Freund and R.E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999. 72. Y. Freund and R.E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771–780, September 1999. Appeared in Japanese, translation by Naoki Abe. 73. J. Friedman. Stochastic gradient boosting. Technical report, Stanford University, March 1999. 74. J. Friedman, T. Hastie, and R.J. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 2:337–374, 2000. with discussion pp.375–407, also Technical Report at Department of Statistics, Sequoia Hall, Stanford University. 75. J.H. Friedman. On bias, variance, 0/1–loss, and the corse of dimensionality. In Data Mining and Knowledge Discovery, volume I, pages 55–77. Kluwer Academic Publishers, 1997. 76. J.H. Friedman. Greedy function approximation. Technical report, Department of Statistics, Stanford University, February 1999. 77. K.R. Frisch. The logarithmic potential method of convex programming. Memorandum, University Institute of Economics, Oslo, May 13 1955. 78. T. Graepel, R. Herbrich, B. Sch¨ olkopf, A.J. Smola, P.L. Bartlett, K.-R. M¨ uller, K. Obermayer, and R.C. Williamson. Classification on proximity data with LPmachines. In D. Willshaw and A. Murray, editors, Proceedings of ICANN’99, volume 1, pages 304–309. IEE Press, 1999. 79. Y. Grandvalet. Bagging can stabilize without reducing variance. In ICANN’01, Lecture Notes in Computer Science. Springer, 2001. 80. Y. Grandvalet, F. D’alch´e-Buc, and C. Ambroise. Boosting mixture models for semi-supervised tasks. In Proc. ICANN, Vienna, Austria, 2001. 81. A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence, 1998.

An Introduction to Boosting and Leveraging

177

82. V. Guruswami and A. Sahai. Multiclass learning, boosing, and error-correcting codes. In Proc. of the twelfth annual conference on Computational learning theory, pages 145–155, New York, USA, 1999. ACM Press. 83. W. Hart. Non-intrusive appliance load monitoring. Proceedings of the IEEE, 80(12), 1992. 84. M. Haruno, S. Shirai, and Y. Ooyama. Using decision trees to construct a practical parser. Machine Learning, 34:131–149, 1999. 85. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, N.Y., 2001. 86. T.J. Hastie and R.J. Tibshirani. Generalized Additive Models, volume 43 of Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1990. 87. D. Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Information and Computation, 100:78– 150, 1992. 88. S.S. Haykin. Neural Networks : A Comprehensive Foundation. Prentice-Hall, second edition, 1998. 89. D.P. Helmbold, K. Kivinen, and M.K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6):1291–1304, 1999. 90. R. Herbrich. Learning Linear Classifiers: Theory and Algorithms, volume 7 of Adaptive Computation and Machine Learning. MIT Press, 2002. 91. R. Herbrich, T. Graepel, and J. Shawe-Taylor. Sparsity vs. large margins for linear classifiers. In Proc. COLT, pages 304–308, San Francisco, 2000. Morgan Kaufmann. 92. R. Herbrich and R. Williamson. Algorithmic luckiness. JMLR, 3:175–212, 2002. 93. R. Hettich and K.O. Kortanek. Semi-infinite programming: Theory, methods and applications. SIAM Review, 3:380–429, September 1993. 94. F.J. Huang, Z.-H. Zhou, H.-J. Zhang, and T. Chen. Pose invariant face recognition. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pages 245–250, Grenoble, France, 2000. 95. R.D. Iyer, D.D. Lewis, R.E. Schapire, Y. Singer, and A. Singhal. Boosting for document routing. In A. Agah, J. Callan, and E. Rundensteiner, editors, Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management, pages 70–77, McLean, US, 2000. ACM Press, New York, US. 96. W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, volume 1, pages 361–380, Berkeley, 1960. University of California Press. 97. W. Jiang. Some theoretical aspects of boosting in the presence of noisy data. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001. 98. D.S. Johnson and F.P. Preparata. The densest hemisphere problem. Theoretical Computer Science, 6:93–107, 1978. 99. M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6(2):181–214, 1994. 100. M. Kearns and Y. Mansour. On the boosting ability og top-down decision tree learning algorithms. In Proc. 28th ACM Symposium on the Theory of Computing,, pages 459–468. ACM Press, 1996. 101. M. Kearns and L. Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, January 1994. 102. M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.

178

R. Meir and G. R¨ atsch

103. G.S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971. 104. J. Kivinen and M. Warmuth. Boosting as entropy projection. In Proc. 12th Annu. Conference on Comput. Learning Theory, pages 134–144. ACM Press, New York, NY, 1999. 105. J. Kivinen, M. Warmuth, and P. Auer. The perceptron algorithm vs. winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. Special issue of Artificial Intelligence, 97(1–2):325–343, 1997. 106. J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1–64, 1997. 107. K.C. Kiwiel. Relaxation methods for strictly convex regularizations of piecewise linear programs. Applied Mathematics and Optimization, 38:239–259, 1998. 108. V. Koltchinksii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statis., 30(1), 2002. 109. A. Krieger, A. Wyner, and C. Long. Boosting noisy data. In Proceedings, 18th ICML. Morgan Kaufmann, 2001. 110. J. Lafferty. Additive models, boosting, and inference for generalized divergences. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 125–133, New York, NY, 1999. ACM Press. 111. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural information processings systems, volume 14, 2002. to appear. Longer version also NeuroCOLT Technical Report NC-TR-2001-098. 112. Y.A. LeCun, L.D. Jackel, L. Bottou, A. Brunot, C. Cortes, J.S. Denker, H. Drucker, I. Guyon, U.A. M¨ uller, E. S¨ ackinger, P.Y. Simard, and V.N. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In F. Fogelman-Souli´e and P. Gallinari, editors, Proceedings ICANN’95 — International Conference on Artificial Neural Networks, volume II, pages 53–60, Nanterre, France, 1995. EC2. 113. M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer Feedforward Networks with a Nonpolynomial Activation Function Can Approximate any Function. Neural Networks, 6:861–867, 1993. 114. N. Littlestone, P.M. Long, and M.K. Warmuth. On-line learning of linear functions. Journal of Computational Complexity, 5:1–23, 1995. Earlier version is Technical Report CRL-91-29 at UC Santa Cruz. 115. D.G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley Publishing Co., Reading, second edition, May 1984. Reprinted with corrections in May, 1989. 116. G´ abor Lugosi and Nicolas Vayatis. A consistent strategy for boosting algorithms. In Proceedings of the Annual Conference on Computational Learning Theory, volume 2375 of LNAI, pages 303–318, Sydney, February 2002. Springer. 117. Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7–35, 1992. 118. S. Mallat and Z. Zhang. Matching Pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, December 1993. 119. O.L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Operations Research, 13:444–452, 1965. 120. O.L. Mangasarian. Arbitrary-norm separating plane. Operation Research Letters, 24(1):15–23, 1999.

An Introduction to Boosting and Leveraging

179

121. S. Mannor and R. Meir. Geometric bounds for generlization in boosting. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 461–472, 2001. 122. S. Mannor and R. Meir. On the existence of weak learners and applications to boosting. Machine Learning, 48(1-3):219–251, 2002. 123. S. Mannor, R. Meir, and T. Zhang. The consistency of greedy algorithms for classification. In Procedings COLT’02, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer. 124. L. Mason. Margins and Combined Classifiers. PhD thesis, Australian National University, September 1999. 125. L. Mason, P.L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Technical report, Department of Systems Engineering, Australian National University, 1998. 126. L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P.L. Bartlett, B. Sch¨ olkopf, and C. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 1999. 127. L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A.J. Smola, P.L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221–247. MIT Press, Cambridge, MA, 2000. 128. J. Matouˇsek. Geometric Discrepancy: An Illustrated Guide. Springer Verlag, 1999. 129. R. Meir, R. El-Yaniv, and Shai Ben-David. Localized boosting. In Proc. COLT, pages 190–199, San Francisco, 2000. Morgan Kaufmann. 130. R. Meir and T. Zhang. Data-dependent bounds for bayesian mixture models. unpublished manuscript, 2002. 131. J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415–446, 1909. 132. S. Merler, C. Furlanello, B. Larcher, and A. Sboner. Tuning cost-sensitive boosting and its application to melanoma diagnosis. In J. Kittler and F. Roli, editors, Proceedings of the 2nd Internationa Workshop on Multiple Classifier Systems MCS2001, volume 2096 of LNCS, pages 32–42. Springer, 2001. 133. J. Moody. The effective number of parameters: An analysis of generalization and regularization in non-linear learning systems. In S. J. Hanson J. Moody and R. P. Lippman, editors, Advances in Neural information processings systems, volume 4, pages 847–854, San Mateo, CA, 1992. Morgan Kaufman. 134. K.-R. M¨ uller, S. Mika, G. R¨ atsch, K. Tsuda, and B. Sch¨ olkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001. 135. N. Murata, S. Amari, and S. Yoshizawa. Network information criterion — determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5:865–872, 1994. 136. S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY, 1996. 137. Richard Nock and Patrice Lefaucheur. A robust boosting algorithm. In Proc. 13th European Conference on Machine Learning, volume LNAI 2430, Helsinki, 2002. Springer Verlag.

180

R. Meir and G. R¨ atsch

138. T. Onoda, G. R¨ atsch, and K.-R. M¨ uller. An asymptotic analysis of AdaBoost in the binary classification case. In L. Niklasson, M. Bod´en, and T. Ziemke, editors, Proc. of the Int. Conf. on Artificial Neural Networks (ICANN’98), pages 195–200, March 1998. 139. T. Onoda, G. R¨ atsch, and K.-R. M¨ uller. A non-intrusive monitoring system for household electric appliances with inverters. In H. Bothe and R. Rojas, editors, Proc. of NC’2000, Berlin, 2000. ICSC Academic Press Canada/Switzerland. 140. J. O’Sullivan, J. Langford, R. Caruana, and A. Blum. Featureboost: A metalearning algorithm that improves model robustness. In Proceedings, 17th ICML. Morgan Kaufmann, 2000. 141. N. Oza and S. Russell. Experimental comparisons of online and batch versions of bagging and boosting. In Proc. KDD-01, 2001. 142. R. El-Yaniv P. Derbeko and R. Meir. Variance optimized bagging. In Proc. 13th European Conference on Machine Learning, 2002. 143. T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978–982, 1990. 144. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992. 145. J.R. Quinlan. Boosting first-order learning. Lecture Notes in Computer Science, 1160:143, 1996. 146. G. R¨ atsch. Ensemble learning methods for classification. Master’s thesis, Dep. of Computer Science, University of Potsdam, April 1998. In German. 147. G. R¨ atsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany, October 2001. 148. G. R¨ atsch. Robustes boosting durch konvexe optimierung. In D. Wagner et al., editor, Ausgezeichnete Informatikdissertationen 2001, volume D-2 of GI-Edition – Lecture Notes in Informatics (LNI), pages 125–136. Bonner K¨ ollen, 2002. 149. G. R¨ atsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning, 48(1-3):193–221, 2002. Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085. 150. G. R¨ atsch, S. Mika, B. Sch¨ olkopf, and K.-R. M¨ uller. Constructing boosting algorithms from SVMs: an application to one-class classification. IEEE PAMI, 24(9), September 2002. In press. Earlier version is GMD TechReport No. 119, 2000. 151. G. R¨ atsch, S. Mika, and M.K. Warmuth. On the convergence of leveraging. NeuroCOLT2 Technical Report 98, Royal Holloway College, London, August 2001. A short version appeared in NIPS 14, MIT Press, 2002. 152. G. R¨ atsch, S. Mika, and M.K. Warmuth. On the convergence of leveraging. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural information processings systems, volume 14, 2002. In press. Longer version also NeuroCOLT Technical Report NC-TR-2001-098. 153. G. R¨ atsch, T. Onoda, and K.-R. M¨ uller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, March 2001. also NeuroCOLT Technical Report NCTR-1998-021. 154. G. R¨ atsch, B. Sch¨ olkopf, A.J. Smola, S. Mika, T. Onoda, and K.-R. M¨ uller. Robust ensemble learning. In A.J. Smola, P.L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 207–219. MIT Press, Cambridge, MA, 2000. 155. G. R¨ atsch, A.J. Smola, and S. Mika. Adapting codes and embeddings for polychotomies. In NIPS, volume 15. MIT Press, 2003. accepted.

An Introduction to Boosting and Leveraging

181

156. G. R¨ atsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, and K.-R. M¨ uller. Barrier boosting. In Proc. COLT, pages 170–179, San Francisco, 2000. Morgan Kaufmann. 157. G. R¨ atsch and M.K. Warmuth. Maximizing the margin with boosting. In Proc. COLT, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer. 158. G. Ridgeway, D. Madigan, and T. Richardson. Boosting methodology for regression problems. In D. Heckerman and J. Whittaker, editors, Proceedings of Artificial Intelligence and Statistics ’99, pages 152–161, 1999. 159. J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978. 160. C. P. Robert. The Bayesian Choice: A Decision Theoretic Motivation. Springer Verlag, New York, 1994. 161. M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Bangalore, H. Alshawi, and S. Douglas. Combining prior knowledge and boosting for call classification in spoken language dialogue. In International Conference on Accoustics, Speech and Signal Processing, 2002. 162. R.T. Rockafellar. Convex Analysis. Princeton Landmarks in Mathemathics. Princeton University Press, New Jersey, 1970. 163. R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197– 227, 1990. 164. R.E. Schapire. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the 14th International Conference, pages 313– 321, 1997. 165. R.E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. 166. R.E. Schapire. The boosting approach to machine learning: An overview. In Workshop on Nonlinear Estimation and Classification. MSRI, 2002. 167. R.E. Schapire, Y. Freund, P.L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, October 1998. 168. R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999. also Proceedings of the 14th Workshop on Computational Learning Theory 1998, pages 80–91. 169. R.E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. 170. R.E. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text filtering. In Proc. 21st Annual International Conference on Research and Development in Information Retrieval, 1998. 171. R.E. Schapire, P. Stone, D. McAllester, M.L. Littman, and J.A. Csirik. Modeling auction price uncertainty using boosting-based conditional density estimations noise. In Proceedings of the Proceedings of the Nineteenth International Conference on Machine Learning, 2002. 172. B. Sch¨ olkopf, R. Herbrich, and A.J. Smola. A generalized representer theorem. In D.P. Helmbold and R.C. Williamson, editors, COLT/EuroCOLT, volume 2111 of LNAI, pages 416–426. Springer, 2001. 173. B. Sch¨ olkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support of a high-dimensional distribution. TR 87, Microsoft Research, Redmond, WA, 1999. 174. B. Sch¨ olkopf, A. Smola, R.C. Williamson, and P.L. Bartlett. New support vector algorithms. Neural Computation, 12:1207–1245, 2000. also NeuroCOLT Technical Report NC-TR-1998-031.

182

R. Meir and G. R¨ atsch

175. B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 176. H. Schwenk and Y. Bengio. Boosting neural networks. Neural Computation, 12(8):1869–1887, 2000. 177. R.A. Servedio. PAC analogoues of perceptron and winnow via boosting the margin. In Proc. COLT, pages 148–157, San Francisco, 2000. Morgan Kaufmann. 178. R.A. Servedio. Smooth boosting and learning with malicious noise. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory, pages 473–489, 2001. 179. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44(5):1926–1940, September 1998. 180. J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the twelfth Conference on Computational Learning Theory, pages 278–285, 1999. 181. J. Shawe-Taylor and N. Cristianini. On the genralization of soft margin algorithms. Technical Report NC-TR-2000-082, NeuroCOLT2, June 2001. 182. J. Shawe-Taylor and G. Karakoulas. Towards a strategy for boosting regressors. In A.J. Smola, P.L. Bartlett, B. Sch¨ olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 247–258, Cambridge, MA, 2000. MIT Press. 183. Y. Singer. Leveraged vector machines. In S.A. Solla, T.K. Leen, and K.-R. M¨ uller, editors, Advances in Neural Information Processing Systems, volume 12, pages 610–616. MIT Press, 2000. 184. D. Tax and R. Duin. Data domain description by support vectors. In M. Verleysen, editor, Proc. ESANN, pages 251–256, Brussels, 1999. D. Facto Press. 185. F. Thollard, M. Sebban, and P. Ezequel. Boosting density function estimators. In Proc. 13th European Conference on Machine Learning, volume LNAI 2430, pages 431–443, Helsinki, 2002. Springer Verlag. 186. A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-posed Problems. W.H. Winston, Washington, D.C., 1977. 187. K. Tsuda, M. Sugiyama, and K.-R. M¨ uller. Subspace information criterion for non-quadratic regularizers – model selection for sparse regressors. IEEE Transactions on Neural Networks, 13(1):70–80, 2002. 188. L.G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, November 1984. 189. A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Verlag, New York, 1996. 190. V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995. 191. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 192. V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971. 193. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Math. Ann., 100:295–320, 1928. 194. M.A. Walker, O. Rambow, and M. Rogati. Spot: A trainable sentence planner. In Proc. 2nd Annual Meeting of the North American Chapter of the Assiciation for Computational Linguistics, 2001.

An Introduction to Boosting and Leveraging

183

195. R. Zemel and T. Pitassi. A gradient-based boosting algorithm for regression problems. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 696–702. MIT Press, 2001. 196. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Technical Report RC22155, IBM Research, Yorktown Heights, NY, 2001. 197. T. Zhang. A general greedy approximation algorithm with applications. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. 198. T. Zhang. On the dual formulation of regularized linear systems with convex risks. Machine Learning, 46:91–129, 2002. 199. T. Zhang. Sequential greedy approximation for certain convex optimization problems. Technical report, IBM T.J. Watson Research Center, 2002. 200. Z.-H. Zhou, Y. Jiang, Y.-B. Yang, and S.-F. Chen. Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine, 24(1):25–36, 2002.

An Introduction to Reinforcement Learning Theory: Value Function Methods Peter L. Bartlett Barnhill Technologies

Abstract. These lecture notes are intended to give a tutorial introduction to the formulation and analysis of reinforcement learning problems. In these problems, an agent chooses actions to take in some environment, aiming to maximize a reward function. Many control, scheduling, planning and game-playing tasks can be formulated in this way, as problems of controlling a Markov decision process. We review the classical dynamic programming approaches to finding optimal controllers. For large state spaces, these techniques are impractical. We review methods based on approximate value functions, estimated via simulation. In particular, we discuss the motivation for (and shortcomings of) the TD (λ) algorithm.

1

Introduction

Consider a call admission control problem faced by a telecommunications carrier selling bandwidth on a communication channel. Suppose that there are different types of calls (voice, video, data, etc), each bringing different revenues. At any moment, there is a range of calls that the carrier might allow, but the bandwidth is limited, and so the carrier must choose which of these calls to allow on the channel. The carrier’s aim is to maximize revenue. The key feature of this decision problem is that decisions made at one moment might affect performance later. Hence, it does not always pay to greedily choose the calls that bring the most revenue, since this might prevent subsequent, more lucrative, calls from being connected. The call arrival process is typically random, although the carrier might have some information about the statistics of the process. Problems like this, of sequential decision-making under certainty, have been widely studied. We refer to them as reinforcement learning problems. The books by Bertsekas and Tsitsiklis [2] and Sutton and Barto [6] are useful reference texts on reinforcement learning. These notes take the approach of [2]. In the next section, we formulate reinforcement learning problems as problems of controlling Markov decision processes so as to maximize some performance criterion. We shall see that this formulation is convenient for analysis, and is a reasonable model for many of these problems. In these notes, we consider value function methods for reinforcement learning, which estimate some quality measure (called the value) over the different states of the MDP and use these estimates to choose appropriate control actions. We first consider (in Section 4) the classical, optimal approaches to these problems, which involve dynamic programming. We give performance bounds for these algorithms. These S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 184–202, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

An Introduction to Reinforcement Learning Theory

185

methods suffer from two drawbacks. First, they assume that there is a model for the system being controlled. In Section 5, we consider the problem of estimating quantities (like parameters of a model) from simulations. Second, the dynamic programming approach is computationally impractical when the state space is large, which is typical. In Section 6, we consider approximate value function methods, which use a more restricted class of functions to approximate the value function. We consider an approximate version of policy iteration, and focus on the TD(λ) algorithm, which is a simulation-based gradient method for updating a parameterized value function estimate. We review convergence results for TD(λ). Section 7 considers why TD(λ) is often unsuccessful: there are problems for which improving the value function estimate degrades the performance of the policy. This motivates the use of local policy search methods, which are the subject of a companion paper.

2

Controlling a Markov Decision Process

In this section, we define the reinforcement learning problem as that of controlling a Markov decision process. Definition 1. A finite Markov decision process (MDP) is a tuple (S, U, P, r), where – S = {1, 2, . . . , n} is a finite set of states. – U = {1, 2, . . . , N } is a finite set of actions. – P is a mapping from actions to transition probability matrices, so that for states i, j ∈ S, and controls u ∈ U, the mapping j → P (u)i,j = pi,j (u) is a probability distribution. – r is a function from S to R, the reward function. Given a sequence u1 , u2 , . . . of actions, the sequence X0 , X1 , X2 , . . . of states of a Markov decision process evolves as follows. 1. An initial state X0 ∈ S is chosen. 2. For t = 1, 2, . . . , state Xt+1 is chosen at random with Pr(Xt+1 = j|Xt = i) = pi,j (ut ). The reinforcement learning problem is that of choosing the control actions, Ut , so as to maximize the sequence of rewards that is received. These actions should be chosen based only on the previous history of observed states and actions. In fact, we shall see that the Markov property implies that it always suffices to choose an action based on the current state only. Thus, we assume that, at time t, action Ut = µ(Xt ) is chosen, where µ is a reactive policy, that is, a mapping from the space S of states to the space U of actions. The process is illustrated in Figure 1. For the call admission control problem described above, the state is the mix of call types on the channel, the control actions are decisions to accept calls of a

186

P.L. Bartlett

Xt

Environment

r(X t)

Agent Policy: µ

Ut

Fig. 1. The Markov decision process (S, U, P, r), controlled by the reactive policy µ.

certain type, and the state transitions depend on the statistics of call durations and call arrivals. The reward function is the revenue received from the accepted calls. Playing backgammon can be viewed as another example of a reinforcement learning problem. The state is the board position and current dice roll outcome. The control actions are the available moves. The state transition probability ps,s (u) is the probability that, after the player makes move u from state s, the opponent makes a move and the player rolls the dice so that the board position and dice roll outcome constitute s . The reward is r(s) = 0, except at the end of the game, where r(s) is positive for a win, zero for a draw, and negative for a loss. The problem of controlling a robot to perform a particular task can also be viewed in this way. The state consists of all relevant position and velocity information about the robot, as well as relevant information about its environment. The state transitions depend on the dynamics of the robot and the statistics of the environment. The final example is from optimization. Consider the problem of optimizing a schedule. Given the data (a set of tasks, a set of resources, and the constraints), the problem is to produce a schedule that minimizes some cost function. One approach is to start with a candidate solution, and then iteratively apply perturbations (for example, interchanging two tasks) until a satisfactory solution is reached. The choice of a suitable sequence of search options is a reinforcement learning problem. The state space is the data specifying the problem and a candidate solution. The control actions are the various search options. The transition probabilities reflect how these options change the candidate solution.

An Introduction to Reinforcement Learning Theory

3

187

Performance Criteria, Value Functions

In a reinforcement learning problems, the aim is to choose actions at each time instant so as to maximize the sequence of rewards. How do we choose the relative importance of short-term rewards and long-term rewards? There are several possibilities. If the problem has a finite time horizon, that is, if it proceeds for only N < ∞ time steps, one plausible performance criterion is the total reward received, N   JN (i) = E r(Xt )|X0 = i , t=0

where i is the start state, and the expectation is over the transitions Xt → Xt+1 of states of the MDP. It often makes sense to discount future rewards, so that we place more emphasis on the short term over the long term. Define the discounted value of initial state i over N steps as N   t JN (i) = E α r(Xt )|X0 = i t=0

  = E r(X0 ) + αr(X1 ) + α2 r(X2 ) + · · · + αN r(XN )|X0 = i , where α is the discount factor, 0 ≤ α ≤ 1. Clearly, when α is small, this criterion favors actions that achieve short-term rewards. When α is close to 1, it favors more prudent policies that might forgo short-term rewards in the interests of long-term rewards. If we express α=

1 1+p

for some p ≥ 0, we can interpret the discounted value as the present value of the cash flow stream r(X0 ), r(X1 ), . . . , r(XN ), where p is the interest rate per time period. In an infinite horizon setting, we might consider the discounted value of initial state i,  T   t α r(Xt )|X0 = i , J(i) = lim E T →∞

t=0

where 0 ≤ α ≤ 1. We shall concentrate on this performance criterion, with α < 1. In this case, the limit always exists, since S is finite. (For α = 1, it exists if the state converges almost surely to a set of states that each have zero reward.) To emphasize the fact that the expectation operator (and hence the value function) depends on the policy µ, we write N   J µ (i) = lim E αt r(Xt )|X0 = i , N →∞

and we consider the vector in R , n

t=0

188

P.L. Bartlett

 J µ (1)   J µ =  ...  . 

J µ (n) We shall refer to this vector as the value function of the policy µ. If we define   p1,1 (µ(1)) . . . p1,n (µ(1))   .. .. .. Pµ =   . . . pn,1 (µ(n)) . . . pn,n (µ(n)) and 

 r(1)   r =  ...  , r(n) then we notice that E [r(Xt+1 )|Xt = i] =



pij (µ(i))r(j) = [P µ r]i .

j∈S

Similarly,   E [r(Xt+k )|Xt = i] = (P µ )k r i . That is, if we view a function of the state as a vector of size n, left multiplication by P µ corresponds to computing the expectation one step into the future. Thus, we can write the value function as Jµ =

∞ 

(αP µ )t r.

t=0

Notice that this implies J µ = r + αP µ J µ . That is, the value of a state is the sum of its reward and the discounted expected future value. More generally, the value of a state is the sum of expected discounted rewards l ≥ 0 steps into the future, plus the discounted expected future value after that: Jµ =

l 

(αP µ )t r + (αP µ )l+1 J µ .

t=0

These are commonly referred to as the Bellman equations.

(1)

An Introduction to Reinforcement Learning Theory

4

189

Optimal Policies and Dynamic Programming

In this section, we assume that we know the parameters of the MDP (the reward function and transition probabilities). Our aim is to compute the optimal policy. The optimal value of state i ∈ S is J ∗ (i) = max J µ (i), µ

where the maximum is over all reactive, stationary (that is, time invariant) policies µ : S → U. It turns out that there is a policy (call it µ∗ ) that is optimal for all initial states i. Even if we were to allow a richer class of policies, for which the control action is a time-varying function of the history of states and actions, it turns out that this would not provide an advantage: we can always choose a reactive, stationary policy. This is not true when there is a finite horizon; in that case, the optimal policy typically varies with the number of steps∗ to go. If we know the value function of the optimal policy, J ∗ = J µ , then we can use the Bellman equations to compute the optimal policy: for any i we have ∗ J ∗ (i) = r(i) + α P µ J ∗ = r(i) + α max [P µ J ∗ ]i , µ

i

and hence µ∗ (i) = arg max [P µ J ∗ ]i . µ

Since, for each i, we can choose µ(i) separately to maximize the ith row of P µ J ∗ , we can write this maximization in vector form as µ∗ = arg max P µ J ∗ . µ

We say that µ∗ is the greedy policy corresponding to the value function J ∗ . 4.1

Value Iteration

We have seen that, given the value function J ∗ for the optimal policy, we can compute the optimal policy. This motivates the first approach to finding the optimal policy, known as value iteration. This approach uses dynamic programming to find J ∗ , and hence µ∗ . To understand how this dynamic programming algorithm works, it is helpful to consider the finite horizon case. There, we can start at the last time step, and work backwards. Let Jk∗ (i) denote the value of state i under the optimal policy, with k steps remaining. With zero steps to go, the optimal (only) value of the discounted future reward is the reward of the current state: J0∗ (i) = r(i). With one step to go, we can optimize over the choice of the final action: J1∗ (i) = max E [r(XN −1 ) + αr(XN )|XN −1 = i] u  = r(i) + max α pij (u)J0∗ (j). u

j

190

P.L. Bartlett

With k steps to go, the optimal value is Jk∗ (i) = r(i) + max α u



∗ pij (u)Jk−1 (j).

j

For any value function J, define a new value function T J by T J = r + α max P µ J. µ

(Here, we use the vector notation, since the maximization can be performed component-wise, and hence involves O(|S|2 |U|) operations.) Notice that T is the ∗ to Jk∗ . dynamic programming operator that transforms Jk−1 With an infinite horizon, we can take a similar approach; we repeatedly apply the dynamic programming operator. Figure 2 describes the value iteration algorithm. It starts with the value function estimate J0 = 0, and at every iteration applies the dynamic programming operator T to update this estimate. For every estimate Jk , there is a corresponding greedy policy, µk , which is the policy that maximizes the expectation of Jk at the next time step.

1. Set J0 = 0. 2. For all k, Set Jk+1 = T Jk , µk = arg maxµ P µ Jk .

Fig. 2. Value iteration.

In order to understand the performance of the value iteration algorithm, it is convenient to make several observations about the operator T . (Our treatment for the remainder of this section follows [2], although we consider a simpler setting.) The next two lemmas are easy to verify using the definition of T . We ¯ use the notation J  J¯ to indicate that, for every component i, J(i) ≤ J(i). Lemma 1 (Monotonicity). For any J, J¯ ∈ Rn , ¯ J  J¯ implies T J  T J. Lemma 2 (Exponential decay of constants). For any J ∈ Rn , β ∈ R, T (J + βe) = T J + αβe, where e = [1, 1, . . . , 1] . These lemmas imply that the operator T is a contraction with respect to the infinity norm, J∞ = maxi∈S |J(i)|.

An Introduction to Reinforcement Learning Theory

191

Lemma 3 (Contraction). For any J, J¯ ∈ Rn , ¯ ∞ ≤ αJ − J ¯ ∞. T J − T J Proof. Because of monotonicity and the exponential decay of constants, we have ¯ ∞e ≤ J − J − J ¯ ∞ e) ≤ T (J − J − J

J¯ ≤ T J¯ ≤

¯ ∞e ≤ T J − αJ − J

T J¯ ≤

¯ ∞e J + J − J ¯ ∞ e) T (J + J − J ¯ ∞ e. T J + αJ − J

The fact that T is a contraction implies the following theorem. We use T N J to denote the result of N applications of the operator T to the vector J, N times    T J = T T . . . T J. N

Theorem 1 (Convergence of value iteration). For any value function J : S → R, there is a J ∗ satisfying 1. 2. 3. 4.

J∗ J∗ J∗ J∗

= limN →∞ T N J. is the unique solution to J ∗ = T J ∗ . = maxµ J µ . ∗ = J µ , where µ∗ = arg maxµ P µ J ∗ .

It is straightforward to determine the rate of convergence of T N J to J ∗ . Lemma 4. For the value function estimates Jk produced by the value iteration algorithm, αk r∞ . J ∗ − Jk ∞ ≤ 1−α To understand how the performance of the corresponding greedy policy approaches the optimal value, we need to combine this result with the following, which shows that as the estimate of the optimal value function improves, so does the performance of the corresponding greedy policy. ˆ is the correspondLemma 5. If Jˆ is an estimate of the optimal value J ∗ , and µ ing greedy policy, defined by ˆ µ ˆ = arg max P µ J, µ

ˆ satisfies then the value J µˆ of the policy µ J ∗ − J µˆ ∞ ≤

2α ˆ ∞. J ∗ − J 1−α

192

P.L. Bartlett

Proof. Suppose that state i is one of the states where the value for the greedy policy µ ˆ is furthest from the optimal value, J ∗ − J µˆ ∞ = J ∗ (i) − J µˆ (i). Then we can write J ∗ − J µˆ ∞ = J ∗ (i) − J µˆ (i)   =α µ∗ (i))J ∗ (j) − pij (ˆ µ(i))J µˆ (j) pij (ˆ j∈S

≤α



   ˆ + J ∗ − J ˆ ∞ − pij (ˆ pij (ˆ µ∗ (i)) J(j) µ(i))J µˆ (j)

j∈S

ˆ ∞+α ≤ αJ ∗ − J





ˆ − pij (ˆ pij (ˆ µ(i))J(j) µ(i))J µˆ (j)

j∈S



ˆ ∞+α ≤ αJ − J





(by definition of µ ˆ) ∗



ˆ ∞ − J µˆ (j) pij (ˆ µ(i)) J (j) + J ∗ − J

j∈S

ˆ ∞ + αJ ∗ − J µˆ ∞ . ≤ 2αJ − J ∗

Combining these results, we obtain the following performance bounds for the policies generated by value iteration. Theorem 2 (Performance of value iteration). Each policy µk produced by the value iteration algorithm has value J µk satisfying J ∗ − J µk ∞ ≤

2αk+1 r∞ . (1 − α)2

Hence, after a finite number of iterations, µk converges to the optimal policy. The second statement of the theorem follows from the fact that there are only finitely many policies µ, and hence there is a minimum non-zero value of J ∗ − J µ ∞ . Thus, the value iteration algorithm starts with an arbitrary estimate Jˆ of the optimal value function. At each iteration, it chooses the greedy policy correˆ that is the policy that maximizes the expectation of this estimate sponding to J, at the next state. It then updates the value estimates to these expectations. The estimate Jˆ converges exponentially to the optimum J ∗ , and hence the corresponding greedy policy has value converging exponentially to J ∗ . 4.2

Policy Iteration

In this section, we consider a second dynamic programming approach to computing the optimal policy for an MDP. The policy iteration algorithm (see Figure 3)

An Introduction to Reinforcement Learning Theory

193

maintains a policy µk , and updates it at each iteration to a policy µk+1 obtained as follows. Suppose that our task is to choose an optimal action to take at the current time, but after that we will revert to the old policy µk . The optimal actions correspond to a policy µk+1 that maximizes the expectation of the value at the next time step under the old policy, µk .

1. Choose an initial policy µ0 2. For k = 1, 2, . . . – Compute the value function for µk , J µk = (I − αP µk )−1 r – Compute the new policy (the greedy policy corresponding to J µk ), µk+1 = arg max P µ J µk . µ

Fig. 3. Policy iteration

To understand the performance of the algorithm, we use an analysis that is similar to our analysis of value iteration. Define the dynamic programming operator Tµ as Tµ J = r + αP µ J. Notice that, in contrast to the operator T which chooses µ to maximize the expectation of the value at the next time step, the Tµ operator computes this expectation under the fixed policy µ. Rearranging the Bellman equations (1) allows us to express the value function in terms of the transition probabilities and reward function. J µ = r + αP µ J µ ⇒

−1

J µ = (I − αP µ )

r.

The policy iteration algorithm uses this identity. We shall see that we can view the computation of the value J µ in policy iteration as the limit of a sequence of applications of the Tµ operator. The following properties of Tµ follow easily from the definition. Lemma 6 (Monotonicity). For any J, J¯ ∈ Rn , ¯ J  J¯ implies Tµ J  Tµ J. Lemma 7 (Suboptimality). For any J ∈ Rn , Tµ J  T J.

194

P.L. Bartlett

If µ is the greedy policy corresponding to J, that is, µ = arg max P µ J, then Tµ J = T J. Lemma 8 (Fixed point). If J ∈ Rn is the value function of policy µ, then J is the unique vector satisfying J = Tµ J (the fixed point of Tµ ). These properties imply that policy iteration is guaranteed to converge monotonically to the optimal policy in a finite number of steps, as the following ¯ that is, for theorem shows. The notation J ≺ J¯ means that J = J¯ and J  J, ¯ all i ∈ S, J(i) ≤ J(i). Theorem 3 (Convergence of policy iteration). For any initial policy µ0 , there is a k ∗ such that J µ0 ≺ J µ1 ≺ · · · ≺ J µk∗ = J ∗ . Proof. We first show that the value function is non-decreasing as policy iteration proceeds. The fixed point property shows that J µk = Tµk J µk  T J µk = Tµk+1 J µk , from the suboptimality lemma, and so J µk  lim TµNk+1 J µk N →∞ µk+1

=J

,

again using the fixed point property. If the value does not increase, that is, J µk+1 = J µk , then J µk+1 = Tµk+1 J µk+1 = Tµk+1 J µk = T J µk = T J µk+1 . So the value increases until it reaches the optimum: either J µk+1 J µk or J µk = J ∗ . Thus, the policy iteration algorithm starts with an arbitrary policy. At each iteration, it computes the value of the policy, then updates the policy greedily, so that the action chosen in state i maximizes the expectation of the (old) value of the next state. The value of this sequence of policies converges monotonically (no state has a value that decreases, and at least one state’s value increases) until it reaches the optimum.

An Introduction to Reinforcement Learning Theory

5

195

Simulation

So far, we have assumed complete knowledge of the MDP, and used this to compute an optimal policy. In many problems, this is unrealistic. For instance, we might not know the dynamics or reward function. We might have access only to simulations: observations of the sequence of states, actions and rewards generated by an MDP with a fixed policy. Alternatively, we might know the dynamics, but the system might be so large that storing and manipulating functions of every state is infeasible. (In the latter case, we would also need to use approximate methods, which we shall consider in the next section.) If the dynamics are unknown, we could use a model based approach, in which we use simulations to estimate P and r, and then apply the techniques of the previous section. Alternatively, we could use a model free approach, in which we perform policy iteration but estimate the value function by simulation. In both cases, we need to estimate quantities (like pij (u) and r(i)) from simulation. For this to be possible, it suffices that the controlled MDP satisfies an ergodicity condition, which ensures that the sequence of states, actions and rewards contains enough information; in particular, it ensures that all states are visited infinitely often. We need some definitions and results from the theory of finite Markov chains. We state these without proof; see [4] for further details. A finite Markov chain is a pair (S, P ), where S is the finite set of states and P is the transition probability matrix. Given an initial state X0 , the states evolve according to Pr (Xt+1 = j|Xt = i) = Pij . The Markov chain of interest to us is the state sequence of the MDP under a fixed policy µ. In this case, the state transition probabilities are given by [P µ ]ij = pij (µ(i)). We say that a state j is accessible from i if, for some t > 0, Pijt > 0. A communicating class of states is a set of mutually accessible states. A recurrent class is a communicating class that does not access other states. A transient class is a communicating class that is not recurrent. Recurrent classes dominate the asymptotic behavior of the chain, in the sense that Pr(Xt ∈ a recurrent class) → 1. The period of a state is the greatest common divisor of the set   t : [P t ]ii > 0 . If the period is one, we say that the chain is aperiodic. All states in a communicating class have the same period. An irreducible Markov chain has a single communicating class. An ergodic Markov chain is one that is irreducible and aperiodic (period=1). The transition probability matrix of an ergodic Markov chain is primitive, that

196

P.L. Bartlett

is, for some t, all entries of P t are strictly positive. This implies all states in an ergodic Markov chain are in the same communicating class, which is necessarily a recurrent class. We shall represent a probability distribution over states as a vector: q ∈ Rn is the distribution over states at time t if Pr(Xt = i) = qi . For a finite Markov chain, (S, P ), it is easy to see that the transition probability matrix P determines how the distribution of states evolves. if q is the distribution at time t, then the distribution at time t + 1 is given by q  P . A distribution π is a stationary distribution if it is invariant under this operation: π P = π . That is, a stationary distribution is a left eigenvector of P . The Perron-Frobenius theorem says that any primitive matrix P has a positive real eigenvalue, larger than the magnitude of all other eigenvalues, and that the associated left and right eigenvectors are strictly positive and unique, up to a multiplicative constant. Thus, for an ergodic Markov chain, we can order the eigenvalues of P as 1 = λ1 > |λ2 | ≥ |λ3 | ≥ · · · ≥ |λn |. The left eigenvector corresponding to λ1 = 1 is the unique stationary distribution, and every component of it is strictly positive. An ergodic Markov chain mixes to the stationary distribution, which means that lim P t = eπ  .

t→∞

(Recall that e = [1 . . . 1] .) It follows that the stationary distribution π captures the asymptotic behavior of an ergodic Markov chain: with probability 1, lim

T →∞

1 |{1 ≤ t ≤ T : Xt = i}| = πi > 0. T

In summary, if the Markov chain corresponding to an MDP with a fixed policy is ergodic, then there is a unique stationary distribution, π. The probability under π of state i is the asymptotic proportion of time spent in state i. Whatever the initial probability distribution over states, the distribution over states approaches the stationary distribution π. Furthermore, πi > 0 for all states i, so with probability one all states are visited infinitely often. These properties are important when we wish to estimate quantities from simulation. For instance, suppose that we have a function f : S → R, such as the error (r(i) − rˆ(i)) of an estimate rˆ of the reward function. Then the empirical average of f 2 over a sample path, f 2T = approximates

T 1 f (Xt )2 , T t=1

An Introduction to Reinforcement Learning Theory

f 2π = Eπ f 2 =



197

π(i)f (i)2 ,

i∈S

in the sense that, with probability one, lim f 2T = f 2π .

T →∞

For our purposes, the most important quantity to estimate is the value function. Figure 4 shows the policy iteration algorithm with simulation. At each iteration, simulation is used to estimate the value function J µk . In the next section, we present a popular algorithm for estimating the value function, called TD(λ).

1. Choose an initial policy µ0 2. For k = 1, 2, . . . – Generate states X0 , X1 , . . . and rewards r(X0 ), r(X1 ), . . . by simulating the system with policy µk . – Use X0 , X1 , . . . and r(X0 ), r(X1 ), . . . to estimate the value function J µk . – Compute the new policy (the greedy policy corresponding to the estimate Jk of J µk ),  pi,j (u)Jk (j). µk+1 (i) = arg max u∈U

j∈S

Fig. 4. Policy iteration with simulation

6

Approximate Value Functions and TD(λ)

In practice, the number of states is often too large to allow the storage and manipulation of an estimate of the value of every state. One way to simplify the problem is to group states, and estimate the value simultaneously for every state in a group. More generally, we could consider a restricted class of value functions (rather than the full class of all mappings from S to R). We call this a class of approximate value functions. We shall focus on parameterized approximate value functions, denoted J˜w (i), where the parameter is w = [w1 , . . . , wk ] ∈ Rk . For instance, w might be a neural network with a single output and k adjustable parameters. In such a case, each state might be represented by a vector of features that form the input to the network, and the value function estimate is based on these features. The policy iteration algorithm using simulation can be used with an approximate value function class also. In this case, the estimate of the value function J µk is of the form J˜w .

198

P.L. Bartlett

Figure 5 shows the TD(λ) algorithm, which is an approximate gradient method for estimating the parameters of such an approximate value function. To understand the motivation for this algorithm, we recall the (l + 1)-step Bellman equations, Jµ =

l 

(αP µ )t r + (αP µ )l+1 J µ .

t=0

For each value of l, J µ is the unique solution. To choose suitable parameters ˜ we want to find an approximate solution for an approximate value function J, to these equations. The question is: which equations should we approximately solve? TD (λ) minimizes an approximation to the error of a weighted linear combination of all (l + 1)-step equations. Fix 0 ≤ λ < 1, and consider the difference  l   ∞   λl (αP µ )t r + (αP µ )l+1 J˜ − J˜ . (1 − λ) t=0

l=0

0

(We use the convention that 0 = 1.) Notice that this is a convex combination of differences between the right and left sides of the Bellman equations. If we choose J˜ to minimize (in some sense) this convex combination of differences, then a small value of λ will focus on satisfying the small-order Bellman equations— the consistency of J˜ over a short time scale. A value of λ near 1 will emphasize the accuracy with which J˜ approximates the expected discounted reward. It is easy to show that  l   ∞   λl (αP µ )t r + (αP µ )l+1 J˜ − J˜ (1 − λ) t=0

l=0

=

∞ 

t

(αλP µ )



 r + αP µ J˜ − J˜ ,

t=0

and this latter quantity makes sense also for λ = 1: in that case, ∞    t (αλP µ ) r + αP µ J˜ − J˜ t=0

=

∞ 

t

(αP µ ) r +

t=0

∞ 

t+1

(αP µ )

t J˜ − (αP µ ) J˜

t=0

˜ = J − J. Since we wish to use simulation, we minimize the following error functional, which is the two norm of this difference vector, weighted according to the stationary distribution:   ∞   2 1   t ˜ =  E(J) (αλP µ ) r + αP µ J˜ − J˜  .  2 t=0

(Recall that

a2π

2

= Ei∼π a(i) .) This is

π

An Introduction to Reinforcement Learning Theory

˜ = 1 Ei∼π E(J) 2

  E

∞ 

199

2 (αλ) dt |X0 = i t

,

t=0

˜ t+1 )− where we have defined the temporal difference at time t as dt = r(Xt )+αJ(X ˜ t ). Notice that the conditional expectation of the temporal difference is zero J(X when J = J µ , since E[r(Xt ) + αJ µ (Xt+1 )|X0 ] = E[J µ (Xt )|X0 ]. The gradient of the error functional E with respect to the parameters w of the approximate value function J˜w is given by  ∞  ∞    t t (αλ) dt |X0 = i ∇E (αλ) dt |X0 = i ∇E(J˜w ) = Ei∼π E t=0

t=0

The TD (λ) algorithm approximates the gradient factor in this expression, which involves a sum into the future, by a single gradient term: it approximates ∞   t ∇E (αλ) dt |X0 = i t=0

by −∇J˜w (i). This approximation is exact for α = 0 or λ = 1, but in general it introduces a bias. Thus, the negative gradient is approximated by  ∞    t ˜ (αλ) dt |X0 = i ∇Jw (i) Ei∼π E t=0 a.s.

T −1 t−1 1   dt (αλ)s ∇J˜w (Xt−s ) T →∞ T t=0 s=0

= lim

T −1 1  d t zt , T →∞ T t=0

= lim

where the eligibility trace, zt ∈ Rk is the low-pass filtered gradient sequence defined by z0 = 0, zt+1 = αλzt + ∇J˜w (Xt+1 ). Despite the approximation in this derivation, the following convergence result can be proved for the case of linearly parameterized approximate value functions (for all parameter vectors w1 , w2 , J˜w1 +w2 = J˜w1 + J˜w2 ); see [8]. Theorem 4 (Convergence of TD (λ)). Suppose J˜w is linear in the parameters w. Then, with probability 1, TD (λ) converges to a parameter vector w∗ satisfying 2  1 − αλ 2  2     ˜ ∗ min J˜w − J µ  . Jw − J µ  ≤ w 1−α π π

200

P.L. Bartlett

Choose λ ∈ [0, 1], α ∈ [0, 1), γt as before. Choose initial parameters v0 ∈ Rk . Set z0 = 0, z0 ∈ Rk . Generate state sequence X0 , X1 , . . . , XT and rewards r(Xt ) by simulation with policy µ. 5. After each transition Xt → Xt+1 , – Set dt = r(Xt ) + αJ˜vt (Xt+1 ) − J˜vt (Xt ) – Set vt+1 = vt + γt dt zt – Set zt+1 = αλzt + ∇J˜vt+1 (Xt+1 ) 6. Return vT . 1. 2. 3. 4.

Fig. 5. TD (λ)

Hence, for λ = 1 or α = 0 (the two cases where the approximation is exact), 2  2     ˜ ∗ Jw − J µ  = min J˜w − J µ  . π

w

π

In particular, TD (1) minimizes weighted squared error between the approximate value function and the true value function. Also, if J˜w is sufficiently rich that  2   min J˜w − J µ  = 0 w

π

(in particular, this is the case if J˜w includes every real-valued function defined on S), then TD (λ) converges almost surely to the optimal w, for any λ. Roughly speaking, a value of λ close to 0 gives higher bias, but lower variance in the parameter estimates. In this case, TD (λ) predominantly focuses on the ‘consistency’ of J˜ over a short time scale. A value of λ close to 1 gives lower bias but higher variance. In this case, the algorithm predominantly uses the reward ˜ In practice, there is often a benefit to allow an increase sequence to estimate J. in bias (by decreasing λ) in order to reduce the variance.

7

Limitations of TD(λ)

Although TD (λ) has been used successfully in a number of reinforcement learning problems, including backgammon [7], elevator dispatching [3], channel allocation in a mobile phone network [5], and job shop scheduling [10], it frequently is not successful in practice. In this section, we consider why this is so; see also [1]. Recall the analysis of policy iteration in Section 4. Exact policy iteration uses the correct value function J µk for the current policy µk . This ensures the important property that µk+1 is guaranteed to be better than µk in the sense that J µk+1  J µk .

An Introduction to Reinforcement Learning Theory

201

In contrast, when we use approximate policy iteration (that is, when the value function estimate for the current policy J˜µk is only an approximation to J µk ), the best we can say about the performance of the new policy is given by a generalization of Lemma 5: J µk+1  J µk −

2α ˜µk J − J µk ∞ e. 1−α

In other words, unless the approximation is exact (J˜ = J), µk+1 can be worse than µk .

r(1) = 0

r(2) = 1

1

2

Fig. 6. A two state Markov decision process.

Figure 6 illustrates a two-state MDP for which this performance degradation occurs. The action space is U = {u1 , u2 }. The transition probabilities satisfy 1 2 2 1 P (u1 ) = 31 23 , P (u2 ) = 32 13 . 3 3

3 3



Clearly, the optimal policy µ always chooses u1 , and the stationary distribution under this policy satisfies π ∗ = [1/3, 2/3] . Consider approximate value functions parameterized by a single real number: ˜ w) = wφ(i), J(i, where φ(1) = 2, and φ(2) = 1. It is straightforward to solve Bellman’s equations to show that 2α 3(1 − α) J ∗ (2) = 1 + J ∗ (1).

J ∗ (1) =

˜ w) < J(2, ˜ w) and so the corresponding greedy policy is Thus, if w < 0, J(1, optimal, whereas if w > 0, the greedy policy is suboptimal. If we use the optimal policy, the solution found by TD (1) is the minimizer of 2   ˜ J(·, w) − J ∗  , π∗

202

P.L. Bartlett

and it is easy to show that the solution w is positive. Thus, the new greedy policy based on w will be suboptimal. In fact, even for the suboptimal policy, the solution is positive, so the policy remains suboptimal. Thus, approximate value function methods can lead to degradation of the policy performance, even in the simplest of MDPs. This phenomenon is not restricted to contrived examples. In backgammon [9] and in tetris [2], policy degradation has been observed. This is a strong motivation for the use of techniques that ensure only an improvement in performance; see, for example, [1], and references cited there.

References 1. J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. 2. D P Bertsekas and J N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. 3. R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 1017–1023. MIT Press, 1996. 4. E Seneta. Non-negative Matrices and Markov Chains. Springer-Verlag, New-York, 1981. 5. S.P. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, pages 974–980. MIT Press, 1997. 6. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. 7. G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves masterlevel play. Neural Computation, 6:215–219, 1994. 8. J. N. Tsitsiklis and B. Van-Roy. An Analysis of Temporal Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5):674– 690, 1997. 9. L. Weaver and J. Baxter. STD(λ): learning state differences with TD(λ). In Proceedings of the Post-graduate ADFA Conference on Computer Science (PACCS’01), ADFA Monographs in Computer Science Series (1), pages 63–70, 2001. 10. W. Zhang and T.G. Dietterich. A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114–1120. Morgan Kaufmann, 1995.

Learning Comprehensible Theories from Structured Data J.W. Lloyd Research School of Information Sciences and Engineering, The Australian National University, Canberra 0200, ACT, Australia Abstract. This tutorial discusses some knowledge representation issues in machine learning. The focus is on machine learning applications for which the individuals that are the subject of learning have complex structure. To represent such individuals, a rich knowledge representation language based on higher-order logic is introduced. The logic is also employed to construct comprehensible hypotheses that one might want to learn about the individuals. The tutorial introduces the main ideas of this approach to knowledge representation in a mostly informal way and gives a number of illustrations. The application of the ideas to decisiontree learning is also illustrated with an example.

1

Introduction

Machine learning applications increasingly involve situations for which the individuals that are the subject of learning have complex structure. Such application areas include bio-informatics and text learning. However, most machine learning systems assume that the individuals are represented by feature vectors, that is, fixed-length tuples of numbers or constants. Even in the case when learning is best achieved by using such features, there is a strong case for being careful about representing the structure of the individuals to support the feature engineering phase. Furthermore, if one has a suitable knowledge representation language available, the prospect of employing learning algorithms that deal directly with the structure becomes attractive. This tutorial introduces a knowledge representation language that supports machine learning applications for which the individuals have complex structure. Having represented the individuals in this language, one can then proceed by extracting features or by employing a learning system that can directly accept the structured representation or by a combination of both. The language also provides a rich and expressive formalism for stating and constructing hypotheses. Furthermore, the hypotheses so constructed are comprehensible to users. These knowledge representation ideas have application throughout machine learning to decision-tree systems, neural networks, distance-based learning, and kernel methods. The tutorial is written in an informal fashion, omitting all theorems, and avoiding the more technical details as far as possible. Also the ideas are illustrated with a variety of examples. The next three sections introduce the key S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 203–225, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

204

J.W. Lloyd

ideas of the underlying logic and basic terms that are used for representing individuals. The logic itself is based on Church’s simple theory of types [3]. The subsequent section, Section 5, shows how basic terms can be used to represent individuals in two typical application areas. Section 6 shows how one can define kernels on individuals that are represented by basic terms. This opens up the whole array of kernel methods to structured data. Section 7 then shows how the higher-order logic can be used to construct predicates and introduces the important concept of a predicate rewrite system for expressing hypothesis languages. Section 8 illustrates the ideas of the preceding section by showing how a simple multiple-instance learning problem can be handled using a predicate rewrite system and a decision-tree learner. The last section provides references to related work and further reading.

2

Types

The logic is a polymorphically typed one, so the first task is to define a suitable collection of types which include parameters (that is, type variables) that capture the polymorphism. Definition 1. An alphabet consists of four sets: 1. 2. 3. 4.

A A A A

set set set set

T of type constructors. P of parameters. C of constants. V of variables.

Each type constructor in T has an arity. The set T always includes the type constructors 1 and Ω both of arity 0. 1 is the type of some distinguished singleton set and Ω is the type of the booleans. The set P is denumerable (that is, countably infinite). Parameters are type variables and are typically denoted by a, b, c, . . . . Each constant in C has a signature (see below). The set V is also denumerable. Variables are typically denoted by x, y, z, . . . . For any particular application, the alphabet is assumed fixed and all definitions are relative to the alphabet. Types are built up from the set of type constructors and the set of parameters, using the symbols → and ×. Definition 2. A type is defined inductively as follows. 1. Each parameter in P is a type. 2. If T is a type constructor in T of arity k and α1 , . . . , αk are types, then T α1 . . . αk is a type. (For k = 0, this reduces to a type constructor of arity 0 being a type.) 3. If α and β are types, then α → β is a type. 4. If α1 , . . . , αn are types, then α1 × · · · × αn is a type. (For n = 0, this reduces to 1 being a type.)

Learning Comprehensible Theories from Structured Data

205

& denotes the set of all types obtained from an alphabet (6 for 'sort'). The symbol —> is right associative, so that a —> fi —> 7 means a —> (/3 —> 7). Definition 3. ,4 £?/pe is closed if it contains no parameters. &c denotes the set of all closed types obtained from an alphabet. Note that &c is non-empty, since 1, fi G &c.

Example 1. In practical applications of the logic, a variety of types is needed,

including 1, fi, Nat (the type of natural numbers), Int (the type of integers), Float (the type of floating-point numbers), Char (the type of characters), and String (the type of strings). Each of these is a miliary type constructor. Other useful type constructors arc those used to define lists, trees, and so on.

In the logic, List denotes the (unary) list type constructor. Thus, if a is a type, then List a is the type of lists whose elements have type a. Use will be made of the concept of one type being more general than another.

Definition 4. A type substitution is a finite set of the form {ai/ai,... ,an/an}. where each a, is a parameter, each ol;l is a type distinct from a;L, and a-\,... , an are distinct.

Definition 5. Let fi = {a\/ot\,... , an/an} be a type substitution and a a type. Then ay,,, the instance of a by \i, is the type obtained from a by simultaneously

replacing each occurrence of the parameter a-i in a by the type a.; (i — 1,... , n). Definition 6. Let a and fi be types. Then a is more general than fi if there exists a type substitution £ such that 3 = a£.

Note that "more general than" includes "equal to", since £ can be the identity substitution.

Example 2. Let a = (List a)y.fi and 0 = (List Int) x fi. Then a is more general than fi, since fi = a£, where £ = {a/Int}. Informally, a unifier for a set E of types is a type substitution /i such that the afj, are identical, for all a E E. A most general unifier is a unifier that instantiates the types 'as little as possible' to make them identical. For example,

{a/Int, b/Int} unifies the set {List a, List b}, but {a/b} is a most general unifier. 3

Terms

Definition 7. A signature is the declared type for a constant. The fact that a constant C has signature a is sometimes denoted by C : a. I distinguish two different kinds of constants: data constructors and functions. In a knowledge representation context, data constructors are used to represent individuals. In a programming language context, data constructors are used to construct data values. In contrast, functions are used to compute on data values; functions have definitions while data constructors do not. In the semantics for

the logic, the data constructors are used to construct models. As examples,

the constants T (true) and ± (false) are data constructors, as is each integer, floating-point number, and character. The constant (j (cons) used to construct lists is a data constructor.

206

J.W. Lloyd

The set £ always includes the following constants (where a is a parameter). 1. (), having signature 1. =, having signature a —► a —> Q.

2. 3. 4. 5.

T and ±, having signature (2. -i, having signature f2 ^> f2. A, V, —>, Q ^ Q.

6. £ and 77, having signature (a —> i?) —> 42. The intended meaning of = is identity (that is, — x y is T iff x and y are identical), the intended meaning of T is true, the intended meaning of _L is false, and the intended meanings of the connectives -■, A, V, —>, • • ■ —► an —►

(T oi... afe), where T is a type constructor of arity k, a±,... ,au are distinct parameters, and all the parameters appearing in ci,... , an occur among oi.... , (ik

(n > 0, k > 0). Furthermore, for each type constructor T, I assume that there docs exist at least one data constructor having a signature of the form

cr-i ->

>■ an ->■ (T o-i ... ak).

Example 3. The data constructors for constructing lists arc [] having signature List a and }j having signature a —> Lzsi a —> List a, where }j is usually written

infix. [] represents the empty list. The term s\.t represents the list with head s and tail t. Thus 4 \. 5 \. 6 \. [] represents the list [4, 5,6]. The next task is to define the central concept of a term. In the non-polymorphic case, a simple inductive definition suffices. But the polymorphic case is more complicated since, when putting terms together to make larger terms, it is generally necessary to solve a system of equations and these equations depend upon the relative types of free variables in the component terms. The effect of this is that to define a term one has to define simultaneously its type, and its set of free variables and their relative types.

Definition 8. A term, together with its type, and its set of free variables and their relative types, is defined inductively as follows. 1. Each variable x m 21 is a term of type a, where a is a parameter. The variable x is free with relative type a in x. 2. Each constant C in j3, if x is free with relative type a in t, or type a —> 8, where a is a new parameter, otherwise.

A variable other than x is free with relative type a in Xx.t if the variable is free with relative type a in t.

Learning Comprehensible Theories from Structured Data

207

4- (Application) If s is a term of type a —> 0 and t a term of type 7 such that the equation a = 7,

augmented with equations of the form p = 6, for each variable that is free with relative type p in s and is also free with

relative type 5 in t, have a most general unifier 6, then {s t) is a term of type 86.

A variable is free with relative type oO in {s t) if the variable is free with relative type o in s or t.

5. (Tupling) Ift\.... ,tn are terms of type a\,... ,an, respectively, such that the set of equations of the form Aii = Ph = ••■ = Pik,

for each variable that is free with relative type pi} in the termti:i (j — 1,... , k and k > 1), have a most general unifier 9, then (ii,... , tn) is a term of type a^Q x ■ ■ ■ x anQ.

A variable is free with relative type o9 in (ii,... ,t„) if the variable is free with relative type a in tj, for some j G {1,... . n}. The type substitution 0 in Parts 4 and 5 of the definition is called the associated most general unifier. £ denotes the set of all terms obtained from an alphabet and is called the language given by the alphabet.

Definition 9. A term is closed if it contains no free variables. Example 4- Let M be a miliary type, and A : M and concatenate : List a x

List a —> List a be constants. Recall that [] : List a and (j : a —»■ List a —> List a are the data constructors for lists. I will show that {concatenate ([], [A])) is a term. For this, ([], [A]) must be shown to be a term, which leads to the consideration of [] and [A]. Now [] is a term of type List a, by Part 2 of the definition of a term. By Parts 2 and 4, (jj A) is a term of type List M —> List M, where along the way the equation a — M is solved with the most general unifier

{a/M}. Then ((jj A) []), which is the list [A], is a term of type List M by Part 4, where the equation List M = List a is solved. By Part 5, it follows that ([], [A]) is a term of type List a x List M. Finally, by Part 4 again, {concatenate ([], [A])) is a term of type List M, where the equation to be solved is List a x List a —

List a x List M whose most general unifier is {a/M}. Example 5. Consider the constants append : List a —> List a —> List a —> (2 and

process : List a —> List a. I will show that {{{append x) []) {process x)) is a term. First, the variable x is a term of type b, where the parameter is chosen to avoid a clash in the next step. Then {append x) is a term of type List a —> List a —> C2, for which the equation solved is List a — b. Next {{append x) []) is a term of type List a —> (2 and x has relative type List a in {{append x) []). Now consider {process x), for which the constituent parts arc process of type List c —> List c and the variable x of type d. Thus {process x) is a term of type List c and x

208

J.W. Lloyd

has relative type List c in (process x). Finally, I have to apply ((append x) []) to the term (process x). For this, by Part 4, there arc two equations. These arc List a = List c, coming from the top-level types, and List a = List c, coming from the free variable x in each of the components. These equations have the

most general unifier {c/a}. Thus (((append x) []) (process x)) is a term of type Terms of the form (E Xx.t) are written as Bx.t and terms of the form (77 Xx.t) arc written as Vx.t (in accord with the intended meaning of U and 77). In a higher-order logic, one may identify sets and predicates - the actual identification is between a set and its characteristic function which is a predicate. Thus, if

t is of type Q, the abstraction Xx.t may be written as {x \ t} if it is intended to emphasize that its intended meaning is a set. The notation {} means {x | ±}. The notation set means (t s), where t has type a —> fi and s has type a, for some a. Furthermore, notwithstanding the fact that sets are mathematically identified with predicates, it is sometimes convenient to maintain an informal distinction

between sets (as "collections of objects") and predicates. For this reason, the notation {a} is introduced as a synonym for the type a —> Q. The term (s t) is often written as simply s t, using juxtaposition to denote application. Juxtaposition is

left associative, so that r s t means ((r s) t). Thus (((append x) []) (process x)) can be written more simply as append x [] (process x). 4

Basic Terms

Next I identify a class of terms, called basic terms, suitable for representing

individuals in diverse applications. From a (higher-order) programming language perspective, basic terms are data values. The most interesting aspect of the class of basic terms is that it includes certain abstractions and therefore is much wider

than is normally considered for knowledge representation. These abstractions allow one to model sets, multisets, and similar data types, in an elegant way.

Of course, there arc other ways of introducing (cxtcnsional) sets, multisets, and so on, without using abstractions. For example, one can define abstract data types or one can introduce data constructors with special equality theories. The primary advantage of the approach adopted here is that one can define these

abstractions intensionally (that is, by some implicit condition) [10]. Before getting down to the first step of giving the definition of basic terms,

some motivation will be helpful. How should a (finite) set or multiset be represented? First, advantage is taken of the higher-order nature of the logic to identify sets and their cliaracteristic functions, that is, sets are viewed as predicates. With this approach, an obvious representation of sets uses the connectives,

so that Xx.(x = 1) V (x = 2) is the representation of the set {1, 2}. This was the kind of representation used in [9] and it works well for sets. But the connectives are, of course, not available for multisets, so something more general is needed.

An alternative representation for the set {1, 2} is the term Xx.if x = 1 then T else if x = 2 then T else _L and this idea generalizes to multisets and similar abstractions. For example,

Learning Comprehensible Theories from Structured Data

209

Xx.if x = A then 42 else if x = D then 21 else 0

is the multiset with 42 occurrences of A and 21 occurrences of B (and nothing else). Thus I adopt abstractions of the form Xx.if x — t\_ then sy else ... if x — tn then sn else so

to represent (extensional) sets, multisets, and so on. However, before giving the definition of a basic term, some attention has to be paid to the term s0 in the previous expression. The reason is that So in this

abstraction is usually a very specific term. For example, for finite sets, so is _L and for finite multisets, so is 0. For this reason, the concept of a default term is now introduced. The intuitive idea is that, for each closed type, there is a

(unique) default term such that each abstraction having that type as codomain takes the default term as its value for all but a finite number of points in the domain, that is. sq is the default value. The choice of default term depends on the particular application but. since sets and multisets are so useful, one would expect the set of default terms to include ± and 0. However, there could also be other types for which a default term is needed. For each type constructor T, I assume there is chosen a unique default data constructor C such that C has

signature ■■■—> a.„ —> (T a-\ ... a*,). For example, for J?, the default data constructor could be _L, for Int. the default data constructor could be 0, and for

List, the default data constructor could be []. Definition 10. The set of default terms, 2), is defined inductively as follows. 1. If C is a default data constructor having signature oy —> • • • —> on —>

(T a-\ ... ak) and U, ■ ■ ■ , tn G 2) (n > 0) such that C U ... tn G £,, then Cii...i„GS).

2. Ifte® and x G 2T, then Xx.t G 2).

3.1fti,....tne1)(n>0) and (*i,... ,tn) G £, then (ti,... ,tn) G 2). There may not be a default term for some closed types. However, if it exists, one can show that the default term for each closed type is unique. Now basic terms can be defined. In the following, Xx.sq is regarded as the special case of

Xx.if x = t\ then S\ else ... if x = tn then sn else sq when n = 0.

Definition 11. The set of basic terms, 58, is defined inductively as follows.

1. If C is a data constructor having signature ay —> • • • —> an —> (T «i... %) and U, • • ■ , tn G 0) such that CU ... tn G £,, then CU ... tn G 0),s0elS) and Xx.if x = t-\ then s-\ else ... if x = tn then sn else sq G £, then

Xx.if x = U then s-\ else ... if x = tn then sn else so G 58.

3. Iftu... ,tn G 05 (n>0) and {ti:... ,tn) G £,, then (*i,... ,t„) G 58.

210

J.W. Lloyd

The previous definition is incomplete in that it overlooks some technical issues relating to the unique representation of individuals. Details on this can be

found in [10]. Part 1 of the definition of the set of basic terms states, in particular, that individual natural numbers, integers, and so on, are basic terms. Also a term

formed by applying a constructor to (all of) its arguments, each of which is a basic term, is a basic term. As an example of this, consider the following declarations of the data constructors Circle and Rectangle. Circle : Float —> Shape

Rectangle : Float —> Float —> Shape.

Then {Circle 7.5) and {Rectangle 42.0 21.3) are basic terms of type Shape. However, {Rectangle 42.0) is not a basic term as not all arguments to Rectangle arc given. Basic terms coming from Part 1 of the definition arc called basic structures and always have a type of the form T a-\ ... ari. The abstractions formed in Part 2 of the definition are "almost constant"

abstractions since they take the default term so as value for all except a finite number of points in the domain. They are called basic abstractions and always have a type of the form j3 —> 7. This class of abstractions includes useful data

types such as (finite) sets and multisets (assuming _L and 0 arc default terms). More generally, basic abstractions can be regarded as lookup tables, with sq as the value for items not in the table.

Part 3 of the definition of basic terms just states that one can form a tuple from basic terms and obtain a basic term. These terms are called basic tuples and always have a type of the form oi\ x • • • x an. It will be convenient to gather together all basic terms that have a type more general than some specific closed type.

Definition 12. For each a e &c, define !8a = {t e 2J | t has type more general than a}. The intuitive meaning of 03 a is that it is the set of terms representing individ-

uals of type a. Note that 03 — {Jaeec 03q,. However, the 03 a are not necessarily disjoint. For example, if the alphabet includes List, then [] e 03/^ a, for each closed type a.

5

Representation of Individuals

In this section, some practical issues concerning the representation of individuals are discussed and the ideas are illustrated with two examples. To make the ideas more concrete, consider an inductive learning problem, in which there is some collection of individuals for which a general classification

is required [13]. Training examples are available that state the class of certain individuals. The classification is given by a function from the domain of the individuals to some small finite set corresponding to the classes.

Learning Comprehensible Theories from Structured Data

211

I adopt a standard approach to knowledge representation. The basic princi-

ple is that an individual should be represented by a (closed) term: this is referred to as the 'individuals-as-ternis' approach. Thus the individuals are represented by basic terms. For a complex individual, the term will be correspondingly complex. Nevertheless, this approach has significant advantages: the representation is compact, all information about an individual is contained in one place, and the structure of the term provides strong guidance on the search for a suitable induced definition.

What types are needed to represent individuals? Typically, one needs the following: integers, floats, characters, strings, and boolcans; data constructors; tuples: sets; multisets; lists; trees; and graphs. The first group are the basic types, such as Int. Float, and (2. Also needed are data constructors for user-defined types. For example, see the data constructors Abloy and Chubb for the miliary type constructor Make below. Tuples are essentially the basis of the attributevalue representation of individuals, so their utility is clear. Less commonly used elsewhere for representing individuals are sets and multisets. However, sets, especially, and multisets are basic and extremely useful data types. Other constructs needed for representing individuals include the standard data types, lists, trees, and graphs. This catalog of data types is a rich one, and intentionally so. I advocate making a careful selection of the type which best models the application being studied. Consider now the problem of determining whether a key in a bunch of keys can open a door. More precisely, suppose there arc some bunches of keys and a particular door which can be opened by a key. For each bunch of keys either no key opens the door or there is at least one key which opens the door. For each bunch of keys it is known whether there is some key which opens the door, but it is not known precisely which key docs the job, or it is known that no key opens the door. The problem is to find a classification function for the bunches of keys, where the classification is into those which contain a key that opens the door and those that do not. This problem is prototypical of a number of important

practical problems such as drug activity prediction [4], as a bunch corresponds to a molecule and a key corresponds to a conformation of a molecule, and a molecule has a certain behavior if some conformation of it does. I make the

following declarations. Abloy, Chubb, Rubo, Yale, : Make, Short, Medium, Long : Length Narrow, Normal, Broad : Width.

I also make the following type synonyms. NumProngs = Nat Key — Make x NumProngs x Length x Width

Bunch = {Key}. Thus the individuals in this case are sets whose elements are 4-tuples. The function to be learned is

opens : Bunch —> (2.

212

J.W. Lloyd

Here is a typical training example.

opens {(Abloy, 4, Medium, Broad), (Chubb,3, Long, Narrow), (Abloy, 3, Short, Normal)} = T. I will return to this example in Sections 7 and 8.

As another example of knowledge representation, consider the problem of modeling a chemical molecule. The first issue is to choose a suitable type to represent a molecule. 1 use an undirected graph to model a molecule an atom is a vertex in the graph and a bond is an edge. Having made this choice, suitable types are then set up for the atoms and bonds. For this, the nullary type

constructor Element, which is the type of the (relevant) chemical elements, is first introduced. Here are the constants of type Element. Br, C, CI, F, H, I, N, O, S : Element.

I also make the following type synonyms.

AtomType — Nat Charge = Float

Atom = Element x AtomType x Charge Bond = Nat.

For (undirected) graphs, there is a "type constructor'' Graph such that the type of a graph is Graph v e, where v is the type of information in the vertices and e is the type of information in the edges. Graph is defined as follows. Label = Nat

Graph v e = {Label x v\ x {(Label —> Nat) x e}. Here the multisets of type Label —> Nat are intended to all have cardinality 2, that is, they are intended to be regarded as unordered pairs. Note that this definition corresponds closely to the mathematical definition of a graph: each vertex is labeled by a unique integer and each edge is uniquely labeled by the unordered pair of labels of the vertices it connects. Also it should be clear by now that Graph is not actually a type constructor at all; instead Graph v e is simply notational sugar for the expression on the right hand side of its definition.

The type of a molecule is now obtained as an (undirected) graph whose vertices have type Atom and whose edges have type Bond. This leads to the following definition. Molecule — Graph Atom Bond. Here is an example molecule, called dl, from the mutagenesis datasct available

at [18]. The notation (s,t) is used as a shorthand for the multiset that takes

Learning Comprehensible Theories from Structured Data

213

the value 1 on each of s and t, and is 0 elsewhere. Thus (s,t) is essentially an unordered pair.

({(1,(0,22,-0.117)),

(2, (C, 22,-0.117)),

(3,(0,22,-0.117)),

(4, (C, 195,-0.087)),

(5, (C, 195,0.013)),

(6,(0,22,-0.117)),

(7, (if, 3,0.142)),

(8, (if, 3,0.143)),

(9, (if, 3,0.142)),

(10, (13, (16, (19,

(11, (14, (17, (20,

(12,(0,27,0.013)), (15, (if, 3, 0.143)), (18,(0,22,-0.117)), (21, (if, 3, 0.142)),

(if, 3,0.142)), (C, 22,-0.117)), (if, 3, 0.143)), (C, 22,-0.117)),

(22, (if, 3,0.143)), (25,(0,40,-0.388)),

(O, (C, (C, (O,

27, -0.087)), 22,-0.117)), 22,-0.117)), 22, -0.117)),

(23, (if, 3,0.142)), (26,(0,40,-0.388))},

(24, (A-, 38, 0.812)),

{((1,2),7),

«1,6),7),

((1,7),1),

((2.,3),7),

((2,8),1),

((3,4),7),

((3,9)..1),

((4,5), 7),

((4 ,11),7),

((5,6),7),

((5,14),7), ((12,13), 7), ((14,16),1),

((6,10),1),

((11,12),7),

((12, 20), 7),

((13,14), 7),

((17,18),7),

((17,21),1),

((11,17),7), «13,15),1), ((18,19),7),

«18,22),1),

((19,20), 7),

((19,24),1),

((20,23),1),

((24, 25), 2),

((24, 26), 2)}).

Having represented the molecules, the next task is to learn a function that provides a classification of the carcinogenicity of the molecules. One way of doing this is to build, using a set of training examples, a decision tree from which the definition of the classification function can be extracted. The most important aspect of building this tree is to find suitable predicates to split the training examples. This issue will be discussed in Section 7.

6

Kernels on Basic Terms

Learning methods that rely on kernels arc becoming increasingly widely used

[17]. This section provides a kernel for basic terms that opens up the way for kernel-based learning methods to be applied to individuals that can be represented by basic terms. Readers should note that kernel methods do not generally provide comprehensible theories; nevertheless, they are of great interest because of their potential for learning from structured data.

The starting point is the definition of a positive definite kernel. (Z+ is the set of positive integers and K is the set of real numbers.) Definition 13. Let X be a set. A symmetric function k : XxX —> R is a positive

definite kernel on X if, for all n e Z+

follows that J2i,je{i

™} ci ci k(xi,Xj) > 0.

e X, and c\.... , cn e R; it

214

J.W. Lloyd

Of course, symmetry means k(x,x') = k(x',x), for all x,x' £ X. One can think of a positive definite kernel as being a generalized dot product. Many learning algorithms, for example, support vector machines, depend only on being able to compute the dot product between individuals. Originally, individuals were actually represented by vectors in Rm and the dot product in Rm was used in these algorithms. However, recent versions of these algorithms substitute the dot product by a kernel k. The justification for this replacement is as follows. Let X be a set and k a positive definite kernel on X. Then there exists a Hilbert space 5j and a mapping

: X —> fj such that k(x,x') = ($(x)./T(x')), where (.,.) is the dot product in fj. The mapping

7, the support of u, denoted supp(u), is the set {v G 58a | V(u w) 0 ID}. Thus, for the s above, supp(s) = {A. B\. Definition 15. The function k : 58 x 58 —> R is defined inductively on the structure of terms in 58 as follows. Let s,t G 58. i. -[/«,£ G 58a, where a— T ai. ..a/c, for some T,a\,... , a^, t/jen

v' ' 'y

I kt(C, C) + 2J fc(s':' ^) otherwise I

i 1

w/iere s isCsi...s„ and t is D ty.. .tm.

2. If s,t G 58a, where a = j3 —> 7, /or some /?, 7, t/ien

k(s,i)=

Yl k(V(su),V(tv))-k(u,v). uEsupp(s) v£supp(t)

S. If 5, £ £ Q3«; where a = a-\ x ■ - - x aTlJ /or .some a-i,... , aTlJ then n

k(s,t) = ^2k{Si,ti), where s is (s-\,... , sn) and t is (£1,... , tn). 4- If there does not exist a G &c such that s,t G 58a, then k(s, t) = 0. The definition for k is, of course, only one of many possibilities, but it at least establishes the general form of kernels on 58. Many variants for each component

of the above definition are suggested in [17]. It should be clear that the definition of k does not depend on the choice of a

such that s,t G 58a- (There may be more than one such a.) What is important is only whether a has the form T ot\... a.k, 0 —► 7, or ot\ x • ■ ■ x an, and this is invariant.

One can show that, for each a G &'\ k is a positive definite kernel on 58a [6]. There are several special cases of the preceding definition that are of interest. First, if a is (3 —> (7, then the abstractions of Part 2 of the definition arc, of course, sets. Let k_q be the discrete kernel. Then

k(s,t) =

JJ

k(u, v),

uesupp(s') v£supp(t)

since k(V(s u), (V(t v)) = Kn(T, T) = 1, for u G supp(s) and v G supp(t). This kernel on sets was studied in [7]. For multisets, the component k(V(s u), V(t v))

216

J.W. Lloyd

scales by the product of the multiplicities of the element u in s and v in t

(assuming the product kernel is used on N). Let a be Real x • • • x Real, where there are n components in the product.

Denote this type by Real". Let Kneai be the product kernel. Then the kernel in Part 3 for 23_ReaP is simply the usual dot product in Kn. Next several examples illustrating how to compute the kernel on individuals of certain types are given. Example 8. Suppose that Kiist is the discrete kernel. Let M be a nullary type constructor and A, B,C,D : M. Suppose that km is the discrete kernel. Let s

be the list [A, B, C] and t the list [A, D]. Then

k(s, t) = KLlst(t,«) + k(A, A) + k([B, C], [£>]) = 1 + KM(A, A) + KrMl tl) + KB, D) + k([C}} []) = 1 + 1 + 1 + KM(B, D) + KLlst(l []) =3+0+0 = 3.

Example 9. Let BTree be a unary type constructor, and Null : BTree a and BNode : BTree a —► a —> BTree a —> BTree a be data constructors. Suppose that KBTree is the discrete kernel. Let M be a nullary type constructor and A, B,C,D : M. Suppose that km is the discrete kernel. Let s be

BNode (BNode Null A Null) B (BNode Null C (BNode Null D Null)), a binary tree of type BTree M. and t be

BNode (BNode Null A Null) B (BNode Null D Null). Then

k(s,t)

KBTree(BNode, BNode)

+ k(BNode Null A Null, BNode Null A Null) +k(B,B) + k(BNode Null C (BNode Null D Null), BNode Null D Null) 1 + KBTree(BNode, BNode) + k(Null, Null) + k(A, A) + k(Null, Null) + km(B, B) + KBTree(BNode, BNode) + k(Null, Null) + k(C, D) + k(BNode Null D Null, Null) 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1+0 + KHTreeXBNode, Null) 8.

Example 10. Suppose that kq is the discrete kernel. Let M be a nullary type constructor and A, B,C,D : M. Suppose that km is the discrete kernel. If s is

the set {A, B, C\ e Sm^.q and t is the set {A, D\ G *Bm^„q, then

Learning Comprehensible Theories from Structured Data

217

k(s, t) = k(A, A) + k{A, D) + k(B, A) + k(B, D) + k{C, A) + k{C, D) = km (A, A) + km (A, D) + km (B, A) + km (B, D) + km (C, A) + km(C,D) =1+0+0+0+0+0 = 1.

Example 11. Suppose that kmul is the product kernel. Let M be a miliary type constructor and A, B,C.D : M. Suppose that km is the discrete kernel. If s is

{A, A, B, C, C, C) e ft) -^ P -^ a,

where any parameters in p\,... ,pk and a appear in p, and k > 0. The type p, is distinguished and is called the source of the transformation, while the type a is called the target of the transformation. The number k is called the rank of the transformation.

218

J.W. Lloyd

Here are some examples of transformations.

Example 12. Consider again the type Shape and the data constructors Circle : Float —► Shape Rectangle : Float —> Float —> Shape.

The function isCircle : (Float —> 17) —> Shape —> 17 defined by isCircle bt = 3x.(t = Circle x) A (6 a;) is a transformation. Similarly, the function isRectangle : (Float —> 17) —> (Float —> 17) —► Shape —> 17 defined by isRectangle b ct = 3x.3y.(t = Rectangle x y) A (b x) A (c y) is a transformation. For predicates 6 and c, (isRectangle b c) is a predicate on geometrical shapes which returns T iff the shape is a rectangle whose length satisfies 6 and whose breadth satisfies c.

Example 13. Each projection j9ro.)i : ai x • • • x an —>• a*

defined by

proj.; (£i,... .*„) = t;, for i = l,... , ra, is a transformation of rank 0. Example 14- Consider the transformation

setExistsi : (a —> 17) —> {a} —> 17 defined by

setExistst bt = 3x.(b x) A (x G £). The predicate (setExistsi b) checks whether a set has an element x that satisfies b.

Example 15. The transformation

An : (a ->■ 17) -> ■ • • ->■ (a -> 17) -> o -> 17 defined by

A„ Pi • • -Pn = >^x.((pi x) A ■ ■ ■ A (pn x)), where n > 2, provides a 'conjunction' with n conjuncts.

Learning Comprehensible Theories from Structured Data

219

Example 16. There are two fundamental transformations top : a —► f2 and bottom : a —> f2 defined by top x = T and bottom x = ±, for each x. Each of top and bottom, is a constant predicate, with top being the weakest predicate on the type a and bottom being the strongest.

Example 17. Let /i be a type and suppose that A : ji. D : /i, and C : /x are constants. Then, corresponding to A, one can define a transformation

(= A) : /i ^ rt by

((= A) x) = T iff s = A, with analogous definitions for (= D) and (= C). Similarly, one can define the transformation

by

((/ A) x) = T iff a; ^ A Next the definition of the class of predicates formed by composing transformations is presented. In the following definition, it is assumed that some

(possibly infinite) class of transformations is given and all transformations considered are taken from this class. A standard predicate is defined by induction

on the number of (occurrences of) transformations it contains as follows. Definition 17. A standard predicate is a term of the form

(/i 61,1 ...bhkl)o ... o(/r, 6„,j ...bThkn),

where f;L is a transformation of rank hi (i = 1,... . n), the target of fn is fi, b;^,-h is a standard predicate (i — 1,... . n, ji — 1,... , ki), ki > 0 (i — 1,... ,n) and n > 1.

The set of all standard predicates is denoted by S. Standard predicates have type of the form /x —> (2, for some type \i. Example 18. Returning to the keys example of Section 5, let the transformations be as follows.

(= Abloy)

Make ->■ Q

(= Chubb)

Make ->■ Q

(— Rubo)

Make —> (2

(= Yale)

Make -> (2

NumProngs —> f2

(=3)

NumProngs —► Q

220

J.W. Lloyd

(=4)

NumProngs —► i?

(=5) NumProngs —> 47

(=6) (= Short) (= Medium) (= Long)

NumProngs —> i?

(= Narrow) (= Normal) (= Broad)

Mdi/i -^ J?

Length —> i?

Length —> i? Length —> i7 Mrfi/t -> 4? Mdf/i -► ft

projMake isTe?/ —> Ma&e vjNumProngs ifej/ —> NumProngs

projLength

Key —> Length

proj Width

Key -»■ ITjrfift

setExistsi

(ATej/ —> 17) —> Bunch

A4

-> Q

(Key -> Q) -> (Key - Q) ^ (toy ^ fl)

(toy ^ fl)

Then the following is a standard predicate.

setExistsi (A4 (projMake ° top) (projNumProngs ° top) (projLength ° top)

(projWidth ° fop)). Now predicate rewrite systems are introduced.

Definition 18. .4 predicate rewrite system is a finite relation >—> on S satisfying the following two properties. 1. For each p ■ 2. For each p

q, the type of p is more general than the type of q. > q. there does not exist s >—> i such that q is a strict subterm

of s. If p >—> q, then p >-^> q is called a predicate rewrite, p the head, and q the body of the predicate rewrite. Conditions 1 and 2 in Definition 18 arc needed to ensure that predicate

rewrite systems have certain important properties, but these need not concern us here. Similarly, in the next definition, some technicalities are overlooked to

simplify the exposition. The notation p[r/b] denotes the expression obtained by replacing the subterm r in p by b. Definition 19. Let >—> be a predicate rewrite system andp a standard predicate. A subterm r of p is a redex with respect to >—> if there exists a predicate rewrite

r >—> b such that p[r/b] is a standard predicate. In this case, r is said to be a redex via r >—> b.

Learning Comprehensible Theories from Structured Data

221

Definition 20. Let >—> be a predicate rewrite system, and p and q standard predicates. Then q is obtained by a predicate derivation step from p using ^-> if

there is a redex r via r >—> b in p and q = p[r/b]. Now I can give the definition of the key concept of a predicate derivation.

Definition 21. A predicate derivation with respect to a predicate rewrite system

>—> is a finite sequence (po,Pi,--- ,Pn) of standard predicates such that pi is obtained by a derivation step from Pi-\ using >—>, for i = 1.... ,n. The standard predicate po is called the initial predicate and the standard predicate pn is called the final predicate.

Usually the initial predicate is top, the weakest predicate. Typically, the search space of predicates is a tree that has top at the root. Also each branch in the search space is a predicate derivation. By traversing the search space in some systematic fashion, a collection of standard predicates is enumerated. Example 19. Consider the keys example again. Let the predicate rewrite system be as follows.

top ;—> setExistsi (A4 (projMake ° top) (projNurnProngs ° top) (projLength ° top) (projWidth ° top)) top >—>

= Abloy)

top >—>

= Chubb)

top ;—» = Rubo) top >—> = Yale) top >—>

= 2)

top >—>

= 3)

top ;—» = 4) top >—> = 5) top >—>

= 6)

top >—>

= Short)

top ;—» = Medium) top >—> = Long) top >—>

— Narrow)

top >—>

= Normal)

top ;—» = Broad).

Then the following is a predicate derivation with respect to this predicate rewrite system.

top

setExists-[ (A4 (projMake. ° top) (projNurnProngs ° top) (projLength ° top) (projWidth ° top))

222

J.W. Lloyd

setExistsi (A4 (projMake ° (= Abloy)) (projNumProngs ° top) (projLength ° top) (projWidth ° top))

setExistsi (A4 (projMake ° (= Abloy)) (projNumProngs ° top) (projLength ° (= Medium)) (projWidth ° top)) setExistsi (A4 (projMake ° (= Abloy)) (projNumProngs ° top) (projLength" (— Medium)) (projWidth° (— Broad))) setExistsi (A4 (projMake°(= Abloy)) (projNumProngs ° (= 6)) (projLength ° (= Medium)) (projWidth J?.) For the second step, the redex top has positional type Make —► i? which is the same as the type of the body of the predicate rewrite top >—> (= Abloy). Thus this occurrence of top is indeed a redex via this predicate rewrite. The remaining steps are similar.

8

A Simple Learning Example

The predicate construction process described in the previous section has applications throughout machine learning. Perhaps the most obvious of these is to

decision-tree learning. To build a decision tree, the fundamental step in the algorithm is to find a predicate that splits some given set of training examples into two sets. The split must be a good one in the sense that each of the two subsets is "purer" in the distribution of classes than the original set. The previous section provides a convenient and effective method for finding such a predicate for individuals with complex structure. Given the type of the individuals, one first decides on a suitable set of transformations and then a suitable set of rewrites

for enumerating predicates. I now turn to a problem which is interesting because it illustrates the

multiple-instance problem that is discussed in [4]. From a knowledge representation point of view, a multiple-instance problem is indicated by having a term which is a set at the top level representing an individual. Consider again the problem of determining whether a key in a bunch of keys can open a door. This problem is prototypical of a number of important prac-

tical problems such as that of pharmacophore discovery [4, 5]. A characteristic property of the multiple-instance problem is that the "individuals" are actually a collection of similar entities and the problem is to try to find a suitable entity in each collection satisfying some predicate. Thus the kind of predicate one wants to find has the form

(setExistsi b) for a suitable predicate b on the entities in the collection. Expressed this way, it is clear that the multiple-instance problem is just a special case of the general framework that is discussed in this tutorial in which the type of the individuals

Learning Comprehensible Theories from Structured Data

223

is a set. Since the individuals to be classified have a set type, following the principles proposed here, one should go inside the set to apply predicates to its elements.

To illustrate the multiple-instance problem, consider the following illustration concerning bunches of keys. The function opens to be learned has signature opens : Bunch —> Q. The training examples arc as follows.

opens{(Abloy, 3, Short, Normal),

(Abloy, 4, Medium, Broad).

(Chubb, 3, Long, Narrow)} (Chubb, 2, Long, Normal), (Chubb, 4, Medium, Broad)} (Abloy, 4, Medium, Broad).

=T

=T

opens{(Abloy, 3, Medium, Narrow),

(Chubb, 3, Long, Narrow)} (Abloy, 4, Medium, Narrow), (Yale, 4, Medium, Broad)} (Chubb, 6, Medium. Normal).

(Rubo, 5, Short, Narrow), opens{(Chubb, 3, Short, Broad), (Yale, 3, Short, Narrow), opens{( Yale, 3, Long, Narrow),

(Yale, 4, Long, Broad)} (Chubb, 4, Medium, Broad), (Yale, 4, Long, Normal)} (Yale, 4, Long, Broad)}

=T

opens{(Abloy, 3, Short, Broad), (Rubo, 4, Long, Broad), opens{(Abloy, 4, Short, Broad),

(Chubb, 3, Short, Broad), (Yale, 4, Long, Broad)} (Chubb, 3, Medium, Broad), (Rubo, 5, Long, Narrow)}

opens {(Abloy, 3, Medium, Broad), opens{(Abloy, 3, Short, Broad), opens {(Abloy, 3, Medium, Broad), (Chubb, 3, Long, Broad),

=T

=T

=± — _L = _L — ±.

The predicate rewrite system for this example is the one given in the previous section. Then a decision-tree learning system can find the following definition

for the function opens (in which some simplification has been done to reduce a term containing A4 to one containing A2). opens b =

if setExistsi (A2 (projMake ° (= Abloy)) (projLength ° (= Medium))) b then T else ±.

"A bunch of keys opens the door if and only if it contains an Abloy key of medium length''. As this example illustrates, it is straightforward to convert hypotheses in the logic to statements in structured English that can be easily understood by domain experts.

224

9

J.W. Lloyd

Related Work and Further Reading

The issues discussed in this tutorial have been studied for some decades in the

computational logic community and, more recently, by researchers in inductive

logic programming (ILP). This field was christened in [14]. The first comprehensive account of its research agenda was given in [15]. The definitive account of the theoretical foundations of inductive logic programming is in [16]. Details of the ILP workshop series are recorded at [8]. Almost all work in ILP is in the setting of first-order logic. In contrast, this tutorial advocates the advantages of higher-order logic. More details on this higher-order approach including practical applications can be found in [1], [2], [6], [10], [11], and [12]. This tutorial has not discussed important machine learning issues such as

computational complexity and sample complexity [13]. Many such issues arc orthogonal to the particular approach taken here and therefore existing results are equally applicable. However, there are some interesting open questions that need investigation such as methods for calculating the VC-diinension of the rich hypothesis languages proposed here. References 1. A.F. Bowers, C. Giraud-Carrier, and J.W. Lloyd. Classification of individuals with complex structure. In P. Langley, editor, Machine Learning: Proceedings of the. Seventeenth International Conference, (ICML2000), pages 81-88. Morgan Kaufmarm, 2000. 2. A.F. Bowers, C. Giraud-Carrier, and J.W. Lloyd. A knowledge representation framework for inductive learning, http://csl.anu.edu.au/~jwl. 2001. 3. A. Church. A formulation of the simple theory of types. Journal of Symbolic Logic, 5:56 68, 1940.

4. T.G. Dietterich, R.H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31-71, 1997. 5. P. Finn, S. Muggleton, D. Page, and A. Srinivasan. Pharmacophore discovery using the inductive logic programming system PROGOL. Machine Learning, 30:241-270, 1998.

6. T. Gartner, J.W. Lloyd, and P. Flach. Kernels for structured data. In Proceedings

of the 12th International Conference on Inductive Logic Programming (ILP2002). Springer-Verlag, 2002. Lecture Notes in Computer Science. 7. D. Haussler. Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, University of California in Santa Cruz, Department of Computer Science, 1999.

8. Home page of ILP workshops. http://www.cs.york.ac.uk/ILP-events/. 9. J.W. Lloyd. Programming in an integrated functional and logic language. Journal

of Functional and Logic Programming, 1999(3), March 1999. 10. J.W. Lloyd. Knowledge representation, computation, and learning in higher-order logic, http://csl.anu.edu.au/~jwl, 2001. 11. J.W. Lloyd. Higher-order computational logic. In A. Kakas and F. Sadri, editors, Computational Logic: Logic Programming and Beyond, pages 105-137. SpringerVerlag, LNAI 2407, 2002. Essays in Honour of Robert A. Kowalski, Part I.

Learning Comprehensible Theories from Structured Data

225

12. J.W. Lloyd. Predicate construction in higher-order logic. Electronic Transactions on Artificial Intelligence, 4(2000):21-51. Section B. http://www.ep.liu.se/ej/etai/2000/009/. 13. T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.

14. S. Muggleton. Inductive logic programming. New Generation Computing, 8(4):295318, 1991.

15. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20:629 679. 1994. 16. S.H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming. Lecture Notes in Artificial Intelligence, 1228. Springer-Verlag, 1997. 17. B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, 2002. 18. Home page of Machine Learning Group, The University of York, http://www.c s.york.ac.uk/mlg/.

Algorithms for Association Rules Markus Hegland Australian National University, Canberra ACT 0200, Australia, [email protected], http://datamining.anu.edu.au

Abstract. Association rules are ”if-then rules” with two measures which quantify the support and confidence of the rule for a given data set. Having their origin in market basked analysis, association rules are now one of the most popular tools in data mining. This popularity is to a large part due to the availability of efficient algorithms following from the development of the Apriori algorithm. We will review the basic Apriori algorithm and discuss variants for distributed data, inclusion of constraints and data taxonomies. The review ends with an outlook on tools which have the potential to deal with long itemsets and considerably reduce the amount of (uninteresting) itemsets returned. The discussion will focus on the problem of finding frequent itemsets.

1

Searching for Associations

Association rule mining originated in market basket analysis which aims at understanding the behavior and (shopping) interests of retail customers. This understanding helps with product placement and direct marketing. A major challenge is that the retail customer often has 10,000 or more items to choose from so that one can get widely different market baskets. If one does not consider the variations in amounts of the same items but only the different items in a market basket one has 210,000 ≈ 103,000 different potential market baskets. In practice, however, no market baskets will contain very many items. However, even if one only considers market baskets with, say, up to 30 items one gets more than 10100 possibilities. This is an instance of the curse of dimensionality. Any data mining algorithm will somehow have to deal with this curse. Association rule mining for market basket analysis discovers patterns in observed market baskets which occur frequently. Besides in retail, the market analysis framework has also been used in the health and other service industries. Furthermore, applications of association rule mining are now used far beyond market basket analysis and include the detection of network intrusions or attacks from the logs of web servers and the usage of webserver pages. Association rule discovery is also used by scientists to mine DNA sequences and and protein structure and to investigate time series. Two types of patterns can be found in association rule mining: A first type are “if-then-rules” and are of the form: “If a customer buys milk then she also buys S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 226–234, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

Algorithms for Association Rules

227

bread”. A second type relates to co-occurrence of items in the market basket: “A customer buys bread and milk together”. The discovery of the second pattern is simpler than the first, moreover, one can see that the discovery of the first pattern can be based on the discovery of the second one. We will thus focus here on the second type of pattern. One models the potential items as a set I = {a1 , . . . , am } and a transaction as a subset of I. Any patterns are derived from the transaction database which is a sequence of itemsets DB = (T1 , . . . , Tn ). A pattern of the second type can be described as an itemset A ⊂ I as well. A transaction Ti is said to support an itemset A if A ⊂ Ti . The number (sometimes the proportion) of all transactions which support A is called the support of A in DB and is denoted by σ(A) = #{i | A ⊂ Ti }. A frequent itemset A is defined as an itemset with a support which is larger than some threshold s0 . The choice of this threshold by the user determines how many frequent itemsets are found but also how useful these itemsets will be. Frequent itemset mining aims to find all itemsets A with support larger than the threshold, i.e., for which σ(A) ≥ s0 . The naive approach would be to determine the support of all possible itemsets and select the ones which are frequent. This is infeasible, as the the set of all possible itemsets is the powerset of the set of all items and in most applications very large. Association rule mining algorithms will have to find ways to detect the frequent itemsets without visiting all the possible itemsets. Note that the itemsets form a Boolean lattice, see Figure 1 for a simple example. {milk, bread, coffee, juice} {milk, bread, coffee} {milk, bread}

{milk, bread, juice}

{milk, coffee, juice}

{milk, coffee} {bread, coffee} {milk}

{bread}

{bread, coffee, juice}

{milk, juice}

{bread. juice}

{coffee}

{juice}

{coffee, juice

{}

Fig. 1. Boolean lattice of breakfast itemsets

The search for all frequent itemsets typically starts with the minimal element of this lattice and covers as little as possible. In the next section we review the basic Apriori algorithm, then, in section 3 we consider some variants and in the last section we will look at the problem of finding very large frequent itemsets.

228

2

M. Hegland

Breadth First Search: Apriori Algorithm

The Apriori algorithm [1] determines the support of itemsets in a level-wise BFS fashion. First it finds the supports of 1-itemsets (the itemsets with only one element) then of 2-itemsets etc: C1 is the set of all one-itemsets, k = 1 While Ck = ∅ scan database to determine support σ(A) for all A ∈ Ck extract frequent itemsets from Ck into Lk generate Ck+1 k := k + 1 The algorithm does not determine the supports of all possible itemsets, instead, it uses a clever strategy to determine candidates for frequent itemsets, i.e., it finds sets Ck of k-itemsets which contain all the frequent itemsets but not much else. The main observation on which such a selection of candidates is based is the apriori principle which we formulate as a simple lemma: Lemma 1. For any itemset A ⊂ I with σ(A) ≥ s0 and any other itemset B ⊂ I with B ⊂ A one has σ(B) ≤ s0 . Proof. For any B ⊂ A one has #{i | A ⊂ Ti } ≤ #{i | B ⊂ Ti } and so σ(A) ≤ σ(B). This is the antimonotonicity of the support function σ(·). As σ(A) ≤ s0 it follows that σ(B) ≤ s0 . So any subset of a frequent itemset has to be frequent and once all the frequent itemsets up to size k are known then all the (proper) subsets of the frequent k + 1 itemsets are known. Thus one chooses as the set of candidates Ck+1 just the k + 1 itemsets which only have frequent proper subsets. This, in fact, is the minimal set of candidates one can find without further data scan given the frequent itemsets up to level k. The simplest way to determine the candidate itemset Ck+1 would be to enumerate all possible k + 1 itemsets and remove the ones which have infrequent subsets. This, however, is again feasible for larger n and k as the number of  notk+1 m possible k +1 itemsets is k+1 ≈m /(k +1)! which, in the case of m = 10, 000 21 and k = 5 is more than 10 . (One observes that the complexities have been substantially reduced using the simple observations above. However, in order to be computationally feasible one would require the size of the problem to go below around 1010 .) The complexity determination of potential candidates is further reduced by using the join operator. If itemsets are represented as (alphabetically) ordered lists A[1 : k] the join of a set of frequent k-itemsets Lk with itself is the set of all k + 1 itemsets which are defined as the union of any frequent k itemsets which share the first k − 1 items (the k − 1 prefix): Lk ∗ Lk = {A ∪ B | A, B ∈ Lk , A = B, A[1 : k − 1] = B[1 : k − 1]} . The determination of the join requires one to visit the elements of Lk at most twice if they are visited in alphabetic order. One then has:

Algorithms for Association Rules

229

Lemma 2. Let Lk and Lk+1 be the set of frequent k and k + 1 itemsets, respectively. Then every frequent k + 1 itemset is an element of the join Lk ∗ Lk , i.e., Lk ∗ Lk ⊃ Lk+1 . Proof. As any subset of a frequent itemset is frequent we can find for every element C ∈ Lk+1 two elements A, B ∈ Lk with common prefix such that C = A ∪ B. The size of the join Lk ∗ Lk is now such that one can remove all its elements which contain subsets not in Lk and one has for the candidate itemset:   Ck+1 = A ∈ Lk ∗ Lk | (B ⊂ A) ⇒ (B ∈ L|B| ) where |B| denotes the size of set B. The introduction of the Apriori algorithm by Agrawal and coworkers [1] as discussed above marked a first peak in the development of data mining algorithms. In particular it motivated many variations of the algorithm which address special situations and which will be discussed in the next section. Even the most recent developments are still to a large extent based on the apriori principle as we will see in the last section.

3

Variations of Apriori

The following variations adapt the apriori algorithm to cases where the data has additional known structure. We consider here three cases. In a first case, the data is is partitioned into sub-databases, in a second case the itemsets of interest satisfy constraints and in a third case the items themselves belong to a taxonomy. 3.1

Partitioned Apriori

So far we assumed that the whole database DB was able to fit into the main memory. This is not the case for very large applications. For such an application one needs to partition the data and do some processing on all the subsets and then combine the results [8]. This approach has wider applications, however. It also leads to a generic parallel algorithm. In other cases the data is distributed over multiple databases and it is (technically or because of privacy) not feasible to combine the data into one large database. In all these cases the database is modeled as a partitioned vector of transactions, i.e., there is a sequence of transactions DB1 , . . . , DBp such that DB = (DB1 , . . . , DBp ) . We denote the support of an itemset for each partition j by σ(A, DBj ). It follows that the support over DB is just the sum of the supports over the subsets, i.e., σ(A; DB) =

p  j=1

σ(A; DBj ).

230

M. Hegland

In order to compare the supports on different datasets (of potentially different sizes) one needs to use the relative frequency as support measure to define a n frequent itemset, i.e., we call an itemset A frequent on DBj if σ(A, DBj ) ≥ s0 nj where nj is the size (number of transactions) of DBj and n the size of DB. A direct consequence of the additivity of σ(·) is the invariant partitioning property: Lemma 3. Any itemset which is frequent in DB is at least frequent in one of the partitions DBj . Proof. If an itemset A is infrequent on allthe partitions DBj then p σ(A, DBj ) < n n p s0 nj and by summation σ(A, DB) < s0 j=1 nj = s0 as n = j=1 nj . From this it follows that the union of all sets of frequent itemsets obtained on the partitions contains all the frequent itemsets on the whole data set. One can then use this union as a candidate itemset. The partitioned apriori algorithm then requires two scans over the data: In a first scan, the frequent itemsets are found for each partition and in the second scan the supports of these “local” itemsets is determined on the full database: Find the sets Lk (DBj ) of allthe frequent itemsets for all DBj p Choose as candidates Pk = j=1 Lk (DBj ) Find the supports σ(A, DBj ) for all A ∈ Pk and all j Remove infrequent itemsets This algorithm works particularly well if the candidates are close to the actual frequent itemsets. In this case the extra cost of this algorithm is small. One can expect this to happen, if, e.g., the partitions DBj have been selected at random and are all of the same size. 3.2

Apriori with Constraints

A typical association rule mining job has first the user decide on the data set and the support threshold and then the computer returning a potentially very large number of frequent itemsets. It is frequently found that many of the returned frequent itemsets are of little interest because they are known or because they do not lead to any action. The problem is that the user has very little control over the process except for the selection of the data and the threshold s0 . In some cases the user may actually know more about types of itemsets which would be useful. For example, it might be known that only market baskets of a certain price range would lead to profits or one knows in medical data mining which combinations of treatments are to be expected to occur frequently as they occur in the schedule. Thus the user may wish to impose constraints to limit the amount of uninteresting associations found [6]. Constraints are in these cases modeled by Boolean functions defined on the powerset 2I . One is only interested in the frequent itemsets A for which this function is true. Note that this is also one way to introduce interestingness into

Algorithms for Association Rules

231

the algorithm. A simple and sound approach to finding frequent itemsets under constraints is to first find all the frequent itemsets using the apriori algorithm and then discarding the ones which do not satisfy the constraints. This method, termed Apriori+ can provide the user with a substantially reduced number of frequent itemsets. However, the computational costs are the same as for the original apriori algorithm. One can now try to push the constraints into the apriori algorithm in order to reduce the computational complexity. A simple way to do this is to remove all the itemsets from Lk which do not satisfy the constraints and only use the pruned set of frequent itemsets to generate the candidates at the next level. This naive pushing can result in substantial savings, however, it can also result in wrong results. This is because the constraint may not be consistent with the apriori condition. For example, if one is interested in market-baskets with a given minimal value only then subsets of such “expensive” market-baskets are not necessarily expensive as well. One case where a substantial saving can be obtained is the case of antimonotone constraints γ which satisfy the condition: A ⊂ B and γ(B) ⇒ γ(A). In this case the γ is consistent with the apriori condition and every subset of a frequent itemset which satisfies condition γ will also satisfy this constraint. This is not the case for monotone constraints where A ⊂ B and γ(A) ⇒ γ(B). In this case one would need to run the full apriori algorithm. However, the monotonicity constraint allows prune out the set of frequent itemsets efficiently as one will only have to evaluate the condition γ on the minimal frequent itemsets (minimal with respect to ⊂). 3.3

Mining Items with Taxonomies

The number of frequent itemsets found depends very much on the choice of the items. In particular in retail one might consider as items particular brands of items or whole groups like milk, drinks or food. The more general the items chosen the higher one can expect the support to be. Thus one might be interested in discovering frequent itemsets composed of items which themselves form a taxonomy [9]. We model this taxonomy by a partial order ≤ defined on the items aj . We assume that if we add to any itemset A which contains an item ai ∈ A an item aj which is more general than ai we do not change the support of A, i.e., σ(A ∪ {aj }) = σ(A), if, however, aj is more special than ai (which we denote with aj ≥ ai ) then the support will be reduced, i.e., σ(A ∪ {aj }) ≤ σ(A). This is all of no concern to the apriori algorithm, in fact, one can find frequent itemsets in this case by using the classical apriori method. However, if one does this, one might find that one gets sets containing both an item and its generalization (e.g, both “milk” and “beverage”). In addition, such an itemset would

232

M. Hegland

be seen by the algorithm to be different from the itemset containing only the more special item (in the case of the example this would be “milk”). In order to avoid this multiplication of itemsets one normalizes the itemsets to contain only the most special of the items as the more general item is always in the itemset as well. This defines a constraint γ which holds if the itemset is normalized and is false else. It follows directly that this constraint is antimonotone as defined in the previous section and thus can be pushed into the apriori algorithm and potentially leads to substantial savings.

4

Two Challenges of Size

There are two major computational problems when using the apriori approach to discover associations. First, one often finds that the algorithms produce a large number of associations, most of them uninteresting. While one can control the number of associations produced with the minimal support threshold, one finds that by the time the number of rules has been reduced to a manageable number, there are no interesting rules left at all. The interesting rules are often found in an area of intermediate support sizes – offset from the noise which has low support and the trivial and well-known rules which have very high support. Thus the support threshold alone is not fine enough to find interesting structures. In the last section we have considered constraints in order to select more interesting rules but their effect is limited, moreover, such constraints may not be available. Thus the usage of additional interestingness measures (which act as filters) has received considerable discussion in the datamining literature, see, e.g. [5]. There is, however, a large deal of redundancy in the set of frequent itemsets and we will consider here the removal of this redundancy, effectively compressing the set of frequent itemsets. The tool used for this compression is the apriori principle. The second computational problem relates not to the number of frequent itemsets (the size of Lk ) but to the size of the frequent itemsets themselves. If one uses the Apriori algorithm, for every frequent itemset found the supports of all the subsets have to be determined before that itemset even becomes a candidate. Thus for a k-itemset the supports of all the 2k − 2 (nontrivial) subsets will have to be determined. Now if k = 20 this means that around one million subsets need to be considered. Note that the complexity of the determination of this support grows proportional to the product of the number of subsets considered and the data size. Thus frequent itemsets of this size are not feasible. They have been found, however, in examples in biology [2]. The computational limitation is a property of the breadth-first search method used. In order to find long frequent itemsets one needs to abandon this approach. It turns out that both challenges can be addressed with the same mathematical tool. One starts with two sets, the set of items I = {a1 , . . . , am } and the set of transactions T = {t1 , . . . , tn }. (In practice one uses identifiers to point to items and transactions and thus considers the sets I = {1, . . . , m} and T = {1, . . . , n}, the later often been called the “tid set”.) Note that the set of transactions is not a set of subsets of I as the same subsets may correspond to

Algorithms for Association Rules

233

different transactions. The database is then modeled as a triple (I, T, R) where R ⊂ I × T is a table or relation between items and transactions. Other names for the items are (boolean) features or attributes and the transactions are also called objects. This terminology is used in formal concept analysis [3], and in this case the database is also called a formal context. We will only consider the case of binary attributes and one table (“flat file”). The patterns of interest are described by itemsets which are subsets of I. The relation R defines a mapping from the powerset of I onto the powerset of T by s(A) = {tk ∈ T | R(tk , ai ) for all ai ∈ A} where A ⊂ I. This is the set of transactions which “support” the itemset A and the support is the size of this set: σ(A) = |s(A)|. There is a dual function which maps sets of transactions (which are called samples in statistics) onto itemsets defined by g(B) = {ai ∈ I | R(tk , ai ) for all tk ∈ B} where B ⊂ T . The itemset g(B) is the set of all items which are common to all transactions in B. The two mappings g and s form a Galois connection and thus the composition g ◦ s is a closure operator, see [3]. So for any itemset A we have the closed itemset A = g(s(A)) and it is known that the set of all closed itemsets forms a sublattice of the powerset. Furthermore, from the supports of all the closed itemsets one can find the supports of all the itemsets: Lemma 4. For any itemset A ∈ I one has σ(A) = σ(A ). Proof. For any A ∈ I one has gs(A) ⊃ A and, as s is antimonotone, sgs(A) ⊂ s(A). A transaction in s(A) contains all the items of A, and A is the set of all items which are contained in all the transactions in s(A). Thus, in particular, a transaction in s(A) also contains all the items of A and thus belongs to the support of A so that s(A) ⊂ s(A ) and it follows that s(A) = s(A ). From this one gets σ(A ) = |s(A )| = |s(A)| = σ(A). This property amounts to a compression of the set of all possible itemsets which is lossless in terms of frequency. In particular, if one can find all the closed itemsets which are frequent, one can derive all the other frequent itemsets without having to inspect the data again. This assumes that the closure of any set A can be found without further scanning of the data. This is possible as one has  A = Ai A⊂Ai

where {Ai } is the set of all closed itemsets. This procedure has the potential to substantially reduce the amount of frequent itemsets produced and also the amount of computation, especially in the case of very long frequent itemsets which are found early in the search process [10]. The determination of closed itemsets is also of interest for the application as from the point of view of the data the itemsets A and A (and all the sets in between) are indistinguishable.

234

M. Hegland

This, for example, may point to items which are jointly scheduled in the case of medical services. This opens a whole new area of closed itemset mining [10]. A further reduction is possible by considering only maximal frequent itemsets. A frequent itemset A is maximal if nox superset of A is frequent. If all the maximal frequent itemsets are known then any other frequent itemset will be a subset of these maximal frequent itemsets. Now if A is a frequent itemset then A is frequent as well and if A is maximal it follows from A ⊃ A that A = A and so the maximal frequent itemsets are closed. Thus the maximal frequent itemsets are the maximal closed itemsets which are frequent. Unlike in the previous case one cannot derive the support σ(A) of all the frequent itemsets from the support of the maximal frequent itemsets. Thus this further compression of the sets considered has been called “lossy”. Particular algorithms related to closed itemsets can be found in the bibliography of [10], see, in particular [7, 4, 2].

References 1. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. 2. R. J. Bayardo. Efficiently mining long patterns from databases. In ACM SIGMOD Int’l Conf. on Management of Data, pages 85–93. ACM, 1998. 3. Bernhard Ganter and Rudolf Wille. Formal concept analysis. Springer-Verlag, Berlin, 1999. Mathematical foundations, Translated from the 1996 German original by Cornelia Franzke. 4. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Weidong Chen, Jeffrey Naughton, and Philip A. Bernstein, editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05 2000. 5. Robert J. Hilderman and Howard J. Hamilton. Evaluation of interestingness measures for ranking discovered knowledge. Lecture Notes in Computer Science, 2035:247–??, 2001. 6. Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Han, and Alex Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc. ACM SIGMOD Int. Conf. Management of Data, pages 13–24, 1998. 7. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science, 1540:398–416, 1999. 8. Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of the 21st Int. Conf. on VLDB, pages 432–443, 1995. 9. Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Proc. of the 21st Int. Conf. on VLDB, pages 407–419, 1995. 10. M. J. Zaki and C. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In R. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, Proceedings of the Second SIAM International Conference on Data Mining. SIAM, 2002. http://www.siam.org/meetings/sdm02/proceedings/.

Online Learning of Linear Classifiers Jyrki Kivinen Research School of Information Sciences and Engineering Australian National University Canberra, ACT 0200, Australia [email protected]

Abstract. This paper surveys some basic techniques and recent results related to online learning. Our focus is on linear classification. The most familiar algorithm for this task is the perceptron. We explain the perceptron algorithm and its convergence proof as an instance of a generic method based on Bregman divergences. This leads to a more general algorithm known as the p-norm perceptron. We give the proof for generalizing the perceptron convergence theorem for the p-norm perceptron and the non-separable case. We also show how regularization, again based on Bregman divergences, can make an online algorithm more robust against target movement.

1

Introduction

We consider perhaps the most basic learning problems, supervised classification learning. The basic components of such a problem are an instance space X and a discrete label space Y. Mappings from X to Y are called classifiers. In the usual batch learning setting, the input to the learning algorithm consists of a sequence of examples (xt , yt ) ∈ X × Y, t = 1, . . . , T . Intuitively, we may often think that there is some unknown target classifier f : X → Y, and the label yt gives the “true” label f (xt ), or perhaps an inaccurate or corrupted version of it. Based on the examples, the learning algorithm is supposed to produce its hypothesis h : X → Y. Again, intuitively we wish that h would be a good approximation of the target f . Since the algorithm of course does not know f , its concrete goal must be formulated in other terms. Typically the algorithm would try to approximately fit the examples, i.e., obtain h(xt ) = yt for a large proportion of the examples (xt , yt ), while also keeping the hypothesis h relatively simple according to some complexity measure. Structural risk minimization [26] is one particular example of such an approach. In the following we shall be mainly concerned with binary classification using linear classifiers. Thus, we assume that X = Rn and Y = { −1, +1 }. With any weight vector w ∈ Rn we associate the linear classifier hw given by hw (x) = sign(w · x), where sign(p) = −1 if p < 0 and sign(p) = +1 otherwise. We define the margin of w at example (xt , yt ) as yt w · xt . Thus if the margin is positive we have a correct prediction hw (xt ) = yt , and if the margin is negative we have a mistake with hw (xt ) = yt . A natural first goal for linear classifier learning would now be to find a weight vector that achieves a positive margin on all the S. Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 235–257, 2003.

c Springer-Verlag Berlin Heidelberg 2003 

236

J. Kivinen

examples. In many applications, in particular when kernels are used [11], the dimensionality n of the instance space can be much larger than the number T of examples, in which case such weight vectors are usually easy to find but not very useful unless some further constraints are made. Intuitively, if n is large there are enough degrees of freedom for the weight vector to “memorize” the individual examples without “learning” any underlying structure. An appropriate way of controlling the complexity of the hypothesis is to minimize the norm ||w||2 under the constraint that the margins are above some positive threshold, say yt wt ·xt ≥ 1 for all t. This is one of the ideas that lead to support vector machines [2]. However, the model we are interested in here is not batch learning, but online learning. In this setting, the learning proceeds in a sequence of trials. At trial t, for t = 1, . . . , T , the learning algorithm is first presented with an instance xt and must produce its prediction yˆt = ht (xt ) using its current hypothesis ht . Since we are concerned with linear classification, we assume that ht is represented by the current weight vector wt and the prediction is yˆt = sign(wt · xt ). The algorithm then receives the desired label yt . Based on wt , xt and yt (and possibly other relevant data from previous trials) the algorithm then updates its weight vector into wt+1 before the next trial. If yˆt = yt , we say that the algorithm made a mistake. In general, the goal is to make as few mistakes as possible over the whole sequence of trials. Online learning is particularly appropriate if we wish to consider settings where the learning task may change over time. Even with a purely batch learning problem it may be desirable to use an online algorithm to reduce computation time or memory requirements. Section 2 presents the well-known perceptron algorithm [23] for online learning of linear classifiers. The main goal of this paper is to show how certain simple techniques can be used to generalize the algorithm and its convergence proof. Our main tool is Bregman divergences [3], which are introduced in Section 3. Section 4 shows a particular generalization of the perceptron algorithm, called the p-norm perceptron [9], and gives a mistake bound for it. The robustness of a regularized version of the p-norm perceptron against target movement [19] is explained in Section 5. Since the goal of this paper is to explain the basic analysis techniques, the bounds given here are often not the best ones known. To get the most recent results, the reader must consult the latest papers which are only cited here.

2

The Perceptron Algorithm

We consider online learning of linear classifiers, where the learning algorithm represents its t-th hypothesis by a weight vector wt ∈ Rn . Thus, at trial t, the algorithm receives an instance xt ∈ Rn , makes its prediction yˆt = sign(wt · xt ) and receives the desired label yt ∈ { −1, +1 }. What distinguishes different algorithms is how they update wt into wt+1 based on the examples (xt , yt ) received at trial t. The well-know perceptron algorithm (PA) [23] is presented in Figure 1. The hypothesis is updated when there was a mistake, and also in the special case wt ·xt = 0. (The special case is mainly for notational convenience and does not significantly affect the behavior of the algorithm.) The update consists of adding a scalar multiple of the instance to the hypothesis. For simplicity, we

Online Learning of Linear Classifiers

237

assume that the hypothesis is initialized as the zero vector w1 = 0. This is usually the most natural choice, lacking any other preference. Notice that in this case the choice of the learning rate η affects only the magnitude of the weight vector, not the actual predictions yˆt = sign(wt · xt ). We have included η for compatibility with material in later sections. Parameters: learning rate η > 0 Initialization: w1 = 0 ∈ Rn Repeat for t = 1, . . . , T : 1. Receive the instance xt ∈ Rn . 2. Give the prediction yˆt = sign(wt · xt ). 3. Receive the desired label yt ∈ { −1, +1 }. 4. Update the hypothesis according to wt+1 = wt + ησt yt xt

(1)

where σt = 0 if yt wt · xt > 0 and σt = 1 otherwise. Fig. 1. The perceptron algorithm PA

As a sanity check, consider how the margin yt wt · xt changes as a result of the update. Since yt2 = 1, we have yt wt+1 · xt − yt wt · xt = σt ηx2t . Hence, the margin increases if there was a mistake, so if the same example were to be encountered again the algorithm would at least be closer to getting the classification correct this time. A more rigorous justification for the algorithm is given by the Perceptron Convergence Theorem [23]. Denote the example sequence by S = ((x1 , y1 ), . . . , (xT , yT )). For any u ∈ Rn and µ ≥ 0, we say that u has margin µ on S if yt u · xt > µ holds for t = 1, . . . , T . If some u has a positive margin on S, we say that S is linearly separable. Usually S is clear from the context and we omit mentioning it. Clearly if u has margin µ, then αu has margin αµ for any α > 0. Therefore, we would really expect the performance of the algorithm to depend on the normalized margin µ/||u||. Given an algorithm A, we denote by M(A) the number of mistakes is makes on the example sequence S (which again is assumed clear from the context), i.e., M(A) = | { t | ht (x t ) = yt } |. In the particular case of the perceptron algorithm we have M(PA) ≤ t σt (with equality if we ignore the special case wt · xt = 0). Theorem 1 (Perceptron Convergence Theorem, Novikoff 1962). Let S = ((x1 , y1 ), . . . , (xT , yT )), and assume ||xt ||2 ≤ X for all t. If some u ∈ Rn with ||u||2 ≤ U has margin µ on S, then the perceptron algorithm makes at most M(PA) ≤

T  t=1

mistakes on the example sequence S.

σt ≤

U 2X 2 µ2

238

J. Kivinen

In particular, consider a sequence S (m) which consists of m repetitions of S. If S is linearly separable, then so is S (m) , and Theorem 1 shows that the number of mistakes, and therefore updates, is bounded by a quantity that does not depend on m. This means that if we repeatedly cycle the perceptron algorithm through a linearly separable example sequence, its hypothesis eventually converges to one that has a positive margin. We give the proof of Theorem 1 in a somewhat unusual form to make it more readily generalizable in what follows. Proof of Theorem 1. We consider the progress the algorithm makes towards the target vector u, measured in terms of the squared distance from u to the current weight vector wt . The progress at trial t is 1 1 1 ||u − wt ||22 − ||u − wt+1 ||22 = ησt yt xt · (u − wt ) − σt η 2 ||xt ||22 2 2 2

(2)

directly from the definition of wt+1 . (Notice that yt2 = 1 and σt2 = σt .) This holds for any u ∈ Rn . We now use the assumption that yt u · xt ≥ µ. If there was no mistake at trial t, then σt = 0 and both sides of (2) are zero. Otherwise yt wt ·xt ≤ 0, and the right-hand side is lower bounded by ηµ − η 2 X 2 /2. Combining the two cases, we write   1 1 2 2 1 2 2 ||u − wt ||2 − ||u − wt+1 ||2 ≥ σt ηµ − η X . (3) 2 2 2 In particular, if we choose η = µ/X 2 , which maximizes the right-hand side of (3), we get 1 1 µ2 1 ||u − wt ||22 − ||u − wt+1 ||22 ≥ σt . 2 2 2 X2 Summing this over t = 1, . . . , T , we get T

 1 1 ||u − w1 ||22 − ||u − wT +1 ||22 = 2 2 t=1



1 1 ||u − wt ||22 − ||u − wt+1 ||22 2 2



T 1 µ2  ≥ σt . 2 X 2 t=1

We get the claimed bound by noticing that ||u − w1 ||22 ≤ U and ||u − wT +1 ||22 ≥ T 0 and solving for t=1 σt . This was based on assuming the particular choice η = µ/X 2 , but since the predictions of the algorithm remain the same for any positive learning rate, the result holds for all η > 0.   The role of the learning rate η in the proof of Theorem 1 may be a little confusing. The fact that we ended up with a fixed choice η = µ/X 2 was due to the fact that we were analyzing convergence towards a target u with a fixed norm. We could instead fix the learning rate in advance to η = αµ/X 2 for any

Online Learning of Linear Classifiers

239

α > 0, but then we would need to analyze convergence towards αu. The resulting mistake bound would be the same, since αu has margin αµ. It is common to include in the perceptron algorithm (as well as other linear classification algorithm) a bias term bt , with the prediction now being yˆt = sign(wt · xt + bt ). To enhance the perceptron algorithm with the bias term we initiate b1 = 0 and update bt+1 = bt + ησt yt α for some constant α > 0. Notice that running this perceptron algorithm with bias on the example sequence S gives exactly the same predictions as running the basic perceptron algorithm on the example sequence S  = ((xt , yt ))Tt=1 where xt = (xt,1 , . . . , xt,n , α) ∈ Rn+1 . To analyze the algorithm, we say for any u ∈ Rn , b ∈ R and µ ≥ 0 that the pair (u, b) has margin margin µ on S if yt (u · xt + b) > µ holds for t = 1, . . . , T . Suppose the pair (u, b) has margin µ on S, with ||u||2 ≤ U . Then the vector u = (u1 , . . . , un , b/α) has margin µ on S  . Assuming ||xt ||2 ≤ X for all t, the interesting values of b clearly satisfy |b| ≤ maxt |u · xt | ≤ U X. Therefore, we can assume that ||u ||22 = ||u||22 +(b/α)2 ≤ U 2 +U 2 X 2 /α2 . Since ||xt ||22 = ||xt ||22 +α2 , applying the Perceptron Convergence Theorem to the sequence S  now gives the mistake bound (U 2 + U 2 X 2 /α2 )(X 2 + α2 ) . µ2 The bound is minimized for α = X, giving an extra factor 4 compared to the bound without the bias term. We generally ignore the bias term from now on, since the reduction technique can be used with all the algorithms we encounter here. Sometimes, however, the result is not quite satisfactory. For example, suppose we wish to find a pair (u, b) which minimizes ||u||2 subject to the constraint that (u, b) has margin 1. Eliminating b by our reduction treats it like a component of u, punishing large values |b|. Since the actual problem does not include a cost for large |b|, this leads to solutions with |b| too small. Better handling of the bias term in online margin maximization remains an open problem [6]. The additive form of the perceptron update has the important consequence that the weight vector is always in the span of the preceding examples: we can write wτ =

τ −1 

ατ,t xt

(4)

t=1

for some coefficients ατ,t . This allows in particular generalizing the algorithm for non-linear classification tasks using the kernel method (see, e.g., [11] for a good exposition). However, there are also some drawbacks. Weight vectors of the form (4) do not perform particularly well on tasks in which most of the input dimensions are irrelevant for the classification [18]. This is in contrast to Littlestone’s Winnow [20], which has a multiplicative update wt+1,i = wt,i exp(−ησt yt xt,i ) . Another family of non-additive algorithms is the p-norm perceptron algorithms introduced by Grove et al. [9]. In some sense they interpolate between Winnow

240

J. Kivinen

and the perceptron algorithm. The p-norm perceptron and its variants and will be considered in Sections 4 and 5 after introducing in Section 3 the necessary mathematical tools.

3

Bregman Divergences and Online Learning

Our proof of the Perceptron Convergence Theorem was based on analyzing the squared distance ||u − wt ||22 . To extend the result, we introduce Bregman divergences which generalize the squared Euclidean distance. These divergences are an appropriate way of deriving gradient-based learning algorithms and then analyzing their behavior. Bregman [3] originally introduced them in the context of convex optimization. They were introduced into the analysis of online learning algorithms by Grove et al. [9], Jagota and Warmuth [27] and Kivinen and Warmuth [17]. Let F : Rn → R be a strictly convex twice differentiable function and f = ∇F its gradient. Notice that f is one-to-one. We define the Bregman divergence [3] dF (u, w) as the error of the first order Taylor approximation of F (u) around w (Figure 2). More formally, dF (u, w) = F (u) − F (w) − f (w) · (u − w) . Then dF (u, w) ≥ 0 for all u, w with equality if and only if u = w. The Bregman divergence dF (u, w) is (strictly) convex in u but usually not in w. F

dF (u, w)

w

u

Fig. 2. The Bregman divergence dF defined by a convex function F

The convex dual G of F is defined by the equation F (w) + G(θ) = w · θ where θ = f (w). It can be shown that G is strictly convex, and its gradient g = ∇G is the inverse g = f −1 (see e.g. [25]). For our purposes it is interesting to note that the Bregman divergences defined by F and G are connected by dF (w, w ) = dG (θ  , θ) where θ = f (w) and θ  = f (w ). Notice the changed order of the arguments. The following example defines the Bregman divergences related to p-norms, which are the ones we shall mainly need.

Online Learning of Linear Classifiers

241

Example 1. Given 1 < q ≤ 2, define F (w) =

1 ||w||2q 2

 where ||w||q = ( i |wi |q )1/q . This was first suggested by Grove et al. [9] as a suitable potential function for analyzing generalizations of the perceptron algorithm. We denote the related Bregman divergence by dq ; thus dq (u, w) =

1 1 ||u||q − ||w||q − f (w) · (u − w) 2 2

where the gradient f = ∇Fq is given by fi (w) =

sign(wi )|wi |q−1 . ||w||qq−2

(5)

It can be shown that the convex dual of Fq is given by G(θ) =

1 ||θ||2p 2

where 1/p + 1/q = 1. The gradient g = f −1 of the dual is then of course given by fi−1 (θ) =

sign(θi )|θi |p−1 . ||θ||p−2 p

(6)

Another interesting example is the (unnormalized) relative entropy. This is related to the exponentiated gradient algorithm [16] and similar algorithms such as Winnow [20, 21]. Example 2. Define F (w) =

n 

(wi ln wi − wi ) .

i=1

The gradient is given by fi (w) = ln wi . The inverse of the gradient is then obviously given by gi (θ) = exp(θi ), and the dual of F is G(θ) =

n 

exp(θi ) .

i=1

By writing out the definition we see that the related Bregman divergence is  n   ui ui ln − ui + wi dF (u, w) = wi i=1

  which in the case i ui = i wi = 1 becomes the usual relative entropy (i.e., Kullback-Leibler divergence).

242

J. Kivinen

Bregman divergences are not in general symmetrical. Neither do they satisfy the triangle inequality. This can be seen from the identity dF (u, w ) = dF (u, w) + dF (w, w ) + (f (w ) − f (w)) · (w − u)

(7)

which follows directly from the definition. Since the dot product on the righthand side can be either positive or negative, the triangle inequality does not hold in general. Given a set A ⊆ Rn , we call w2 = arg min dF (w, w1 ) w∈A

the projection of w1 onto A with respect to dF , assuming the arg min is uniquely defined. For any R ≥ 0 and w1 ∈ Rn , the set { w | dF (w, w1 ) ≤ R } is closed, convex and bounded. Hence, the projection is well defined at least for all closed and convex sets A. Lemma 1. Let w1 ∈ Rn and u ∈ A, where A ⊂ Rn is closed and convex. Let w2 be the projection of w1 onto A with respect to dF . Then dF (u, w1 ) ≥ dF (u, w2 ) + dF (w2 , w1 ) .

(8)

In the special case that A is a hyperplane, (8) holds as an equality. See Herbster and Warmuth [12] for a proof of Lemma 1 and its earliest application in a learning setting. The result has been used earlier in various other contexts [3, 14, 4]. Notice that the special case with dF = d2 and A a hyperplane gives the familiar Pythagorean Theorem. We now consider a particular method of deriving online learning algorithms using Bregman divergences. This type of derivation was applied by Kivinen and Warmuth [16] in certain special cases of linear regression and then extended to generalized linear regression [10, 17]. Suppose we have for t = 1, . . . , T a loss function Lt where for any w ∈ Rn the loss Lt (w) tells us how well the weight vector w would have performed at trial t. A typical loss function for linear regression, where yt ∈ R and the goal is to have wt · xt = yt , would be the square loss Lt (w) = (yt − w · xt )2 . Assume for now that Lt is convex and differentiable, and assume a Bregman divergence dF has been fixed. We define a cost function Ct by Ct (w) = dF (w, wt ) + ηLt (w) .

(9)

The basic idea is to choose wt+1 so as to minimize Ct (wt+1 ). This attempts on one hand to minimize the loss Lt (w) on the current example, on the other hand to remain close to the old weight vector wt that contains all the information we have gained from previous trials. Let w∗ = arg minw Ct (w). By our assumptions about convexity and differentiability, we have ∇Ct (w∗ ) = 0. Substituting the definition of dF into (9) and setting the gradient to zero gives us f (w∗ ) = f (wt ) − η∇Lt (w∗ ) .

Online Learning of Linear Classifiers

243

This is in general difficult to solve in closed form because of the way w∗ appears in the gradient on the right-hand side. It could of course be solved numerically. However, for our purposes it is sufficient to approximate ∇Lt (w∗ ) ≈ ∇Lt (wt ), which leads to the update rule f (wt+1 ) = f (wt ) − η∇Lt (wt ) .

(10)

This approximation can be justified by observing that for reasonable learning rates w∗ should be relatively close to wt . Another way of arriving at (10) is to replace Lt in (9) by its first-order Taylor approximation around wt . In general, we do not worry too much about the technicalities of this derivation. It is intended mainly as a motivation, and the real theoretical validation of the algorithms is provided by the loss bounds we prove. To get some simple examples, let us consider the linear regression case with Lt (w) = (yt − w · xt )2 . Then ∇Lt (w) = 2(w · xt − yt )xt . In the special case dF = d2 , with f the identity function, (10) becomes just the usual online gradient descent rule wt+1 = wt − 2η(w · xt − yt )xt . As another example, consider the unnormalized relative entropy from Example 2. Then fi (w) = ln wi and fi−1 (θ) = eθi , so the update rule becomes wt+1,i = wt,i exp(−2η(w · xt − yt )xt,i ) for i = 1, . . . , n. For an analysis of these two particular cases, see [16]. Consider now the classification case. There our real performance measure is typically the number of mistakes, which cannot be presented by a convex and differentiable loss function. We resort to approximating it by an upper bound. Thus, given ρ ≥ 0 define the hinge loss with margin ρ as Lρ (w, x, y) = max { 0, ρ − yw · x }

(11)

(see Figure 3). The hinge loss was first introduced to online learning by Gentile and Warmuth [8] in the case ρ = 0. It is closely related to soft margin support vector machines; see e.g. [11] for an overview. Notice that Lρ (w, x, y)/ρ ≥ 1 if yw · x ≤ 0, and 0 ≤ Lρ (w, x, y)/ρ ≤ 1 otherwise. Thus we can use Lρ (wt , xt , yt )/ρ as an upper bound for the indicator function of whether the algorithm made a mistake at trial t. Now Lρ (w, x, y) is convex in w. Apart from the case w · x = 0 it is also differentiable, with  0 if yw · x > ρ ∇w Lρ (w, x, y) = −yx if yw · x < ρ. For convenience, we define ∇w Lρ (w, x, y) = −yx also when yw · x = ρ. By writing σt = 1 if yt wt · xt ≤ ρ and σt = 0 otherwise, and substituting into (10), we get f (wt+1 ) = f (wt ) − ησt yt xt .

(12)

244

J. Kivinen

The fact that in going from (10) into (12) we handled the case yw · x = ρ somewhat arbitrarily does not matter in our analysis of the algorithm, which is based directly on (12). The gradient interpretation is used here only for motivation purposes. Notice that when f is the identity function and ρ = 0, the update (12) is exactly the perceptron update. In Section 4 we analyze this update in the general form derived from the p-norm divergence for 2 ≤ p < ∞ and with margin ρ ≥ 0. The special case ρ = 0 is what Grove et al. [9] considered when they originally proposed the p-norm family of algorithms. The p-norm updates have also been analyzed in the regression setting by Gentile and Littlestone [7].

Lρ (w, x, y)

0

ρ

yw · x

Fig. 3. The hinge loss Lρ (w, x, y)

In general, we can write (10) as θ t+1 = θ t − η∇Lt (wt ) where θ t = f (wt ) (and hence wt = g(θ t )). Thus it looks like a reparameterized gradient descent, but it must be noticed that θ t is updated using a gradient taken with respect to wt . The result is quite unlike writing θ = f (w) and then doing gradient descent with respect to θ.

4

Marginalized p-Norm Perceptron

Let us now apply the general framework of Section 3 to the Bregman divergence dq and loss function Lρ . The resulting algorithm, which we call the marginalized p-norm perceptron algorithm and denote by MPAp , is given in Figure 4. The special case ρ = 0 would be the original p-norm perceptron of Grove et al. [9]. If we choose ρ = 0 and p = 2, we get the basic perceptron algorithm. Instead of presenting the direct generalization of the Perceptron Convergence Theorem to the p-norm case, we consider the more general case where the example sequence needs not be linearly separable. Ideally, we would like to compare the number of mistakes made by the algorithm to the number

Online Learning of Linear Classifiers

245

Parameters: learning rate η > 0 margin ρ ≥ 0 2 ≤ p < ∞ (and 1 < q ≤ 2 such that 1/p + 1/q = 1) Initialization: w1 = 0 ∈ Rn Repeat for t = 1, . . . , T : 1. Receive the instance xt ∈ Rn . 2. Give the prediction yˆt = sign(wt · xt ). 3. Receive the desired label yt ∈ { −1, +1 }. 4. Update the hypothesis according to wt+1 = f −1 (f (wt ) + ησt yt xt )

(13)

where σt = 0 if yt wt · xt > ρ and σt = 1 otherwise, and f and f −1 are as in (5) and (6). Fig. 4. The marginalized p-norm perceptron algorithm MPAp

min | { t | yt u · xt ≤ 0 } | u

(14)

of mistakes made by the best fixed classifier u. However, achieving tight bounds in this setting seems very hard. It is known that approximating the minimization problem (14) is computationally intractable [13]. Hence, it seems unlikely that we could obtain strong performance guarantees for simple online algorithms in minimizing the number of mistakes. To simplify matters, we use a continuous and convex approximation for the number of mistakes. Thus, for µ ≥ 0 define Lossµ (u) =

T 

Lµ (u, xt , yt )

t=1

where Lµ (u, x, y) = max { 0, µ − yu · x } as before. Then Lossµ (u)/µ is an upper bound for the number of mistakes made by u; this upper bound can of course be very loose. In particular, u has margin µ on S if and only if Lossµ (u) = 0. We now hope to achieve bounds of the type M(MPAp ) ≤ a1

Lossµ (u) + a2 µ

(15)

where a1 is some small numerical constant and a2 depends on µ and the norms of u and xt . We usually wish to have (15) hold for all vectors u ∈ U in some comparison class U ⊂ Rn . Typically the comparison class is defined as U = { u | ||u|| ≤ U } for some norm || · ||. We use the term comparison vector to refer to vectors u ∈ U, against which we thus compare our algorithm. Bounds of form (15), with a1 arbitrarily close to one, can indeed be achieved. However, achieving the best bounds requires careful tuning of the learning rate and margin parameters. The parameter values given in the following results depend on various properties of the example sequence which in general would not be known at the time the algorithm is started and the parameters need to be fixed. In particular, they depend on an upper bound K ≥ Lossµ (u) for

246

J. Kivinen

the loss of some vector u. What we then actually get are bounds of the form M(MPAp ) ≤ a1 K/µ+a2 . Thus, if the chosen upper bound K is a very loose, then our mistake bound will be correspondingly loose. In practice, of course, instead of trying to estimate the parameters Lossµ (u) etc. and using the tuning given in our theorems, one should directly tune η and ρ by some empirical methods. Grove et al. [9] in their original work on the p-norm algorithm proved convergence bounds for the linearly separable case. The results in the rest of this section are essentially simplifications of the work of Gentile and Littlestone [7], who first provided bounds for the p-norm perceptron in the non-separable case, and also for regression. For other mistake bounds for online classification in the non-separable case, see [5, 8]. We start with a bound for the number ME(MPAp ) =

T 

σt

t=1

of margin errors (i.e., updates) made by the algorithm. The parameter tuning is given here in terms of the function h defined for x > 0, R > 0 and S > 0 by    S S S h(x, R, S) = x+ − . (16) R R R We extend this continuously to R = 0 by setting h(x, 0, S) = x/2. Notice that we always have 0 ≤ h(x, R, S) ≤ x. Theorem 2. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1. Let X > 0 and assume that ||xt ||p ≤ X for all t. Fix K ≥ 0 and U > 0 and write C=

p−1 2 2 U X . 4

(17)

If there is a vector u ∈ Rn with Lossµ (u) ≤ K such that ||u||q ≤ U , then the marginalized p-norm perceptron with any margin 0 ≤ ρ < µ and learning rate η = 2h(µ − ρ, K, C)/((p − 1)X 2 ) satisfies the bound 2C K + +2 ME(MPAp ) ≤ µ − ρ (µ − ρ)2



K C + µ − ρ (µ − ρ)2

1/2 

C (µ − ρ)2

1/2

for its number of updates. Theorem 2 provides a bound for the number of updates, which of course is an upper bound for the number of mistakes. The interpretation of the result is perhaps easiest after properly converting it into a mistake bound. We thus postpone the discussion until after Theorem 3 where the mistake bounds are given. Before that, we present the proof of Theorem 2. A reader not interested in the technicalities of the proof should still read the summary of the method below before jumping to Theorem 3. The proof has the same basic structure as many other proofs for online algorithms. To explain it on a general level, assume we have defined some notions of

Online Learning of Linear Classifiers

247

loss Lt (A) for algorithm A and Lt (u) for comparison vector u at trial t. Thus here we would have Lt (A) = Lρ (wt , xt , yt ) and Lt (u) = Lµ (u, xt , yt ). We also use some Bregman divergence dF to define Pt = dF (u, wt ) as a potential of the current state of the algorithm, where u is some arbitrary but fixed comparison vector. The change in potential ∆t = Pt − Pt+1 is called the progress at trial t. The key part of the proof is proving a progress bound of the type ∆t ≥ aLt (A) − bLt (u) − ct

(18)

where a and b are positive constants which typically depend on the tuning parameters of the algorithm, and ct > 0 may depend on what actually happened at trial t. This means that at a trial on which the algorithm A performed much worse than the comparison vector u, there will be a large amount of progress T towards u. Since t=1 ∆t = d(u, w1 ) − d(u, wT +1 ) ≤ d(u, w1 ), we can sum (18) over t = 1, . . . , T and solve for Lt (A) to obtain T  t=1

Lt (A) ≤

T T  ct b 1 . Lt (u) + d(u, w1 ) + a t=1 a a t=1

Thus we have a bound for the loss of the algorithm in terms of the loss of an arbitrary comparison vector. The term d(u, w1 ) can be bounded, e.g., by assuming a norm bound on u. What remains to be done is to choose the tuning parameters, which affect a, b and ct , so that the bound becomes as good as possible. In the classification case there is also the question of converting the hinge loss bound into a margin error or mistake bound. We now go to the details of the proof of Theorem 2. The one specific part that is needed in order to use the particular Bregman divergence dq in the general framework sketched above is the following bound, which is essentially proven in [9]; see also Lemma 2 in [7]. Lemma 2. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1. Assume f (w ) = f (w) + x where f is as in (5). Then dq (w, w ) ≤

p−1 ||x||2p . 2

We can now prove the following progress bound. Lemma 3. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1, and let 0 ≤ ρ < µ. Then at trial t, the update of the marginalized p-norm perceptron with margin ρ and learning rate η satisfies dq (u, wt ) − dq (u, wt+1 )

  p−1 ≥ ηLρ (wt , xt , yt ) − ηLµ (u, xt , yt ) + ησt µ − ρ − η ||xt ||2p 2

for any u ∈ Rn .

248

J. Kivinen

Proof. We apply (7) with w = wt and w = wt+1 . Since f (wt+1 ) − f (wt ) = ησt yt xt , we get dq (u, wt ) − dq (u, wt+1 ) = ησt yt xt · (u − wt ) − dq (wt , wt+1 ) . Applying Lemma 2 yields dq (u, wt ) − dq (u, wt+1 ) ≥ ησt yt xt · (u − wt ) − σt

p−1 2 η ||xt ||2p . 2

(19)

Notice that this is a direct generalization of (2) in the proof of the Perceptron Convergence Theorem. Since we are not assuming linear separability, we need to bound the dot products a little differently from what we did there. From the definition of Lρ we get σt yt wt · xt = σt ρ − Lρ (wt , xt , yt ). Similarly, σt yt wt · xt ≥ σt µ−σt Lµ (u, xt , yt ) ≥ σt µ−Lµ (u, xt , yt ). Substituting these estimates into (19) gives the claim.   It remains to sum the progress over all the trials and tune the learning rate. For tuning we use the following simple technical lemma. Lemma 4. Given R ≥ 0, S > 0 and γ > 0, define v(z) =

S R + γ−z z(γ − z)

for 0 < z < γ. Then v(z) is minimized for z = h(γ, R, S) where h is as in (16), and the maximum value is 1/2  1/2  S R R 2S S v(h(γ, R, S)) = + 2 +2 + 2 . γ γ γ γ γ2 Proof. Write v(z) = a/z + b/(γ − z) where a = R + S/γ and b = S/γ. The result now follows by easy differentiation.   Proof of Theorem 2. By summing the bound of Lemma 3 over t = 1, . . . , T we get dq (u, w1 ) − dq (u, wT +1 ) =

T 

(dq (u, wt ) − dq (u, wt+1 ))

t=1

≥ ηLossρ (MPAp ) − ηLossµ (u)   p−1 2 + ηME(MPAp ) µ − ρ − η X . 2 We notice that dq (u, wT +1 ) ≥ 0, and for w1 = 0 we have dq (u, w1 ) = ||u||2q /2 ≤ U 2 /2. Therefore, using the assumption about Lossµ (u), we have   p−1 2 X ME(MPAp ) µ − ρ − η 2 1 U2 ≤ K − Lossρ (MPAp ) + . (20) η 2

Online Learning of Linear Classifiers

249

We can safely drop the non-positive term −Lossρ (MPAp ). The factor µ − ρ − η(p − 1)X 2 /2 is positive for the parameter values we consider, so we can then solve for ME(MPAp ) to get ME(MPAp ) ≤

U 2 /2 K + . µ − ρ − η(p − 1)X 2 /2 η(µ − ρ − η(p − 1)X 2 /2)

The result follows by minimizing the right-hand side with respect to η by using Lemma 4 with γ = µ − ρ and z = η(p − 1)X 2 /2.   We now show how to get mistake bounds from the margin error results. Theorem 3. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1. Let X > 0 and assume that ||xt ||p ≤ X for all t. Fix K ≥ 0, U > 0 and µ > 0. Define C as in (17), and let r = h(µ, K, C). Choose either ρ = 0 or ρ = µ − r, and let η = 2r/((p−1)X 2 ). Then for both of these parameter settings, if there is a vector u ∈ Rn with Lossµ (u) ≤ K such that ||u||q ≤ U , then the marginalized p-norm perceptron with margin ρ and learning rate η satisfies the mistake bound 1/2  1/2  C K 2C C K + 2 +2 + 2 M(MPAp ) ≤ . µ µ µ µ µ2 Proof. For ρ = 0 this follows directly from Theorem 2. For the other case, we start with (20). Notice that up to that point in the proof of Theorem 2, no assumptions about η or ρ have been applied, so (20) holds also with our current assumptions. We now choose η so that the coefficient in front of ME(MPAp ) becomes zero. We therefore have Lossρ (MPAp ) K 1 (p − 1)X 2 U 2 ≤ + . ρ ρ ρ 2(µ − ρ) 2

(21)

For any ρ > 0 we have M(MPAp ) ≤ Lossρ (MPAp )/ρ. The claim follows by minimizing the right-hand side of (21) with respect to ρ by using Lemma 4 with γ = µ, z = µ − ρ, R = K and S = C.   In the linearly separable case, we can take K = 0, and Theorem 3 simplifies to M(MPAp ) ≤

(p − 1)U 2 X 2 . µ2

For ρ = 0 this is a direct generalization of the Perceptron Convergence Theorem into the p-norm case. It is interesting to notice that we can get the same mistake bound with the strictly positive margin ρ = µ/2, too. In the latter case the number of margin errors would of course be larger; the upper bound of Theorem 2 for this positive margin is worse by factor 4 than the mistake bound (i.e., margin error bound for zero margin). The fact that we get a bound only for these two values of ρ is clearly an artefact of our proof technique. Whether a similar bound holds for intermediate values of ρ remains an open problem. The Perceptron Convergence Theorem is the special case p = 2. It is also interesting to consider the other extreme, with a very large p. Grove et al. [9]

250

J. Kivinen

first noticed that with a large p, the mistake bounds of the p-norm perceptron become similar to those of Winnow [20, 21]. Specifically, for p = 2 ln n the estimate of Gentile and Littlestone [7] gives us M(MPAp ) ≤

2 (p − 1)U 2 X 2 (2e ln n)U12 X∞ ≤ 2 2 µ µ

where U1 and X∞ are upper bounds such that ||u||1 ≤ U1 and ||xt ||∞ ≤ X∞ for all t. Here ||x||∞ = maxi |xi |. See [16] for a discussion on how the product of norms ||u||1 ||x||∞ in the bounds gives a qualitatively different behavior than the pair ||u||2 ||x||2 appearing in the classical Perceptron Convergence Theorem. In the limit of large K, the bound has the form  K K + c1 + c2 M(MPAp ) ≤ µ µ where c1 and c2 depend on U and X but not on K. Thus, this has the desired form M ≤ (1 + o(1))K/µ discussed earlier. However, it should be noticed that the parameter tuning required for this to hold does depend on K. Recently it has been shown how online learning algorithms can in many cases tune their parameters online and achieve bounds of the form M ≤ (1 + o(1))K/µ without any prior knowledge [1]. In particular, our result here can be seen as a much weaker version of Gentile’s bound for Alma [6] which does tune its parameters online.

5

Regularized p-Norm Perceptron

We now consider a variant we call the regularized marginalized p-norm perceptron algorithm and denote by RMPAp (Figure 5). The difference to MPAp is that now there is an additional regularization parameter B > 0, which is used to bound the norm of the weight vector. Whenever the usual update of MPAp would make the norm ||wt+1 ||q larger than B, RMPAp scales wt+1 so that the constraint ||wt+1 ||q ≤ B is maintained. It is interesting to notice, and important for the following analysis, that the normalization step can be interpreted as a projection with respect to a Bregman divergence: we can easily verify that wt+1 = arg min dq (w, wt+1 ) w∈B

where B = { w | ||w||q ≤ B } is the q-norm ball of radius B around the origin. There are at least two reasons for wishing to bound the norm of the weight vector. If we are interested in finding a weight vector w with a large margin, the quantity of interest is really the normalized margin mint yt w · xt /||w||, so one natural approach is to bound the norm of the weight vector and then try to achieve a large unnormalized margin mint yt w · xt . Notice that here we are normalizing with respect to the q-norm, which for q < 2 leads to a different notion of normalized margin than the usual 2-norm case. The second reason for

Online Learning of Linear Classifiers

251

Parameters: learning rate η > 0 margin ρ ≥ 0 regularization parameter B > 0 2 ≤ p < ∞ (and 1 < q ≤ 2 such that 1/p + 1/q = 1) Initialization: w1 = 0 ∈ Rn Repeat for t = 1, . . . , T : 1. Receive the instance xt ∈ Rn . 2. Give the prediction yˆt = sign(wt · xt ). 3. Receive the desired label yt ∈ { −1, +1 }. 4. To update the hypothesis, first compute wt+1 = f −1 (f (wt ) + ησt yt xt )

(22)

where f and f −1 are as in (5) and (6), and σt = 0 if yt wt · xt > ρ and σt = 1 otherwise. Then bound the norm by setting   B wt+1 = wt+1 · min 1, . (23) ||wt+1 ||q Fig. 5. The regularized marginalized p-norm perceptron algorithm RMPAp

bounding the norm of the weight vector is to make the algorithm more robust against target movement. As the norm of the weight vector increases, the relative change in it due to a single update decreases, since we keep the learning rate constant. If we wish the algorithm to converge, this would usually be a good thing. However, if we are doing predictions on a non-stationary problem, we need to avoid getting stuck and pay enough attention to the most recent examples. Regularization of this specific type in the context of tracking a moving target was first considered by Herbster and Warmuth [12]. Mesterharm [22] has analyzed a variant of the Winnow algorithm [20] with a moving target using somewhat different methods. The regularized marginalized p-norm perceptron can be seen as a simplified version of Gentile’s Alma [6]. The difference is that RMPAp needs to be given a fixed learning rate and margin as parameters at the outset. In contrast, Alma is able to tune these parameters online. This means in particular that on a linearly separable example sequence Alma will find a weight vector with approximately maximal margin within a bounded number of iterations, without needing to know the maximum margin in advance. This is of course highly desirable. However, we have not been able to generalize Gentile’s analysis to allow for moving targets. Therefore we consider here the cruder version where parameters need to be fixed by the user at the start of the trial sequence, and a bad choice may lead to bad performance by the algorithm. To model target movement, we introduce the notion of comparison sequences. In Section 4, our margin error and mistake bounds were given in terms of the hinge loss Lossµ (u) of a single comparison vector. We generalize this by calling any sequence U = (u1 , . . . , uT +1 ) ∈ (Rn )T +1 a comparison sequence and defining its loss in the natural manner

252

J. Kivinen

Lossµ (U ) =

T 

Lµ (ut , xt , yt ) .

t=1

(Appending the extra vector uT +1 is a convenient technicality.) In proving a bound for the performance of the algorithm in terms of Lossµ (U ) we need to make two assumptions about the sequence U . First, analogously to Section 4, we need to bound the norm of each individual comparison vector ut . Second, we need to bound the total amount of target movement during the trial sequence, T which we measure by t=1 ||ut − ut+1 ||q . The following results are from [19]. We start by giving a margin error bound. Theorem 4. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1. Let X > 0 and assume that ||xt ||p ≤ X for all t. Fix K ≥ 0, U > 0 and D ≥ 0 and write C=

p−1 2 (U + 2U D)X 2 . 4

(24)

If there is a comparison sequence U ∈ (Rn )T +1 with Lossµ (U ) ≤ K such T that ||ut ||q ≤ U for all t and t=1 ||ut − ut+1 ||q ≤ D, then the regularized marginalized p-norm perceptron with any margin 0 ≤ ρ < µ, learning rate η = 2h(µ − ρ, K, C)/((p − 1)X 2 ) and regularization parameter B = U satisfies the bound 1/2  1/2  C K 2C C K + + + 2 ME(RMPAp ) ≤ µ − ρ (µ − ρ)2 µ − ρ (µ − ρ)2 (µ − ρ)2 for its number of updates. To interpret the result, notice first that the actual bound is the same as the one given in Theorem 2 for MPAp , except that the value of the constant C is different. Thus, in the constant C we have to pay both for the norms of the individual comparison vectors and for the total amount of movement. This is as expected. Another important difference is that we now have an explicit bound on the norms of the weight vectors of the algorithm. This makes the margin error bound more interesting even when there is no target movement (D = 0). Thus, assume that u has margin µ on the example sequence S, and ||u||q ≤ U . If we now run RMPAp by cycling repeatedly through S, then assuming ρ < µ and B = U it converges after at most (p − 1)U 2 X 2 (µ − ρ)2 updates to a weight vector w with margin ρ on S and norm at most ||w||q ≤ U . If we write ρ = (1 − α)µ, the bound becomes (p − 1)U 2 X 2 , α2 µ2 so we see that the algorithm finds a weight vector with a margin within factor 1 − α of the optimal in O(1/α2 ) iterations. This result is similar to Gentile’s

Online Learning of Linear Classifiers

253

result for Alma [6], but as noticed before there is the serious drawback that unlike for Alma we need to know the optimal margin beforehand to guarantee this. Similar approximation results for the case (p, q) = (∞, 1) have in the batch case been proved for a boosting algorithm [24]. Both for RMPAp and Alma it should be noticed that if we include the bias term using the reduction explained in Section 2, we can no longer get such a good approximation result. Intuitively this is because the reduction leads to the bias term b being part of the weight vector which is regularized, although the bias term should be outside the regularization. Finding a better method for handling the bias term remains an open question. Proof of Theorem 4. The proof is based on the technique of Herbster and Warmuth [12]. Again we use the Bregman divergence as a measure of progress, but now we need to include the effect of target movement. Therefore, we start by estimating the progress dq (ut , wt ) − dq (ut+1 , wt+1 ). We split it into three parts as   dq (ut , wt ) − dq (ut+1 , wt+1 ) = dq (ut , wt ) − dq (ut , wt+1 )   + dq (ut , wt+1 ) − dq (ut , wt+1 ) + (dq (ut , wt+1 ) − dq (ut+1 , wt+1 )) . For the first part, Lemma 3 directly implies dq (ut , wt ) − dq (ut , wt+1 ) ≥ ηLρ (wt , xt , yt ) − ηLµ (ut , xt , yt )   p−1 ||xt ||2p . + ησt µ − ρ − η 2

(25)

For the second part, we already observed that wt+1 = arg min dq (w, wt+1 ) w∈B

where B = { w | ||w||q ≤ B } is convex. Since we also assume ut ∈ B, Lemma 1 gives us dq (ut , wt+1 ) − dq (ut , wt+1 ) ≥ dq (wt+1 , wt+1 ) ≥ 0 .

(26)

For the third part, the definition of dq gives us dq (ut , wt+1 ) − dq (ut+1 , wt+1 ) =

1 1 ||ut ||2q − ||ut+1 ||2q + f (wt+1 ) · (ut+1 − ut ) . 2 2

Since 1/p + 1/q = 1, the p-norm and q-norm are dual and H¨ older’s inequality implies |f (wt+1 ) · (ut+1 − ut )| ≤ ||f (wt+1 )||p ||ut+1 − ut ||q . It is easy to verify that ||f (w)||p = ||w||q for all w, so we get dq (ut , wt+1 ) − dq (ut+1 , wt+1 ) ≥

1 1 ||ut ||2q − ||ut+1 ||2q − ||ut+1 − ut ||q ||wt+1 ||q . 2 2 (27)

254

J. Kivinen

We now combine (25), (26) and (27), estimate ||xt ||p ≤ X and ||wt+1 ||q ≤ U , and sum over t = 1, . . . , T to obtain dq (u1 , w1 ) − dq (uT +1 , wT +1 ) =

T 

(dq (ut , wt ) − dq (ut+1 , wt+1 ))

t=1

≥ ηLossρ (RMPAp ) − ηLossµ (U )   p−1 2 + ηME(RMPAp ) µ − ρ − η X 2 T

 1 1 + ||u1 ||2q − ||uT +1 ||2q − U ||ut − ut+1 ||q . 2 2 t=1 We notice that for w1 = 0 we have dq (u1 , w1 ) = 12 ||u1 ||2q , and clearly dq (uT +1 , wT +1 ) ≥ 0. By our assumptions about U we now get   p−1 2 X ME(RMPAp ) µ − ρ − η 2   1 U2 ≤ K − Lossρ (RMPAp ) + + UD . (28) η 2 As before, we drop the term −Lossρ (RMPAp ). The coefficient µ−ρ−η(p−1)X 2 /2 is positive for the parameter values we consider, so solving for ME(RMPAp ) gives us ME(RMPAp ) ≤

U 2 /2 + U D K + . µ − ρ − η(p − 1)X 2 /2 η (µ − ρ − η(p − 1)X 2 /2)

We then apply Lemma 4 with z = η(p − 1)X 2 /2, γ = µ − ρ, R = K and S = C.   Like in the non-regularized case, we can easily convert the margin error bound into a mistake bound. Again, two different parameter settings, one with ρ = 0 and one with ρ > 0, give the same bound. Theorem 5. Fix 2 ≤ p < ∞ and q such that 1/p + 1/q = 1. Let X > 0 and assume that ||xt ||p ≤ X for all t. Fix K ≥ 0, U > 0, D > 0 and µ > 0. Define C as in (24), and let r = h(µ, K, C). Choose either ρ = 0 or ρ = µ − r, and let η = 2r/((p − 1)X 2 ). Then for both of these parameter settings, if there is a sequence U ∈ (Rn )T +1 with Lossµ (U ) ≤ K such that ||ut ||q ≤ U for all t and T t=1 ||ut − ut+1 ||q ≤ D, then the regularized marginalized p-norm perceptron with margin ρ, learning rate η and regularization parameter B = U satisfies the mistake bound 1/2  1/2  C K 2C C K M(RMPAp ) ≤ + 2 +2 + 2 . µ µ µ µ µ2

Online Learning of Linear Classifiers

255

Proof. The case ρ = 0 follows directly from Theorem 4. For the other case, we start from (28) in the proof of Theorem 4. The parameters η and ρ have been chosen so that the coefficient µ − ρ − η(p − 1)X 2 /2 becomes zero. Therefore the term with ME(RMPAp ) in (28) vanishes, and we get M(RMPAp ) ≤

K (p − 1)X 2 (U 2 /2 + U D) Lossρ (RMPAp ) ≤ + . ρ ρ 2ρ(µ − ρ)

The claim now follows from Lemma 4 with γ = µ and z = µ − ρ.   Kivinen et al. [19, 15] have also considered a closely related algorithm they call Norma. In Norma, the renormalization step of RMPAp is replaced by wt+1 = (1 − λ)wt+1 for some constant 0 < λ < 1. Thus, the weight vector is always shrunken, regardless of its norm. The mistake bounds given for Norma are similar to Theorem 5 but make somewhat more restrictive assumptions on the target movement, and the analysis is somewhat more complicated.

6

Conclusion

The main theme in this paper has been the application of Bregman divergences to the analysis of online algorithms. We have seen that the 2-norm used in analyzing the classical perceptron algorithm can be replaced by the q-norm for any 1 < q ≤ 2, leading to algorithms with different convergence bounds. The methodology also naturally generalizes to the nonseparable case. Bregman divergences can also be applied as regularizers, which improves the robustness of the algorithms against target movement. Various problems remain open for future work. The most pressing technical problem with the proposed RMPA algorithm is parameter tuning. Automatic parameter tuning methods are known for the case with stationary target, but the case with a moving target seems more complicated. In particular, assuming a stationary target we want our algorithm eventually to converge, so the learning rate should approach zero. (Here we assume that the algorithm keeps the norm of the weight vector bounded.) If we allow for target movement, the learning rate should presumably approach some positive value that depends on the target speed. How exactly this should be done is not clear. Another immediate question concerns using the algorithms (in the case p = 2) in conjuction with kernels. The basic kernelized versions of the perceptron and other algorithms considered here depend on maintaining all the examples on which an update was made. If we are applying the algorithms in a truly online setting, this often leads to the computational complexity growing without bound as time goes by. Some kind of forgetting of past examples seems to be called for. On a more general level, an empirical evaluation of the algorithms would of course be of interest. Most traditional classification learning benchmarks are batch problems. It would be nice to have problems that naturally have an online nature, possibly even with non-stationary targets. Suitable test problems can perhaps be found in the field of signal processing.

256

J. Kivinen

Acknowledgments. This work was support by the Australian Research Council.

References 1. P. Auer, N. Cesa-Bianchi and C. Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64:48–75, 2002. 2. B. E. Boser, I. M. Guyon and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proc. 5th Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, New York, NY, 1992. 3. L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7:200–217, 1967. 4. I. Csiszar. Why least squares and maximum entropy? An axiomatic approach for linear inverse problems. The Annals of Statistics, 19:2032–2066, 1991. 5. Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37:277–296, 1999. 6. C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. 7. C. Gentile and N. Littlestone. The robustness of the p-norm algorithms. In Proc. 12th Annual Conference on Computational Learning Theory, pages 1–11. ACM Press, New York, NY, 1999. 8. C. Gentile and M. K. Warmuth. Hinge loss and average margin. In M. S. Kearns, S. A. Solla and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 225–231. MIT Press, Cambridge, MA, 1998. 9. A. J. Grove, N. Littlestone and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43:173–210, 2001. 10. D. P. Helmbold, J. Kivinen and M. K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:1291–1304, 1999. 11. R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, MA, 2002. 12. M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281–309, 2001. 13. K.-U. H¨ offgen, H.-U. Simon and K. S. Van Horn. Robust trainability of single neurons. Journal of Computer and System Sciences, 50:114–125, 1995. 14. L. Jones and C. Byrne. General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Transactions on Information Theory, 36:23–30, 1990. 15. J. Kivinen, A. J. Smola and R. C. Williamson. Online learning with kernels. In T. G. Dietterich, S. Becker and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 785–792. MIT Press, Cambridge, MA, 2002. 16. J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132:1–64, 1997. 17. J. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional regression problems. Machine Learning, 45:301–329, 2001. 18. J. Kivinen, M. K. Warmuth and P. Auer. The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97:325–343, 1997.

Online Learning of Linear Classifiers

257

19. J. Kivinen, A. J. Smola and R. C. Williamson. Large margin classification for moving targets. In N. Cesa-Bianchi, M. Numao and R. Reischuk, editors, Proc. 13th International Conference on Algorithmic Learning Theory. Springer, Berlin, November 2002. 20. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285–318, 1988. 21. N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, Technical Report UCSC-CRL-89-11, University of California Santa Cruz, 1989. 22. C. Mesterharm. Tracking linear-threshold concepts with Winnow. In J. Kivinen and B. Sloan, editors, Proc. 15th Annual Conference on Computational Learning Theory, pages 138–152. Springer, Berlin, 2002. 23. A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622. Polytechnic Institute of Brooklyn, 1962. 24. G. R¨ atsch and M. K. Warmuth. Maximizing the margin with boosting. In J. Kivinen and B. Sloan, editors, Proc. 15th Annual Conference on Computational Learning Theory, pages 334–350. Springer, Berlin, 2002. 25. R. Rockafellar. Convex Analysis. Princeton University Press, 1970. 26. V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, New York, NY, 1982. 27. M. K. Warmuth and A. Jagota. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner and E. Boros, editors, Electronic Proceedings of Fifth International Symposium on Artificial Intelligence and Mathematics, http://rutcor.rutgers.edu/˜amai, 1998.

Author Index

Bartlett, Peter L.

184

Hegland, Markus

226

Kivinen, Jyrki Lloyd, J.W.

235

203

Meir, Ron 118 Mendelson, Shahar R¨ atsch, Gunnar

1

118

Sch¨ olkopf, Bernhard Smola, Alexander J.

41, 65 41, 65