A short Introduction to Support Vector Machines and Kernelbased Learning

• Disadvantages of classical neural nets• SVM properties and standard SVM classifier• Related kernelbased learning metho

363 40 1MB

English Pages 21 Year 2003

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Learning with Support Vector Machines 1608456161, 9781608456161

185 12 672KB Read more

Data Mining via Support Vector Machines

Support vector machines (SVMs) have played a key role in broad classes of problems arising in various fields. Much more

437 41 538KB Read more

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods [1 ed.] 9780521780193, 0521780195

This is the first comprehensive introduction to SVMs, a new generation learning system based on recent advances in stati

376 72 4MB Read more

Least Squares Support Vector Machines 9812381511, 9789812381514, 9789812776655

An examination of least squares support vector machines (LS-SVMs) which are reformulations to standard SVMs. LS-SVMs are

377 81 261KB Read more

Learning and Soft Computing: Support Vector Machines, Neural Networks and Fuzzy Logic Models [1 ed.] 0262112558, 9780585393001, 9780262112550

This textbook provides a thorough introduction to the field of learning from experimental data and soft computing. Suppo

470 89 7MB Read more

Learning with Fractional Orthogonal Kernel Classifiers in Support Vector Machines: Theory, Algorithms and Applications 9789811965531, 9789811965524, 9811965536

This book contains select chapters on support vector algorithms from different perspectives, including mathematical back

140 59 39MB Read more

Advances in Kernel Methods: Support Vector Learning 0262194163, 9780262194167

The Support Vector Machine is a powerful new learning algorithm for solving a variety of learning and function estimatio

115 8 5MB Read more

Support Vector Machines: Theory and Applications [1 ed.] 9783540243885, 3-540-24388-7

The support vector machine (SVM) has become one of the standard tools for machine learning and data mining. This careful

512 92 7MB Read more

Tutorial on support vector regression

In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation.

467 12 443KB Read more

Support Vector Machine In Chemistry 9789812389220, 9812389229

In recent years, the support vector machine (SVM), a new data processing method, has been applied to many fields of chem

402 99 358KB Read more

A short Introduction to Support Vector Machines and Kernelbased Learning

Author / Uploaded
Suykens J.

Similar Topics
Education

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

A (short) Introduction to Support Vector Machines and Kernelbased Learning Johan Suykens K.U. Leuven, ESAT-SCD-SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70 Email: [email protected] http://www.esat.kuleuven.ac.be/sista/members/suykens.html ESANN 2003, Bruges April 2003

⋄

Overview • Disadvantages of classical neural nets • SVM properties and standard SVM classiﬁer • Related kernelbased learning methods • Use of the “kernel trick” (Mercer Theorem) • LS-SVMs: extending the SVM framework • Towards a next generation of universally applicable models? • The problem of learning and generalization

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

1

x1 w1 x2 w2 w x3 w 3 xn n b 1

h(·)

y

Classical MLPs

h(·) Multilayer Perceptron (MLP) properties:

• Universal approximation of continuous nonlinear functions • Learning from input-output patterns; either oﬀ-line or on-line learning • Parallel network architecture, multiple inputs and outputs Use in feedforward and recurrent networks Use in supervised and unsupervised learning applications Problems: Existence of many local minima! How many neurons needed for a given task? Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

2

Support Vector Machines (SVM) cost function

cost function

MLP

SVM

weights

weights

• Nonlinear classiﬁcation and function estimation by convex optimization with a unique solution and primal-dual interpretations. • Number of neurons automatically follows from a convex program. • Learning and generalization in huge dimensional input spaces (able to avoid the curse of dimensionality!). • Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ). Application-speciﬁc kernels possible (e.g. textmining, bioinformatics) Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

3

SVM: support vectors 1

3

0.9

2.5 2

0.8

1.5

0.7

1

x2

x2

0.6

0.5

0.5 0

0.4 −0.5

0.3

−1

0.2

−1.5

0.1

0

−2

0

0.1

0.2

0.3

0.4

0.5

x1

0.6

0.7

0.8

0.9

1

−2.5 −2.5

−2

−1.5

−1

−0.5

x1

0

0.5

1

1.5

2

• Decision boundary can be expressed in terms of a limited number of support vectors (subset of given training data); sparseness property • Classiﬁer follows from the solution to a convex QP problem. Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

4

SVMs: living in two worlds ... Primal space:

(→ large data sets)

Parametric: estimate w ∈ Rnh y(x) = sign[wT ϕ(x) + b] ϕ1 (x)

ϕ(x)

w1 y(x)

+

+

w nh

+ + +

+

+

+ + +

x

x x

x x

x

x

x x

ϕnh (x)

x

+

+

x

x x

+ +

Dual space:

x x x x

K(xi, xj ) = ϕ(xi)T ϕ(xj ) (“Kernel trick”)

Input space

(→ high dimensional inputs)

Non-parametric: α ∈ RN Pestimate #sv y(x) = sign[ i=1 αiyiK(x, xi) + b] K(x, x1 ) α1

Feature space y(x)

x

α#sv K(x, x#sv ) Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

5

Standard SVM classifier (1) n • Training set {xi, yi}N i=1 : inputs xi ∈ R ; class labels yi ∈ {−1, +1}

• Classiﬁer: y(x) = sign[wT ϕ(x) + b] with ϕ(·) : Rn → Rnh a mapping to a high dimensional feature space (which can be inﬁnite dimensional!) • For separable data, assume ½ T w ϕ(xi) + b ≥ +1, if yi = +1 ⇒ yi[wT ϕ(xi) + b] ≥ 1, ∀i T w ϕ(xi) + b ≤ −1, if yi = −1 • Optimization problem (non-separable case): ½ N X 1 T yi[wT ϕ(xi) + b] ≥ 1 − ξi ξi s.t. min J (w, ξ) = w w + c ξi ≥ 0, i = 1, ..., N w,b,ξ 2 i=1 Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

6

Standard SVM classifier (2) • Lagrangian: L(w, b, ξ; α, ν) = J (w, ξ) −

N X

αi{yi[wT ϕ(xi) + b] − 1 + ξi} −

νiξi

i=1

i=1

• Find saddle point:

N X

max min L(w, b, ξ; α, ν) α,ν

w,b,ξ

• One obtains                 

∂L ∂w

=0 → w=

N X

αiyiϕ(xi)

i=1

∂L ∂b ∂L ∂ξi

=0 →

N X

αiyi = 0

i=1

= 0 → 0 ≤ αi ≤ c, i = 1, ..., N

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

7

Standard SVM classifier (3) • Dual problem: QP problem  N   X

N N X 1 X αj s.t. yiyj K(xi, xj ) αiαj + max Q(α) = − α  2 i,j=1  j=1

αiyi = 0

i=1

0 ≤ αi ≤ c, ∀i

with kernel trick (Mercer Theorem): K(xi, xj ) = ϕ(xi)T ϕ(xj ) • Obtained classiﬁer:

PN y(x) = sign[ i=1 αi yi K(x, xi) + b]

Some possible kernels K(·, ·): K(x, xi) = xTi x (linear SVM) K(x, xi) = (xTi x + τ )d (polynomial SVM of degree d) K(x, xi) = exp{−kx − xik22/σ 2} (RBF kernel) K(x, xi) = tanh(κ xTi x + θ) (MLP kernel) Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

8

Kernelbased learning: many related methods and fields SVMs Regularization networks

Some early history on RKHS: RKHS

LS-SVMs

?

=

Gaussian processes

1910-1920: Moore 1940: Aronszajn 1951: Krige 1970: Parzen 1971: Kimeldorf & Wahba

Kernel ridge regression

Kriging

SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

9

Wider use of the kernel trick • Angle between vectors: Input space: Feature space: cos θϕ(x),ϕ(z)

cos θxz

xT z = kxk2kzk2

ϕ(x)T ϕ(z) K(x, z) p = =p kϕ(x)k2kϕ(z)k2 K(x, x) K(z, z)

• Distance between vectors: Input space: kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z Feature space: kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z) Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

10

LS-SVM models: extending the SVM framework • Linear and nonlinear classiﬁcation and function estimation, applicable in high dimensional input spaces; primal-dual optimization formulations.

• Solving linear systems; link with Gaussian processes, regularization networks and kernel versions of Fisher discriminant analysis.

• Sparse approximation and robust regression (robust statistics). • Bayesian inference (probabilistic interpretations, inference of hyperparameters, model selection, automatic relevance determination for input selection).

• Extensions to unsupervised learning: kernel PCA (and related methods of kernel PLS, CCA), density estimation (↔ clustering).

• Fixed-size LS-SVMs: large scale problems; adaptive learning machines; transductive. • Extensions to recurrent networks and control. Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

11

SVM

LS-SVM

Robust Kernel

Kernel

Robust Linear

Linear

Towards a next generation of universal models?

FDA

× × × − × −

PCA

× × × − × −

PLS

× × × − × −

CCA

× × × − × −

Classiﬁers

× × × − × ×

Regression

× × × × × ×

Clustering

× − × − × ×

Recurrent

× − × − × −

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

Research issues: Large scale methods Adaptive processing Robustness issues Statistical aspects Application-speciﬁc kernels

12

Fixed-size LS-SVM (1) Primal space

Dual space Nystr¨om method Kernel PCA Density estimate Entropy criteria

Regression

Eigenfunctions SV selection

Modelling in view of primal-dual representations Link Nystr¨ om approximation (GP) - kernel PCA - density estimation

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

13

Fixed-size LS-SVM (2) high dimensional inputs, large data sets, adaptive learning machines (using LS-SVMlab) Sinc function (20.000 data, 10 SV) 1.4

1.2

1.2 1

1 0.8

0.8 0.6

y

y

0.6

0.4

0.4

0.2

0.2

0 0

−0.2 −0.2

−0.4

−0.6 −10

−8

−6

−4

−2

x 0

2

4

6

8

10

−0.4 −10

−8

−6

−4

−2

x 0

2

4

6

40

50

60

70

80

8

10

Santa Fe laser data 5

4

4

3

yk , yˆk

yk

3

2

1

2

1

0

0

−1

−1

−2

−2 0

100

200

300

400

500

600

700

800

10

20

900

discrete time k Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

30

90

discrete time k 14

Fixed-size LS-SVM (3) 2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−2.5 −2.5

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2.5 −2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

15

The problem of learning and generalization (1) Diﬀerent mathematical settings exist, e.g. • Vapnik et al.: Predictive learning problem (inductive inference) Estimating values of functions at given points (transductive inference) Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.

• Poggio et al., Smale: Estimate true function f with analysis of approximation error and sample error (e.g. in RKHS space, Sobolev space) Cucker F., Smale S. (2002) “On the mathematical foundations of learning theory”, Bulletin of the AMS, 39, 1–49.

Goal: Deriving bounds on the generalization error (this can be used to determine regularization parameters and other tuning constants). Important for practical applications is trying to get sharp bounds. Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

16

The problem of learning and generalization (2) (see Pontil, ESANN 2003)

Random variables x ∈ X, y ∈ Y ⊆ R Draw i.i.d. samples from (unknown) probability distribution ρ(x, y) Generalization error: Z E[f ] = L(y, f (x))ρ(x, y)dxdy X,Y

N 1 X Loss function L(y, f (x)); empirical error EN [f ] = L(yi, f (xi)) N i=1

fρ := arg min E[f ] (true function); fN := arg min EN [f ] f f R If L(y, f ) = (f − y)2 then fρ = Y yρ(y|x)dy (regression function) Consider hypothesis space H with fH := arg min E[f ] f ∈H

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

17

The problem of learning and generalization (3) generalization error = E[fN ] − E[fρ] =

sample error + (E[fN ] − E[fH]) +

approximation error (E[fH] − E[fρ])

approximation error depends only on H (not on sampled examples) sample error: E(fN ) − E(fH) ≤ ǫ(N, 1/h, 1/δ) (w.p. 1 − δ) ǫ is a non-decreasing function h measures the size of hypothesis space H Overﬁtting when h large & N small (large sample error) Goal: obtain a good trade-oﬀ between sample error and approximation error

Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

18

Interdisciplinary challenges NATO-ASI on Learning Theory and Practice,

Leuven July 2002

http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html

neural networks data mining

linear algebra

pattern recognition

mathematics SVM & kernel methods

machine learning optimization

statistics signal processing

systems and control theory J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), Advances in Learning Theory: Methods, Models and Applications, NATO-ASI Series Computer and Systems Sciences, IOS Press, 2003. Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

19

Books, software, papers ... www.kernel-machines.org

&

www.esat.kuleuven.ac.be/sista/lssvmlab/

Introductory papers: C.J.C. Burges (1998) “A tutorial on support vector machines for pattern recognition”, Knowledge Discovery and Data Mining, 2(2), 121-167. A.J. Smola, B. Sch¨ olkopf (1998) “A tutorial on support vector regression”, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK. T. Evgeniou, M. Pontil, T. Poggio (2000) “Regularization networks and support vector machines”, Advances in Computational Mathematics, 13(1), 1–50. K.-R. M¨ uller, S. Mika, G. R¨atsch, K. Tsuda, B. Sch¨olkopf (2001) “An introduction to kernel-based learning algorithms”, IEEE Transactions on Neural Networks, 12(2), 181-201. Introduction to SVM and kernelbased learning ⋄ Johan Suykens ⋄ ESANN 2003

20