Applied inductive learning - Lecture 7

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

ICS-E4030 Kernel Methods in Machine Learning

Learning with kernels and SVM

Support Vector Machines: Maximum Margin Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS-E4830 Kernel Methods in Machine Learning

Lecture Notes on Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Perceptron Revisited: Linear Separators. Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines.

Max Margin-Classifier

CS798: Selected topics in Machine Learning

Linear & nonlinear classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Support Vector Machines Explained

Kernel Methods and Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine for Classification and Regression

Statistical Machine Learning from Data

Machine Learning : Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

(Kernels +) Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Convex Optimization and Support Vector Machine

Non-linear Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Support vector machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

SUPPORT VECTOR MACHINE

An introduction to Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Incorporating detractors into SVM classification

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods. Machine Learning A W VO

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines for Classification: A Statistical Portrait

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Constrained Optimization and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines

Support Vector Machine

LECTURE 7 Support vector machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine & Its Applications

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

SMO Algorithms for Support Vector Machines without Bias Term

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Soft-margin SVM can address linearly separable problems with outliers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines for Regression

Classifier Complexity and Support Vector Classifiers

Support Vector Machines, Kernel SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression

Introduction to SVM and RVM

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines and Kernel Methods

Support Vector Machines and Speaker Verification

Announcements - Homework

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Transcription:

Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/ lwh/aia/ Wehenkel & Geurts AIA... (1/45)

Linear support vector machines Linear classification model Maximal Margin Hyperplane Optimisation problem formulation A brief introduction to constrained optimization Dual problem formulation Support vectors Soft margin SVM The Kernel trick Kernel based SVMs Formal definition of Kernel function Examples of Kernels Kernel methods Kernel ridge regression Conclusion Wehenkel & Geurts AIA... (2/45)

Linear support vector machines Linear classification model Linear classification model Given a LS = {(x k,y k )} N k=1, where y k { 1,1}, and x k R n. Find a classifier in the form ŷ(x) = sgn(w T x + b), which classifies the LS correctly, i.e. that minimizes N 1(y k ŷ(x k )) k=1 Several methods to find one such classifier: perceptron, linear discriminant analysis, naive bayes... Wehenkel & Geurts AIA... (3/45)

Linear support vector machines Maximal Margin Hyperplane Maximal margin hyperplane When the data is linearly separable in the feature space, the separating hyperplane is not unique SVM maximizes the distance from the hyperplane to the nearest points in LS, i.e. max w,b min{ x x k : w T x + b = 0,k = 1,...,N}. Wehenkel & Geurts AIA... (4/45)

Linear support vector machines Maximal Margin Hyperplane Why maximizing the margin? Intuitively, it feels safest There exist theoretical bounds on the generalization error that depend on the margin Err(TS) < O(1/γ), where γ is the margin. But these bounds are often loose. It works very well in practice. It yields a convex optimization problem whose solution can be written in terms of dot-products only. But this is the case with other criterion as well. Wehenkel & Geurts AIA... (5/45)

Linear support vector machines Optimisation problem formulation Some geometry w is perpendicular to the line y(x) = w T x + b: y(x a ) = 0 = y(x b ) w T (x A x B ) = 0 Let x such that y(x) = 0. The distance from the origin to the line is: w T x x cos(w,x) = x w x = wt x w = b w Wehenkel & Geurts AIA... (6/45)

Linear support vector machines Optimisation problem formulation Some geometry Any point x can be written as: w x = x perp + r w, where r is the distance from x to the line. Multiplying both sides by w T and adding b, one gets: w T x + b = w T x perp + b + r wt w w = 0 + r w r = y(x) w Wehenkel & Geurts AIA... (7/45)

Linear support vector machines Optimisation problem formulation Optimisation problem formulation The optimisation problem can be written: { } 1 arg max w,b w min[y n (w T x n + b)]. n The solution is not unique as the hyperplane is unchanged if we multiply w and b by a constant c > 0. To impose unicity, one typically chooses w T x + b = 1 for the point x that is closest to the surface (support vector). The problem is then equivalent to maximizing 1 w (or minimizing w ) with the constraints: y k (w T x k + b) 1, k = 1,...,N. (Show that w T x k + b = 1 for at least two points x k at the solution) Wehenkel & Geurts AIA... (8/45)

Linear support vector machines Optimisation problem formulation Optimisation problem formulation The SVM problem is equivalent to min w,b E(w,b) = 1 2 w 2 subject to the N inequality constraints y k (w T x k + b) 1, k = 1,...,N. NB w 1 2 w 2 for mathematical convenience This is a quadratic programming problem A solution exists only when the data is linearly separable Wehenkel & Geurts AIA... (9/45)

Linear support vector machines A brief introduction to constrained optimization A brief introduction to constrained optimisation Equality constrained optimisation problem Minimise (or maximise) f (x) R, with x R n subject to p equality constraints ei (x) = 0, i = 1,...,p Inequality constrained optimisation problem Minimise (or maximise) f (x) R, with x R n subject to p equality constraints ei (x) = 0, i = 1,...,p subject to r inequality constraints ij (x) 0, j = 1,...,r Feasibillity: existence of at least one x satisfying all equality and inequality constraints Wehenkel & Geurts AIA... (10/45)

Linear support vector machines A brief introduction to constrained optimization Pictorial view: equality constraints f f e f (x) = 2 f f (x) = 1 e(x) = 0 f (x) = 3 At the optimum, f (x) + α e(x) = 0 and e(x) = 0. Wehenkel & Geurts AIA... (11/45)

Linear support vector machines A brief introduction to constrained optimization Pictorial view: active inequality constraint f f f (x) = 2 f f (x) = 1 i i(x) > 0 i(x) < 0 f (x) = 3 At the optimum, f (x) + α i(x) = 0, i(x) = 0 and α > 0. Wehenkel & Geurts AIA... (12/45)

Linear support vector machines A brief introduction to constrained optimization Karush-Kuhn-Tucker complementarity conditions To minimise f (x) R, with x R n, subject to equality constraints e i (x) = 0,i = 1,...,p and inequality constraints i j (x) 0,j = 1,...,r, define the Lagrangian L(x,α,β) = f (x) + α T e(x) + β T i(x) (α R p,β R r ). At the optimum, there must exist some α R p and β R r such that the following conditions hold: x L = 0 x f (x) + α T x e(x) + β T x i(x) = 0 α L = 0 e i (x) = 0, i = 1,...,p β L 0 i j (x) 0, j = 1,...,r β j i j (x) = 0, β j 0. Wehenkel & Geurts AIA... (13/45)

Linear support vector machines A brief introduction to constrained optimization Convex optimization problems If the optimization problem is convex (ie. f and i j,j = 1,...,r are convex and e i,i = 1,...,p are affine (linear) functions) and some conditions on the constraints are satisfied, we have (roughly): KKT conditions are necessary and sufficient conditions for optimality in some cases, an optimum x may be obtained analytically from these conditions Let W(α,β) = min x L(x,α,β) and let α and β be the solution of the (Lagrange) dual problem: max α,β W(α,β) subject to β j 0, j = 1,...,r, then the minimizer x of L(x,α,β ) is the solution of the original optimization problem. Wehenkel & Geurts AIA... (14/45)

Linear support vector machines Dual problem formulation Back to the optimisation problem formulation The SVM optimization problem is formulated as: min w,b E(w,b) = 1 2 w 2 subject to the N inequality constraints y k (w T x k + b) 1, k = 1,...,N. Let α k 0,k = 1,...,N and let us construct the Lagrangian L(w,b,α) = 1 2 w 2 N α k (y k (w T x k + b) 1) k=1 This function must be minimized w.r.t. w and b, and maximized w.r.t. α. Wehenkel & Geurts AIA... (15/45)

Linear support vector machines Dual problem formulation Lagrangian equations By deriving the Lagrangian L(w,b,α) = 1 2 w 2 N α k (y k (w T x k + b) 1) k=1 according to the primal variables w and b, we obtain the following optimality conditions: { L w = 0 L b = 0 w = N j=1 α jy j x j N k=1 α ky k = 0 (with α k 0). Wehenkel & Geurts AIA... (16/45)

Linear support vector machines Dual problem formulation Dual problem Substituting in the Lagrangian L(w,b,α) = 1 2 w 2 N α k (y k (w T x k + b) 1) the expressions w = N j=1 α jy j x j and N k=1 α ky k = 0 yields the dual (quadratic) maximisation problem k=1 max α W(α) = N k=1 α k 1 2 N i,j=1 α iα j y i y j x T i x j subject to the N inequality constraints α k 0, k = 1,...,N. and one equality constraint N i=1 α iy i = 0 Wehenkel & Geurts AIA... (17/45)

Linear support vector machines Support vectors Support vectors Back to the primal problem: L(w,b,α) = 1 2 w 2 N α k (y k (w T x k + b) 1) k=1 According to the KKT complementary conditions, the solution vector w is such that: α k (y k (w T x k + b) 1) = 0, k = 1,...,N α k = 0 if the constraint is satisfied as a strict inequality y k (w T x k + b) > 1 because this is the way to maximize L α k > 0 if the constraint is satisfied as an equality y k (w T x k + b) = 1, in which case x k is a support vector Wehenkel & Geurts AIA... (18/45)

Linear support vector machines Support vectors Final model Once the optimal values of α have been determined, the final model may be written as ( N ) ŷ(x) = sgn y i α i xi T x + b, i=1 where the α k values that are different from zero (i.e. strictly positive) are corresponding to the support vectors. NB: b is computed by exploiting the fact that for any α k > 0 we have necessarily y k (w T x k + b) 1 = 0. Wehenkel & Geurts AIA... (19/45)

Linear support vector machines Support vectors Leave-one-out bound Having a small set of support vectors is computationally efficient as only those vectors x k and their weights α k and class labels y k need to be stored for classifying new examples f (x) = sgn y i α i xi T x + b, i α i >0 Moreover, if a non-support vector x is removed from the learning sample or moved around freely (outside the margin region), the solution would be unchanged. The proportion of support vectors in the learning sample gives a bound on the leave-one-out error: Err loo {k α k > 0} N Wehenkel & Geurts AIA... (20/45)

Linear support vector machines Soft margin SVM Soft margin Due to noise or outliers, the samples may not be linearly separable in the feature space Discrepencies with respect to the margin is measured by slack variables ξ i 0 with the associated relaxed constraints y i (w T x i + b) 1 ξ i By making ξ i large enough, the constraints can always be met if 0 < ξi < 1 the margin is not satisfied but x i is still correctly classified if ξi > 1 then x i is misclassified Wehenkel & Geurts AIA... (21/45)

Linear support vector machines Soft margin SVM 1-Norm soft margin optimization problem Primal problem: 1 min w,ξ 2 w 2 + C N i=1 ξ i subject to y i (w T x i + b) 1 ξ i,ξ i 0, i = 1,...,N where C is a positive constant balancing the objective of maximizing the margin and minimizing the margin error Dual problem: max α W(α) = N k=1 α k 1 2 N i,j=1 α iα j y i y j x T i x j subject to 0 α k C, k = 1,...,N. (box constraints) N i=1 α iy i = 0 Wehenkel & Geurts AIA... (22/45)

Linear support vector machines Soft margin SVM Effect of C Wehenkel & Geurts AIA... (23/45)

The Kernel trick Kernel based SVMs Dot products The quadratic optimization problem (hard and soft margin): max α N W(α) = α k 1 2 k=1 N α i α j y i y j xi T x j i,j=1 and the decision function: ( N ) ŷ(x) = sgn y i α i xi T x + b make use of the learning samples x i only through dot products. i=1 Moreover the number of parameters to be estimated only depends on the number of training samples and is thus independent of the dimension of the input space (number of attributes). Wehenkel & Geurts AIA... (24/45)

The Kernel trick Kernel based SVMs Non-linear support vector machines The training samples may not be linearly separable in the input space (the above optimization problem does not have a solution) Consider a non-linear mapping φ to a new feature space φ(x) = [z 1, z 2, z 3 ] T = [x 2 1, x2 2, 2x 1 x 2 ] T The dual problem becomes max α W(α) = N k=1 α k 1 2 N i,j=1 α iα j y i y j φ(x i ) T φ(x j ) Wehenkel & Geurts AIA... (25/45)

The Kernel trick Kernel based SVMs The kernel trick Instead of explicitly defining a mapping φ, we can directly specify the dot product φ T (x)φ(x ) and leave the mapping implicit It is possible to caracterise mathematically the functions K(x,x ) defined on pairs of objects that correspond to the dot product φ T (x)φ(x ) for some implicit mapping φ (see next slides) Such function k is called a (positive or Mercer) kernel It can be thought of as a similarity measure between x and x The kernel trick: Any learning algorithm that uses the data only via dot products can rely on this implicit mapping by replacing x T x with K(x,x ) Wehenkel & Geurts AIA... (26/45)

The Kernel trick Kernel based SVMs Kernel-based SVM classification Assuming that we use a kernel K(x,x ) corresponding to the vectorisation map φ(x), we obtain straightforwardly ( N ) ( N ) ŷ(x) = sgn y k α k φ(x k ) T φ(x) + b = sgn y k α k K(x,x k ) + b, k=1 k=1 where the α k can be determined by solving the following quadatric maximisation problem: max α W(α) = N k=1 α k 1 2 N i,j=1 α iα j y i y j K(x i,x j ) subject to the N inequality constraints α k 0, k = 1,...,N. and one equality constraint N i=1 α iy i = 0 Wehenkel & Geurts AIA... (27/45)

The Kernel trick Formal definition of Kernel function Mathematical notion of Kernel (with values in R) Let U be a nonempty set of objects, then a function K(, ), K(, ) : U U R such that for all N N and all o 1,...,o N U the N N matrix K : K i,j = K(o i,o j ) is symmetric and positive (semi)definite, is called a positive kernel. Example: let φ( ) be a function defined on U with values in R m (for some fixed m). Then K φ (o,o ) = φ T (o)φ(o ) is a positive kernel. NB: φ(o) is a vector representation of o. Wehenkel & Geurts AIA... (28/45)

The Kernel trick Formal definition of Kernel function Mathematical notion of Kernel (with values in R) The general result is as follows: For any positive kernel K defined on U, there exists a scalar product space V and a function φ( ) : U V, such that K(o,o ) = φ(o) φ(o ), where the operator denotes the scalar product in V. In general, the space V is not necessarily of finite dimension. The kernel defines a scalar product, hence a norm and a distance measure over U, which are inherited from V: d 2 U (o,o ) = d 2 V (φ(o),φ(o )) = (φ(o) φ(o )) T (φ(o) φ(o )) = φ(o) φ(o) + φ(o ) φ(o ) 2φ(o) φ(o ) = K(o,o) + K(o,o ) 2K(o,o ). Wehenkel & Geurts AIA... (29/45)

The Kernel trick Examples of Kernels Examples of Kernels A trivial kernel: the constant kernel K(o, o ) 1 Linear kernel defined on numerical attributes K(o, o ) = a T (o)a(o ) Hamming kernel for discrete attributes: K(o, o ) = m i=1 δ a i(o),a i(o ) Text kernel number of common substrings in o and o (infinite dimension if text size is not a priori bounded) Combinations of kernels: The sum of several (positive) kernels is also a (positive) kernel The product of several kernels is also a kernel Polynomial kernels: n i=0 a i(k(x, x )) i if i : a i 0 Wehenkel & Geurts AIA... (30/45)

The Kernel trick Examples of Kernels Examples of Kernels (for numerical attributes) Consider the kernel K(x,x ) = (x T x ) 2. We have in dimension 2, posing x = (x1, x 2 ) and x = (x 1, x 2 ): x T x = x 1 x 1 + x 2x 2, and thus (x T x ) 2 = (x 1 x 1 + x 2x 2 )2 = x1 2x 1 2 + 2x 1 x 2 x 1 x 2 + x 2 2 x 2 2. Hence, posing φ(x) = (x 2 1, 2x 1 x 2, x2 2)T we have (x T x ) 2 = φ(x) T φ(x ), K(x, x ) corresponds to the scalar product in the space of degree 2 products of the original features. In general K(x,x ) = (x T x ) d corresponds to the scalar product in the space of all degree d product features. Another useful kernel Gaussian kernel: K(x, x ) = exp ( (x x ) T (x x ) 2σ 2 ) Wehenkel & Geurts AIA... (31/45)

The Kernel trick Examples of Kernels Effect of the kernel Wehenkel & Geurts AIA... (32/45)

The Kernel trick Kernel methods Kernel methods Modular approach: decouple the algorithm from the representation Many algorithms can be kernelized : ridge regression, fischer discriminant, PCA, k-means, etc. Kernels have been defined for many data types: time series, images, graphs, sequences Main interests of kernel methods: We can work in very high dimensional spaces (potentially infinite) efficiently as soon as the inner product is cheap to compute Can apply classical algorithms on general, non-vectorial, data types (e.g. to build a linear model on graphs) Wehenkel & Geurts AIA... (33/45)

Kernel ridge regression Ridge regression problem Let us assume that we are given a learning set LS = {(x k,y k )} N k=1, and have chosen a kernel K(x,x ) defined on the input space, corresponding to a vectorisation mapping φ(x) R n i.e. φ(x) T φ(x ) = K(x,x ). Least squares ridge regression consists in building a regression model in the form ŷ(x) = w T φ(x) + b, by minimising w.r.t. w R n and b R the error function (γ > 0): ERR(w,b) = 1 2 wt w + 1 N 2 γ ( y k (w T φ(x k ) + b)) 2. k=1 Wehenkel & Geurts AIA... (34/45)

Kernel ridge regression Equality-constrained formulation of ridge regression We can restate the unconstrained minimisation problem min ERR(w,b) = 1 w,b 2 wt w + 1 N 2 γ ( y k (w T φ(x k ) + b)) 2, k=1 as an (equality) constrained minimisation problem, namely min w,b,e E(w,b,e) = 1 2 wt w + 1 2 γ N k=1 e k 2 subject to N equality constraints y k (w T φ(x k ) + b) = e k,k = 1,...,N Wehenkel & Geurts AIA... (35/45)

Kernel ridge regression Lagrangian formulation of kernel-based ridge regression The optimisation problem to be solved: min w,b,e E(w,b,e) = 1 2 wt w + 1 2 γ N k=1 e2 k subject to the N equality constraints y k = w T φ(x k ) + b + e k,k = 1,...,N. This problem can be solved by defining the Lagrangian L(w,b,e;α) = 1 2 wt w + 1 N 2 γ N ek 2 ( ) α k w T φ(x k ) + b + e k y k k=1 and stating the optimality conditions k=1 L = 0; L L = 0; = 0; L = 0. w b e k α k Wehenkel & Geurts AIA... (36/45)

Kernel ridge regression Lagrangian equations In other words, we have L(w,b,e;α) = 1 2 w 2 + 1 N 2 γ N ek 2 ( ) α k w T φ(x k ) + b + e k y k k=1 yielding the optimality conditions L w = 0 w = N j=1 α jφ(x j ) L b = 0 N k=1 α k = 0 L k=1 e k = 0 α k = γe k, k = 1,...,N L α k = 0 w T φ(x k ) + b + e k y k = 0, k = 1,...,N Wehenkel & Geurts AIA... (37/45)

Kernel ridge regression Lagrangian equations Solution: L w = 0 w = N j=1 α jφ(x j ) L b = 0 N k=1 α k = 0 L e k = 0 α k = γe k, k = 1,...,N L α k = 0 w T φ(x k ) + b + e k y k = 0, k = 1,...,N Substituting w from the first equation in the last one we can express e k as e k = y k b N α j φ T (x j )φ(x k ) = y k b j=1 N α j K(x j,x k ), j=1 which allows us to eliminate e k from the third equation. Wehenkel & Geurts AIA... (38/45)

Kernel ridge regression Matrix equation for the unknowns α and b We have to solve the following equations for α 1,...,α N and b N k=1 α k = 0 α k = γe k i.e. α k = γ ( y k b ) N j=1 α jk(x j,x k ) in other words 0 1 1 1. 1 K + γ 1 I b α 1. α N = 0 y 1. y N, where the matrix K is defined by K i,j = K(x i,x j ). Wehenkel & Geurts AIA... (39/45)

Kernel ridge regression Kernel based ridge regression: discussion The model ŷ(x) = b + N α i K(x,x i ) i=1 is expressed purely in terms of the Kernel function. Learning algorithm uses only the values of K(x i,x j ) and y k to determine α and b. We don t need to know φ(x). Order N 3 operations for learning, and order N operations for prediction. Refinements: Sparse approximations Prune away the training points which correspond to small values of α. Wehenkel & Geurts AIA... (40/45)

Conclusion Conclusion Support vector machines are based on two very smart ideas: Margin maximization (to improve generalization) The kernel trick (to handle very high dimensional or complex input spaces) SVMs are quite well motivated theoretically SVMs are among the most successful techniques for classification There exist very efficient implementations that can handle very large problems Drawbacks: Essentially black-box models The choice of a good kernel is difficult and critical to reach good accuracy Wehenkel & Geurts AIA... (41/45)

Conclusion Conclusion Kernel methods are very useful tools in machine learning Several generic frameworks have been proposed to define kernels for various data types Many algorithms have been kernelized Especially interesting in complex application domains like bioinformatics Convex optimization is a very useful tool in machine learning Many machine learning problems can be formulated as convex optimization problems Often, a unique solution implies a more stable model Can benefit from the huge work made on optimization algorithms to focus on designing good objective functions Wehenkel & Geurts AIA... (42/45)

Conclusion Further reading Bishop: Chapter 7 (7.1 and 7.1.1), Appendix E Hastie et al.: 12.2, 12.3 (12.3.1) Ben-Hur, A., Soon Ong, C., Sonnenburg, S., Schölkopf, B., Rätsch, G. (2008). Support vector machines and kernels for computational biology. PLOS Computational biology 4(10). http://www.ploscompbiol.org/ Wehenkel & Geurts AIA... (43/45)

Conclusion References Schölkopf, B. and Smola, A. (2002). Learning with kernels: support vector machines, regularization, optimization and beyond. MIT Press, Cambridge, MA. Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. Vapnik, V. (2000). The nature of statistical learning theory. Springer. Boyd, S. and Vandenberghe, L (2004). Convex optimization. Cambridge University Press. Some slides borrowed from Pierre Dupont (UCL) Wehenkel & Geurts AIA... (44/45)

Conclusion Softwares LIBSVM & LIBLINEAR http://www.csie.ntu.edu.tw/~cjlin/libsvm SVM light http://svmlight.joachims.org/ SHOGUN http://www.shogun-toolbox.org/ Wehenkel & Geurts AIA... (45/45)