Machine Learning And Applications: Supervised Learning-SVM

Similar documents
Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression

Announcements - Homework

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (continued)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Pattern Recognition 2018 Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Convex Optimization and Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Jeff Howbert Introduction to Machine Learning Winter

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support vector machines Lecture 4

Support Vector Machine

ICS-E4030 Kernel Methods in Machine Learning

Max Margin-Classifier

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines, Kernel SVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machine

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Discriminative Models

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines

Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine. Industrial AI Lab.

Statistical Pattern Recognition

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machine (SVM) and Kernel Methods

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

COMP 875 Announcements

Discriminative Models

Kernel Methods and Support Vector Machines

Machine Learning 4771

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machine via Nonlinear Rescaling Method

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 2 Machine Learning Review

Convex Optimization M2

Recap from previous lecture

Lecture Notes on Support Vector Machine

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

SVMs, Duality and the Kernel Trick

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Today. Calculus. Linear Regression. Lagrange Multipliers

Statistical Methods for SVM

Machine Learning for NLP

L5 Support Vector Classification

Support vector machines

Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Constrained Optimization and Support Vector Machines

Introduction to Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines for Classification: A Statistical Portrait

Statistical Methods for Data Mining

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines and Kernel Methods

Linear, Binary SVM Classifiers

Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture 3 January 28

Transcription:

Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine learning algorithms can be either supervised or unsupervised. Supervised learning consists of using already labeled data (training set) to infere the labels of new data. The label can be a class (classification problem), a real value (regression problem). Unsupervised learning is adapted when we do not have any information on the class of the data. 2 Penalized empirical risk minimization During the last course, we saw a method to minimize the structural risk, by: defining nested function sets of increasing complexity minimizing the empirical risk over each family choosing the solution that gives the best generalization performances This approach requires to solve a constrained optimization problem, which is usually harder than to solve its unconstrained equivalent. This problem is often equivalent to the following penalized problem: min f n L(y i, f(x i )) + λω(f) The first term corresponds to the loss function minimization, and favors a good fit to the data. The second term is a penalty term, which penalizes functions with high complexity, it thus favors regularity of f. In order to solve this problem, we need to define a good mesure of the complexity Ω, and then to compare the generalization performances of the functions found for decreasing values of λ. 2.1 Loss functions For regression problems, common loss functions include the l 1 or l 2 norms of y f(x). The l 1 norm has the advantage of being robust (i.e: less sensitive to large errors than the l 2 norm). With the l 2 norm, if a function makes one large mistake, it can be very costly.

For classification problems, we can use the 0/1, the logistic or the hinge loss function. The 0/1 loss function has a major drawback: it is a non-convex function, and optimization problems are simpler to solve when the function to minimize is convex. Indeed, some general results of convex optimization guarantee the existence of a unique global solution if a minimum is found. Moreover, methods based on convex objectives are simpler to analyze. 2.2 Penalties Let us consider a linear function f(x) = Θ T x, Θ IR p. A very common penalty is the ridge penalty: Ω(Θ) = Θ 2 2 The ridge penalty is used in ridge regression (a regression algorithm, combined with the l 2 loss) and in Support Vector Machines (SVM) (a classification algorithm, combined with the hinge loss). This penalty leads to function that are regular in the sense that 2 points x, x that are closed in the Euclidian space have close evaluations by the function, since by the Cauchy-Schwarz inequality: Θ T x Θ T x Θ 2 x x 2 This property can limit overfitting and improve generalization performances: it makes functions behave in a similar manner over similar data. 3 The ridge regression We now assume that the data can be explained with a linear model: y = Θ T x + ɛ where ɛ is a random noise with mean 0 and variance σ 2, y IR and Θ, x IR p. We observe n realization of this linear model, represented by the matrix X IR n,p, and the vector Y IR n. Let us consider the estimator: We can show that: ˆΘ = arg min( Y XΘ 2 + λ Θ 2 ) Θ ˆΘ = (X T X + λi) 1 X T Y It is also possible the find the bias E[ ˆΘ Θ] and the variance of ˆΘ: E[ ˆΘ Θ] = λ(x T X + λi) 1 Θ

V ar[ ˆΘ] = σ 2 (X T X + λi) 1 X T X(X T X + λi) 1 The bias increases with λ and tends to Θ when λ. It is noteworthy that if λ = 0 (unpenalized linear regression), the bias is null. If the ratio n/p is small (i.e the number of parameters is large compared to the number of points), the bias and the variance increase. In practise, for a given dataset, some λ give smaller risks than others. λ can be chosen by hold-out or cross validation. 4 Fundamentals of constrained optimization Constrained optimization methods are important to understand the SVM. Let us consider an equality and inequality constrained optimization problem over a variable x X : minimize f(x) subject to h i (x) = 0, i = 1,..., m, g i (x) 0, j = 1,..., r, making no assumption about f, g and h. Let f be the optimal value of the decision function under the constraints (f = f(x ) if the global minimum is reached at x ). The Lagrangian of the constrained problem is the function L : X IR m IR r IR defined by: L(x, λ, µ) = f(x) + m λ i h i (x) + r µ j g j (x) We can now define the Lagrange dual function q : IR m IR r IR: q(λ, µ) = inf L(x, λ, µ) x X We can show that q is concave in (λ, µ), even if the original problem in not convex. Furthermore, we have: j=1 q(λ, µ) f, λ IR m, µ IR r, µ 0 We can finally define the Lagrange dual problem: maximize q(λ, µ) subject to µ 0 If we let d be the optimal value of Lagrange dual problem, the weak duality inequality is: d f The difference d f is called the optimal duality gap. The strong duality holds when: d = f Strong duality does not hold for general nonlinear problems, but usually holds for convex problems. The conditions that ensure strong duality for convex problems are called constraint qualification.

5 Penalized vs constrained risk optimization 6 Support vector machines Support Vector Machines (SVM) is a popular linear classification algorithm. Let us consider a training set D = {(x 1, y 1 ),..., (x n, y n )}, where x i IR p and y i { 1, 1}. Is is assumed that the set of points is linearly separable, i.e. that there exists (w, b) IR p IR such that: { w.xi + b > 0 if y i = 1 w.x i + b < 0 if y i = 1 Many decision boundaries can split the training data into the two classes correctly, but which one offers the best generalization performance? Among the possible boundaries, SVM select the one with the largest margin between the two classes. The margin is defined as the distance between a planar decision surface that separates two classes and the closest training samples to the decision surface.. The values of w and b are chosen such that the margin is maximal under the constraint that the points in the training set are correctly classified (i.e. the decision function gives the correct value of y i ). The decision function gives the correct value of y i if: We can show that the margin is 2 w y i (w.x i + b) 1 Maximizing the margin is equivalent to minimizing w 2. Hence, finding the optimal hyperplace fullfilling these conditions can be written as a constrained optimization problem: Find (w, b) which minimize w 2 under the constraints: i = 1,..., n, y i (w.x i + b) 1 0. Using the constrained optimization methods studied earlier, we can obtain the Lagrange dual function: Then q(α) equals: q(α) = inf L(w, b, α) w IR p,b IR { n α i 1/2 n j=1 y iy j α i α j x i x j if n α iy i = 0 otherwise And the problem becomes: maximize q(α) under the constraints α 0. This is a quadratic program on IR n, and can be solved efficiently using dedicated optimization software. Once the optimal α (= α ) is found, we can recover the (w, b ) corresponding to the optimal hyperplane that separate the two classes:

w = And the decision function is therefore: n α i y i x i f (x) = w x + b This SVM method is called hard-margin, because the data is assumed to be linearly separable. If it is not, the problem cannot be solved anymore. The separation constraints need to be relaxed, by introducing slack variables ξ i (softmargin SVM): i = 1,..., n, y i (w.x i + b) 1 ξ i