About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Similar documents
Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support vector machines Lecture 4

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernelized Perceptron Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

LECTURE 7 Support vector machines

Announcements - Homework

Lecture 10: A brief introduction to Support Vector Machine

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Kernel Methods and Support Vector Machines

SUPPORT VECTOR MACHINE

Review: Support vector machines. Machine learning techniques and image analysis

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Jeff Howbert Introduction to Machine Learning Winter

Reproducing Kernel Hilbert Spaces

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine (continued)

Machine Learning And Applications: Supervised Learning-SVM

Statistical Machine Learning from Data

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine (SVM) and Kernel Methods

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

(Kernels +) Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Max Margin-Classifier

MLCC 2017 Regularization Networks I: Linear Models

Pattern Recognition 2018 Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear & nonlinear classifiers

SVMs, Duality and the Kernel Trick

Applied Machine Learning Annalisa Marsico

CSC 411 Lecture 17: Support Vector Machine

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Support Vector Machines, Kernel SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines

Support Vector Machines.

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines Explained

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines and Speaker Verification

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Reproducing Kernel Hilbert Spaces

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines and Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Learning with kernels and SVM

Kernel Methods. Barnabás Póczos

Support Vector Machines for Classification: A Statistical Portrait

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear classifiers Lecture 3

Support Vector Machine

Support Vector Machine II

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

18.9 SUPPORT VECTOR MACHINES

Support Vector Machine & Its Applications

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Applied inductive learning - Lecture 7

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

CS145: INTRODUCTION TO DATA MINING

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

An introduction to Support Vector Machines

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Transcription:

About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the Margin Picture of large and small margin hyperplanes Intuition: large margin condition acts as a regularizer and should generalize better The Support Vector Machine (SVM) makes this formal. Not only that, it is amenable to the kernel trick which will allow us to get much greater representational power! 2

Deriving the SVM (Derivation based on Ryan Rifkin s slides in MIT 9.520 from Spring 2003) Since x z is parallel to w (both perpendicular to the separating hyperplane) k = w.(x z) k = w x z Assume we classify a point x as sgn(w.x) x z = k w Let x be a datapoint on the margin, and z the point on the separating hyperplane closest to x We want to maximize x z For some k (assumed positive) w.x = k w.z = 0 w.(x z) = k So now, maximizing x z is equivalent to minimizing w We can fix k = (this is just a rescaling) Now we have an optimization problem: min w R n w 2 subject to: y i (w.x i ), i =,..., l Can be solved using quadratic programming 3

When a Separating Hyperplane Does Not Exist Think about this expression in terms of training set error and inductive bias! Typically we also use a bias term to shift the hyperplane around (so it doesn t have to pass through the origin) Now f(x) = sgn(w.x + b) The new opti- We introduce slack variables. mization problem becomes min C w R n,ξ R l subject to: l ξ i + i= 2 w 2 y i (w.x i + b) ξ i, i =,..., l ξ i 0, i =,..., l Now we are trading the error off against the margin 4

The Dual Formulation max α R l l subject to: i= α i i,j α i α j y i y j (x i.x j ) l y i α i = 0 i= 0 α i C, i =,..., l This allows for more efficient solution of the QP than we could get otherwise The hypothesis is then: l f(x) = sgn( α i y i (x.x i )) i= Sparsity: it turns out that: y i f(x i ) > α i = 0 y i f(x i ) < α i = C 5

The Kernel Trick!2x x 2 The really nice thing: optimization depends only on the dot product between examples. An example from Russell & Norvig 3 2 0 - -2-3 0 2.5 2.5 0.5 2 x 2 2.5 x 0.5 2.5 Now F (x i ).F (x j ) = (x i.x j ) 2 x 2 0.5 0-0.5 - -.5 -.5 - -0.5 0 0.5.5 x Now suppose we go from representation: x =< x, x 2 > to representation: F (x) =< x 2, x2 2, 2x x 2 > We don t need to compute the actual feature representation in the higher dimensional space, because of Mercer s theorem. For a Mercer Kernel K, the dot product of F (x i ) and F (x j ) is given by K(x i, x j ). What is a Mercer kernel? Continuous, symmetric, and positive definite 6

Positive definiteness: for any m-size subset of the input space, the matrix K where K ij = K(X i, X j ) is positive definite for all non- Remember positive definiteness: zero vectors z, z T Kz > 0 Allows us to work with very high-dimensional spaces! Examples: How do we choose which kernel and which λ to use? (The first could be harder!). Polynomial: K(X i, X j ) = ( + x i.x j ) d (feature space is exponential in d!) 2. Gaussian: e x i x j 2 2σ 2 feature space!) (infinite dimensional 3. String kernels, protein kernels!

Selecting the Best Hypothesis Based on notes from Poggio, Mukherjee and Rifkin Define the performance of a hypothesis by a loss function V Commonly used for regression: V (f(x), y) = (f(x) y) 2 Could use absolute value: V (f(x), y) = f(x) y What about classification? 0- loss: V (f(x), y) = I[y = f(x)] Expected error of a hypothesis: Expected error on a sample drawn from the underlying (unknown) distribution I[f] = V (f(x), y)dµ(x, y) In discrete terms we would replace with a sum and µ with P Empirical error, or empirical risk, is the average loss over the training set I S [f] = l V (f(xi ), y i ) Empirical risk minimization: find the hypothesis in the hypothesis space that minimizes the empirical risk Hinge loss: V (f(x), y) = ( y.f(x)) + Hypothesis space: space of functions that we search 7 min f H n n V (f(x i ), y i ) i=

For most hypothesis spaces, ERM is an illposed problem. A problem is ill-posed if it is not well-posed. A problem is well-posed if its solution exists, is unique, and depends continuously on the data Regularization restores well-posedness. Ivanov regularization directly constrains the hypothesis space, and Tikhonov regularization imposes a penalty on hypothesis complexity Ivanov regularization: min f H n n V (f(x i ), y i ) subject to ω(f) τ i= Tikhonov regularization: min f H n n V (f(x i ), y i ) + λω(f) i= ω is the regularization or smoothness functional. The mathematical machinery for defining this is complex, and we won t get into it much more, but the interesting thing is that if we use the hinge loss and the linear kernel, the SVM comes out of solving the Tikhonov regularization problem! Meaning of using an unregularized bias term? Punish function complexity but not an arbitrary translation of the origin However, in the case of SVMs, the answer will end up being different if we add a fictional to each example, because now we punish the weight we put on it!

Generalization Bounds Important concepts of error:. Sample (estimation) error: difference between hypothesis we find in H and the best hypothesis in H 2. Approximation error: difference between best hypothesis in H and the true function in some other space T 3. Generalization error: difference between hypothesis we find in H and the true function in T, which is the sum of the two above Tradeoff: making H bigger makes the approximation error smaller, but the estimation error larger 8