L5 Support Vector Classification

Similar documents
Statistical Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Introduction to Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Lecture Notes on Support Vector Machine

Convex Optimization and Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

Perceptron Revisited: Linear Separators. Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines Explained

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Support Vector Machines, Kernel SVM

Machine Learning A Geometric Approach

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture Support Vector Machine (SVM) Classifiers

CS798: Selected topics in Machine Learning

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Announcements - Homework

Support Vector Machine. Industrial AI Lab.

Constrained Optimization and Support Vector Machines

Statistical Methods for SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Pattern Recognition 2018 Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines.

Support Vector Machines

Kernel Methods and Support Vector Machines

Statistical Machine Learning from Data

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)


Max Margin-Classifier

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Statistical Methods for Data Mining

Support Vector Machines for Classification and Regression

Support Vector Machines and Speaker Verification

Learning with kernels and SVM

Support Vector Machines and Kernel Methods

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SVM optimization and Kernel methods

Support Vector Machines: Maximum Margin Classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Introduction to Support Vector Machines

LECTURE 7 Support vector machines

Incorporating detractors into SVM classification

(Kernels +) Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Basis Expansion and Nonlinear SVM. Kai Yu

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS-E4830 Kernel Methods in Machine Learning

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Introduction to SVM and RVM

Support Vector Machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear, Binary SVM Classifiers

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines

Support vector machines

Support Vector Machine

ML (cont.): SUPPORT VECTOR MACHINES

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

SUPPORT VECTOR MACHINE

ICS-E4030 Kernel Methods in Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Non-linear Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

A Tutorial on Support Vector Machine

SVMs: nonlinearity through kernels

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Transcription:

L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander J. Smola: An Introduction to Machine Learning 3 / 1

Classification Data Pairs of observations (x i, y i ) generated from some distribution P(x, y), e.g., (blood status, cancer), (credit transaction, fraud), (profile of jet engine, defect) Task Estimate y given x at a new location. Modification: find a function f (x) that does the task. Alexander J. Smola: An Introduction to Machine Learning 4 / 1

So Many Solutions Alexander J. Smola: An Introduction to Machine Learning 5 / 1

One to rule them all... Alexander J. Smola: An Introduction to Machine Learning 6 / 1

Optimal Separating Hyperplane Alexander J. Smola: An Introduction to Machine Learning 7 / 1

Optimization Problem Margin to Norm Separation of sets is given by 2 kwk Equivalently minimize 1kwk. 2 Equivalently minimize 1 2 kwk2. Constraints Separation with margin, i.e. so maximize that. hw, x i i + b 1 if y i = 1 hw, x i i + b apple 1 if y i = 1 Equivalent constraint y i (hw, x i i + b) 1 Alexander J. Smola: An Introduction to Machine Learning 8 / 1

Optimization Problem Mathematical Programming Setting Combining the above requirements we obtain 1 minimize 2 kwk2 subject to y i (hw, x i i + b) 1 0 for all 1 apple i apple m Properties Problem is convex Hence it has unique minimum Efficient algorithms for solving it exist Alexander J. Smola: An Introduction to Machine Learning 9 / 1

Lagrange Function Objective Function 1 2 kwk2. Constraints c i (w, b) := 1 y i (hw, x i i + b) apple 0 Lagrange Function L(w, b, ) = PrimalObjective + X i i c i = 1 2 kwk2 + mx i (1 i=1 y i (hw, x i i + b)) Saddle Point Condition Derivatives of L with respect to w and b must vanish. Alexander J. Smola: An Introduction to Machine Learning 10 / 1

Saddle Point of z = x2 y2

Support Vector Machines Optimization Problem minimize 1 2 subject to mx i j y i y j hx i, x j i i,j=1 mx i=1 mx i y i = 0 and i 0 i=1 Support Vector Expansion w = X i y i x i and hence f (x) = i Kuhn Tucker Conditions i mx i y i hx i, xi + b i=1 i (1 y i (hx i, xi + b)) = 0 Alexander J. Smola: An Introduction to Machine Learning 11 / 1

Proof (optional) Lagrange Function L(w, b, ) = 1 2 kwk2 + mx i (1 i=1 y i (hw, x i i + b)) Saddlepoint condition @ w L(w, b, ) =w @ b L(w, b, ) = mx i y i x i = 0 () w = i=1 mx i y i x i = 0 () i=1 mx i y i x i i=1 mx i y i = 0 To obtain the dual optimization problem we have to substitute the values of w and b into L. Note that the dual variables i have the constraint i 0. Alexander J. Smola: An Introduction to Machine Learning 12 / 1 i=1

Proof (optional) Dual Optimization Problem After substituting in terms for b, w the Lagrange function becomes subject to 1 2 mx i j y i y j hx i, x j i + i,j=1 mx i y i = 0 and i i=1 mx i=1 i 0 for all 1 apple i apple m Practical Modification Need to maximize dual objective function. Rewrite as minimize 1 mx mx i j y i y j hx i, x j i i 2 i,j=1 subject to the above constraints. Alexander J. Smola: An Introduction to Machine Learning 13 / 1 i=1

Support Vector Expansion Solution in w = mx i y i x i i=1 w is given by a linear combination of training patterns x i. Independent of the dimensionality of x. w depends on the Lagrange multipliers i. Kuhn-Tucker-Conditions At optimal solution Constraint Lagrange Multiplier = 0 In our context this means Equivalently we have i (1 y i (hw, x i i + b)) = 0. i 6= 0 =) y i (hw, x i i + b) =1 Only points at the decision boundary can contribute to the solution. Alexander J. Smola: An Introduction to Machine Learning 14 / 1

Mini Summary Linear Classification Many solutions Optimal separating hyperplane Optimization problem Support Vector Machines Quadratic problem Lagrange function Dual problem Interpretation Dual variables and SVs SV expansion Hard margin and infinite weights Alexander J. Smola: An Introduction to Machine Learning 15 / 1

Kernels Nonlinearity via Feature Maps Replace x i by (x i ) in the optimization problem. Equivalent optimization problem minimize 1 mx mx i j y i y j k(x i, x j ) 2 subject to i,j=1 mx i y i = 0 and i 0 i=1 Decision Function mx w = i y i (x i ) implies i=1 f (x) =hw, (x)i + b = i=1 Alexander J. Smola: An Introduction to Machine Learning 16 / 1 i=1 i mx i y i k(x i, x)+b.

Examples and Problems Advantage Works well when the data is noise free. Problem Already a single wrong observation can ruin everything we require y i f (x i ) 1 for all i. Idea Limit the influence of individual observations by making the constraints less stringent (introduce slacks). Alexander J. Smola: An Introduction to Machine Learning 17 / 1

Support Vector Machines Three main ideas: 1. Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin 2. Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space 217

Non-Linearly Separable Data Var 1 i Introduce slack variables i w i w x b 1 Allow some instances to fall within the margin, but penalize them w x b 1 1 1 w x b 0 Var 2 218

Formulating the Optimization Problem Constraint becomes : Var 1 i y ( w x b) 1, x 0 i i i i i w i w x b 1 1 1 w x w x b 1 Var 2 b 0 Objective function penalizes for misclassified instances and those within the margin 1 min 2 C trades-off margin width and misclassifications 219 2 w C i i

Linear, Soft-Margin SVMs 1 min 2 2 w C i i y ( w x b) 1, x i i i i 0 i Algorithm tries to maintain i to zero while maximizing margin Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes Other formulations use i 2 instead As C, we get closer to the hard-margin solution 220

Robustness of Soft vs Hard Margin SVMs Var 1 Var 1 i i w x b 0 Soft Margin SVN Var 2 w x b 0 Hard Margin SVN Var 2 221

Soft vs Hard Margin SVM Soft-Margin always have a solution Soft-Margin is more robust to outliers Smoother surfaces (in the non-linear case) Hard-Margin does not require to guess the cost parameter (requires no parameters at all) 222

Optimization Problem (Soft Margin) Recall: Hard Margin Problem Softening the Constraints minimize 1 minimize 2 kwk2 subject to y i (hw, x i i + b) 1 0 1 2 kwk2 + C mx i=1 subject to y i (hw, x i i + b) 1+ i 0 and i 0 i Alexander J. Smola: An Introduction to Machine Learning 18 / 1

Linear SVM C = 1 Alexander J. Smola: An Introduction to Machine Learning 26 / 1

Linear SVM C = 2 Alexander J. Smola: An Introduction to Machine Learning 27 / 1

Linear SVM C = 5 Alexander J. Smola: An Introduction to Machine Learning 28 / 1

Linear SVM C = 10 Alexander J. Smola: An Introduction to Machine Learning 29 / 1

Linear SVM C = 20 Alexander J. Smola: An Introduction to Machine Learning 30 / 1

Linear SVM C = 50 Alexander J. Smola: An Introduction to Machine Learning 31 / 1

Linear SVM C = 100 Alexander J. Smola: An Introduction to Machine Learning 32 / 1

Linear SVM C = 1 Alexander J. Smola: An Introduction to Machine Learning 40 / 1

Linear SVM C = 2 Alexander J. Smola: An Introduction to Machine Learning 41 / 1

Linear SVM C = 5 Alexander J. Smola: An Introduction to Machine Learning 42 / 1

Linear SVM C = 10 Alexander J. Smola: An Introduction to Machine Learning 43 / 1

Linear SVM C = 20 Alexander J. Smola: An Introduction to Machine Learning 44 / 1

Linear SVM C = 50 Alexander J. Smola: An Introduction to Machine Learning 45 / 1

Linear SVM C = 100 Alexander J. Smola: An Introduction to Machine Learning 46 / 1

Insights Changing C For clean data C doesn t matter much. For noisy data, large C leads to narrow margin (SVM tries to do a good job at separating, even though it isn t possible) Noisy data Clean data has few support vectors Noisy data leads to data in the margins More support vectors for noisy data Alexander J. Smola: An Introduction to Machine Learning 47 / 1

Dual Optimization Problem Optimization Problem minimize 1 2 subject to mx i j y i y j k(x i, x j ) i,j=1 mx i=1 mx i y i = 0 and C i 0 for all 1 apple i apple m i=1 Interpretation Almost same optimization problem as before Constraint on weight of each i (bounds influence of pattern). Efficient solvers exist (more about that tomorrow). i Alexander J. Smola: An Introduction to Machine Learning 49 / 1

SV Classification Machine Alexander J. Smola: An Introduction to Machine Learning 50 / 1

Gaussian RBF with C = 0.1 Alexander J. Smola: An Introduction to Machine Learning 51 / 1

Gaussian RBF with C = 0.2 Alexander J. Smola: An Introduction to Machine Learning 52 / 1

Gaussian RBF with C = 0.4 Alexander J. Smola: An Introduction to Machine Learning 53 / 1

Gaussian RBF with C = 0.8 Alexander J. Smola: An Introduction to Machine Learning 54 / 1

Gaussian RBF with C = 1.6 Alexander J. Smola: An Introduction to Machine Learning 55 / 1

Gaussian RBF with C = 3.2 Alexander J. Smola: An Introduction to Machine Learning 56 / 1

Gaussian RBF with C = 6.4 Alexander J. Smola: An Introduction to Machine Learning 57 / 1

Gaussian RBF with C = 12.8 Alexander J. Smola: An Introduction to Machine Learning 58 / 1

Summary Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander J. Smola: An Introduction to Machine Learning 77 / 1