Introduction to Support Vector Machines

Similar documents
Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Formulation with slack variables

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines Explained

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine for Classification and Regression

Support Vector Machine (SVM) and Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines and Kernel Algorithms

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

An introduction to Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Jeff Howbert Introduction to Machine Learning Winter

Non-linear Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Training Support Vector Machines: Status and Challenges

Support Vector Machine (continued)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Pattern Recognition 2018 Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Learning with kernels and SVM

Support Vector Machines

Support Vector Machines

LIBSVM: a Library for Support Vector Machines (Version 2.32)

Incorporating detractors into SVM classification

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Machine Learning : Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Brief Introduction to Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Applied inductive learning - Lecture 7

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

MinOver Revisited for Incremental Support-Vector-Classification

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Support Vector Machines for Classification: A Statistical Portrait

Learning Kernel Parameters by using Class Separability Measure

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine

Convergence of a Generalized Gradient Selection Approach for the Decomposition Method

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

A Tutorial on Support Vector Machine

Support Vector Machine & Its Applications

Linear & nonlinear classifiers

Support Vector Machines and Kernel Methods

Polyhedral Computation. Linear Classifiers & the SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

The Intrinsic Recurrent Support Vector Machine

Statistical Machine Learning from Data

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

SMO Algorithms for Support Vector Machines without Bias Term

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine via Nonlinear Rescaling Method

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Constrained Optimization and Support Vector Machines

Sequential Minimal Optimization (SMO)

Linear & nonlinear classifiers

Support Vector Machines.

Support Vector Machines and Speaker Verification

Lecture Notes on Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Support Vector Machines

Kernel Methods and Support Vector Machines

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Statistical learning theory, Support vector machines, and Bioinformatics

Linear Classification and SVM. Dr. Xin Zhang

Gaps in Support Vector Optimization

A Revisit to Support Vector Data Description (SVDD)

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

A Simple Decomposition Method for Support Vector Machines

Support Vector Machines. Maximizing the Margin

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CSC 411 Lecture 17: Support Vector Machine

CS798: Selected topics in Machine Learning

Model Selection for LS-SVM : Application to Handwriting Recognition

18.9 SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods and Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Transcription:

Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006

1 The Problem 2 The Basics 3 The Proposed Solution

Learning by Machines Learning Rote Learning memorization (Hash tables) Reinforcement feedback at end (Q [Watkins 89]) Induction generalizing examples (ID3 [Quinlan 79]) Clustering grouping data (CMLIB [Hartigan 75]) Analogy representation similarity (JUPA [Yvon 94]) Discovery unsupervised, no goal Genetic Alg. simulated evolution (GABIL [DeJong 93])

Supervised Learning Definition Supervised Learning: given nontrivial training data (labels known) predict test data (labels unknown) Implementations Clustering Rote Learning Nearest Neighbor [Cover, Hart 67] Induction Hash tables Neural Networks [McCulloch, Pitts 43] Decision Trees [Hunt 66] SVMs [Vapnik et al 92]

Problem Description General Problem Classify a given input binary classification: two classes multi-class classification: several, but finitely many classes regression: infinitely many classes Major Applications Handwriting recognition Cheminformatics (Quantitative Structure-Activity Relationship) Pattern recognition Spam detection (HP Labs, Palo Alto)

Problem Description Specific Electricity Load Prediction Challenge 2001 Power plant that supports energy demand of a region Excess production expensive Load varies substantially Challenge won by libsvm [Chang, Lin 06] Problem given: load and temperature for 730 days ( 70kB data) predict: load for the next 365 days

Example Data Load 850 800 750 700 650 600 550 500 450 400 12:00 24:00 Load 1997 50 100 150 200 250 300 350 Day of Year

Problem Description Formal Definition (cf. [Lin 01]) Given a training set S R n { 1, 1} of correctly classified input data vectors x R n, where: every input data vector appears at most once in S there exist input data vectors p and n such that ( p, 1) S as well as ( n, 1) S (non-trivial) successfully classify unseen input data vectors.

Linear Classification [Vapnik 63] Given: A training set S R n { 1, 1} Goal: Find a hyperplane that separates R n into halves that contain only elements of one class

Definition Hyperplane n ( x x 0 ) = 0 n R n weight vector x R n input vector x 0 R n offset Alternatively: w x + b = 0 Decision Function training set S = {( x i, y i ) 1 i k} Representation of Hyperplane separating hyperplane w x + b = 0 for S { > 0 if y i = 1 Decision: w x i +b f ( x) = sgn( w x + b) < 0 if y i = 1

Problem Given: training set S Learn Hyperplane Goal: coefficients w and b of a separating hyperplane Difficulty: several or no candidates for w and b Solution [cf. Vapnik s statistical learning theory] Select admissible w and b with maximal margin (minimal distance to any input data vector) Observation We can scale w and b such that { 1 if y i = 1 w x i + b 1 if y i = 1

Maximizing the Margin Closest points x + and x (with w x ± + b = ±1) Distance between w x + b = ±1: ( w x + + b) ( w x + b) w = 2 w = 2 w w max w,b 2 w w min w,b w w 2 Basic (Primal) Support Vector Machine Form 1 target: min w,b 2 ( w w) subject to: y i ( w x i + b) 1 (i = 1,..., k)

Non-separable Data Problem Maybe a linear separating hyperplane does not exist! Solution Allow training errors ξ i penalized by large penalty parameter C Standard (Primal) Support Vector Machine Form 1 target: min w,b, ξ 2 ( w w) + C( k i=1 ξ ) i subject to: y i ( w x i + b) 1 ξ i ξ i 0 (i = 1,..., k) If ξ i > 1, then misclassification of x i

Higher Dimensional Feature Spaces Problem Data not separable because target function is essentially nonlinear! Approach Potentially separable in higher dimensional space Map input vectors nonlinearly into high dimensional space (feature space) Perform separation there

Higher Dimensional Feature Spaces Literature Classic approach [Cover 65] Kernel trick [Boser, Guyon, Vapnik 92] Extension to soft margin [Cortes, Vapnik 95] Example (cf. [Lin 01]) Mapping φ from R 3 into feature space R 10 φ( x) = (1, 2x 1, 2x 2, 2x 3, x 2 1, x 2 2, x 2 3, 2x 1 x 2, 2x 1 x 3, 2x 2 x 3 )

Adapted Standard Form Definition Standard (Primal) Support Vector Machine Form 1 target: min w,b, ξ 2 ( w w) + C( k i=1 ξ ) i subject to: y i ( w φ( x i ) + b) 1 ξ i ξ i 0 (i = 1,..., k) w is a vector in a high dimensional space

How to Solve? Problem Find w and b from the standard SVM form Solution Solve via Lagrangian dual [Bazaraa et al 93]: where L( w, b, ξ, α) = w w 2 max α 0, π 0 ( min w,b, ξ L( w, b, ξ, α) ) + C ( k ) k ξ i + α i (1 ξ i y i ( w φ( x i ) + b)) i=1 i=1 k π i ξ i i=1

Simplifying the Dual [Chen et al 03] Standard (Dual) Support Vector Machine Form target: min α 1 2 ( αt Q α) k i=1 α i subject to: y α = 0 0 α i C where: Q ij = y i y j ( φ( xi ) φ( x j ) ) (i = 1,..., k) Solution We obtain w as w = k α i y i φ( x i ) i=1

Where is the Benefit? α R k (dimension independent from feature space) Only inner products in feature space Kernel Trick Inner products efficiently calculated on input vectors via kernel K K( x i, x j ) = φ( x i ) φ( x j ) Select appropriate feature space Avoid nonlinear transformation into feature space Benefit from better separation properties in feature space

Kernels Example Mapping into feature space φ: R 3 R 10 φ( x) = (1, 2x 1, 2x 2,..., 2x 2 x 3 ) Kernel K( x i, x j ) = φ( x i ) φ( x j ) = (1 + x i x j ) 2. Popular Kernels Gaussian Radial Basis Function: (feature space is an infinite dimensional Hilbert space) g( x i, x j ) = exp( γ x i x j 2 ) Polynomial: g( x i, x j ) = ( x i x j + 1) d

The Decision Function Observation No need for w because f ( x) = sgn ( w φ( x) + b ) ( k ( = sgn α i y i φ( xi ) φ( x) ) ) + b i=1 Uses only x i (support vectors) where α i > 0 Few points determine the separation; borderline points

Support Vectors

Definition Given: Kernel K and training set S Goal: decision function f Support Vector Machines target: subject to: decide: min α ( αt Q α 2 y α = 0 0 α i C k ) α i i=1 (i = 1,..., k) ( k ) f ( x) = sgn α i y i K( x i, x) + b i=1 Q ij = y i y j K( x i, x j )

Quadratic Programming Suppose Q (k by k) fully dense matrix 70,000 training points 70,000 variables 70, 000 2 4B 19GB: huge problem Traditional methods: Newton, Quasi Newton cannot be directly applied Current methods: Decomposition [Osuna et al 97], [Joachims 98], [Platt 98] Nearest point of two convex hulls [Keerthi et al 99]

www.kernel-machines.org Main forum on kernel machines Lists over 250 active researchers 43 competing implementations Sample Implementation libsvm [Chang, Lin 06] Supports binary and multi-class classification and regression Beginners Guide for SVM classification Out of the box -system (automatic data scaling, parameter selection) Won EUNITE and IJCNN challenge

Application Accuracy Automatic Training using libsvm Application Training Data Features Classes Accuracy Astroparticle 3,089 4 2 96.9% Bioinformatics 391 20 3 85.2% Vehicle 1,243 21 2 87.8%

References Books Statistical Learning Theory (Vapnik). Wiley, 1998 Advances in Kernel Methods Support Vector Learning (Schölkopf, Burges, Smola). MIT Press, 1999 An Introduction to Support Vector Machines (Cristianini, Shawe-Taylor). Cambridge Univ., 2000 Support Vector Machines Theory and Applications (Wang). Springer, 2005

References Seminal Papers A training algorithm for optimal margin classifiers (Boser, Guyon, Vapnik). COLT 92, ACM Press. Support vector networks (Cortes, Vapnik). Machine Learning 20, 1995 Fast training of support vector machines using sequential minimal optimization (Platt). In Advances in Kernel Methods, MIT Press, 1999 Improvements to Platt s SMO algorithm for SVM classifier design (Keerthi, Shevade, Bhattacharyya, Murthy). Technical Report, 1999

References Recent Papers A tutorial on ν-support Vector Machines (Chen, Lin, Schölkopf). 2003 Support Vector and Kernel Machines (Nello Christianini). ICML, 2001 libsvm: A library for Support Vector Machines (Chang, Lin). System Documentation, 2006

Sequential Minimal Optimization [Platt 98] Commonly used to solve standard SVM form Decomposition method with smallest working set, B = 2 Subproblem analytically solved; no need for optimization software Contained flaws; modified version [Keerthi et al 99] Karush-Kuhn-Tucker (KKT) of the dual ( E = (1,..., 1)): Q α E + b y λ + µ = 0 µ i (C α i ) = 0 µ 0 α i λ i = 0 λ 0

Computing b KKT yield { (Q α E 0 if α i < C + b y) i 0 if α i > 0 Let F i ( α) = k j=1 α jy j K( x i, x j ) y i and I 0 = {i 0 < α i < C} I 1 = {i y i = 1, α i = 0} I 2 = {i y i = 1, α i = C} I 3 = {i y i = 1, α i = C} I 4 = {i y i = 1, α i = 0} Case analysis on y i yields bounds on b max{f i ( α) i I 0 I 3 I 4 } b min{f i ( α) i I 0 I 1 I 2 }

Working Set Selection Observation (see [Keerthi et al 99]) α not optimal solution iff max{f i ( α) i I 0 I 3 I 4 } > min{f i ( α) i I 0 I 1 I 2 } Approach Select working set B = {i, j} with i arg max m {F m ( α) m I 0 I 3 I 4 } j arg min m {F m ( α) m I 0 I 1 I 2 }

The Subproblem Definition Let B = {i, j} and N = {1,..., k} \ B. ( ) αi α B = and α N = α N (similar for matrices) α j B-Subproblem target: min αb α T B Q BB α B 2 subject to: y α = 0 0 α i, α j C ( ) ( ) + α b Q b,n α N α b b B b B

Final Solution Note that y i α i = y N α N + y j α j Substitute α i = y i ( y N α N + y j α j ) into target One-variable optimization problem Can be solved analytically (cf., e.g., [Lin 01]) Iterate (yielding new α) until max{f i ( α) i I 0 I 3 I 4 } min{f i ( α) i I 0 I 1 I 2 } ɛ