MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Similar documents
Natural Language Processing and Information Retrieval

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Non-linear Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines Explained

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

An introduction to Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Introduction to Support Vector Machines

Support Vector Machine (continued)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

ML (cont.): SUPPORT VECTOR MACHINES

Learning with kernels and SVM

Kernel Methods and Support Vector Machines

Support Vector Machine. Industrial AI Lab.

Support Vector Machines and Kernel Algorithms

L5 Support Vector Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines.

Solving Classification Problems By Knowledge Sets

Soft-margin SVM can address linearly separable problems with outliers

Statistical Pattern Recognition

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine & Its Applications

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Polyhedral Computation. Linear Classifiers & the SVM

Introduction to Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Soft-margin SVM can address linearly separable problems with outliers

Announcements - Homework

Support Vector Machines with Example Dependent Costs

Constrained Optimization and Support Vector Machines

SUPPORT VECTOR MACHINE

Introduction to Support Vector Machines

Support Vector Machines

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Multiclass Classification-1

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning : Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Support Vector Learning for Ordinal Regression

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Support Vector Machines and Kernel Methods

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Max Margin-Classifier

Support Vector Machines

Brief Introduction to Machine Learning

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Applied inductive learning - Lecture 7

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine for Classification and Regression

Machine Learning And Applications: Supervised Learning-SVM

Formulation with slack variables

Support vector machines Lecture 4

Statistical Machine Learning from Data

Discriminative Models

Incorporating detractors into SVM classification

Linear classifiers Lecture 3

Support Vector Machines for Classification and Regression

CS-E4830 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Kobe University Repository : Kernel

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Introduction to SVM and RVM

Bits of Machine Learning Part 1: Supervised Learning

CS798: Selected topics in Machine Learning

Discriminative Models

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Transcription:

MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it

Summary Support Vector Machines Hard-margin SVMs Soft-margin SVMs

Communications No lecture tomorrow (neither Dec. 8) ML Exams 12 January 2011 at 9:00, 26 January 2011 at 9:00 Exercise in Lab A201 (Polo scientifico e tecnologico) Wednesday 15 and 22 December, 2011 Time: 8.30-10.30

Which hyperplane choose?

Classifier with a Maximum Margin Var 1 IDEA 1: Select the hyperplane with maximum margin Margin Margin Var 2

Support Vector Var 1 Support Vectors Margin Var 2

Support Vector Machine Classifiers Var 1 The margin is equal to 2 k w w w x + b = k w x + b = k k k w x + b = 0 Var 2

Support Vector Machines Var 1 The margin is equal to We need to solve 2 k w w w x + b = k k k w x w x + b = k Var 2 + b = 0 max 2 k w w x + b +k, if x is positive w x + b k, if x is negative

Support Vector Machines Var 1 There is a scale for which k=1. The problem transforms in: w w x + b = 1 1 1 w x w x + b = 1 Var 2 + b = 0 max 2 w w x + b +1, if x is positive w x + b 1, if x is negative

Final Formulation max 2 w w x i + b +1, y i =1 w x i + b 1, y i = -1 max 2 w y i ( w x i + b) 1 min w 2 y i ( w x i + b) 1 min w 2 2 y i ( w x i + b) 1

Optimization Problem Optimal Hyperplane: Minimize Subject to τ ( w ) = 1 2 w 2 y i (( w x i ) + b) 1,i = 1,...,m The dual problem is simpler

Lagrangian Definition

Dual Optimization Problem

Dual Transformation Given the Lagrangian associated with our problem To solve the dual problem we need to evaluate: Let us impose the derivatives to 0, with respect to w

Dual Transformation (cont d) and wrt b Then we substituted them in the Lagrange function

Final Dual Problem

Khun-Tucker Theorem Necessary and sufficient conditions to optimality

Properties coming from constraints Lagrange constraints: Karush-Kuhn-Tucker constraints m a i y i = 0 i=1 w = m i=1 α i y i x i α i [y i ( x i w + b) 1] = 0, i = 1,...,m α i Support Vectors have not null To evaluate b, we can apply the following equation

Warning! On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient) b in this case is exactly the distance of the hyperplane from the origin So if we have an equation not normalized we may have x w '+b = 0 with x = x,y ( ) and w '= 1,1 ( ) and b is not the distance

Warning! Let us consider a normalized gradient w = 1/ 2,1/ 2 ( ) ( x,y) ( 1/ 2,1/ 2) + b = 0 x / 2 + y/ 2 = b y = x b 2 Now we see that -b is exactly the distance. For x =0, we have the intersection with w distance projected on is -b b 2. This

Soft Margin SVMs Var 1 ξ i ξ i slack variables are added w w x + b = 1 Some errors are allowed but they should penalize the objective function w x + b = 1 1 1 w x + b = 0 Var 2

Soft Margin SVMs Var 1 ξ i The new constraints are y i ( w x i + b) 1 ξ i x i where ξ i 0 w w x + b = 1 1 1 w x w x + b = 1 Var 2 + b = 0 The objective function penalizes the incorrect classified examples min 1 2 w 2 +C ξ i i C is the trade-off between margin and the error

Dual formulation By deriving wrt w, ξ and b

Partial Derivatives

Substitution in the objective function δ ij of Kronecker

Final dual optimization problem

Soft Margin Support Vector Machines The algorithm tries to keep ξ i low and maximize the margin NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized If C, the solution tends to the one of the hard-margin algorithm Attention!!!: if C = 0 we get w = 0, since min 1 2 w 2 +C If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation ξ i i y i ( w x i + b) 1 ξ i x i ξ i 0 y i b 1 ξ i x i

Robusteness of Soft vs. Hard Margin SVMs Var 1 Var 1 ξ i ξ i w x + b = 0 Var 2 w x + b = 0 Var 2 Soft Margin SVM Hard Margin SVM

Soft vs Hard Margin SVMs Soft-Margin has ever a solution Soft-Margin is more robust to odd examples Hard-Margin does not require parameters

Parameters min 1 2 w 2 +C ξ i i = min 1 2 w 2 +C + + + C ξ i ξ i i i C: trade-off parameter J: cost factor = min 1 2 w 2 +C J ξ + i i + ξ i i

Theoretical Justification

Definition of Training Set error Training Data f Empirical Risk (error) Risk (error) { } N : R ±1 R emp [ f ] = 1 m ( x 1,y 1 ),...,( x m,y m ) R N ±1 { } m i=1 1 2 f ( x i ) y i R[ f ] = 1 2 f ( x ) y dp( x,y)

Error Characterization (part 1) From PAC-learning Theory (Vapnik): R(α) R emp (α) + ϕ( d m, log(δ ) m ) ϕ( d, log(δ ) ) = d (log 2 m d +1) log(δ 4 ) m m m where d is thevc-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.

There are many versions for different bounds

Error Characterization (part 2)

Ranking, Regression and Multiclassification

Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] The aim is to classify instance pairs as correctly ranked or incorrectly ranked This turns an ordinal regression problem back into a binary classification problem We want a ranking function f such that x i > x j iff f(x i ) > f(x j ) or at least one that tries to do this with minimal error Suppose that f is a linear function f(x i ) = w x i

The Ranking SVM Sec. 15.4.2 Ranking Model: f x i f (x i )

Sec. 15.4.2 The Ranking SVM Then (combining the two equations on the last slide): x i > x j iff w x i w x j > 0 x i > x j iff w (x i x j ) > 0 Let us then create a new instance space from such pairs: z k = x i x k y k = +1, 1 as x i, < x k

Support Vector Ranking 1 Given two examples we build one example (x i, x j )

Support Vector Regression (SVR) f(x) Solution: +ε 0 -ε Constraints: x

Support Vector Regression (SVR) f(x) ξ +ε 0 -ε Minimise: N 1 2 wt w + C ξ i + ξ i Constraints: ( * ) i=1 ξ* x

Support Vector Regression y i is not -1 or 1 anymore, now it is a value ε is the tollerance of our function value

From Binary to Multiclass classifiers Three different approaches: ONE-vs-ALL (OVA) Given the example sets, {E1, E2, E3, } for the categories: {C1, C2, C3, } the binary classifiers: {b1, b2, b3, } are built. For b1, E1 is the set of positives and E2 E3 is the set of negatives, and so on For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers

From Binary to Multiclass classifiers ALL-vs-ALL (AVA) Given the examples: {E1, E2, E3, } for the categories {C1, C2, C3, } build the binary classifiers: {b1_2, b1_3,, b1_n, b2_3, b2_4,, b2_n,,bn-1_n} by learning on E1 (positives) and E2 (negatives), on E1 (positives) and E3 (negatives) and so on For testing: given an example x, all the votes of all classifiers are collected where b E1E2 = 1 means a vote for C1 and b E1E2 = -1 is a vote for C2 Select the category that gets more votes

From Binary to Multiclass classifiers Error Correcting Output Codes (ECOC) The training set is partitioned according to binary sequences (codes) associated with category sets. For example, 10101 indicates that the set of examples of C1,C3 and C5 are used to train the C 10101 classifier. The data of the other categories, i.e. C2 and C4 will be negative examples In testing: the code-classifiers are used to decode one the original class, e.g. C 10101 = 1 and C 11010 = 1 indicates that the instance belongs to C1 That is, the only one consistent with the codes

SVM-light: an implementation of SVMs Implements soft margin Contains the procedures for solving optimization problems Binary classifier Examples and descriptions in the web site: http://www.joachims.org/ (http://svmlight.joachims.org/)

References A tutorial on Support Vector Machines for Pattern Recognition Downloadable article (Chriss Burges) The Vapnik-Chervonenkis Dimension and the Learning Capability of Neural Nets Downloadable Presentation Computational Learning Theory (Sally A Goldman Washington University St. Louis Missouri) Downloadable Article AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press Check our library The Nature of Statistical Learning Theory Vladimir Naumovich Vapnik - Springer Verlag (December, 1999) Check our library

Exercise 1. The equations of SVMs for Classification, Ranking and Regression (you can get them from my slides). 2. The perceptron algorithm for Classification, Ranking and Regression (the last two you have to provide by looking at what you wrote in point (1)). 3. The same as point (2) by using kernels (write the kernel definition as introduction of this section).