Support Vector Machines

Similar documents
Support Vector Machines

Support Vector Machines

Support Vector Machines

Kernel Methods and SVMs Extension

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Kristin P. Bennett. Rensselaer Polytechnic Institute

Lecture 10 Support Vector Machines II

Lecture 3: Dual problems and Kernels

18-660: Numerical Methods for Engineering Design and Optimization

Maximal Margin Classifier

Natural Language Processing and Information Retrieval

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines CS434

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Support Vector Machines

Lecture 10 Support Vector Machines. Oct

17 Support Vector Machines

Nonlinear Classifiers II

Which Separator? Spring 1

Lagrange Multipliers Kernel Trick

Lecture 6: Support Vector Machines

Support Vector Machines CS434

Multilayer Perceptron (MLP)

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE 252C: Computer Vision III

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Chapter 6 Support vector machine. Séparateurs à vaste marge

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Pattern Classification

1 Convex Optimization

Generalized Linear Methods

Discriminative classifier: Logistic Regression. CS534-Machine Learning

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CSCI B609: Foundations of Data Science

Linear Feature Engineering 11

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

10-701/ Machine Learning, Fall 2005 Homework 3

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

Advanced Introduction to Machine Learning

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

CSC 411 / CSC D11 / CSC C11

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Kernel Methods and SVMs

Lecture Notes on Linear Regression

Logistic Classifier CISC 5800 Professor Daniel Leeds

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Fisher Linear Discriminant Analysis

Feature Selection: Part 1

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Classification as a Regression Problem

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

Multilayer neural networks

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Discriminative classifier: Logistic Regression. CS534-Machine Learning

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Maxent Models & Deep Learning

Maximum Likelihood Estimation (MLE)

Statistical machine learning and its application to neonatal seizure detection

A Note on Structural Extensions of SVMs

SELECTED SOLUTIONS, SECTION (Weak duality) Prove that the primal and dual values p and d defined by equations (4.3.2) and (4.3.3) satisfy p d.

Intro to Visual Recognition

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

15-381: Artificial Intelligence. Regression and cross validation

EM and Structure Learning

Logistic Regression Maximum Likelihood Estimation

UVA CS / Introduc8on to Machine Learning and Data Mining

CSE 546 Midterm Exam, Fall 2014(with Solution)

Lecture 20: November 7

Assortment Optimization under MNL

Generative classification models

Recap: the SVM problem

Structured Perceptrons & Structural SVMs

One-class Classification: ν-svm

Pairwise Multi-classification Support Vector Machines: Quadratic Programming (QP-P A MSVM) formulations

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Support Vector Novelty Detection

On the Multicriteria Integer Network Flow Problem

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Ensemble Methods: Boosting

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Probabilistic & Unsupervised Learning

Multi-layer neural networks

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Boosting as a Regularized Path to a Maximum Margin Classifier

Hidden Markov Models

Transcription:

Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng

So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear models Neural networks, Decson trees, Assocaton rules Unsupervsed machne learnng Clusterng/EM, PCA Generc scaffoldng Probablstc modelng, ML/MAP estmaton Performance evaluaton, Statstcal learnng theory Lnear algebra, Optmzaton methods

Comng up next Supervsed machne learnng Lnear models Least squares regresson, SVM Fsher s dscrmnant, Perceptron, Logstc regresson, SVM Non-lnear models Neural networks, Decson trees, Assocaton rules SVM, Kernel-XXX Unsupervsed machne learnng Clusterng/EM, PCA, Kernel-XXX Generc scaffoldng Probablstc modelng, ML/MAP estmaton Performance evaluaton, Statstcal learnng theory Lnear algebra, Optmzaton methods Kernels

Frst thngs frst SVM: lbrary('e1071') (y { 1,1}) m = svm(x, y, kernel='lnear') predct(m, newx)

Quz Ths lne s called Ths vector s Those lnes are f x =? x 1 =? y 1 =? Functonal margn of x 1? Geometrc margn of x 1? Dstance to orgn?

Quz Separatng hyperplane Normal w Isolnes (level lnes) f x = w T x + b x 1 = (2, 6); y 1 = 1 y 1 f x 1 2 f(x 1 )/ w 3 2 d = b/ w

Quz Suppose we scale w and b by some constant. Wll t: Affect the separatng hyperplane? How? Affect the functonal margns? How? Affect the geometrc margns? How?

Quz Example: w 2w, b = 0

Quz Suppose we scale w and b by some constant. Wll t: Affect the separatng hyperplane? How? No: w T x + b = 0 2w T x + 2b = 0 Affect the functonal margns? How? Yes: 2w T x + 2b y = 2 w T x + b y Affect the geometrc margns? How? No: 2w T x+2b 2w = wt x+b w

Whch classfer s best?

Maxmal margn classfer

Why maxmal margn? Well-defned, sngle stable soluton Nose-tolerant Small parameterzaton (Farly) effcent algorthms exst for fndng t

Maxmal margn: Separable case f x = 1 f x = 1

Maxmal margn: Separable case f x = 1 f x = 1 f x y 1

Maxmal margn: Separable case f x = 1 The (geometrc) dstance to the solne f x = 1 s: f x = 1

Maxmal margn: Separable case f x = 1 The (geometrc) dstance to the solne f x = 1 s: f x d = w = 1 w f x = 1

Maxmal margn: Separable case Among all lnear classfers (w, b) whch keep all ponts at functonal margn of 1 or more, we shall look for the one whch has the largest dstance d to the correspondng solnes,.e. the largest geometrc margn. As d = 1, ths s equvalent to fndng the classfer w wth mnmal w. whch s equvalent to fndng the classfer wth mnmal w 2

Compare Generc lnear classfcaton (separable case): Fnd (w, b), such that all ponts are classfed correctly.e. f x y > 0 Maxmal margn classfcaton (separable case): Fnd (w, b), such that all ponts are classfed correctly wth a fxed functonal margn.e. f x y > 1 and w 2 s mnmal.

Remember SVM optmzaton problem (separable case): mn w,b 1 2 w 2 so that w T x + b y 1

General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: mn w,b 1 2 w 2 + C ξ where ξ = 1 f x y + ξ = 1 f x y +

General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: ξ = 1 f x y + mn w,b 1 2 w 2 + C 1 f x y +

General case ( soft margn ) The same, but we also penalze all margn volatons. SVM optmzaton problem: ξ = 1 f x y + mn w,b 1 2 w 2 + C 1 m +

General case ( soft margn ) The same, but we also penalze all margn volatons. mn w,b SVM optmzaton problem: 1 2 w 2 + C hnge(m ) where hnge m = 1 m + ξ = 1 f x y +

Hnge loss hnge m = 1 m +

Classfcaton loss functons Generc classfcaton: mn w,b [m < 0]

Classfcaton loss functons Perceptron:

Classfcaton loss functons Perceptron: mn w,b ( m ) +

Classfcaton loss functons Least squares classfcaton*: mn w,b m 1 2

Classfcaton loss functons Boostng: mn w,b exp( m )

Classfcaton loss functons Logstc regresson: mn w,b log (1 + e m )

Classfcaton loss functons Regularzed logstc regresson: mn w,b log (1 + e m ) +λ 1 2 w 2

Classfcaton loss functons SVM: mn w,b 1 m + + 1 2C w 2

Classfcaton loss functons L2-SVM: mn w,b 1 m + 2 + 1 2C w 2

Classfcaton loss functons L1-regularzed L2-SVM: mn w,b 1 m + 2 + 1 2C w etc

In general mn w,b φ(m ) + λ Ω(w) Model ft Model complexty

Compare to MAP estmaton max Model log P(x Model) + log P(Model) Lkelhood Model pror

Compare to MAP estmaton max Model log P(Data Model) + log P(Model) Lkelhood Model pror

Solvng the SVM mn w,b 1 2 w 2 + C 1 f x y +

Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ ξ 0

Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ 0 ξ 0

Solvng the SVM such that mn w,b 1 2 w 2 + C ξ f x y 1 ξ 0 ξ 0 Quadratc functon wth lnear constrants!

Solvng the SVM such that Mnmze mn w,b subject to: 1 2 w 2 + C ξ Quadratc programmng ff xx = y 1 1 ξ 0 ξ2 xt Qx + c T x 0 Ax b Cx = d Quadratc functon wth lnear constrants!

Solvng the SVM such that Mnmze mn w,b subject to: 1 2 w 2 + C ξ Quadratc programmng f fx x y= 1 1 ξ 0 ξ2 xt Qx + c T x 0 Ax b Cx = d Quadratc functon wth lnear constrants! > lbrary(quadprog) > solve.qp(q, -c, A, b, neq)

Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: 1 2 w 2 + C ξ mn w,b max α 0,β 0 α (f x y 1 ξ ) β ξ

Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: 1 2 w 2 + C ξ mn w,b max α 0,β 0 α (f x y 1 ξ ) β ξ

Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1

Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1 C α β = 0

Solvng the SVM: Dual mn w,b 1 w 2 + C ξ 2 such that f x y 1 ξ 0, ξ 0 Is equvalent to: mn w,b max α 0,β 0 1 2 w 2 + ξ C α β α f x y 1 0 α C

Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C

Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C Sparsty: α s nonzero only for those ponts whch have f x y 1 < 0

Solvng the SVM: Dual mn w,b max α 1 2 w 2 α f x y 1 0 α C Now swap the mn and the max (can be done n partcular because everythng s nce and convex).

Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Next solve the nner (unconstraned) mn as usual.

Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Next solve the nner (unconstraned) mn as usual: w = w α y x = 0 b = α y = 0

Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: w = α y x α y = 0

Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: w = α y x Dual representaton α y = 0

Solvng the SVM: Dual max α mn w,b 1 2 w 2 α f x y 1 0 α C Express w and substtute: max α α 1 α 2 α j y y j x T x j,j 0 α C α y = 0

Solvng the SVM: Dual max α α 1 α 2 α j y y j x T x j,j 0 α C α y = 0

Solvng the SVM: Dual max α 1 T α 1 2 αt K Y α K j = x T x j, 0 α C y T α = 0 Y j = y y j

Solvng the SVM: Dual 1 mn α 2 αt K Y α 1 T α α 0 α C y T α = 0 Then fnd b from the condton: f x y = 1 f 0 < α < C

Support vectors

Support vectors C 0.5 0 C 0 0 1 0.5 α y = 0 0 0 α C 0

Sparsty The dual soluton s often very sparse, ths allows to perform optmzaton effcently Workng set approach.

Kernels f x = w T x + b w = α y x f x f x = α y x T x + b = α y K(x, x) + b

Kernels f x = w T x + b w = α y x Kernel functon f x f x = α y x T x + b = α y K(x, x) + b

Kernels f x = w T x + b f x = w 1 x + w 2 x 2 + b w = α y x f x f x = α y x T x + b = α y K(x, x) + b f x = α y exp ( x x 2 ) + b

Quz SVM s a lnear classfer. Margn maxmzaton can be acheved va mnmzaton of. SVM uses loss and regularzaton. Besdes hnge loss I also know loss and loss. SVM n both prmal and dual form s solved usng programmng.

Quz In prmal formulaton we solve for parameter vector. In dual formulaton we solve for nstead. form of SVM s typcally sparse. Support vectors are those tranng ponts for whch. The relaton between prmal and dual varables s: =. A Kernel s a generalzaton of product.