Support Vector Machines

Similar documents
Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning. Support Vector Machines. Manfred Huber

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear & nonlinear classifiers

Generative v. Discriminative classifiers Intuition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Gaussian Naïve Bayes Big Picture

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Generative v. Discriminative classifiers Intuition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture Notes on Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine

ECE 5984: Introduction to Machine Learning

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS798: Selected topics in Machine Learning

Support Vector Machines for Classification and Regression

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines and Kernel Methods

Support Vector Machines for Classification: A Statistical Portrait

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Bias-Variance Tradeoff

Introduction to Support Vector Machines

Announcements - Homework

Linear Models for Classification

Statistical Machine Learning from Data

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Support Vector Machine (continued)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Logistic Regression. Machine Learning Fall 2018

Naïve Bayes classification

ML (cont.): SUPPORT VECTOR MACHINES

Discriminative Models

Stochastic Gradient Descent

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Support Vector Machines for Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Support Vector Machines, Kernel SVM

Discriminative Models

Lecture 10: A brief introduction to Support Vector Machine

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Constrained Optimization

Support Vector Machines

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machine (SVM) and Kernel Methods

Machine Learning

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Kernel methods, kernel SVM and ridge regression

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine

Introduction to Machine Learning

Lecture 2: Linear SVM in the Dual

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Max Margin-Classifier

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Kernel Methods and Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CS-E4830 Kernel Methods in Machine Learning

Convex Optimization and Support Vector Machine

Classification Logistic Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear, Binary SVM Classifiers

Introduction to Machine Learning

Machine Learning

Is the test error unbiased for these programs?

SVMs, Duality and the Kernel Trick

L5 Support Vector Classification

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to SVM and RVM

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CMU-Q Lecture 24:

Transcription:

Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013

Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized p x y = 1 = p(x i y = 1) d i=1 Or the variables corresponding to each dimension of the data are independent given the label 2

Discriminative classifier Directly estimate decision boundary h x posterior distribution p(y x) = ln q i x q j x Logistic regression, Support vector machine, Neural networks Do not estimate p(x y) and p(y) or h(x) or f x : = p y = 1 x is a function of x, and does not have probabilistic generative model for x, hence can not be used to sample data points Why discriminative classifier? Avoid difficult density estimation problem Empirically achieve better classification results 3

What is logistic regression model Assume that the posterior distribution p y = 1 x take a particular form 1 p y = 1 x, θ = 1 + exp( θ x) Logistic function f u = 1 1+exp u 4

Learning parameters in logistic regression Find θ, such that the conditional likelihood of the labels is maximized m max l θ : = log P θ yi x i, θ Good news: l θ is concave function of θ, and there is a single global optimum. i=1 Bad new: no closed form solution (resort to numerical method) 5

The objective function l θ logistic regression model p y = 1 x, θ = 1 1 + exp( θ x) Note that p y = 0 x, θ = 1 1 1 + exp θ x = exp θ x 1 + exp θ x Plug in m l θ : = log P y i x i, θ i=1 = (y i 1) θ x i log(1 + exp θ x i ) i 6

The gradient of l θ Gradient l θ : = log m i=1 P y i x i, θ = (y i 1) θ x i log(1 + exp θ x i ) i l(θ) θ = (yi 1) i x i x i + exp θ x i 1 + exp θ x Setting it to 0 does not lead to closed form solution 7

Gradient Ascent/Descent algorithm Initialize parameter θ 0 Do θ t+1 θ t + η (y i 1) i x i x i + exp θ x i 1 + exp θ x While the θ t+1 θ t > ε 8

Naïve Bayes vs. logistic regression Consider y 1, 1, x R n Number of parameters Naïve Bayes : 2n + 1 n mean, n variance, and 1 for prior logistic regression: n + 1 θ 0, θ 1, θ 2,, θ n 9

Naïve Bayes vs logistic regression II Asymptotic comparison (# training examples infinity) when model assumptions correct Naïve Bayes, logistic regression produce identical classifiers when model assumptions incorrect logistic regression is less biased does not assume conditional independence logistic regression has fewer parameters therefore expected to outperform Naïve Bayes 10

Naïve Bayes vs logistic regression III Estimation method: Naïve Bayes parameter estimates are decoupled (super easy) logistic regression parameter estimates are coupled (less easy) How to estimate the parameters in logistic regression? Maximum likelihood estimation More specifically, maximize the conditional likelihood the label 11

Experiments with handwritten digits 12

Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error Class 2 But there are many such decision boundaries Which one is better? Class 1 13

Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error? 14

Geometric interpretation of a classifier Parameterizing decision boundary as: w x + b = 0 w denotes a vector orthogonal to the decision boundary b is a scalar offset term Dash lines are parallel to decision boundary and they just hit the data points w T x b 0 w Class 2 Class 1 c c 15

Constraints on data points Constraints on data points For all x in class 2, y = 1 and w x + b c For all x in class 1, y = 1 and w x + b c Or more compactly, (w x + b)y c w T x b 0 w Class 2 Class 1 c c 16

Classifier margin Pick two data points x 1 and x 2 which are on each dash line respectively The margin is γ = 1 w w x 1 x 2 = 2c w w T x b 0 w x 1 Class 2 x 2 Class 1 c c 17

Maximum margin classifier Find decision boundary w as far from data point as possible 2c max w,b w s. t. y i w x i + b c, i w T x b 0 w x 1 Class 2 x 2 Class 1 c c 18

Equivalent form max w,b 2c w s. t. y i w x i + b c, i Note that the magnitude of c merely scales w and b, and does not change the relative goodness of different classifiers Set c = 1 (and drop the 2) to get a cleaner problem 1 max w,b w s. t. y i w x i + b 1, i 19

Support vector machine A constrained convex quadratic programming problem min w,b w 2 s. t. y i w x i + b 1, i After optimization, the margin is given by 2 w Only a few of the constraints are relevant support vectors Kernel methods are introduced for nonlinear classification problem 20

Lagrangian Duality The primal problem min f(w) w st. g i w 0, i = 1,, k h i w = 0, i = 1,, l The Lagrangian function k l L w, α, β = f w + α i g i w + β i h i (w) α i 0, and β i are called the Lagrangian multipliers i i 21

The KKT conditions If there exists some saddle point of L, then the saddle point satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions: L w = 0 L b = 0 L α = 0 L β = 0 g i w 0 h i w = 0 α i 0 α i g i w = 0 22

Dual problem of support vector machines min w,b w 2 s. t. y i w x i + b 1, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b 0, i The lagrangian function m L w, α, β = 1 2 w w + α i 1 y i w x i + b i 23

Deriving the dual problem m L w, α, β = 1 2 w w + α i 1 y i w x i + b i Taking derivative and set to zero m L w = w α iy i x i = i m L b = α iy i = 0 i 0 24

Plug back relation of w and b L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 25

The dual problem m 1 L w, α, β = α i 2 i m i,j α i α j y i y j s. t. α i 0, i = 1,, m m i α i y i = 0 x i x j This is a constrained quadratic programming Nice and convex, and global maximum can be found w can be found as w = How about b? m i α i y i x i 26

Support vectors Note that the KKT condition α i g i w = 0 α i 1 y i w x i + b = 0 For data points with 1 y i w x i + b < 0, α i = 0 For data points with 1 y i w x i + b = 0, α i > 0 Class 2 a 5 =0 a 8 =0.6 a 10 =0 a 7 =0 a 2 =0 Call the training data points whose a i 's are nonzero the support vectors (SV) a 4 =0 a 6 =1.4 a 1 =0.8 a 9 =0 Class 1 a 3 =0 27

Computing b and obtain the classifer Pick any data point with α i > 0, solve for b with 1 y i w x i + b = 0 For a new test point z Compute w z + b = α i y i x i z + b i support vectors Classify z as class 1 if the result is positive, and class 2 otherwise 28

Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This sparse representation can be viewed as data compression To compute the weights α i, and to use support vector machines we need to specify only the inner products (or kernel) between the examples x i x j We make decisions by comparing each new example z with only the support vectors: y = sign α i y i x i z + b i support vectors 29