Support Vector Machines, Kernel SVM

Similar documents
CSC 411 Lecture 17: Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Announcements - Homework

Support Vector Machines for Classification and Regression

Machine Learning. Support Vector Machines. Manfred Huber

Max Margin-Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

CS-E4830 Kernel Methods in Machine Learning

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

Support vector machines

Introduction to Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture Notes on Support Vector Machine

Learning From Data Lecture 25 The Kernel Trick

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines and Kernel Methods

(Kernels +) Support Vector Machines

L5 Support Vector Classification

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Statistical Pattern Recognition

Convex Optimization and Support Vector Machine

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

CS798: Selected topics in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines. Machine Learning Fall 2017

Linear, Binary SVM Classifiers

Machine Learning And Applications: Supervised Learning-SVM

LECTURE 7 Support vector machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Soft-Margin Support Vector Machine

Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines

CS , Fall 2011 Assignment 2 Solutions

COMP 875 Announcements

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

SVM optimization and Kernel methods

Support Vector Machines and Speaker Verification

SVMs, Duality and the Kernel Trick

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Statistical Machine Learning from Data

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Support Vector Machines.

Introduction to Logistic Regression and Support Vector Machine

CIS 520: Machine Learning Oct 09, Kernel Methods

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

10701 Recitation 5 Duality and SVM. Ahmed Hefny

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

MLCC 2017 Regularization Networks I: Linear Models

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINE

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines for Regression

Machine Learning, Fall 2011: Homework 5

Transcription:

Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40

Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 2 / 40

Announcements HW4 due now HW5 will be posted online today Midterm has been graded Average: 64.6/90 Median: 64.5/90 Standard Deviation: 14.8 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 3 / 40

Outline 1 Administration 2 Review of last lecture SVMs Geometric interpretation 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 4 / 40

SVM Intuition: where to put the decision boundary? Consider the following separable training dataset, i.e., we assume there exists a decision boundary that separates the two classes perfectly. There are an infinite number of decision boundaries H : w T φ(x) + b = 0! H H H Which one should we pick? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 5 / 40

SVM Intuition: where to put the decision boundary? Consider the following separable training dataset, i.e., we assume there exists a decision boundary that separates the two classes perfectly. There are an infinite number of decision boundaries H : w T φ(x) + b = 0! H H H Which one should we pick? Idea: Find a decision boundary in the middle of the two classes. In other words, we want a decision boundary that: Perfectly classifies the training data Is as far away from every training point as possible Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 5 / 40

Distance from a point to decision boundary The unsigned distance from a point φ(x) to decision boundary (hyperplane) H is d H (φ(x)) = wt φ(x) + b w 2 We can remove the absolute value by exploiting the fact that the decision boundary classifies every point in the training dataset correctly. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 6 / 40

Distance from a point to decision boundary The unsigned distance from a point φ(x) to decision boundary (hyperplane) H is d H (φ(x)) = wt φ(x) + b w 2 We can remove the absolute value by exploiting the fact that the decision boundary classifies every point in the training dataset correctly. Namely, (w T φ(x) + b) and x s label y must have the same sign, so: d H (φ(x)) = y[wt φ(x) + b] w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 6 / 40

Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 7 / 40

Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 How should we pick (w, b) based on its margin? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 7 / 40

Optimizing the Margin Margin Smallest distance between the hyperplane and all training points margin(w, b) = min n y n [w T φ(x n ) + b] w 2 H : w T φ(x)+b =0 w T φ(x)+b w 2 How should we pick (w, b) based on its margin? We want a decision boundary that is as far away from all training points as possible, so we to maximize the margin! max w,b y n [w T φ(x n ) + b] 1 min = max min y n [w T φ(x n ) + b] n w w,b w 2 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 7 / 40

Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 8 / 40

Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 We ve fixed the numerator in the margin(w, b) equation, and we have: margin(w, b) = 1 w 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 8 / 40

Rescaled Margin We can further constrain the problem by scaling (w, b) such that min n y n [w T φ(x n ) + b] = 1 We ve fixed the numerator in the margin(w, b) equation, and we have: margin(w, b) = 1 w 2 Hence the points closest to the decision boundary are at distance 1! w T φ(x)+b =1 H : w T φ(x)+b =0 1 w2 w T φ(x)+b = 1 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 8 / 40

SVM: max margin formulation for separable data Assuming separable training data, we thus want to solve: max w,b This is equivalent to 1 w 2 such that y n [w T φ(x n ) + b] 1, n min w,b 1 2 w 2 2 s.t. y n [w T φ(x n ) + b] 1, n Given our geometric intuition, SVM is called a max margin (or large margin) classifier. The constraints are called large margin constraints. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 9 / 40

SVM for non-separable data Constraints in separable setting y n [w T φ(x n ) + b] 1, n Constraints in non-separable setting Idea: modify our constraints to account for non-separability! Specifically, we introduce slack variables ξ n 0: y n [w T φ(x n ) + b] 1 ξ n, n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 10 / 40

SVM for non-separable data Constraints in separable setting y n [w T φ(x n ) + b] 1, n Constraints in non-separable setting Idea: modify our constraints to account for non-separability! Specifically, we introduce slack variables ξ n 0: y n [w T φ(x n ) + b] 1 ξ n, n For hard training points, we can increase ξ n until the above inequalities are met What does it mean when ξ n is very large? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 10 / 40

Soft-margin SVM formulation We do not want ξ n to grow too large, and we can control their size by incorporating them into our optimization problem: min w,b,ξ 1 2 w 2 2 + C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n C is user-defined regularization hyperparameter that trades off between the two terms in our objective This is a convex quadratic program that can be solved with general purpose or specialized solvers Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 11 / 40

What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs ξ n = 0. 1 ξ n y n [w T φ(x n ) + b]} = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 12 / 40

What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 12 / 40

What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. These are points that can be classified correctly but do not satisfy the large margin constraint, i.e., distance to the hyperplane is less than 1 ξ n > 1. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 12 / 40

What are support vectors? Support vectors are data points where the margin inequality constraint is active (i.e., an equality): 3 types of SVs 1 ξ n y n [w T φ(x n ) + b]} = 0 ξ n = 0. This implies y n [w T φ(x n ) + b] = 1. These are points that are on the margin ξ n < 1. These are points that can be classified correctly but do not satisfy the large margin constraint, i.e., distance to the hyperplane is less than 1 ξ n > 1. These are points that are misclassified. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 12 / 40

Visualization of how training data points are categorized w T (x)+b =1 H : w T (x)+b =0 n < 1 n > 1 n =0 w T (x)+b = 1 The SVM solution solution is only determined by a subset of the training samples (as we will see later in the lecture) These samples are called support vectors, which are highlighted by the dotted orange lines in the figure Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 13 / 40

Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 14 / 40

Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 15 / 40

Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise No penalty if raw output, f(x), has same sign and is far enough from decision boundary (i.e., if margin is large enough) Otherwise pay a growing penalty, between 0 and 1 if signs match, and greater than one otherwise Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 15 / 40

Hinge loss Definition Assume y { 1, 1} and the decision rule is h(x) = sign(f(x)) with f(x) = w T φ(x) + b, Intuition { l hinge (f(x), y) = 0 if yf(x) 1 1 yf(x) otherwise No penalty if raw output, f(x), has same sign and is far enough from decision boundary (i.e., if margin is large enough) Otherwise pay a growing penalty, between 0 and 1 if signs match, and greater than one otherwise Convenient shorthand l hinge (f(x), y) = max(0, 1 yf(x)) = (1 yf(x)) + Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 15 / 40

Visualization and Properties 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Upper-bound for 0/1 loss function (black line) We use hinge loss is a surrogate to 0/1 loss Why? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 16 / 40

Visualization and Properties 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Upper-bound for 0/1 loss function (black line) We use hinge loss is a surrogate to 0/1 loss Why? Hinge loss is convex, and thus easier to work with (though it s not differentiable at kink) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 16 / 40

Visualization and Properties 8 7 6 5 4 3 2 1 0 2 1 0 1 2 Other surrogate losses can be used, e.g., exponential loss for Adaboost (in blue), logistic loss (not shown) for logistic regression Hinge loss less sensitive to outliers than exponential (or logistic) loss Logistic loss has a natural probabilistic interpretation We can greedily optimize exponential loss (Adaboost) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 17 / 40

Primal formulation of support vector machines (SVM) Minimizing the total hinge loss on all the training data min w,b max(0, 1 y n [w T φ(x n ) + b]) + λ 2 w 2 2 n Analogous to regularized least squares, as we balance between two terms (the loss and the regularizer). Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 18 / 40

Primal formulation of support vector machines (SVM) Minimizing the total hinge loss on all the training data min w,b max(0, 1 y n [w T φ(x n ) + b]) + λ 2 w 2 2 n Analogous to regularized least squares, as we balance between two terms (the loss and the regularizer). Previously, we used geometric arguments to derive: min w,b,ξ 1 2 w 2 2 + C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n and ξ n 0, n Do these the yield the same solution? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 18 / 40

Recovering our previous SVM formulation Define C = 1/λ: min w,b C n max(0, 1 y n [w T φ(x n ) + b]) + 1 2 w 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 19 / 40

Recovering our previous SVM formulation Define C = 1/λ: min w,b C n max(0, 1 y n [w T φ(x n ) + b]) + 1 2 w 2 2 Define ξ n max(0, 1 y n f(x n )) min w,b,ξ C n ξ n + 1 2 w 2 2 s.t. max(0, 1 y n [w T φ(x n ) + b]) ξ n, n Since c max(a, b) c a, c b, we recover previous formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 19 / 40

Outline 1 Administration 2 Review of last lecture 3 SVM Hinge loss (primal formulation) 4 Kernel SVM Lagrange duality theory SVM Dual Formulation and Kernel SVM SVM Dual Derivation and Support Vectors Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 20 / 40

Kernel SVM Roadmap Key concepts we ll cover Brief review of constrained optimization with inequality constraints Primal and Dual problems Strong Duality and KKT conditions Dual SVM problem and Kernel SVM Dual SVM problem and support vectors Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 21 / 40

Constrained Optimization Equality Constraints The Lagrangian is defined as follows: min x f(x) s.t. h j (x) = 0, j L(x, β) = f(x) + j β j h j (x) When problem is convex, we can find the optimal solution by Computing partial derivatives of L Setting them to zero Solving the corresponding system of equations Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 22 / 40

Constrained Optimization Inequality Constraints This is the primal problem min x f(x) s.t. g i (x) 0, i h i (x) = 0, j Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Thus min x θ P (x) = min x max α,β,αi 0 L(x, α, β) has same solution as Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints min x f(x) s.t. g i (x) 0, i h i (x) = 0, j This is the primal problem with the generalized Lagrangian: L(x, α, β) = f(x) + i α i g i (x) + j β j h j (x) Consider the following function: θ P (x) = max L(x, α, β) α,β,α i 0 If x violates a primal constraint, θ P (x) = ; otherwise θ P (x) = f(x) Thus min x θ P (x) = min x max α,β,αi 0 L(x, α, β) has same solution as primal problem, which we denote as p Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 23 / 40

Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) L(x, α, β) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 24 / 40

Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 24 / 40

Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Primal and dual are the same, except the max and min are exchanged! Relationship between primal and dual? Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 24 / 40

Constrained Optimization Inequality Constraints Primal Problem Dual Problem p = min x θ P (x) = min x max α,β,α i 0 Consider the function: θ D (α, β) = min x L(x, α, β) d = max α,β,α i 0 θ D(α, β) = L(x, α, β) max min L(x, α, β) α,β,α i 0 x Primal and dual are the same, except the max and min are exchanged! Relationship between primal and dual? p d (weak duality) min max of any function is always greater than the max min https://en.wikipedia.org/wiki/max%e2%80%93min_inequality Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 24 / 40

Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 25 / 40

Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 25 / 40

Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i These conditions are all satisfied by the SVM optimization problem! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 25 / 40

Strong Duality When p = d, we can solve the dual problem in lieu of primal problem! Sufficient conditions for strong duality: f and g i are convex, h i are affine (i.e., linear with offset) Inequality constraints are strictly feasible, i.e., there exists some x such that g i (x) < 0 for all i These conditions are all satisfied by the SVM optimization problem! When these conditions hold, there must exist x, α, β such that: x is the solution to the primal and α, β is the solution to the dual p = d = L(x, α, β ) x, α, β satisfy the KKT conditions (which in fact are necessary and sufficient) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 25 / 40

Recap When working with constrained optimization problems with inequality constraints, we can write down primal and dual problems The dual solution is always a lower bound on the primal solution (weak duality) The duality gap equals 0 under certain conditions (strong duality), and in such cases we can either solve the primal or dual problem Strong duality holds for the SVM problem, and in particular the KKT conditions are necessary and sufficient for the optimal solution See http://cs229.stanford.edu/notes/cs229-notes3.pdf for details Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 26 / 40

Dual formulation of SVM Dual is also a convex quadratic programming max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 27 / 40

Dual formulation of SVM Dual is also a convex quadratic programming max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n There are N dual variable α n, one for each constraint in the primal formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 27 / 40

Kernel SVM We replace the inner products φ(x m ) T φ(x n ) with a kernel function max α α n 1 y m y n α m α n k(x m, x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n We can define a kernel function to work with nonlinear features and learn a nonlinear decision surface Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 28 / 40

Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 29 / 40

Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 29 / 40

Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] = [y n m y m α m k(x m, x n )] Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 29 / 40

Recovering solution to the primal formulation Weights w = n y n α n φ(x n ) Linear combination of the input features Offset For any C > α n > 0, we have b = [y n w T φ(x n )] = [y n m y m α m k(x m, x n )] Prediction on a test point x h(x) = sign(w T φ(x) + b) = sign( n y n α n k(x n, x) + b) At test time it suffices to know the kernel function! Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 29 / 40

Derivation of the dual We will next derive the dual formulation for SVMs. Recipe Formulate the generalized Lagrangian function that incorporates the constraints and introduces dual variables Minimize the Lagrangian function over the primal variables Substitute the primal variables for dual variables in the Lagrangian Maximize the Lagrangian with respect to dual variables Recover the solution (for the primal variables) from the dual variables Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 30 / 40

A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 31 / 40

A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 31 / 40

A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 31 / 40

A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Its dual problem is Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 31 / 40

A simple example Consider the example of convex quadratic programming min 1 2 x2 s.t. x 0 2x 3 0 The generalized Lagrangian is (note that we do not have equality constraints) L(x, α) = 1 2 x2 + α 1 ( x) + α 2 (2x 3) = 1 2 x2 + (2α 2 α 1 )x 3α 2 under the constraints that α 1 0 and α 2 0. Its dual problem is max min L(x, α) = max min 1 α 1 0,α 2 0 x α 1 0,α 2 0 x 2 x2 + (2α 2 α 1 )x 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 31 / 40

Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 32 / 40

Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) We next substitute the solution back into the Lagrangian: 1 g(α) = min x 2 x2 + (2α 2 α 1 )x 3α 2 = 1 2 (2α 2 α 1 ) 2 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 32 / 40

Example (cont d) We now solve min x L(x, α). The optimal x is attained by ( 1 2 x2 + (2α 2 α 1 )x 3α 2 ) x = 0 x = (2α 2 α 1 ) We next substitute the solution back into the Lagrangian: 1 g(α) = min x 2 x2 + (2α 2 α 1 )x 3α 2 = 1 2 (2α 2 α 1 ) 2 3α 2 Our dual problem can now be simplified: We will solve the dual next. max α 1 0,α 2 0 1 2 (2α 2 α 1 ) 2 3α 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 32 / 40

Solving the dual Note that, g(α) = 1 2 (2α 2 α 1 ) 2 3α 2 0 for all α 1 0, α 2 0. Thus, to maximize the function, the optimal solution is Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 33 / 40

Solving the dual Note that, g(α) = 1 2 (2α 2 α 1 ) 2 3α 2 0 for all α 1 0, α 2 0. Thus, to maximize the function, the optimal solution is α 1 = 0, α 2 = 0 This brings us back the optimal solution of x x = (2α 2 α 1) = 0 Namely, we have arrived at the same solution as the one we guessed from the primal formulation Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 33 / 40

Deriving the dual for SVM Primal SVM min w,b,ξ 1 2 w 2 2 + C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 34 / 40

Deriving the dual for SVM Primal SVM min w,b,ξ 1 2 w 2 2 + C n ξ n s.t. y n [w T φ(x n ) + b] 1 ξ n, ξ n 0, n n Lagrangian L(w, b, {ξ n }, {α n }, {λ n }) = C n ξ n + 1 2 w 2 2 n λ n ξ n + n α n {1 y n [w T φ(x n ) + b] ξ n } under the constraints that α n 0 and λ n 0. Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 34 / 40

Minimizing the Lagrangian Taking derivatives with respect to the primal variables L w = w n L b = n α n y n = 0 y n α n φ(x n ) = 0 L ξ n = C λ n α n = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 35 / 40

Minimizing the Lagrangian Taking derivatives with respect to the primal variables L w = w n L b = n α n y n = 0 y n α n φ(x n ) = 0 L ξ n = C λ n α n = 0 These equations link the primal variables and the dual variables and provide new constraints on the dual variables: w = n α n y n = 0 n C λ n α n = 0 y n α n φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 35 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n + 1 2 n y n α n φ(x n ) 2 2 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = n (C α n λ n )ξ n + 1 2 n y n α n φ(x n ) 2 2 + n α n ( ) α n y n b n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n + 1 2 y n α n φ(x n ) 2 2 + α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n + 1 2 y n α n φ(x n ) 2 2 + α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n = α n + 1 2 y n α n φ(x n ) 2 2 α n α m y m y n φ(x m ) T φ(x n ) n n m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

Rearrange the Lagrangian and incorporate these constraints Recall: L( ) = C n ξn + 1 2 w 2 2 n λnξn + n αn{1 yn[wt φ(x n) + b] ξ n} where α n 0 and λ n 0 Constraints from partial derivatives: n αnyn = 0 and C λn αn = 0 g({α n },{λ n }) = L(w, b, {ξ n }, {α n }, {λ n }) = (C α n λ n )ξ n + 1 2 y n α n φ(x n ) 2 2 + α n n n n ( ) α n y n b ( T α n y n y m α m φ(x m )) φ(x n ) n m n = α n + 1 2 y n α n φ(x n ) 2 2 α n α m y m y n φ(x m ) T φ(x n ) n n m,n = n α n 1 α n α m y m y n φ(x m ) T φ(x n ) 2 m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 36 / 40

The dual problem Maximizing the dual under the constraints max α g({α n }, {λ n }) = n s.t. α n 0, n α n y n = 0 n C λ n α n = 0, λ n 0, n α n 1 y m y n α m α n k(x m, x n ) 2 n m,n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 37 / 40

The dual problem Maximizing the dual under the constraints max α g({α n }, {λ n }) = n s.t. α n 0, n α n y n = 0 n C λ n α n = 0, λ n 0, n α n 1 y m y n α m α n k(x m, x n ) 2 n We can simplify as the objective function does not depend on λ n. Specifically, we can combine the constraints involving λ n resulting in the following inequality constraint: α n C: m,n C λ n α n = 0, λ n 0 λ n = C α n 0 α n C Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 37 / 40

Simplified Dual max α α n 1 y m y n α m α n φ(x m ) T φ(x n ) 2 n m,n s.t. 0 α n C, n α n y n = 0 n Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 38 / 40

Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 39 / 40

Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Prediction only depends on support vectors, i.e., points with α n > 0! When does α n > 0? KKT conditions tell us: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 39 / 40

Recovering solution to the primal formulation We already identified the primal variable: L w w = n α ny n φ(x n ) Prediction only depends on support vectors, i.e., points with α n > 0! When does α n > 0? KKT conditions tell us: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 (2) tells us that α n > 0 iff 1 ξ n = y n [w T φ(x n ) + b] If ξ n = 0, then support vector is on the margin Otherwise, ξ n > 0 means that the point is an outlier Equality from derivative of Lagrangian yields: (3) C α n λ n = 0 If ξn > 0, then (1) and (3) imply that α n = C Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 39 / 40

Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 40 / 40

Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: λ n = C α n > 0 ξ n = 0 Using (2), if 0 < α n < C and y n { 1, 1}: 1 y n [w T φ(x n ) + b] = 0 b = y n w T φ(x n ) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 40 / 40

Expressions for offset (b) and test predictions Recovering b KKT conditions: (1) λ n ξ n = 0 (2) α n {1 ξ n y n [w T φ(x n ) + b]} = 0 Equality from derivative of Lagrangian: (3) C α n λ n = 0 Combine (1) and (3) and assume α n < C: λ n = C α n > 0 ξ n = 0 Using (2), if 0 < α n < C and y n { 1, 1}: 1 y n [w T φ(x n ) + b] = 0 b = y n w T φ(x n ) Test Prediction: h(x) = sign( n y nα n k(x n, x) + b) Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 40 / 40