Lagrange Multipliers Kernel Trick

Similar documents
Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Lecture 10 Support Vector Machines II

Convex Optimization. Optimality conditions. (EE227BT: UC Berkeley) Lecture 9 (Optimality; Conic duality) 9/25/14. Laurent El Ghaoui.

Lecture 3: Dual problems and Kernels

Which Separator? Spring 1

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Kernel Methods and SVMs Extension

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Maximal Margin Classifier

Support Vector Machines CS434

Support Vector Machines CS434

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines

Support Vector Machines

Kristin P. Bennett. Rensselaer Polytechnic Institute

SELECTED SOLUTIONS, SECTION (Weak duality) Prove that the primal and dual values p and d defined by equations (4.3.2) and (4.3.3) satisfy p d.

MMA and GCMMA two methods for nonlinear optimization

Support Vector Machines

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Assortment Optimization under MNL

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

6.854J / J Advanced Algorithms Fall 2008

Lecture 6: Support Vector Machines

Support Vector Machines

COS 521: Advanced Algorithms Game Theory and Linear Programming

Natural Language Processing and Information Retrieval

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Chapter 6 Support vector machine. Séparateurs à vaste marge

Advanced Introduction to Machine Learning

Ensemble Methods: Boosting

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Lecture Notes on Linear Regression

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

18-660: Numerical Methods for Engineering Design and Optimization

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Recap: the SVM problem

Lecture 10 Support Vector Machines. Oct

Some modelling aspects for the Matlab implementation of MMA

Intro to Visual Recognition

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

The Minimum Universal Cost Flow in an Infeasible Flow Network

Lecture 20: November 7

PHYS 705: Classical Mechanics. Calculus of Variations II

17 Support Vector Machines

14 Lagrange Multipliers

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Kernel Methods and SVMs

Fuzzy Approaches for Multiobjective Fuzzy Random Linear Programming Problems Through a Probability Maximization Model

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

UVA CS / Introduc8on to Machine Learning and Data Mining

How Strong Are Weak Patents? Joseph Farrell and Carl Shapiro. Supplementary Material Licensing Probabilistic Patents to Cournot Oligopolists *

An Interactive Optimisation Tool for Allocation Problems

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

On the Multicriteria Integer Network Flow Problem

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

CSE 252C: Computer Vision III

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Module 9. Lecture 6. Duality in Assignment Problems

10) Activity analysis

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Successive Lagrangian Relaxation Algorithm for Nonconvex Quadratic Optimization

CHAPTER 7 CONSTRAINED OPTIMIZATION 2: SQP AND GRG

Perfect Competition and the Nash Bargaining Solution

Integrals and Invariants of Euler-Lagrange Equations

1 Convex Optimization

The Study of Teaching-learning-based Optimization Algorithm

Nonlinear Classifiers II

10-701/ Machine Learning, Fall 2005 Homework 3

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

PROBLEM SET 7 GENERAL EQUILIBRIUM

Errors for Linear Systems

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

A SEPARABLE APPROXIMATION DYNAMIC PROGRAMMING ALGORITHM FOR ECONOMIC DISPATCH WITH TRANSMISSION LOSSES. Pierre HANSEN, Nenad MLADENOVI]

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

The General Nonlinear Constrained Optimization Problem

The Second Anti-Mathima on Game Theory

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Pattern Classification

Mixed Taxation and Production Efficiency

Mechanics Physics 151

Calculation of time complexity (3%)

Feature Selection: Part 1

MEM Chapter 4b. LMI Lab commands

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

The Finite Element Method: A Short Introduction

15 Lagrange Multipliers

On the Global Linear Convergence of the ADMM with Multi-Block Variables

CSC 411 / CSC D11 / CSC C11

Computing Correlated Equilibria in Multi-Player Games

The Karush-Kuhn-Tucker. Nuno Vasconcelos ECE Department, UCSD

General viscosity iterative method for a sequence of quasi-nonexpansive mappings

Transcription:

Lagrange Multplers Kernel Trck Ncholas Ruozz Unversty of Texas at Dallas Based roughly on the sldes of Davd Sontag

General Optmzaton A mathematcal detour, we ll come back to SVMs soon! subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p 2

General Optmzaton subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p f 0 s not necessarly convex 3

General Optmzaton subject to: f x 0, h x = 0, mn f 0(x) x R n = 1,, m = 1,, p Constrants do not need to be lnear 4

Lagrangan L x, λ, ν = f 0 x + m =1 λ f x + p =1 ν h (x) Incorporate constrants nto a new objectve functon λ 0 and ν are vectors of Lagrange multplers The Lagrange multplers can be thought of as soft constrants 5

Dualty Construct a dual functon by mnmzng the Lagrangan over the prmal varables g λ, ν = nf x L(x, λ, ν) g λ, ν = whenever the Lagrangan s not bounded from below for a fxed λ and ν 6

The Prmal Problem subject to: Equvalently, f x 0, h x = 0, nf x mn f 0(x) x R n sup λ 0,ν = 1,, m = 1,, p L(x, λ, ν) 7

The Dual Problem Equvalently, sup λ 0,ν sup λ 0,ν nf x g(λ, ν) L(x, λ, ν) The dual problem s always concave, even f the prmal problem s not convex 8

Prmal vs. Dual sup λ 0,ν nf x L(x, λ, ν) nf x sup λ 0,ν L(x, λ, ν) Why? g λ, ν L(x, λ, ν) for all x L x, λ, ν f 0 (x ) for any feasble x, λ 0 x s feasble f t satsfes all of the constrants Let x be the optmal soluton to the prmal problem and λ 0 g λ, ν L x, λ, ν f 0 x 9

Dualty Under certan condtons, the two optmzaton problems are equvalent sup λ 0,ν nf x L(x, λ, ν) = nf x Ths s called strong dualty sup λ 0,ν L(x, λ, ν) If the nequalty s strct, then we say that there s a dualty gap Sze of gap measured by the dfference between the two sdes of the nequalty 10

Slater s Condton For any optmzaton problem of the form subject to: mn f 0(x) x R n f x 0, = 1,, m Ax = b where f 0,, f m are convex functons, strong dualty holds f there exsts an x such that f x < 0, = 1,, m Ax = b 11

Dual SVM such that mn w 1 2 w 2 y w T x + b 1, for all Note that Slater s condton holds as long as the data s lnearly separable 12

Dual SVM L w, b, λ = 1 2 wt w + λ (1 y (w T x + b)) Convex n w, so take dervatves to form the dual L w k = w k + L b = λ y x k () = 0 λ y = 0 13

Dual SVM L w, b, λ = 1 2 wt w + λ (1 y (w T x + b)) Convex n w, so take dervatves to form the dual w = λ y x () λ y = 0 14

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 By strong dualty, solvng ths problem s equvalent to solvng the prmal problem Gven the optmal λ, we can easly construct w (b can be found by complementary slackness) 15

Complementary Slackness Suppose that there s zero dualty gap Let x be an optmum of the prmal and (λ, ν ) be an optmum of the dual f 0 x = g λ, ν = nf x f 0 x + f 0 x + = f 0 x + m =1 m =1 m =1 λ f x + λ f x + λ f x p =1 p =1 ν h (x) ν h x f 0 x 16

Complementary Slackness Ths means that m =1 λ f x = 0 As λ 0 and f x, ths can only happen f λ f x = 0 for all Put another way, If f x < 0 (.e., the constrant s not tght), then λ = 0 If λ > 0, then f (x ) = 0 ONLY apples when there s no dualty gap 17

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 By complementary slackness, λ > 0 means that x () s a support vector (can then solve for b usng w) 18

Dual SVM such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 Takes O(n 2 ) tme just to evaluate the objectve functon Actve area of research to try to speed ths up 19

The Kernel Trck such that max λ 0 1 2 j λ λ j y y j x T x j + λ λ y = 0 The dual formulaton only depends on nner products between the data ponts Same thng s true f we use feature vectors nstead 20

The Kernel Trck For some feature vectors, we can compute the nner products quckly, even f the feature vectors are very large Ths s best llustrated by example Let φ x 1, x 2 = x 1 x 2 x 2 x 1 x 1 2 x 2 2 φ x 1, x 2 φ z 1, z 2 = x 1 2 z 1 2 + 2x 1 x 2 z 1 z 2 + x 2 2 z 2 2 = x 1 z 1 + x 2 z 2 2 = x z 2 Reduces to a dot product n the orgnal space 21

The Kernel Trck The same dea can be appled for the feature vector φ of all polynomals of degree (exactly) d φ x φ z = x z d More generally, a kernel s a functon k x, z = φ x φ(z) for some feature map φ Rewrte the dual objectve max 1 λ λ 0, λ y =0 2 λ j y y j k(x (), x j ) + j λ 22

Examples of Kernels Polynomal kernel of degree exactly d k x, z = x z d General polynomal kernel of degree d for some c k x, z = x z + c d Gaussan kernel for some σ k x, z = exp x z 2 2σ 2 The correspondng φ s nfnte dmensonal! So many more 23

Kernels Bgger feature space ncreases the possblty of overfttng Large margn solutons should stll generalze reasonably well Alternatve: add penaltes to the objectve to dsncentvze complcated solutons mn w 1 2 w 2 + c (# of msclassfcatons) Not a quadratc program anymore (n fact, t s NP-hard) Smlar problem to Hammng loss, no noton of how badly the data s msclassfed 24