ICS-E4030 Kernel Methods in Machine Learning

Similar documents
CS-E4830 Kernel Methods in Machine Learning

Constrained Optimization and Lagrangian Duality

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

5. Duality. Lagrangian

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization and Support Vector Machine

Convex Optimization and SVM

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 18: Optimization Programming

Convex Optimization & Lagrange Duality

Lecture Notes on Support Vector Machine

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines for Classification and Regression

Convex Optimization Boyd & Vandenberghe. 5. Duality

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

On the Method of Lagrange Multipliers

Convex Optimization M2

Support Vector Machines

Lecture: Duality.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support vector machines

Support Vector Machines: Maximum Margin Classifiers

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Support Vector Machines for Regression

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

A Brief Review on Convex Optimization

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (continued)

Statistical Machine Learning from Data

Support Vector Machine

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Max Margin-Classifier

Lecture: Duality of LP, SOCP and SDP

Support Vector Machine I

12. Interior-point methods

Support Vector Machines, Kernel SVM

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Support Vector Machine (SVM) and Kernel Methods

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Applied inductive learning - Lecture 7

Support Vector Machines

Lecture 7: Convex Optimizations

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Lecture 2: Linear SVM in the Dual

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Introduction to Support Vector Machines

Machine Learning A Geometric Approach

Lecture Support Vector Machine (SVM) Classifiers

Lagrangian Duality and Convex Optimization

CS798: Selected topics in Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Math 273a: Optimization Subgradients of convex functions

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

EE/AA 578, Univ of Washington, Fall Duality

Learning From Data Lecture 25 The Kernel Trick

EE364a Review Session 5

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Convex Optimization and Modeling

subject to (x 2)(x 4) u,

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

Support Vector Machines

Lecture 3 January 28

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Tutorial on Convex Optimization: Part II

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 10: A brief introduction to Support Vector Machine

LECTURE 7 Support vector machines

Machine Learning And Applications: Supervised Learning-SVM

Jeff Howbert Introduction to Machine Learning Winter

Applications of Linear Programming

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

8. Geometric problems

Lecture 6: Conic Optimization September 8

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Announcements - Homework

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Linear & nonlinear classifiers

Optimization for Machine Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Transcription:

ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38

Convex optimization Convex optimisation This lecture Crash course on convex optimisation Lagrangian duality Derivation of the SVM dual Additional reading on convex optimisation: Boyd & Vandenberghe, Convex optimisation, Cambridege University Press, 2004 Juho Rousu 28. September, 2016 2 / 38

Convex optimization Convex optimisation Convex optimization problem Standard form of a convex optimization problem minimize f 0 (x) subject to f i (x) 0, i = 1,..., m a T i x = b i, i = 1,..., p The objective is f 0 is a convex function Constraint functions f 1,..., f m related to inequality constraints are convex functions Equality constraints are affine (linear) Value of x that satisfies the constraints is called feasible, the set of all feasible points is called the feasible set. The feasible set defined by the above constraints is convex Juho Rousu 28. September, 2016 3 / 38

Convex optimization Convex optimisation Why is convexity useful? I Convex objective vs. non-convex objective I Convex feasible set vs. non-convex feasible set Juho Rousu 28. September, 2016 4 / 38

Convex optimization Convex optimisation Why convexity? Convex objective: We can always improve a sub-optimal objective value by stepping towards negative gradient All local optima are global optima Convex constraints i.e. convex feasible set Any point between two feasible points is feasible Updates remain inside the feasible set as long as the update direction is towards a feasible point = fast algorithms based on the principle of feasible descent Graph of a convex function f lies below any line segment (x, f (x)) and (y, f (y)): Convex set C contains all line segments between any pair of its distinct points Juho Rousu 28. September, 2016 5 / 38

Convex optimization Convex optimisation Linear optimization problem (LP) A convex optimization problem that as affine objective and affine constraints is called a linear program (LP). A general linear program has the form minimize c T x + d subject to Gx h 0 Ax b = 0 Geometric interpretation: The feasible set is a polyhedron, given by the intersection of set of half-spaces (Gx h) and set of hyperplanes (Ax = b) Objective decreases the steepest when moving along the direction c T x + d = c Juho Rousu 28. September, 2016 6 / 38

Convex optimization Convex optimisation L1-regularized SVM as an LP min w N l Hinge (y i, w, φ(x i ) ) + λ w 1 Express Hinge loss by upper bound (slack) ξ i and inequality constraint 1 w T φ(x i ) y i ξ i Express L1 norm of the weight vector w = D j=1 w j as a sum of magnitudes m j and constraints m j w j, w j m j We obtain the LP N D ξ i + λ min w,m j=1 m j subject to w T φ(x i ) y i 1 ξ i, ξ i 0, i = 1,..., n w j m j, w j m j, m j 0, j = 1,..., D Juho Rousu 28. September, 2016 7 / 38

Convex optimization Convex optimisation Quadratic optimization problem (QP) Quadratic program (QP) has a quadratic objective function and affine constraint functions A QP can be expressed in the form minimize (1/2)x T Px + q T x + r subject to Gx h 0 Ax b = 0 P R n n is positive semi-definite matrix, G R m n, A R p n Feasible set of a QP is a polyhedron The objective decreases the steepest when moving along f 0 (x) = (Px + q) When P = 0, QP reduces to LP Juho Rousu 28. September, 2016 8 / 38

Convex optimization Convex optimisation Ridge regression as a QP min w N ( w, φ(x i ) y i ) 2 + λ w 2 Express norm of the weight vector as w 2 2 = wt I D w Write squared error in vector form y X w, y X w = y T y 2w T X T y + w T X T X w We get P = X T X + λi D for the quaratic part, q = 2X T y for the linear part ( ) min w λwt X T X + λi D w 2w T X T y + y T y This is an example of an unconstrained QP Juho Rousu 28. September, 2016 9 / 38

Convex optimization Convex optimisation Soft-margin SVM as a QP min w N l Hinge ( w, φ(x i ), y i ) + λ 2 w 2 Express Hinge loss by upper bound (slack) ξi and inequality constraint 1 w T φ(x i ) y i ξ i Norm of the weight vector as as w 2 2 = wt I D w We get: P = ID, q = 1 N = vector of ones We obtain the QP min w,ξ 1T N ξ + λ 2 wt I D w subject to w T φ(x i ) y i + ξ i 1, ξ i 0, i = 1,..., n Juho Rousu 28. September, 2016 10 / 38

Duality Principle of viewing an optimization problem from two interchangeable views, primal and dual views Intuitively: Minimization of a primal objective Maximization of the dual objective Primal constraints Dual variables Dual constraints Primal variables Motivation in this course: Learning with Features (primal) Learning with Kernels (dual) Juho Rousu 28. September, 2016 11 / 38

Duality: Lagrangian Consider a convex optimization problem with variable x R n minimize f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p We will call this the primal optimisation problem We take the constraints into account by augmenting the objective with the weighted sum of the constraints m p L(x, λ, ν) = f 0 (x) + λ i f i (x) + ν i h i (x) The resulting function is the Lagrangian associated with the optimization problem. Its domain is R n R m R p Juho Rousu 28. September, 2016 12 / 38

Example: Minimizing f 0 (x) = x 2, s.t. f 1 (x) = 1 x 0 Lagrangian: L(x, λ) = x 2 + λ(1 x) Juho Rousu 28. September, 2016 13 / 38

Lagrangian dual function The Lagrangian dual function g : R m R p R is the minimum value of the Lagrangian over x: g(λ, ν) = inf x L(x, λ, ν) = inf {f 0(x) + x D m λ i f i (x) + p ν i h i (x)} Fixing coefficients (λ, ν) corresponds to certain level of penalty, The infimum returns the optimal x for that level of penalty g(λ, ν) is the corresponding value for the Lagrangian g(λ, ν) is a concave function (i.e. g is convex) Juho Rousu 28. September, 2016 14 / 38

Example Minimizing x 2, s.t. x 1 Lagrangian dual function : g(λ) = inf x (x 2 + λ(1 x)) Set derivatives to zero x x 2 + λ(1 x) = 2x λ = 0 = x = λ/2 Plug back to the Lagrangian: g(λ) = λ2 4 + λ(1 λ/2) = λ2 4 + λ Juho Rousu 28. September, 2016 15 / 38

Lower bounds on optimal value The Lagrangian dual function gives us lower bounds on the optimal value p of the original problem: For any λ 0 and for any ν, we have g(λ, ν) p Juho Rousu 28. September, 2016 16 / 38

Lower bounds on optimal value To see this, let x be a feasible point of the original problem, thus f i ( x) 0 and h i ( x) = 0 Then L( x, λ, ν) = f 0 ( x) + m λ if i ( x) + p ν ih i ( x) f 0 ( x), as λ i f i ( x) 0 and ν i h i ( x) = 0 Juho Rousu 28. September, 2016 17 / 38

Lower bounds on optimal value Now, g(λ, ν) = inf x D L(x, λ, ν) L( x, λ, ν) f 0 ( x) as infimum is computed over a set containing x Juho Rousu 28. September, 2016 18 / 38

The Lagrange dual problem For each pair (λ, ν), λ 0, the Lagrange dual function gives a lower bound on the optimal value of p. What is the tightest lower bound that can be achieved? We need to find the maximum This gives us a optimization problem maximize g(λ, ν) subject to λ 0 with variable (λ, ν) It is called the Lagrange dual problem of the original optimization problem. It is a convex optimisation problem, since it is equivalent to minimising g(λ, ν) which is a convex function Juho Rousu 28. September, 2016 19 / 38

Dual problem for SVMs Derivation of Dual Soft-Margin SVM Recall the soft-margin SVM: Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) + b) ξ i for all i = 1,..., N. ξ i 0 AG AG AG AG AG AG AG AG AG GC content after 'AG' ξ i is called the slack variable, when positive, the margin γ(x i ) < 1 The sum of slacks is to be minimized so the objective still favours hyperplanes that separates the classes well The coefficient C > 0 controls the balance between maximizing the margin and the amount of slack needed GC content before 'AG' ξ AG AG AG Juho Rousu 28. September, 2016 20 / 38

Dual problem for SVMs Recipe to find the dual problem Write down the Lagrangian Find its minimum by setting derivatives w.r.t. primal variables to zero Plug the solution (w, ξ ) back to the Lagrangian to obtain the dual function Write down the optimisation problem to maximise the dual function w.r.t the dual variables Juho Rousu 28. September, 2016 21 / 38

Dual problem for SVMs Deriving the dual SVM Primal soft-margin SVM problem Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i for all i = 1,..., N. ξ i 0 Above we have dropped the intercept of the hyperplane, instead we assume constant feature φ 0 (x) = const First write it in the standard form: Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i 0 for all i = 1,..., N. ξ i 0 for all i = 1,..., N. Juho Rousu 28. September, 2016 22 / 38

Dual problem for SVMs Lagrangian of Soft-Margin SVM Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i 0 for all i = 1,..., N. ξ i 0 for all i = 1,..., N. We have objective and two sets of inequality constraints, use λ = (α, β) where α = (α i ) N corresponds to the first set, and β = (β i ) N corresponds to the second set of constraints The Lagrangian becomes L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Juho Rousu 28. September, 2016 23 / 38

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero w L(w, ξ, α, β) N = w = α i y i φ(x i ) = w N α iy i φ(x i ) = 0 Juho Rousu 28. September, 2016 24 / 38

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero ξi L(w, ξ, b, α, β) = C α i β i = 0 Together with β i 0 this implies α i C for all i = 1,..., N Juho Rousu 28. September, 2016 25 / 38

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 2 w 2 + C + N ξ i N α i (1 y i ( w, φ(x i ) ) ξ i ) + N β i ( ξ i ) Plug the solution w = N α iy i φ(x i ) back to the Lagrangian With C α i β i = 0 the terms containing ξ i and β i cancel out L(w, ξ, α, β) = 1 N 2 α i y i φ(x i ) 2 + + N N α i (1 y i ( α i y i φ(x i ), φ(x i ) )) Juho Rousu 28. September, 2016 26 / 38

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 N 2 α i y i φ(x i ) 2 + + N N α i (1 y i ( α j y j φ(x j ), φ(x i ) )) Expanding the squares and simplifying gives the dual function that only depends on the dual variable α j=1 g(α) = N α i 1 N 2 ( N α i α j y i y j φ(x j ), φ(x i ) )) j=1 Juho Rousu 28. September, 2016 27 / 38

Dual problem for SVMs Dual SVM optimisation problem Write down the optimisation problem to maximise the dual function w.r.t the dual variables We have non-negativity for α i 0, as well as the upper bound derived above α i C, the optimal dual solution will be N maximize g(α) = α i 1 N N α i α j y i y j φ(x j ), φ(x i ) 2 j=1 subject to 0 α i C, i = 1,..., N Plug in a kernel K(x i, x j ) = φ(x j ), φ(x i ) to obtain Dual SVM problem: N maximize g(α) = α i 1 N N α i α j y i y j K(x i, x j ) 2 j=1 subject to 0 α i C, i = 1,..., N Juho Rousu 28. September, 2016 28 / 38

Properties of convex optimisation problems Properties of convex optimisation problems We will look at further concepts to understand the properties of convex optimisation problems Weak and strong duality Duality gap Complementary slackness KKT conditions Juho Rousu 28. September, 2016 29 / 38

Properties of convex optimisation problems Weak and strong duality Let p and d denote primal and dual optimal values of an optimization problem. Weak duality d p always holds, even when primal optimization problem is non-convex Strong duality d = p holds for special classes of convex optimization problems LPs and QPs have strong duality Juho Rousu 28. September, 2016 30 / 38

Properties of convex optimisation problems Duality gap A pair x, (λ, ν) where x primal feasible and (λ, ν) is dual feasible is called primal dual feasible pair For primal dual feasible pair, the quantity is called the duality gap f 0 (x) g(λ, ν), A primal dual feasible pair localizes the primal and dual optimal values g(λ, ν) d p f 0 (x) into an interval the width of which is given by the duality gap If the duality gap is zero, we know that x primal optimal and (λ, ν) is dual optimal We can use duality gap as a stopping criterion for optimisation Juho Rousu 28. September, 2016 31 / 38

Properties of convex optimisation problems Stopping criterion for optimization Suppose the algorithm generates a sequence of primal feasible x (k) and dual feasible (λ (k), ν (k) ) solutions for k = 1, 2,... Then the duality gap can be used as the stopping criterion: e.g. stop when f 0 (x (k) ) g(λ (k), ν (k) ɛ, for some ɛ > 0 Juho Rousu 28. September, 2016 32 / 38

Properties of convex optimisation problems Complementary slackness Let x be a primal optimal (and thus also feasible) and (λ, ν ) be a dual optimal (and thus also feasible) solution and let strong optimality hold, i.e. d = p Then, at optimum d = g(λ, ν ) = inf x {f 0(x) + f 0 (x ) + m λ i f i (x) + m λ i f i (x ) + p νi h i (x)} p νi h i (x ) f 0 (x ) = p First inequality: definition of inifimum, second: inequality from x being a feasible solution Due to d = p, the inequalities must hold as equalities = penalty terms must equate to zero Juho Rousu 28. September, 2016 33 / 38

Properties of convex optimisation problems Complementary slackness We have m p λ i f i (x ) + νi h i (x ) = 0 Since h i (x ) = 0, i = 1,..., p we conclude that m λ i f i (x ) = 0 and since each term is non-positive λ i f i (x ) = 0, i = 1,..., m This condition is called complementary slackness Juho Rousu 28. September, 2016 34 / 38

Properties of convex optimisation problems Complementary slackness Intuition: at optimum there cannot be both slack in the dual variable λ i 0 and the constraint f i (x ) 0 at the same time: and λ i > 0 f i (x ) = 0 f i (x ) < 0 λ i = 0 At optimum, positive Lagrange multipliers are associated with active constraints Juho Rousu 28. September, 2016 35 / 38

Properties of convex optimisation problems Karush-Kuhn-Tucker (KKT) conditions Summarizing, at optimum the following conditions must hold true: Inequality contraints satified: f i (x ) 0, i = 1,..., m Equality contraints satified: hi (x ) = 0, i = 1,..., p Non-negativity of dual variables of the inequality constraints: λ i 0, i = 1,..., m Complementary slackness: λ i f i(x ) = 0, i = 1,..., m Derivative of Lagrangian vanishes: f 0 (x ) + m λ i f i(x ) + p ν i h i(x ) = 0 These conditions are called the Karush-Kuhn-Tucker conditions For convex problems satisfying KKT conditions is also sufficient as well as necessary for optimality Juho Rousu 28. September, 2016 36 / 38

Properties of convex optimisation problems Complementary slackness for soft-margin SVM Recall the Lagrangian of the soft-margin SVM: L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Complementary slackness condition gives for all i = 1,..., N αi (1 y i w, φ(x i ) ξ i ) = 0 βi ξ i = 0 It can be shown that together with the constraints of the primal and dual problems the above implies: A α i = C = y i w, φ(x i ) 1 B 0 < α i < C = y i w, φ(x i ) = 1 C α i = 0 = y i w, φ(x i ) 1 Juho Rousu 28. September, 2016 37 / 38

Properties of convex optimisation problems Complementary slackness for soft-margin SVM The data points are classified by their difficulty: A Too small margin, difficult data point : α i = C = y i w, φ(x i ) 1 B At the boundary : 0 < α i < C = y i w, φ(x i ) = 1 C More than enough margin, easy data point α i = 0 = y i w, φ(x i ) 1 <w, x> + b = -1 <w, x> + b = 0 <w, x> + b = +1 ξ>1 B C A B A B B C C C C ξ>0 C Juho Rousu 28. September, 2016 38 / 38