ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38

Convex optimization Convex optimisation This lecture Crash course on convex optimisation Lagrangian duality Derivation of the SVM dual Additional reading on convex optimisation: Boyd & Vandenberghe, Convex optimisation, Cambridege University Press, 2004 Juho Rousu 28. September, 2016 2 / 38

Convex optimization Convex optimisation Convex optimization problem Standard form of a convex optimization problem minimize f 0 (x) subject to f i (x) 0, i = 1,..., m a T i x = b i, i = 1,..., p The objective is f 0 is a convex function Constraint functions f 1,..., f m related to inequality constraints are convex functions Equality constraints are affine (linear) Value of x that satisfies the constraints is called feasible, the set of all feasible points is called the feasible set. The feasible set defined by the above constraints is convex Juho Rousu 28. September, 2016 3 / 38

Convex optimization Convex optimisation Why is convexity useful? I Convex objective vs. non-convex objective I Convex feasible set vs. non-convex feasible set Juho Rousu 28. September, 2016 4 / 38

Convex optimization Convex optimisation Why convexity? Convex objective: We can always improve a sub-optimal objective value by stepping towards negative gradient All local optima are global optima Convex constraints i.e. convex feasible set Any point between two feasible points is feasible Updates remain inside the feasible set as long as the update direction is towards a feasible point = fast algorithms based on the principle of feasible descent Graph of a convex function f lies below any line segment (x, f (x)) and (y, f (y)): Convex set C contains all line segments between any pair of its distinct points Juho Rousu 28. September, 2016 5 / 38

Convex optimization Convex optimisation Linear optimization problem (LP) A convex optimization problem that as affine objective and affine constraints is called a linear program (LP). A general linear program has the form minimize c T x + d subject to Gx h 0 Ax b = 0 Geometric interpretation: The feasible set is a polyhedron, given by the intersection of set of half-spaces (Gx h) and set of hyperplanes (Ax = b) Objective decreases the steepest when moving along the direction c T x + d = c Juho Rousu 28. September, 2016 6 / 38

Convex optimization Convex optimisation L1-regularized SVM as an LP min w N l Hinge (y i, w, φ(x i ) ) + λ w 1 Express Hinge loss by upper bound (slack) ξ i and inequality constraint 1 w T φ(x i ) y i ξ i Express L1 norm of the weight vector w = D j=1 w j as a sum of magnitudes m j and constraints m j w j, w j m j We obtain the LP N D ξ i + λ min w,m j=1 m j subject to w T φ(x i ) y i 1 ξ i, ξ i 0, i = 1,..., n w j m j, w j m j, m j 0, j = 1,..., D Juho Rousu 28. September, 2016 7 / 38

Convex optimization Convex optimisation Quadratic optimization problem (QP) Quadratic program (QP) has a quadratic objective function and affine constraint functions A QP can be expressed in the form minimize (1/2)x T Px + q T x + r subject to Gx h 0 Ax b = 0 P R n n is positive semi-definite matrix, G R m n, A R p n Feasible set of a QP is a polyhedron The objective decreases the steepest when moving along f 0 (x) = (Px + q) When P = 0, QP reduces to LP Juho Rousu 28. September, 2016 8 / 38

Convex optimization Convex optimisation Ridge regression as a QP min w N ( w, φ(x i ) y i ) 2 + λ w 2 Express norm of the weight vector as w 2 2 = wt I D w Write squared error in vector form y X w, y X w = y T y 2w T X T y + w T X T X w We get P = X T X + λi D for the quaratic part, q = 2X T y for the linear part ( ) min w λwt X T X + λi D w 2w T X T y + y T y This is an example of an unconstrained QP Juho Rousu 28. September, 2016 9 / 38

Convex optimization Convex optimisation Soft-margin SVM as a QP min w N l Hinge ( w, φ(x i ), y i ) + λ 2 w 2 Express Hinge loss by upper bound (slack) ξi and inequality constraint 1 w T φ(x i ) y i ξ i Norm of the weight vector as as w 2 2 = wt I D w We get: P = ID, q = 1 N = vector of ones We obtain the QP min w,ξ 1T N ξ + λ 2 wt I D w subject to w T φ(x i ) y i + ξ i 1, ξ i 0, i = 1,..., n Juho Rousu 28. September, 2016 10 / 38

Duality Principle of viewing an optimization problem from two interchangeable views, primal and dual views Intuitively: Minimization of a primal objective Maximization of the dual objective Primal constraints Dual variables Dual constraints Primal variables Motivation in this course: Learning with Features (primal) Learning with Kernels (dual) Juho Rousu 28. September, 2016 11 / 38

Duality: Lagrangian Consider a convex optimization problem with variable x R n minimize f 0 (x) subject to f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p We will call this the primal optimisation problem We take the constraints into account by augmenting the objective with the weighted sum of the constraints m p L(x, λ, ν) = f 0 (x) + λ i f i (x) + ν i h i (x) The resulting function is the Lagrangian associated with the optimization problem. Its domain is R n R m R p Juho Rousu 28. September, 2016 12 / 38

Example: Minimizing f 0 (x) = x 2, s.t. f 1 (x) = 1 x 0 Lagrangian: L(x, λ) = x 2 + λ(1 x) Juho Rousu 28. September, 2016 13 / 38

Lagrangian dual function The Lagrangian dual function g : R m R p R is the minimum value of the Lagrangian over x: g(λ, ν) = inf x L(x, λ, ν) = inf {f 0(x) + x D m λ i f i (x) + p ν i h i (x)} Fixing coefficients (λ, ν) corresponds to certain level of penalty, The infimum returns the optimal x for that level of penalty g(λ, ν) is the corresponding value for the Lagrangian g(λ, ν) is a concave function (i.e. g is convex) Juho Rousu 28. September, 2016 14 / 38

Example Minimizing x 2, s.t. x 1 Lagrangian dual function : g(λ) = inf x (x 2 + λ(1 x)) Set derivatives to zero x x 2 + λ(1 x) = 2x λ = 0 = x = λ/2 Plug back to the Lagrangian: g(λ) = λ2 4 + λ(1 λ/2) = λ2 4 + λ Juho Rousu 28. September, 2016 15 / 38

Lower bounds on optimal value The Lagrangian dual function gives us lower bounds on the optimal value p of the original problem: For any λ 0 and for any ν, we have g(λ, ν) p Juho Rousu 28. September, 2016 16 / 38

Lower bounds on optimal value To see this, let x be a feasible point of the original problem, thus f i ( x) 0 and h i ( x) = 0 Then L( x, λ, ν) = f 0 ( x) + m λ if i ( x) + p ν ih i ( x) f 0 ( x), as λ i f i ( x) 0 and ν i h i ( x) = 0 Juho Rousu 28. September, 2016 17 / 38

Lower bounds on optimal value Now, g(λ, ν) = inf x D L(x, λ, ν) L( x, λ, ν) f 0 ( x) as infimum is computed over a set containing x Juho Rousu 28. September, 2016 18 / 38

The Lagrange dual problem For each pair (λ, ν), λ 0, the Lagrange dual function gives a lower bound on the optimal value of p. What is the tightest lower bound that can be achieved? We need to find the maximum This gives us a optimization problem maximize g(λ, ν) subject to λ 0 with variable (λ, ν) It is called the Lagrange dual problem of the original optimization problem. It is a convex optimisation problem, since it is equivalent to minimising g(λ, ν) which is a convex function Juho Rousu 28. September, 2016 19 / 38

Dual problem for SVMs Derivation of Dual Soft-Margin SVM Recall the soft-margin SVM: Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) + b) ξ i for all i = 1,..., N. ξ i 0 AG AG AG AG AG AG AG AG AG GC content after 'AG' ξ i is called the slack variable, when positive, the margin γ(x i ) < 1 The sum of slacks is to be minimized so the objective still favours hyperplanes that separates the classes well The coefficient C > 0 controls the balance between maximizing the margin and the amount of slack needed GC content before 'AG' ξ AG AG AG Juho Rousu 28. September, 2016 20 / 38

Dual problem for SVMs Recipe to find the dual problem Write down the Lagrangian Find its minimum by setting derivatives w.r.t. primal variables to zero Plug the solution (w, ξ ) back to the Lagrangian to obtain the dual function Write down the optimisation problem to maximise the dual function w.r.t the dual variables Juho Rousu 28. September, 2016 21 / 38

Dual problem for SVMs Deriving the dual SVM Primal soft-margin SVM problem Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i for all i = 1,..., N. ξ i 0 Above we have dropped the intercept of the hyperplane, instead we assume constant feature φ 0 (x) = const First write it in the standard form: Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i 0 for all i = 1,..., N. ξ i 0 for all i = 1,..., N. Juho Rousu 28. September, 2016 22 / 38

Dual problem for SVMs Lagrangian of Soft-Margin SVM Minimize 1 2 w 2 + C N ξ i Subject to 1 y i ( w, φ(x i ) ) ξ i 0 for all i = 1,..., N. ξ i 0 for all i = 1,..., N. We have objective and two sets of inequality constraints, use λ = (α, β) where α = (α i ) N corresponds to the first set, and β = (β i ) N corresponds to the second set of constraints The Lagrangian becomes L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Juho Rousu 28. September, 2016 23 / 38

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero w L(w, ξ, α, β) N = w = α i y i φ(x i ) = w N α iy i φ(x i ) = 0 Juho Rousu 28. September, 2016 24 / 38

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero ξi L(w, ξ, b, α, β) = C α i β i = 0 Together with β i 0 this implies α i C for all i = 1,..., N Juho Rousu 28. September, 2016 25 / 38

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 2 w 2 + C + N ξ i N α i (1 y i ( w, φ(x i ) ) ξ i ) + N β i ( ξ i ) Plug the solution w = N α iy i φ(x i ) back to the Lagrangian With C α i β i = 0 the terms containing ξ i and β i cancel out L(w, ξ, α, β) = 1 N 2 α i y i φ(x i ) 2 + + N N α i (1 y i ( α i y i φ(x i ), φ(x i ) )) Juho Rousu 28. September, 2016 26 / 38

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 N 2 α i y i φ(x i ) 2 + + N N α i (1 y i ( α j y j φ(x j ), φ(x i ) )) Expanding the squares and simplifying gives the dual function that only depends on the dual variable α j=1 g(α) = N α i 1 N 2 ( N α i α j y i y j φ(x j ), φ(x i ) )) j=1 Juho Rousu 28. September, 2016 27 / 38

Dual problem for SVMs Dual SVM optimisation problem Write down the optimisation problem to maximise the dual function w.r.t the dual variables We have non-negativity for α i 0, as well as the upper bound derived above α i C, the optimal dual solution will be N maximize g(α) = α i 1 N N α i α j y i y j φ(x j ), φ(x i ) 2 j=1 subject to 0 α i C, i = 1,..., N Plug in a kernel K(x i, x j ) = φ(x j ), φ(x i ) to obtain Dual SVM problem: N maximize g(α) = α i 1 N N α i α j y i y j K(x i, x j ) 2 j=1 subject to 0 α i C, i = 1,..., N Juho Rousu 28. September, 2016 28 / 38

Properties of convex optimisation problems Properties of convex optimisation problems We will look at further concepts to understand the properties of convex optimisation problems Weak and strong duality Duality gap Complementary slackness KKT conditions Juho Rousu 28. September, 2016 29 / 38

Properties of convex optimisation problems Weak and strong duality Let p and d denote primal and dual optimal values of an optimization problem. Weak duality d p always holds, even when primal optimization problem is non-convex Strong duality d = p holds for special classes of convex optimization problems LPs and QPs have strong duality Juho Rousu 28. September, 2016 30 / 38

Properties of convex optimisation problems Duality gap A pair x, (λ, ν) where x primal feasible and (λ, ν) is dual feasible is called primal dual feasible pair For primal dual feasible pair, the quantity is called the duality gap f 0 (x) g(λ, ν), A primal dual feasible pair localizes the primal and dual optimal values g(λ, ν) d p f 0 (x) into an interval the width of which is given by the duality gap If the duality gap is zero, we know that x primal optimal and (λ, ν) is dual optimal We can use duality gap as a stopping criterion for optimisation Juho Rousu 28. September, 2016 31 / 38

Properties of convex optimisation problems Stopping criterion for optimization Suppose the algorithm generates a sequence of primal feasible x (k) and dual feasible (λ (k), ν (k) ) solutions for k = 1, 2,... Then the duality gap can be used as the stopping criterion: e.g. stop when f 0 (x (k) ) g(λ (k), ν (k) ɛ, for some ɛ > 0 Juho Rousu 28. September, 2016 32 / 38

Properties of convex optimisation problems Complementary slackness Let x be a primal optimal (and thus also feasible) and (λ, ν ) be a dual optimal (and thus also feasible) solution and let strong optimality hold, i.e. d = p Then, at optimum d = g(λ, ν ) = inf x {f 0(x) + f 0 (x ) + m λ i f i (x) + m λ i f i (x ) + p νi h i (x)} p νi h i (x ) f 0 (x ) = p First inequality: definition of inifimum, second: inequality from x being a feasible solution Due to d = p, the inequalities must hold as equalities = penalty terms must equate to zero Juho Rousu 28. September, 2016 33 / 38

Properties of convex optimisation problems Complementary slackness We have m p λ i f i (x ) + νi h i (x ) = 0 Since h i (x ) = 0, i = 1,..., p we conclude that m λ i f i (x ) = 0 and since each term is non-positive λ i f i (x ) = 0, i = 1,..., m This condition is called complementary slackness Juho Rousu 28. September, 2016 34 / 38

Properties of convex optimisation problems Complementary slackness Intuition: at optimum there cannot be both slack in the dual variable λ i 0 and the constraint f i (x ) 0 at the same time: and λ i > 0 f i (x ) = 0 f i (x ) < 0 λ i = 0 At optimum, positive Lagrange multipliers are associated with active constraints Juho Rousu 28. September, 2016 35 / 38

Properties of convex optimisation problems Karush-Kuhn-Tucker (KKT) conditions Summarizing, at optimum the following conditions must hold true: Inequality contraints satified: f i (x ) 0, i = 1,..., m Equality contraints satified: hi (x ) = 0, i = 1,..., p Non-negativity of dual variables of the inequality constraints: λ i 0, i = 1,..., m Complementary slackness: λ i f i(x ) = 0, i = 1,..., m Derivative of Lagrangian vanishes: f 0 (x ) + m λ i f i(x ) + p ν i h i(x ) = 0 These conditions are called the Karush-Kuhn-Tucker conditions For convex problems satisfying KKT conditions is also sufficient as well as necessary for optimality Juho Rousu 28. September, 2016 36 / 38

Properties of convex optimisation problems Complementary slackness for soft-margin SVM Recall the Lagrangian of the soft-margin SVM: L(w, ξ, α, β) = 1 2 w 2 + C N ξ i + N α i(1 y i ( w, φ(x i ) ) ξ i ) + N β i( ξ i ) Complementary slackness condition gives for all i = 1,..., N αi (1 y i w, φ(x i ) ξ i ) = 0 βi ξ i = 0 It can be shown that together with the constraints of the primal and dual problems the above implies: A α i = C = y i w, φ(x i ) 1 B 0 < α i < C = y i w, φ(x i ) = 1 C α i = 0 = y i w, φ(x i ) 1 Juho Rousu 28. September, 2016 37 / 38

Properties of convex optimisation problems Complementary slackness for soft-margin SVM The data points are classified by their difficulty: A Too small margin, difficult data point : α i = C = y i w, φ(x i ) 1 B At the boundary : 0 < α i < C = y i w, φ(x i ) = 1 C More than enough margin, easy data point α i = 0 = y i w, φ(x i ) 1 <w, x> + b = -1 <w, x> + b = 0 <w, x> + b = +1 ξ>1 B C A B A B B C C C C ξ>0 C Juho Rousu 28. September, 2016 38 / 38