CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45

Convex optimization Convex optimisation This lecture Crash course on convex optimisation Lagrangian duality Derivation of the SVM dual Additional reading on convex optimisation: Boyd & Vandenberghe, Convex optimisation, Cambridege University Press, 2004 Juho Rousu 27. September, 2017 2 / 45

Convex optimization Convex optimisation Convex functions A function f : R n R is convex if for all x, y, and 0 θ 1, we have f (θx + (1 θ)y) θf (x) + (1 θ)f (y). Geometrical interpretation: the graph of the function lies below the line segment from (x, f (x)) to (y, f (y)) A function f is strictly convex if strict inequality holds above concave if f is convex. Juho Rousu 27. September, 2017 3 / 45

Convex optimization Convex optimisation First order conditions Suppose f : R n R is differentiable. Then f is convex if and only if holds for all x, y R n. f (y) f (x) + f (x) (y x) The right-hand side, the first order Taylor approximation of f, is a global underestimator of f. Geometrical interpretation: a convex function lies above each its tangent. Corresponding forms of the equation can be written for strictly convex (replace with > and concave functions (replace with ) Juho Rousu 27. September, 2017 4 / 45

Convex optimization Convex optimisation Second order conditions Assume f : R n R is twice differentiable. Then f is convex if and only if its Hessian matrix (matrix of second derivatives) H = [ 2 f (x)] ij = 2 f (x) x i x j, i = 1,..., n, j = 1,..., n, is positive semi-definite: for all x Hx 0 Geometrically the condition means that the function has positive curvature at x. Strict convexity is partially characterized by second order conditions: if H = 2 f (x) is positive definite, x Hx > 0 for all x, then f is strictly convex. The converse does not hold true in general. For function defined on R, the condition reduces to the simple condition f (x) 0, that is, that the second derivative is non-decreasing. Analogous conditions can be written for (strictly) concave functions and negative (semi-)definite Hessians Juho Rousu 27. September, 2017 5 / 45

Convex optimization Convex optimisation Operations that preserve convexity of functions Nonnegative weighted sums: Nonnegative weighted sum of convex functions: f = w 1 f 1 +... w m f m, where w j 0 is convex. Similarly, nonnegative weighted sum of concave functions is concave. These properties extend to infinite sums and integrals Pointwise maximum and supremum: Pointwise maximum f (x) = max(f1 (x), f 2 (x),..., f m (x)), of a set f 1,..., f m of convex functions is convex. Pointwise supremum of an infinite set of convex functions is convex Similarly: Pointwise minimum (infimum) of concave functions is concave Juho Rousu 27. September, 2017 6 / 45

Convex optimization Convex optimisation Convex sets A line segment between x 1 R n and x 2 R n is defined as all points that satisfy x = θx 1 + (1 θ)x 2, 0 θ 1 A convex set contains the line segment between any two distinct points in the set x 1, x 2 C, 0 θ 1 θx 1 + (1 θ)x 2 C Below: Convex and non-convex sets. Q: Which ones are convex? Juho Rousu 27. September, 2017 7 / 45

Convex optimization Convex optimisation Operations that preserve convexity of sets There are two main ways of establishing the convexity of a set C 1. apply the definition of convexity: x 1, x 2 C, 0 θ 1 θx 1 + (1 θ)x 2 C 2. or show that the set can be obtained from simpler convex sets with operations that preserve convexity, most importantly: intersection: if S 1, S 2 are convex, their intersection S 1 S 2 is convex. affine functions: If S is a convex and f (x) = Ax b is affine, the image of S under f, f (S) = {f (x) x S} is convex Juho Rousu 27. September, 2017 8 / 45

Convex optimization Convex optimisation Convex optimization problem Standard form of a convex optimization problem min f 0(x) x D s.t. f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p The problem is composed of the following components: The variable x D from a domain D, D = R n is typical. The objective function f0 : R n R to be minimized, a convex function of the variable x The constraint functions f i R related to inequality constraints, convex functions of x The constraint functions hi (x) = a i x b i related to equality constraints, affine (linear) functions of x Juho Rousu 27. September, 2017 9 / 45

Convex optimization Convex optimisation Convex optimization problem Standard form of a convex optimization problem min f 0(x) x D s.t. f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p Value of x that satisfy the constraints is called feasible, the set of all feasible points is called the feasible set. The feasible set defined by the above constraints is a convex set x is optimal if it has the smallest objective function value among all feasible z D. Juho Rousu 27. September, 2017 10 / 45

Convex optimization Convex optimisation Why is convexity useful? I Convex objective vs. non-convex objective I Convex feasible set vs. non-convex feasible set Juho Rousu 27. September, 2017 11 / 45

Convex optimization Convex optimisation Why convexity? Convex objective: We can always improve a sub-optimal objective value by stepping towards negative gradient All local optima are global optima Convex constraints i.e. convex feasible set Any point between two feasible points is feasible Updates remain inside the feasible set as long as the update direction is towards a feasible point = fast algorithms based on the principle of feasible descent Juho Rousu 27. September, 2017 12 / 45

Convex optimization Convex optimisation Quadratic optimization problem (QP) Quadratic program (QP) has a quadratic objective function and affine constraint functions A QP can be expressed in the form min x R (1/2)x Px + q x + r n s.t. g i x h i 0, i = 1,..., m a i x b i = 0, i = 1,..., p P R n n is positive semi-definite matrix and q, g i, a i R n The feasible set is convex, as the intersection of set of half-spaces (g i x h i ) and set of hyperplanes (a i x b i = 0) The objective decreases the steepest when moving along f 0 (x) = (Px + q) Juho Rousu 27. September, 2017 13 / 45

Convex optimization Convex optimisation Ridge regression as a QP min w ( w, x i y i ) 2 + λ w 2 Express norm of the weight vector as w 2 2 = w I n w Write squared error in vector form y X w, y X w = y y 2w X y + w X X w We get P = X X + λi n for the quadratic part, q = 2X y for the linear part Ridge regression problem is thus written as a QP: min w w ( X X + λi n ) w 2w X y + y y This is an example of an unconstrained QP Juho Rousu 27. September, 2017 14 / 45

Convex optimization Convex optimisation Soft-margin SVM as a QP min w L Hinge ( w, x i, y i ) + λ 2 w 2 Express Hinge loss L Hinge ( w, x, y) = max(0, 1 yw x) by an upper bound (slack) ξ i 0 and inequality constraint: 1 y i w x i ξ i Express Squared norm of the weight vector as w 2 2 = w I D w Denote P = I n, q = 1 l R l = vector of ones We obtain the QP min 1 l ξ + λ w,ξ 2 w I n w s.t. 1 y i w x i ξ i, ξ i 0, i = 1,..., l Juho Rousu 27. September, 2017 15 / 45

Duality Principle of viewing an optimization problem from two interchangeable views, primal and dual views Intuitively: Minimization of a primal objective Maximization of the dual objective Primal constraints Dual variables Dual constraints Primal variables Motivation in this course: Learning with Features (primal) Learning with Kernels (dual) Juho Rousu 27. September, 2017 16 / 45

Duality: Lagrangian Consider the primal optimisation problem with variable x R n min f 0(x) x D s.t. f i (x) 0, i = 1,..., m h i (x) = 0, i = 1,..., p Augment the objective function with the weighted sum of the constraint functions to form the Lagrangian of the optimization problem: L(x, λ, ν) = f 0 (x) + m λ i f i (x) + p ν i h i (x) λ i, i = 1,..., m and ν i, i = 1,..., p (ν is the greek letter nu ) are called the Lagrangian multipliers or dual variables Juho Rousu 27. September, 2017 17 / 45

Example: Minimizing f 0 (x) = x 2, s.t. f 1 (x) = 1 x 0 Lagrangian: L(x, λ) = x 2 + λ(1 x) Juho Rousu 27. September, 2017 18 / 45

Lagrangian dual function The Lagrangian dual function g : R m R p R is the minimum value of the Lagrangian over x: g(λ, ν) = inf x L(x, λ, ν) = inf {f 0(x) + x m λ i f i (x) + p ν i h i (x)} Intuitively: Fixing coefficients (λ, ν) corresponds to certain level of penalty, The infimum returns the optimal x for that level of penalty g(λ, ν) is the corresponding value for the Lagrangian g(λ, ν) is a concave function as a pointwise infimum of a family of affine functions of (λ, ν) Juho Rousu 27. September, 2017 19 / 45

Example Minimizing x 2, s.t. x 1 Lagrangian dual function : g(λ) = inf x (x 2 + λ(1 x)) Set derivatives to zero x (x 2 + λ(1 x)) = 2x λ = 0 = x = λ/2 Plug back to the Lagrangian: g(λ) = λ2 4 + λ(1 λ/2) = λ2 4 + λ Juho Rousu 27. September, 2017 20 / 45

Lower bounds on optimal value The Lagrangian dual function gives us lower bounds on the optimal value p of the primal problem: below, g(λ) 1 = p, Juho Rousu 27. September, 2017 21 / 45

Lower bounds on optimal value In general, it holds that g(λ, ν) p for any non-negative λ (λ i 0, i = 1,..., m) and for any ν To see this, let x be a feasible point of the original problem, thus all primal constraints are satisfied: f i ( x) 0 and h i ( x) = 0 We have λ i f i ( x) 0, i = 1,..., m since λ i > 0 by assumption Similarly ν i h i ( x) = 0 for i = 1,..., p Thus the value of the Lagrangian is less than the objective function at x: m p L( x, λ, ν) = f 0 ( x) + λ i f i ( x) + ν i h i ( x) f 0 ( x) Now, g(λ, ν) = inf x D L(x, λ, ν) L( x, λ, ν) f 0 ( x) as infimum is computed over a set containing x Juho Rousu 27. September, 2017 22 / 45

Lower bounds on optimal value The Lagrangian dual function gives us lower bounds on the optimal value p of the primal problem: below, g(λ) 1 = p, Juho Rousu 27. September, 2017 23 / 45

The Lagrange dual problem For each pair (λ, ν), λ 0, the Lagrange dual function gives a lower bound on the optimal value of p. What is the tightest lower bound that can be achieved? We need to find the maximum This gives us a optimization problem max λ,ν g(λ, ν) s.t.λ 0 It is called the Lagrange dual problem of the original optimization problem. It is a convex optimisation problem, since it is equivalent to minimising g(λ, ν) which is a convex function Juho Rousu 27. September, 2017 24 / 45

Dual problem for SVMs Derivation of Dual Soft-Margin SVM Recall the soft-margin SVM from last lecture: min w,ξ 1 2 w 2 + C ξ i s.t. 1 y i ( w, x i + b) ξ i i = 1,..., l ξ i 0 GC content before 'AG' AG AG AG AG AG AG AG AG AG ξ GC content after 'AG' AG AG AG ξ i is called the slack variable, when positive, the margin γ(x i ) < 1 The sum of slacks is to be minimized so the objective still favours hyperplanes that separates the classes well The coefficient C > 0 controls the balance between maximizing the margin and the amount of slack needed Juho Rousu 27. September, 2017 25 / 45

Dual problem for SVMs Note: S-T & Book formulation We use the canonical hyperplane representation: We fix the margin γ = 1 and seek the shortest weight vector w that achieves the margin, penalizing outliers by the slack variables ξ corresponding to the Hinge Loss S-T & C book (Section 7.2.2) writes the problem in the margin maximization form, keeping the norm of the weight vector fixed. This leads to a different derivation of the dual, but the same result. min γ + C γ,w,ξ ξ i s.t. y i ( w, x i + b) γ ξ i i = 1,..., l. ξ i 0, w 2 = 1 GC content before 'AG' AG AG AG AG AG AG AG AG AG ξ GC content after 'AG' AG AG AG Juho Rousu 27. September, 2017 26 / 45

Dual problem for SVMs Recipe to find the dual problem 1. Write down the Lagrangian 2. Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero 3. Plug the solution (w, ξ ) back to the Lagrangian to obtain the dual function 4. Write down the optimisation problem to maximise the dual function w.r.t the dual variables Juho Rousu 27. September, 2017 27 / 45

Dual problem for SVMs Deriving the dual SVM We first write the constraints in the standard form of a QP: min w,ξ 1 2 w 2 + C s.t. 1 y i w x i ξ i 0, i = 1,..., l ξ i ξ i 0, i = 1,..., l Above we have dropped the intercept (b) of the hyperplane, instead we assume a constant feature x 0 = const has been added Juho Rousu 27. September, 2017 28 / 45

Dual problem for SVMs Lagrangian of Soft-Margin SVM min w,ξ 1 2 w 2 + C s.t. 1 y i w x i ξ i 0, i = 1,..., l ξ i ξ i 0, i = 1,..., l Step 1. Write down the Lagrangian We have objective and two sets of inequality constraints, use the dual variables λ = (α, β) where α = (α i ) l corresponds to the first set, and β = (β i ) l corresponds to the second set of constraints The Lagrangian becomes L(w, ξ, α, β) = 1 2 w 2 + C l ξ i + l α i(1 y i w x i ) ξ i ) + l β i( ξ i ) Juho Rousu 27. September, 2017 29 / 45

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C l ξ i + l α i(1 y i w x i ξ i ) + l β i( ξ i ) Step 2. Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero For variable w we get: w L(w, ξ, α, β) = w = α i y i x i = w l α iy i x i = 0 Juho Rousu 27. September, 2017 30 / 45

Dual problem for SVMs Minimum of the Lagrangian L(w, ξ, α, β) = 1 2 w 2 + C l ξ i + l α i(1 y i w x i ξ i ) + l β i( ξ i ) Step 2. Find the minimum of the Lagrangian by setting derivatives w.r.t. primal variables to zero For variable ξ we get: ξi L(w, ξ, α, β) = C α i β i = 0 Together with β i 0 this implies α i C for all i = 1,..., l Juho Rousu 27. September, 2017 31 / 45

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 2 w 2 + C + ξ i α i (1 y i w x i ξ i ) + β i ( ξ i ) Step 3. Plug the solution back to the Lagrangian to obtain the dual function Substitute w = l α iy i x i and C α i β i = 0 The terms containing ξ i and β i cancel out L(w, ξ, α, β) = 1 2 α i y i x i 2 + + Juho Rousu 27. September, 2017 32 / 45 α i (1 y i ( α i y i x i ) x i ))

Dual problem for SVMs Lagrangian Dual function L(w, ξ, α, β) = 1 2 α i y i x i 2 + + α i (1 y i ( α j y j x j ) x i )) Expanding the squares and simplifying gives the dual function that only depends on the dual variable α j=1 g(α) = α i 1 2 ( α i α j y i y j x j x i j=1 Juho Rousu 27. September, 2017 33 / 45

Dual problem for SVMs Dual SVM optimisation problem Step 4. Write down the optimisation problem to maximise the dual function w.r.t the dual variables We have non-negativity for α i 0, as well as the upper bound derived above α i C, the optimal dual solution will be max g(α) = α i 1 α 2 j=1 s.t. 0 α i C, i = 1,..., l α i α j y i y j x j, x i Plug in a kernel κ(x i, x j ) = φ(x j ), φ(x i ) F to obtain Dual SVM problem: max g(α) = α i 1 α 2 j=1 s.t. 0 α i C, i = 1,..., l α i α j y i y j κ(x i, x j ) Juho Rousu 27. September, 2017 34 / 45

Properties of convex optimisation problems Properties of convex optimisation problems We will look at further concepts to understand the properties of convex optimisation problems Weak and strong duality Duality gap Complementary slackness KKT conditions Juho Rousu 27. September, 2017 35 / 45

Properties of convex optimisation problems Weak and strong duality Let p and d denote primal and dual optimal values of an optimization problem. Weak duality d p always holds, even when primal optimization problem is non-convex Strong duality d = p holds for special classes of convex optimization problems QPs have strong duality Juho Rousu 27. September, 2017 36 / 45

Properties of convex optimisation problems Duality gap A pair x, (λ, ν) where x is primal feasible and (λ, ν) is dual feasible is called primal dual feasible pair For primal dual feasible pair, the quantity is called the duality gap f 0 (x) g(λ, ν), A primal dual feasible pair localizes the primal and dual optimal values g(λ, ν) d p f 0 (x) into an interval the width of which is given by the duality gap If the duality gap is zero, we know that x is primal optimal and (λ, ν) is dual optimal We can use duality gap as a stopping criterion for optimisation Juho Rousu 27. September, 2017 37 / 45

Properties of convex optimisation problems Stopping criterion for optimization Suppose the algorithm generates a sequence of primal feasible x (k) and dual feasible (λ (k), ν (k) ) solutions for k = 1, 2,... Then the duality gap can be used as the stopping criterion: e.g. stop when f 0 (x (k) ) g(λ (k), ν (k) ɛ, for some ɛ > 0 Juho Rousu 27. September, 2017 38 / 45

Properties of convex optimisation problems Complementary slackness Let x be a primal optimal (and thus also feasible) and (λ, ν ) be a dual optimal (and thus also feasible) solution and let strong optimality hold, i.e. d = p Then, at optimum d = g(λ, ν ) = inf x {f 0(x) + f 0 (x ) + m λ i f i (x) + m λ i f i (x ) + p νi h i (x)} p νi h i (x ) f 0 (x ) = p First inequality: definition of infimum, second: inequality from x being a feasible solution Due to d = p, the inequalities must hold as equalities = penalty terms must equate to zero Juho Rousu 27. September, 2017 39 / 45

Properties of convex optimisation problems Complementary slackness We have m p λ i f i (x ) + νi h i (x ) = 0 Since h i (x ) = 0, i = 1,..., p we conclude that m λ i f i (x ) = 0 and since each term is non-positive λ i f i (x ) = 0, i = 1,..., m This condition is called the complementary slackness Juho Rousu 27. September, 2017 40 / 45

Properties of convex optimisation problems Complementary slackness Intuition: at optimum there cannot be both slack in the dual variable λ i 0 and the constraint f i (x ) 0 at the same time: and λ i > 0 f i (x ) = 0 f i (x ) < 0 λ i = 0 At optimum, positive Lagrange multipliers are associated with active constraints Juho Rousu 27. September, 2017 41 / 45

Properties of convex optimisation problems Complementary slackness for soft-margin SVM Recall the Lagrangian of the soft-margin SVM: L(w, ξ, α, β) = 1 2 w 2 + C l ξ i + l α i(1 y i w x i ξ i ) + l β i( ξ i ) Complementary slackness condition gives for all i = 1,..., l αi (1 y i w x i ξ i ) = 0 βi ξ i = 0 Juho Rousu 27. September, 2017 42 / 45

Properties of convex optimisation problems Karush-Kuhn-Tucker (KKT) conditions Summarizing, at optimum the following conditions must hold true: Inequality contraints satified: f i (x ) 0, i = 1,..., m Equality contraints satified: hi (x ) = 0, i = 1,..., p Non-negativity of dual variables of the inequality constraints: λ i 0, i = 1,..., m Complementary slackness: λ i f i(x ) = 0, i = 1,..., m Derivative of Lagrangian vanishes: x L(x, λ, ν ) = (f 0 (x ) + m λ i f i(x ) + p ν i h i(x )) = 0 These conditions are called the Karush-Kuhn-Tucker conditions For convex problems satisfying KKT conditions is also sufficient as well as necessary for optimality Juho Rousu 27. September, 2017 43 / 45

Properties of convex optimisation problems KKT conditions for the soft-margin SVM Inequality contraints satified: 1 y i w x i ξ i 0, i = 1,..., l No equality constraints to worry about Non-negativity of dual variables of the inequality constraints: α i 0, β i 0, i = 1,..., l Complementary slackness: for all i = 1,..., l αi (1 y i w x i ξ i ) = 0 βi ξ i = 0 Derivative of Lagrangian vanishes: (w,ξ) L(w, ξ, α, β ) = (w,ξ) ( 1 2 w 2 + C + = w = αi (1 y i w x i ξi ) + α i y i x i, ξ i βi ( ξi )) = 0 α i C, i = 1,..., l Juho Rousu 27. September, 2017 44 / 45

Properties of convex optimisation problems KKT conditions for the soft-margin SVM By using the KKT conditions, we can categorize data points by their position w.r.t the their margin: A Too small margin, difficult data point : α i = C = y i w x i 1 B At the boundary : 0 < α i < C = y i w x i = 1 C More than enough margin, easy data point α i = 0 = y i w x i 1 <w, x> + b = -1 <w, x> + b = 0 B C <w, x> + b = +1 ξ>1 B C B A C C A C ξ>0 C B Juho Rousu 27. September, 2017 45 / 45