HT05: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford Convex Optimization and slides based on Arthur Gretton s Advanced Topics in Machine Learning course http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Optimization and the Lagrangian Optimization problem on x R d / primal, imize f 0 x) subject to f i x) 0 i =,..., m h j x) = 0 j =,... r. Lower bound interpretation of Lagrangian The Lagrangian L : R d R m R r R is an easier to optimize) lower bound on the original problem: Lx, λ, ν) := f 0 x) + λ i f i x) + }} I f i x)) ν j h j x), }} I 0 h j x)) j= domain D := m i=0 domf i r j= domh j nonempty). p : the primal) optimal value Idealy we would want an unconstrained problem where I u) = imize f 0 x) + 0, u 0, u > 0 I f i x)) + and I 0 u) = I 0 h j x)), j= 0, u = 0, u 0. and has domain doml := D R m R r. The vectors λ and ν are called Lagrange multipliers or dual variables. To ensure a lower bound, we require λ 0. I ) f i x) I 0 ) h i x)
Lagrange dual: lower bound on optimum p Lagrange dual: lower bound on optimum p The Lagrange dual function: imize Lagrangian When λ 0 and f i x) 0, Lagrange dual function is Simplest example: imize over x the function Lx, λ) = f 0 x) + λf x) f 0 +λf gλ, ν) := inf Lx, λ, ν). A dual feasible pair λ, ν) is a pair for which λ 0 and λ, ν) domg). We will show: next slide) for any λ 0 and ν, wherever gλ, ν) f 0 x) f i x) 0 h j x) = 0 p f 0 f Reders: f 0 is function to be imized. f 0 is inequality constraint λ 0 is Lagrange multiplier p is imum f 0 in constraint set including at optimal point f 0 x ) = p ). f 0 Figure from Boyd and Vandenberghe Lagrange dual: lower bound on optimum p Lagrange dual: lower bound on optimum p Simplest example: imize over x the function Lx, λ) = f 0 x) + λf x)!"#$%&'") f 0 +λf Simplest example: imize over x the function Lx, λ) = f 0 x) + λf x)!"#$!%"%& f 0 +λf Reders: Reders: p f 0 is function to be imized. p f 0 is function to be imized. f 0 f 0 is inequality constraint f 0 f 0 is inequality constraint f λ 0 is Lagrange multiplier f λ 0 is Lagrange multiplier p is imum f 0 in p is imum f 0 in constraint set constraint set f 0 f 0 Figure from Boyd and Vandenberghe Figure from Boyd and Vandenberghe
Lagrange dual is a lower bound on p Assume x is feasible, i.e. f i x) 0, h i x) = 0, x D, λ 0. Then Thus λ i f i x) + gλ, ν) := inf f 0 x) + f 0 x). f 0 x) + ν i h i x) 0 λ i f i x) + λ i f i x) + ) ν i h i x) ν i h i x) This holds for every feasible x, hence lower bound holds. Best lower bound: maximize the dual Best lower bound gλ, ν) on the optimal solution p of original problem: Lagrange dual problem maximize gλ, ν) subject to λ 0. Dual feasible: λ, ν) with λ 0 and gλ, ν) >. Dual optimal: solutions λ, ν ) to the dual problem, d is optimal value dual always easy to maximize: next slide). Weak duality always holds: max λ 0,ν Lx, λ, ν) }} =gλ,ν) = d p = = max Lx, λ, ν) λ 0,ν }} f 0 x) if constraints satisfied, otherwise. Strong duality: does not always hold, conditions given later): d = p. If strong duality holds: solve the easy concave) dual problem to find p. Maximizing the dual is always easy The Lagrange dual function: imize Lagrangian lower bound) gλ, ν) = inf Lx, λ, ν). Dual function is a pointwise infimum of affine functions of λ, ν), hence concave in λ, ν) with convex constraint set λ 0. How do we know if strong duality holds? Conditions under which strong duality holds are called constraint qualifications they are sufficient, but not necessary) Probably) best known sufficient condition: Strong duality holds if Primal problem is convex, i.e. of the form gλ) λ Example: One inequality constraint, Lx, λ) = f 0 x) + λf x), and assume there are only four possible values for x. Each line represents a different x. imize for convex f 0,..., f m, and f 0 x) subject to f i x) 0 i =,..., n Ax = b Slater s condition: there exists a strictly feasible point x, such that f i x) < 0, i =,..., n reduces to the existence of any feasible point when inequality constraints are affine, i.e., Cx d).
A consequence of strong duality......is complementary slackness Assume primal is equal to the dual. What are the consequences? x solution of original problem imum of f 0 under constraints), λ, ν ) solutions to dual f 0 x ) = assumed) = g definition) inf definition) 4) gλ, ν ) inf f 0 x ) + f 0 x ), f 0 x) + λ i f i x) + λ i f i x ) + ) p νi h i x) p νi h i x ) From previous slide, λ i f i x ) = 0, ) which is the condition of complementary slackness. This means λ i > 0 = f i x ) = 0, f i x ) < 0 = λ i = 0. From λ i, read off which inequality constraints are strict. 4): x, λ, ν ) satisfies λ 0, f i x ) 0, and h i x ) = 0. KKT conditions for global optimum KKT conditions for global optimum Assume functions f i, h i are differentiable and strong duality. Since x imizes Lx, λ, ν ), derivative at x is zero, f 0 x ) + λ i f i x ) + νi h i x ) = 0. KKT conditions definition: we are at global optimum, x, λ, ν ) when a) strong duality holds, and b): f 0 x ) + λ i f i x ) + f i x ) 0, i =,..., m h i x ) = 0, i =,..., r λ i 0, i =,..., m λ i f i x ) = 0, i =,..., m νi h i x ) = 0 In summary: if primal problem convex and inequality constraints affine then strong duality holds. If in addition functions f i, h i differentiable then KKT conditions are necessary and sufficient for optimality.
Linearly separable points Linearly separable points Classify two clouds of points, where there exists a hyperplane which linearly separates one cloud from the other without error. Classify two clouds of points, where there exists a hyperplane which linearly separates one cloud from the other without error. y i = + y i = + y i = Data given by x i, y i } n, x i R p, y i, +} y i = Hyperplane equation w x + b = 0. Linear discriant given by f x) = signw x + b) Linearly separable points Linearly separable points Classify two clouds of points, where there exists a hyperplane which linearly separates one cloud from the other without error. Classify two clouds of points, where there exists a hyperplane which linearly separates one cloud from the other without error. y i = + w y i = + y i = y i = / w For a datapoint close to the decision boundary, a small change leads to a change in classification. Can we make the classifier more robust? Smallest distance from each class to the separating hyperplane w x + b is called the margin.
Maximum margin classifier, linearly separable case This problem can be expressed as follows: ) max margin) = max w,b w,b w subject to w x i + b i : y i = +, w x i + b i : y i =. The resulting classifier is f x) = signw x + b), We can rewrite to obtain a quadratic program: subject to w,b w y i w x i + b). Maximum margin classifier: with errors allowed Allow errors : points within the margin, or even on the wrong side of the decision boudary. Ideally: w,b w + C I[y i w x i + b ) ) < 0], where C controls the tradeoff between maximum margin and loss. Replace with convex upper bound: w,b w + C h y i w x i + b ))). with hinge loss, hα) = α) + = α, α > 0 0, otherwise. Hinge loss Support vector classification Hinge loss: α) + Iα < 0) hα) = α) + = α, α > 0 0, otherwise. α Substituting in the hinge loss, we get w,b w + C h y i w x i + b ))). To simplify, use substitution ξ i = h y i w x i + b )) : ) w,b,ξ w + C ξ i subject to ξ i 0 y i w x i + b ) ξ i
Support vector classification Does strong duality hold? ξ/ w y i = w y i = + / w Is the optimization problem convex wrt the variables w, b, ξ? imize subject to f 0 w, b, ξ) := w + C f i w, b, ξ) := ξ i y i w x i + b ) 0, i =,..., n f i w, b, ξ) := ξ i 0, i = n +,..., n Each of f 0, f,..., f n are convex. No equality constraints. Does Slater s condition hold? Yes trivially) since inequality constraints affine. ξ i Thus strong duality holds, the problem is differentiable, hence the KKT conditions hold at the global optimum. Support vector classification: Lagrangian Support vector classification: Lagrangian The Lagrangian: Lw, b, ξ, α, λ) = w + C ξ i + with dual variable constraints α i ξi y i w x i + b )) + λ i ξ i ) Derivative wrt ξ i : Since λ i 0, L ξ i = C α i λ i = 0 α i = C λ i. α i C. α i 0, λ i 0. Minimize wrt the primal variables w, b, and ξ. Derivative wrt w: Derivative wrt b: L w = w α i y i x i = 0 w = L b = i y i α i = 0. α i y i x i. Now use complementary slackness: Non-margin SVs margin errors): α i = C > 0: We immediately have y i w x i + b ) = ξ i. Also, from condition α i = C λ i, we have λ i = 0, so ξ i 0 Margin SVs: 0 < α i < C: We again have y i w x i + b ) = ξ i. This time, from α i = C λ i, we have λ i > 0, hence ξ i = 0. Non-SVs on the correct side of the margin): α i = 0: From α i = C λ i, we have λ i > 0, hence ξ i = 0. Thus, y i w x i + b )
The support vectors Support vector classification: dual function Thus, our goal is to maximize the dual, We observe: The solution is sparse: points which are neither on the margin nor margin errors have α i = 0 The support vectors: only those points on the decision boundary, or which are margin errors, contribute. 3 Influence of the non-margin SVs is bounded, since their weight cannot exceed C. gα, λ) = w + C ξ i + α i yi w x i + b ) ) ξ i = = + λ i ξ i ) α i α j y i y j xi x j + C j= b α i y i + α i }} 0 α i ξ i α i α j y i y j xi x j α i ξ i j= α i α j y i y j xi x j. j= C α i )ξ i Support vector classification: dual problem Maximize the dual, subject to the constraints gα) = α i α i α j y i y j xi x j, j= 0 α i C, y i α i = 0 This is a quadratic program. From α, obtain the hyperplane with w = n α iy i x i Offset b can be obtained from any of the margin SVs: = y i w x i + b ).