Convex Optimization and Support Vector Machine

Similar documents
Convex Optimization and SVM

Support Vector Machine (continued)

ICS-E4030 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines for Regression

Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression

Support Vector Machines

Support Vector Machines

Announcements - Homework

CSC 411 Lecture 17: Support Vector Machine

Statistical Pattern Recognition

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 18: Optimization Programming

Lecture Notes on Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

L5 Support Vector Classification

Machine Learning And Applications: Supervised Learning-SVM

Constrained Optimization and Lagrangian Duality

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Support Vector Machines Explained

Support Vector Machine (SVM) and Kernel Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines and Kernel Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Support Vector Machine. Industrial AI Lab.

Support Vector Machines

Chapter 2: Linear Programming Basics. (Bertsimas & Tsitsiklis, Chapter 1)

LECTURE 7 Support vector machines

Support Vector Machines, Kernel SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Pattern Recognition 2018 Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Brief Review on Convex Optimization

Introduction to Support Vector Machines

Incorporating detractors into SVM classification

Support Vector Machine (SVM) and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support vector machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

5. Duality. Lagrangian

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Linear & nonlinear classifiers

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Linear & nonlinear classifiers

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Convex Optimization Boyd & Vandenberghe. 5. Duality

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

12. Interior-point methods

Lecture: Duality.

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

subject to (x 2)(x 4) u,

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Support Vector Machine

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Linear Programming Duality

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Support Vector Machines

Max Margin-Classifier

Applied inductive learning - Lecture 7

Lecture 10: A brief introduction to Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

Non-linear Support Vector Machines

Support Vector Machines

Statistical Machine Learning from Data

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Convex Optimization M2

Convex Optimization & Lagrange Duality

Lecture 3 January 28

minimize x subject to (x 2)(x 4) u,

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Perceptron Revisited: Linear Separators. Support Vector Machines

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

In the original knapsack problem, the value of the contents of the knapsack is maximized subject to a single capacity constraint, for example weight.

Lecture 8. Strong Duality Results. September 22, 2008

Support Vector Machines and Speaker Verification

Lecture Note 18: Duality

Soft-margin SVM can address linearly separable problems with outliers

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Convex Optimization M2

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Transcription:

Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We consider classification made using linear models of the form y(x) = β 0 + β t x: classify x as +1 if y(x) is positive, and as 1 otherwise. We assume for now that the data is linearly separable. The SVM maximum margin classifier choses the decision boundary for which the margin is maximized: among all separating hyperplanes, it returns the one that makes the biggest gap (or margin M) between the two classes maximize β0,β M p βi 2 = 1 (1) t i (β 0 + β t x i ) M, i = 1,..., n (a) Show that the optimisation problem (1) can be reexpressed as β0,β 1 2 β 2 t i (β 0 + β t x i ) 1, i = 1,..., n. (2) (b) Write down the Lagrangian of problem (2). (c) State the KKT conditions as they apply to this problem. (d) Derive the dual problem of (2). (e) Does strong duality hold? Explain why/why not. (f) Using complementary slackness, find which points in the training data contribute to the optimal solution (that is, find the support vectors). Express the optimal β in terms of the training data points and the optimal Lagrange multipliers. (g) Suggest an expression for the optimal intercept β 0. (h) How would the expression of the dual problem derived in (d) change if you decided to use a kernel SVM approach? (i) Derive an expression of the margin in terms of the optimal values of the Lagrange multipliers. (j) Suppose that the data is linearly non-separable. Introducing slack variables, suggest a modification to the optimization problem (2) that allow some training points to be misclassified. 1

Problem 1. Which of the following sets are convex? (i) A slab, that is a set of the form {x R n α a t x β}. (ii) A wedge, that is {x R n a t 1x b 1, a t 2x b 2 }. (iii) The set of points closer to a given point that a given set where S R n. {x x x 0 2 x y 2 for all y S}, (iv) The set of points closer to one set than another where S, T R n, and {x dist(x, S) dist(x, T )}, dist(x, S) = inf{ x z 2 z S}. (v) The set {x x + S 2 S 1 }, where S 1, S 2 R n with S 1 convex. Problem 2. (i) Derive the Lagrange dual problem of a linear program in the inequality form Ax b. (ii) Verify that the Lagrange dual of the dual is equivalent to the primal problem. (iii) When does strong duality hold? Problem 3. Consider the following optimization problem in R 2 J(x, y) = x + y g 1 (x, y) = (x 1) 2 + y 2 1 0 g 2 (x, y) = (x + 4) 2 + (y + 3) 2 25 0 (i) Show that the feasible set is convex and sketch it. (ii) Derive the KKT conditions for this problem, and deduce the solution(s) to the problem. 2

Problem 4. In each case, recognise and formulate the problem in either a linear or quadratic optimisation problem. (i) A healthy diet contains m different nutrients in quantities at least equal to b 1,..., b m. We can compose such diet by choosing nonnegative quantities x 1,..., x n of n different foods. One unit quantity of food j contains an amount a ij of nutrient i, and has a cost of c j. We want to determine the cheapest diet that satisfies the nutritional requirements. (ii) Find the distance between two polyhedra P 1 = {x A 1 x b 1 } and P 2 = {x A 2 x b 2 }. Problem 5. Let X be a discrete random variable such that P(X = j) = p j, and let A = (A ij ), where A ij = f i (j), for i = 1,..., m and j = 1,..., n. We want to find the distribution p = (p i ) with maximum entropy (the closest to the uniform distribution), under the constraint E(f i (X)) b i, for i = 1,..., m, p i log(p i ) Ap b 1 t p = 1 (i) Show that the dual problem simplifies to maximize b t λ log λ 0, where a i is the i-th column of A. (ii) Under which condition(s) is the optimal gap zero? ( ) e at i λ Problem 6. In a Boolean linear program, the variable x is constrained to have components equal to 0 or 1, Ax b x i {0, 1}, i = 1,..., n. Although the feasible set is finite, this optimization problem is in general difficult to solve. We refer this problem to as the Boolean LP. We investigate two methods to obtain a lower bound on the optimal solution. 3

In the first method, called relaxation, the constraint that x i is 0 or 1 is replaced with the linear inequalities 0 x i 1, Ax b 0 x i 1, i = 1,..., n. We refer to this problem as the LP relaxation. This problem is by far easier to solve than the original problem. (i) Show that the optimal value of the LP relaxation is a lower bound on the optimal value of the Boolean LP. (ii) What can you say about the Boolean LP if the LP relaxation is infeasible? The Boolean LP can be reformulated as which has quadratic equality constraints. Ax b (3) x i (1 x i ) = 0, i = 1,..., n, (iii) Find the Lagrange dual function of problem (3), and show that the dual problem can be written maximize b t λ + min{0, c i + a t i λ} λ 0, where a i represents the i-th column of A. (iv) The optimal value of the dual of problem (3) provides a lower bound on the optimal value of the Boolean LP. This method of finding a lower bound is called Lagrangian relaxation. Show that the lower bound obtained using Lagrangian relaxation is the same as the lower bound obtained using LP relaxation. Problem 7. We extend SVM to regression problems. In regularised linear regression, the error function is given by (t i y(x i )) 2 + λ β 2, where y(x i ) = β 0 + x t i β. The quadratic error function is replaced with an error function which gives zero error if t i y(x i ) is less than some ɛ > 0, and a linear penalty otherwise, { 0 if t i y(x i ) < ɛ E ɛ (t i y(x i )) = t i y(x i ) ɛ otherwise. 4

The problem is now to minimise the following regularised error function C E ɛ (t i y(x i )) + 1 2 β 2, where y(x i ) = β 0 + x t i β, with C > 0 some regularisation parameter. For each observation x i, we introduce two slack variables ξ i 0 and ˆξ i 0, where ξ i > 0 corresponds to a point for which t i > y(x i ) + ɛ, and ˆξ i > 0 to a point for which t i < y(x i ) ɛ (i) Show that the error function to minimise for support vector regression can be reexpressed as C (ξ i + ˆξ i ) + 1 2 β 2, ξ i 0, ˆξ i 0, t i y(x i ) + ɛ + ξ i and t i y(x i ) ɛ ˆξ i. (ii) Write down the expression of the Lagrangian function, and the associated KKT conditions. (iii) Write down the dual optimisation problem. (iv) Express the optimal solution y (x i ) for the optimal vector of coefficients β. You do not need to return an expression for the optimal intercept β 0 at this stage. (v) Provide a detailed analysis of the solution: which points have Lagrange multipliers strictly positive? equal to zero? equal to C? Which points are support vectors? (vi) Suggest an expression for the optimal intercept β 0. 5