Convex Optimization in Classification Problems

Similar documents
Robust Kernel-Based Regression

Learning the Kernel Matrix with Semidefinite Programming

Learning the Kernel Matrix with Semi-Definite Programming

Support Vector Machine (SVM) and Kernel Methods

A Robust Minimax Approach to Classification

Support Vector Machines

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

A Robust Minimax Approach to Classification

Support Vector Machine (SVM) and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Convex optimization problems. Optimization problem in standard form

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS

Support Vector Machines: Maximum Margin Classifiers

Robust Novelty Detection with Single Class MPM

Jeff Howbert Introduction to Machine Learning Winter

Lecture: Convex Optimization Problems

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ICS-E4030 Kernel Methods in Machine Learning

Robust Fisher Discriminant Analysis

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

4. Convex optimization problems

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) and Kernel Methods

Lecture: Examples of LP, SOCP and SDP

4. Convex optimization problems

EE 227A: Convex Optimization and Applications October 14, 2008

A Second order Cone Programming Formulation for Classifying Missing Data

8. Geometric problems

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Statistical Pattern Recognition

Convex Optimization M2

Sparse and Robust Optimization and Applications

Semidefinite Programming Basics and Applications

Handout 6: Some Applications of Conic Linear Programming

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Lecture 6: Conic Optimization September 8

Advances in Convex Optimization: Theory, Algorithms, and Applications

Lecture Notes on Support Vector Machine

Handout 8: Dealing with Data Uncertainty

Convex Optimization M2

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Support Vector Machine

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty

8. Geometric problems

Lecture Note 5: Semidefinite Programming for Stability Analysis

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Perceptron Revisited: Linear Separators. Support Vector Machines

Convex Optimization. (EE227A: UC Berkeley) Lecture 6. Suvrit Sra. (Conic optimization) 07 Feb, 2013

Support Vector Machines

Nonlinear Programming Models

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

1 Graph Kernels by Spectral Transforms

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Lecture 3 January 28

Lecture 7: Kernels for Classification and Regression

5. Duality. Lagrangian

Robust linear optimization under general norms

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

ML (cont.): SUPPORT VECTOR MACHINES

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

15. Conic optimization

Max-Margin Ratio Machine

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

CS798: Selected topics in Machine Learning

Robustness and Regularization: An Optimization Perspective

Lecture 8. Strong Duality Results. September 22, 2008

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Sparse Covariance Selection using Semidefinite Programming

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Introduction to Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

COURSE ON LMI PART I.2 GEOMETRY OF LMI SETS. Didier HENRION henrion

Lecture 18: Optimization Programming

Advanced Topics in Machine Learning, Summer Semester 2012

CS-E4830 Kernel Methods in Machine Learning

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines for Classification and Regression

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

A direct formulation for sparse PCA using semidefinite programming

LECTURE 13 LECTURE OUTLINE

Interior-point methods Optimization Geoff Gordon Ryan Tibshirani

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines, Kernel SVM

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning And Applications: Supervised Learning-SVM

Learning SVM Classifiers with Indefinite Kernels

Transcription:

New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

goal connection between classification and LP, convex QP has a long history (Vapnik, Mangasarian, Bennett, etc) recent progresses in convex optimization: conic and semidefinite programming; geometric programming; robust optimization we ll outline some connections between convex optimization and classification problems joint work with: M. Jordan, N. Cristianini, G. Lanckriet, C. Bhattacharrya 2

outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 3

convex optimization standard form: min x f0(x) : fi(x) 0, i = 1,..., m arises in many applications convexity not always recognized in practice can solve large classes of convex problems in polynomial-time (Nesterov, Nemirovski, 1990) 4

conic optimization special class of convex problems: min x c T x : Ax = b, x K where K is a cone, direct product of the following building blocks : K = R n + linear programming K = {(y, t) R n+1 : t y 2} second-order cone programming, quadratic programming K = {x R n n : x = x T 0} semidefinite programming fact: can solve conic problems in polynomial-time (Nesterov, Nemirovski, 1990) 5

conic duality dual of conic problem min x c T x : Ax = b, x K is where max b T y : c A T y K y K = {z : z, x 0 x K} is the cone dual to K for the cones mentioned before, and direct products of them, K = K 6

robust optimization conic problem in dual form: maxy b T y : c A T y K what if A is unknown-but-bounded, say A A, where A is given? robust counterpart: maxy b T y : A A, c A T y K still convex, but tractability depends on A systematic ways to approximate (get lower bounds) for large classes of A, approximation is exact 7

example: robust LP linear program: minx c T x : a T i assume ai s are unknown-but-bounded in ellipsoids Ei := { a : (a âi) T Γ 1 i (a âi) 1 } where âi: center, Γi 0: shape matrix robust LP: minx c T x : ai Ei, a T i 8 x b, i = 1,..., m x b, i = 1,..., m

robust LP: SOCP representation robust LP equivalent to min x c T x : â T i x + Γ 1/2 i x 2 b, i = 1,..., m a second-order cone program! interpretation: smoothes boundary of feasible set 9

LP with Gaussian coefficients assume a N (â, Γ), then for given x, Prob{a T x b} 1 ɛ is equivalent to: â T x + κ Γ 1/2 x 2 b where κ = Φ 1 (1 ɛ) and Φ is the c.d.f. of N (0, 1) hence, can solve LP with Gaussian coefficients using second-order cone programming resulting SOCP is similar to one obtained with ellipsoidal uncertainty 10

LP with random coefficients assume a (â, Γ), i.e. distribution of a has mean â and covariance matrix Γ, but is otherwise unknown Chebychev inequality: Prob{a T x b} 1 ɛ is equivalent to: â T x + κ Γ 1/2 x 2 b where 1 ɛ κ = ɛ leads to SOCP similar to ones obtained previously 11

outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 12

SVMs: setup given data points xi with labels yi = ±1, i = 1,..., N two-class linear classification with support vector: min a 2 : yi(a T xi b) 1, i = 1,..., N amounts to select one separating hyperplane among the many possible problem is feasible iff there exists a separating hyperplane between the two classes 13

SVMs: robust optimization interpretation interpretation: SVMs are a way to handle noise in data points assume each data point is unknown-but-bounded in a sphere of radius ρ and center xi find the largest ρ such that separation is still possible between the two classes of perturbed points 14

variations can use other data noise models: hypercube uncertainty (gives rise to LP) ellipsoidal uncertainty ( QP) probabilistic uncertainty, Gaussian or Chebychev ( QP) 15

separation with hypercube uncertainty assume each data point is unknown-but-bounded in an hypercube Ci: xi Ci := {ˆxi + ρp u : u 1} where centers ˆxi and shape matrix P are given robust separation: leads to linear program min P a 1 : yi(a T ˆxi b) 1, i = 1,..., N 16

separation with ellipsoidal uncertainty assume each data point is unknown-but-bounded in an ellipsoid Ei: xi Ei := {ˆxi + ρp u : u 2 1} where center ˆxi and shape matrix P are given robust separation leads to QP min P a 2 : yi(a T ˆxi b) 1, i = 1,..., N 17

outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 18

minimax probability machine goal: make assumptions about the data generating process do not assume Gaussian distributions use second-moment analysis of the two classes let ˆx±, Γ± be the mean and covariance matrix of class y = ±1 MPM: maximize ɛ such that there exists (a, b) such that inf x (ˆx+,Γ+) Prob{aT x b} 1 ɛ inf x (ˆx,Γ ) Prob{aT x b} 1 ɛ 19

MPMs: optimization problem two-sided, multivariable Chebychev inequality: (b a T ˆx) 2 inf x (ˆx,Γ) Prob{aT + x b} = (b a T ˆx) 2 + + a T Γa MPM leads to second-order cone program: min a Γ 1/2 + a 2 + Γ 1/2 a 2 : a T (ˆx+ ˆx ) = 1 complexity is the same as standard SVMs 20

dual problem express problem as unconstrained min-max problem: min a max u 2 1, v 2 1 ut Γ 1/2 + a v T Γ 1/2 a + λ(1 a T (x+ x )) exchange min and max, and set ρ := 1/λ: min ρ : x + + Γ 1/2 + u = x + Γ 1/2 v, u 2 ρ, v 2 ρ ρ,u,v geometric interpretation: define the two ellipsoids { } E±(ρ) := ˆx± + Γ 1/2 ± u : u 2 ρ and find largest ρ for which ellipsoids intersect 21

robust optimization interpretation assume data generated as follows: for data with label +, { } x+ E+(ρ) := ˆx+ + Γ 1/2 + u : u 2 ρ and similarly for data with label MPM finds largest ρ for which robust separation is possible PSfrag replacements ˆx+ a T x b = 0 ˆx 22

variations minimize weighted sum of misclassification probabilities quadratic separation: find a quadratic set such that inf x (ˆx+,Γ+) inf x (ˆx,Γ ) Prob{x Q} 1 ɛ Prob{x Q} 1 ɛ leads to a semidefinite programming problem nonlinear classification via kernels (using plug-in estimates of mean and covariance matrix) 23

outline convex optimization SVMs and robust linear programming minimax probability machine kernel optimization 24

transduction transduction: given labeled training set and unlabeled test set, predict the labels the data contains both labeled points and unlabeled points 25

kernel methods main goal: separate using a nonlinear classifier a T φ(x) = b where φ is a nonlinear operator define the kernel matrix (on both labeled and unlabeled data) Kij = φ(xi)φ(xj) T in a transductive setting, all we need to know to predicts the labels are a, b and the kernel matrix 26

kernel methods: idea of proof all the linear classification methods we ve seen so far are such that at the optimum, a is in the range of the labeled data: a = i λixi thus, in the nonlinear case, the optimization problem depends only on the values of kernel matrix Kij for labeled points xi, xj in a transductive setting, the prediction of labels also involves Kij only, since for an unlabeled data point xj, a T φ(xj) = i λiφ(xi) T φ(xj) involves only Kij s 27

kernel optimization all previous algorithms can be kernelized what is a good kernel? kernel should be close to a target kernel kernel matrix satisfies some structure constraints main idea: kernel can be described via the Gram matrix of data points, hence is a positive semidefinite matrix semidefinite programming plays a role in kernel optimization 28

setup we assume we have given training and test sets goal: maximize alignment to a given kernel on training set (translates into constraints on upper-left block of kernel matrix) kernel matrix satisfies structure constraints (translates into constraints on the whole matrix, including test set) 29

alignment idea: align K to a target kernel C by maximizing A(K, C) := C, K K F C F where C, K = Tr (CK) is the inner product of two symmetric matrices, and C F = C, C is the Frobenius norm we can impose a lower bound α on the alignment with the second-order cone constraint on K α C F K F C, K 30

affine constraints it is also useful to impose that the kernel lies in some affine subspace example: assume that K is of the form N K = λiuiu T i i=1 where λi 0 are the (variable) eigenvalues, and ui s are the (fixed) eigenvectors 31

optimizing kernels: example problem goal: find a kernel that has an alignement with a given matrix (eg, C = yy T ) on the training set belongs to some affine set K the problem reduces to a semidefinite programming feasibility problem: find K such that K K, α C F K F C, K, K positive definite 32

kernel optimization: what s next? of course, this is not a learning method much to learn from duality theory many other constraints can be handled, e.g., margin requirements 33

wrap-up convex optimization has much to offer and gain from interaction with classification described variations on linear classification many robust optimization interpretations all these methods can be kernelized kernel optimization has high potential 34

see also Learning the Kernel Matrix with Semi-Definite Programming (Lanckiert, Cristianini, Bartlett, Elghaoui, Jordan) In Preparation (2002) Minimax probability machine (Lanckiert, Bhattacharrya, El Ghaoui, Jordan) (NIPS 2001) 35