Convex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014

Similar documents
A Deterministic Rescaled Perceptron Algorithm

Lecture 5. Theorems of Alternatives and Self-Dual Embedding

Nonlinear Programming Models

NOTES ON HYPERBOLICITY CONES

Homework 4. Convex Optimization /36-725

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Semidefinite Programming

Week 2. The Simplex method was developed by Dantzig in the late 40-ties.

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Polynomial Column-wise Rescaling von Neumann Algorithm

Discriminative Models

Statistical Methods for SVM

Semidefinite programming and convex algebraic geometry

Homework 5. Convex Optimization /36-725

1 Strict local optimality in unconstrained optimization

A direct formulation for sparse PCA using semidefinite programming

Applications of Linear Programming

Advances in Convex Optimization: Theory, Algorithms, and Applications

9.2 Support Vector Machines 159

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

8. Geometric problems

A strongly polynomial algorithm for linear systems having a binary solution

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines for Classification: A Statistical Portrait

Integer programming (part 2) Lecturer: Javier Peña Convex Optimization /36-725

Primal-Dual Interior-Point Methods. Javier Peña Convex Optimization /36-725

Complexity of linear programming: outline

Support Vector Machines

Optimisation in Higher Dimensions

Recent Developments in Compressed Sensing

Some preconditioners for systems of linear inequalities

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Solving Conic Systems via Projection and Rescaling

Convex optimization COMS 4771

Discriminative Models

Lecture 5 : Projections

CPSC 540: Machine Learning

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Logistic regression and linear classifiers COMS 4771

9. Geometric problems

minimize x subject to (x 2)(x 4) u,

Lecture 1 Introduction

Compressed Sensing and Neural Networks

Exact algorithms: from Semidefinite to Hyperbolic programming

Primal-dual IPM with Asymmetric Barrier

10-725/36-725: Convex Optimization Prerequisite Topics

Barrier Method. Javier Peña Convex Optimization /36-725

Support Vector Machines

CPSC 340: Machine Learning and Data Mining

Kernel Methods. Machine Learning A W VO

Lecture 25: November 27

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

Lecture 9: September 28

10725/36725 Optimization Homework 2 Solutions

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Geometric problems. Chapter Projection on a set. The distance of a point x 0 R n to a closed set C R n, in the norm, is defined as

Lecture 1: September 25

8. Geometric problems

Deciding Emptiness of the Gomory-Chvátal Closure is NP-Complete, Even for a Rational Polyhedron Containing No Integer Point

Support Vector Machines

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Lecture 6 - Convex Sets

We describe the generalization of Hazan s algorithm for symmetric programming

Convex Optimization and l 1 -minimization

A direct formulation for sparse PCA using semidefinite programming

Statistical Methods for Data Mining

Minimizing Cubic and Homogeneous Polynomials Over Integer Points in the Plane

Primal-Dual Interior-Point Methods. Ryan Tibshirani Convex Optimization

Minimizing Cubic and Homogeneous Polynomials over Integers in the Plane

SVM May 2007 DOE-PI Dianne P. O Leary c 2007

Primal-Dual Geometry of Level Sets and their Explanatory Value of the Practical Performance of Interior-Point Methods for Conic Optimization

Lecture 4: January 26

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Matrix Rank Minimization with Applications

Mathematical Methods for Data Analysis

Is the test error unbiased for these programs? 2017 Kevin Jamieson

The Geometry of Semidefinite Programming. Bernd Sturmfels UC Berkeley

DISSERTATION. Submitted in partial fulfillment ofthe requirements for the degree of

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

Optimisation Combinatoire et Convexe.

A primal-simplex based Tardos algorithm

Homework 3. Convex Optimization /36-725

Gauge optimization and duality

1 Machine Learning Concepts (16 points)

ON DISCRETE HESSIAN MATRIX AND CONVEX EXTENSIBILITY

POLYNOMIAL OPTIMIZATION WITH SUMS-OF-SQUARES INTERPOLANTS

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

Solving Corrupted Quadratic Equations, Provably

Complexity of Deciding Convexity in Polynomial Optimization

Statistical Data Mining and Machine Learning Hilary Term 2016

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

CSCI 1951-G Optimization Methods in Finance Part 01: Linear Programming

Sparse Gaussian conditional random fields

SPECTRAHEDRA. Bernd Sturmfels UC Berkeley

Recovery of Simultaneously Structured Models using Convex Optimization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Tractable Upper Bounds on the Restricted Isometry Constant

Transcription:

Convex optimization Javier Peña Carnegie Mellon University Universidad de los Andes Bogotá, Colombia September 2014 1 / 41

Convex optimization Problem of the form where Q R n convex set: min x f(x) x Q, x, y Q, λ [0, 1] λx + (1 λ)y Q, f : Q R convex function: epigraph(f) = {(x, t) R n+1 : x Q, t f(x)} convex set. 2 / 41

Special cases Linear programming Semidefinite programming min y c, y a i, y b i 0, i = 1,..., n. min y c, y m j=1 A jy j B 0. Second-order cone programming min y c, y a i, y b i A i y d i 2, i = 1,..., r. 3 / 41

Agenda Applications Algorithms Open problems 4 / 41

Applications 5 / 41

Classification Classification data D = {(x 1, l 1 ),..., (x n, l n )}, with x i R d, l i { 1, 1}. Linear classification Find (β 0, β) R d+1 such that for i = 1,..., n sgn(β 0 + β, x i ) = l i l i (β 0 + x i, β ) > 0. Support vector machines Find linear classifier with largest margin min β 0,β β 2 l i ( x i, β + β 0 ) 1, i = 1,..., n 6 / 41

Regression Regression data D = {(x 1, y 1 ),..., (x n, y n )}, with x i R d, y i R. Linear regression Find β R d that minimizes training error: min β n i=1 (β 0 + β, x i y i ) 2 min β Xβ y 2 2 1 x T 1 y 1 X :=.., y =. 1 x T n y n 7 / 41

Sparse regression High dimensional regression D = {(x 1, y 1 ),..., (x n, y n )}, x i R d, y i R, large d, e.g., d > n. Want β R d sparse: ( ) min Xβ y 2 2 + λ β 0 β β 0 := {i : β i 0}. Lasso regression (Tibshirani, 1996) The above problem is computationally intractable. Use instead min β ( Xβ y 2 2 + λ β 1 ). Extensions Group lasso, fused lasso, and others. 8 / 41

Compressive sensing Raw 3MB jpeg versus a compressed 0.3MB version. Question If an image is compressible, can it be acquired efficiently? 9 / 41

Compressive sensing Compressibility corresponds to sparsity in a suitable representation. Restatement of the above question: Question Can we recover a sparse vector x R n from m n linear measurements Example (group testing) b k = a k, x, k = 1,..., m b = A x. Suppose only one component of x is different from zero. Then log 2 n measurements or fewer suffice to find x. 10 / 41

Compressive sensing via linear programming Possible approach to recover sparse x Take m n measurements b = A x and solve min x x 0 Ax = b. The above is computationally intractable. Use instead min x x 1 Ax = b. Theorem (Candès & Tao, 2005) If m s log n and A is suitably chosen. Then the l 1 -minimization problem recovers x with high probability. 11 / 41

Matrix completion Problem Assume M R n n has low rank and we observe some entries of M. Can we recover M? Possible approach to recover low rank M Assume we observe entries in Ω {1,..., n} {1,..., n}. Solve min X rank(x) X ij = M ij, (i, j) Ω. Rank-minimization is computationally intractable. 12 / 41

Matrix completion via semidefinite programming Fazel, 2001: Use instead min X X X ij = M ij, (i, j) Ω. Here is the nuclear norm: X := n σ i (X). i=1 Theorem (Candès & Recht, 2010) Assume rank(m) = r and Ω random, Ω Cµr(1 + β) log 2 n. Then the nuclear norm minimization problem recovers M with with high probability. 13 / 41

Algorithms 14 / 41

Late 20th century: interior-point methods To solve min x c, x x Q. Trace path {x(µ) : µ > 0}, where x(µ) minimizes F µ (x) := c, x + µ f(x). Here f : Q R a suitable barrier function for Q. Some barrier functions Q f {y : a i, y b i 0, i = 1,..., n} n i=1 log( a i, y b i ) {y : ( m i=1 A m ) jy j B 0} log det j=1 A jy j B 15 / 41

Interior-point methods Recall F µ (x) = c, x + µ f(x) and f : Q R barrier function. Template of interior-point method pick µ 0 > 0 and x 0 x(µ 0 ) for t = 0, 1, 2,... pick µ t+1 < µ t x t+1 := x t [F µ t+1 (x t )] 1 F µ t+1 (x t ) end for The above can be done so that x t x, where x solves min x c, x x Q. 16 / 41

Interior-point methods Features Superb theoretical properties. Numerical performance far better than what theory states. Excellent accuracy. Commercial and open-source implementations. Limitations Barrier function for entire constraint set. Solve a system of equations (Newton s step) at each iteration. Numerically challenged for very large or dense problems. Often inadequate for above applications. 17 / 41

Early 21th century: algorithms with simpler iterations Tradeoff the above features vs limitations. In many applications modest accuracy is fine. Interior-point methods Need barrier function for entire constraint set, second-order information (gradient & Hessian), and solve systems of equations. Simpler algorithms Use less information about the problem. Avoid costly operations. 18 / 41

Convex feasibility problem Assume Q R m is a convex set and consider the problem Find y Q. Any convex optimization problem can be recast this way. Difficulty depends on how Q is described. Assume a separation oracle for Q R m is available. Separation oracle for Q Given y R m, verify y Q or generate 0 a R m such that a, y < a, v, v Q. 19 / 41

Examples Linear inequalities: a i R m, b i R, i = 1,..., n Q = {y R m : a i, y b i 0, i = 1,..., n}. Oracle: Given y, check each a i, y b i 0. Linear matrix inequalities: B, A j R n n, j = 1,..., m symmetric, m Q = y Rm : A j y j B 0. j=1 Oracle: Given y, check m j=1 A jy j B 0. If this fails, get u 0 such that m m u, A j u y j < u, Bu u, A j u v j, v Q. j=1 j=1 20 / 41

Relaxation method (Agmon, Motzkin-Schoenberg) Assume a i 2 = 1, i = 1,..., n and consider Q = {y R m : a i, y b i, i = 1,..., n}. Relaxation method y 0 := 0; t := 0 while there exists i such that a i, y t < b i y t+1 := y t λ(b i a i, y t )a i t := t + 1 end Theorem (Agmon, 1954) If Q and λ (0, 2) then y t ȳ Q. Theorem (Motzkin-Schoenberg, 1954) If int(q) and λ = 2 then y t Q for t large enough. 21 / 41

Perceptron algorithm (Rosenblatt, 1958) Consider C = {y R m : A T y > 0}, where A = [ a 1... a n ] R m n, a i 2 = 1, i = 1,..., n. Perceptron algorithm y 0 := 0; t := 0 while there exists i such that a i, y t 0 y t+1 := y t + a i t := t + 1 end 22 / 41

Cone width Assume C R m is a convex cone. The width of C is τ C := sup {r : B 2 (y, r) C}. y 2 =1 Observe: τ C > 0 if and only if int(c). Theorem (Block, Novikoff 1962) Assume C = {y R m : A T y > 0}. Then the perceptron algorithm finds y C is at most 1 iterations. τc 2 23 / 41

General perceptron algorithm The perceptron algorithm and the above convergence rate hold for a general convex cone C provided a separation oracle is available. Notation S m 1 := {v R m : v 2 = 1}. Perceptron algorithm (general case) y 0 := 0; t := 0 while y C let a S m 1 be such that a, y 0 < a, v, v C y t+1 := y t + a t := t + 1 end 24 / 41

Rescaled perceptron algorithm (Soheili-P 2013) Key idea If C R m is a convex cone and a S m 1 is such that { C y R m : 0 a, y 1 } y 2, 6m then dilate space along a to get wider Ĉ := (I + aat )C. Lemma If C, a, Ĉ are as above then vol(ĉ Sm 1 ) 1.5 vol(c S m 1 ). Lemma If C is a convex cone then vol(c S m 1 ) τ C vol(s m 1 ). 1+τ 2 C 25 / 41

Rescaled perceptron algorithm (Soheili-P 2013) Assume a separation oracle for C is available. Rescaled perceptron (1) Run perceptron for C up to 6m 4 steps (2) Identify a S m 1 such that { C y R m : 0 a, y 1 } y 2. 6m (3) Rescale: C := (I + aa T )C ; and go back to (1). Theorem (Soheili-P 2013) Assume int(c). ( The above ( rescaled )) perceptron algorithm finds y C is at most O m 5 log 1 τ C perceptron steps. 1 Recall: Perceptron stops after steps. τc 2 26 / 41

Perceptron algorithm again Consider again A T y > 0 where a i 2 = 1, i = 1,..., n. Perceptron algorithm (slight variant) y 0 := 0; for t = 0, 1,... a i := argmin{ a j, y t : j = 1,..., n} y t+1 := y t + a i end Let x(y) := argmin x n A T y, x, where n := {x R n : x 0, x 1 = 1}. Normalized perceptron algorithm y 0 := 0; for t = 0, 1,... y t+1 := (1 1 t+1 )y t + 1 t+1 Ax(y t) end 27 / 41

Smooth perceptron (Soheili-P 2011) Key idea Use a smooth version of x(y) = argmin A T y, x, x n namely, for some µ > 0. x µ (y) := exp( AT y/µ) exp( A T y/µ) 1 28 / 41

Smooth Perceptron Algorithm Let θ t := 2 t+2 ; µ 4 t := (t+1)(t+2), t = 0, 1,... Smooth Perceptron Algorithm y 0 := 1 n A1; x 0 := x µ0 (y 0 ); for t = 0, 1,... y t+1 := (1 θ t )(y t + θ t Ax t ) + θ 2 t Ax µt (y t ) x t+1 := (1 θ t )x t + θ t x µt+1 (y t+1 ) end for Recall main loop in the normalized version: for t = 0, 1,... y t+1 := (1 1 t+1 )y t + 1 t+1 Ax(y t) end for 29 / 41

Theorem (Soheili & P, 2011) Assume C = {y R m : A T y > 0}. Then the above smooth perceptron algorithm finds y C in at most elementary iterations. 2 2 log(n) τ C Remarks Smooth version retains the algorithm s original simplicity. Improvement on perceptron iteration bound 1. τc 2 Very weak dependence on n. 30 / 41

Binary classification again Classification data D = {(u 1, l 1 ),..., (u n, l n )}, with u i R d, l i { 1, 1}. Linear classification Find β R d such that for i = 1,..., n sgn(β T u i ) = l i l i u T i β > 0. Taking A = [ ] l 1 u 1 l n u n and y = β can rephrase as A T y > 0. 31 / 41

Kernels and Reproducing Kernel Hilbert Spaces Assume K : R d R d R symmetric positive definite kernel: x 1,..., x m R d, [K(x i, x j )] ij 0. Reproducing Kernel Hilbert Space { } F K := f( ) = β i K(, z i ), β i R, z i R d, f K <. Feature mapping i=1 φ : R d F K u K(, u) For f F K and u R d we have f(u) = f, φ(u) K 32 / 41

Kernelized classification Nonlinear kernelized classification Find f F K such that for i = 1,..., n Separation margin sgn(f(u i )) = l i l i f(u i ) > 0 Assume D = {(u 1, l 1 ),..., (u n, l n )} and K are given. Define the margin ρ K as ρ K := sup min l if(u i ). i=1,...,n f K =1 33 / 41

Kernelized smooth perceptron Theorem (Ramdas & P 2014) Assume ρ K > 0. (a) Kernelized version of the smooth perceptron finds a nonlinear separator after at most 2 2 log n D iterations. (b) Kernelized smooth perceptron generates f t F K such that ρ K f t f K 2 2 log n D, t where f F separator with best margin. 34 / 41

Open problems 35 / 41

Smale s 9th problem Is there a polynomial-time algorithm over the real numbers which decides the feasibility of the linear system of inequalities Ax b? Related work Tardos, 1986: A polynomial algorithm for combinatorial linear programs. Renegar, Freund, Cucker, P (2000s): Algorithms that are polynomial in problem dimension and condition number C(A, b). Ye, 2005: A polynomial interior-point algorithm for the Markov Decision Problem with fixed discount rate. Ye, 2011: The simplex method is polynomial for the Markov Decision Problem with fixed discount rate. 36 / 41

Hirsch conjecture A polyhedron is a set of form {y R m : a i, y b i 0, i = 1,..., n}. A face of a polyhedron is a non-empty intersection with a non-cutting hyperplane. Vertices: zero-dimensional faces. Edges: one-dimensional faces. Facets: highest-dimensional faces. Observation The vertices and edges of a polyhedron form a graph. 37 / 41

Hirsch conjecture Conjecture (Hirsch, 1957) For every polyhedron P with n facets and dimension d diam(p ) n d. Related work Klee and Walkup, 1967: Unbounded counterexample. Question True for special classes of bounded polyhedra. Santos, 2012: First bounded counterexample. Todd, 2014: diam(p ) d log 2 (n d). Small bound (e.g., linear in n, d) on diam(p )? 38 / 41

Lax conjecture Definition A homogeneous polynomial p R[x] is hyperbolic if there exists e R n such that for every x R n the roots of are real. Theorem (Garding, 1959) t p(x + te) Assume p is hyperbolic. Then each connected component of {x R n : p(x) > 0} is an open convex cone. Hyperbolicity cone: Connected component of {x R n : p(x) > 0} for some hyperbolic polynomial p. 39 / 41

Lax conjecture Question Can every hyperbolicity cone be described in terms of linear matrix inequalities? m y Rm : A j y j 0. Related work j=1 Helton and Vinnikov, 2007: Every hyperbolicity cone in R 3 is of the form {y R 3 : Ix 1 + A 2 x 2 + A 3 x 3 0}, for some symmetric matrices A 2, A 3 (Lax conjecture, 1958). Branden, 2011: Disproved some versions of this conjecture for more general hyperbolicity cones in R n. 40 / 41

Slides and references jpena@cmu.edu http://www.andrew.cmu.edu/user/jfp/ 41 / 41