Convex Optimization. Lieven Vandenberghe Electrical Engineering Department, UCLA. Joint work with Stephen Boyd, Stanford University

Convex Optimization Lieven Vandenberghe Electrical Engineering Department, UCLA Joint work with Stephen Boyd, Stanford University Ph.D. School in Optimization in Computer Vision DTU, May 19, 2008

Introduction

Mathematical optimization minimize f 0 (x) subject to f i (x) 0, i = 1,...,m x = (x 1,...,x n ): optimization variables f 0 : R n R: objective function f i : R n R, i = 1,...,m: constraint functions 1

Solving optimization problems General optimization problem can be extremely difficult methods involve compromise: long computation time or local optimality Exceptions: certain problem classes can be solved efficiently and reliably linear least-squares problems linear programming problems convex optimization problems 2

Least-squares minimize Ax b 2 2 analytical solution: x = (A T A) 1 A T b reliable and efficient algorithms and software computation time proportional to n 2 p (for A R p n ); less if structured a widely used technology Using least-squares least-squares problems are easy to recognize standard techniques increase flexibility (weights, regularization,... ) 3

Linear programming minimize c T x subject to a T i x b i, i = 1,...,m no analytical formula for solution; extensive theory reliable and efficient algorithms and software computation time proportional to n 2 m if m n; less with structure a widely used technology Using linear programming not as easy to recognize as least-squares problems a few standard tricks used to convert problems into linear programs (e.g., problems involving l 1 - or l -norms, piecewise-linear functions) 4

Convex optimization problem minimize f 0 (x) subject to f i (x) 0, i = 1,...,m objective and constraint functions are convex: f i (θx + (1 θ)y) θf i (x) + (1 θ)f i (y) for all x, y, 0 θ 1 includes least-squares problems and linear programs as special cases 5

Solving convex optimization problems no analytical solution reliable and efficient algorithms computation time (roughly) proportional to max{n 3, n 2 m, F }, where F is cost of evaluating f i s and their first and second derivatives almost a technology Using convex optimization often difficult to recognize many tricks for transforming problems into convex form surprisingly many problems can be solved via convex optimization 6

History 1940s: linear programming minimize c T x subject to a T i x b i, i = 1,...,m 1950s: quadratic programming 1960s: geometric programming 1990s: semidefinite programming, second-order cone programming, quadratically constrained quadratic programming, robust optimization, sum-of-squares programming,... 7

New applications since 1990 linear matrix inequality techniques in control circuit design via geometric programming support vector machine learning via quadratic programming semidefinite pogramming relaxations in combinatorial optimization applications in structural optimization, statistics, signal processing, communications, image processing, quantum information theory, finance,... 8

Interior-point methods Linear programming 1984 (Karmarkar): first practical polynomial-time algorithm 1984-1990: efficient implementations for large-scale LPs Nonlinear convex optimization around 1990 (Nesterov & Nemirovski): polynomial-time interior-point methods for nonlinear convex programming since 1990: extensions and high-quality software packages 9

Traditional and new view of convex optimization Traditional: special case of nonlinear programming with interesting theory New: extension of LP, as tractable but substantially more general reflected in notation: cone programming minimize c T x subject to Ax b is inequality with respect to non-polyhedral convex cone 10

Outline Convex sets and functions Modeling systems Cone programming Robust optimization Semidefinite relaxations l 1 -norm sparsity heuristics Interior-point algorithms 11

Convex Sets and Functions

Convex sets Contains line segment between any two points in the set x 1,x 2 C, 0 θ 1 = θx 1 + (1 θ)x 2 C example: one convex, two nonconvex sets: 12

Examples and properties solution set of linear equations solution set of linear inequalities norm balls {x x R} and norm cones {(x, t) x t} set of positive semidefinite matrices image of a convex set under a linear transformation is convex inverse image of a convex set under a linear transformation is convex intersection of convex sets is convex 13

Convex functions domain dom f is a convex set and f(θx + (1 θ)y) θf(x) + (1 θ)f(y) for all x,y domf, 0 θ 1 (x, f(x)) (y, f(y)) f is concave if f is convex 14

Examples expx, log x, xlog x are convex x α is convex for x > 0 and α 1 or α 0; x α is convex for α 1 quadratic-over-linear function x T x/t is convex in x, t for t > 0 geometric mean (x 1 x 2 x n ) 1/n is concave for x 0 log detx is concave on set of positive definite matrices log(e x 1 + e x n ) is convex linear and affine functions are convex and concave norms are convex 15

Pointwise maximum Operations that preserve convexity if f(x, y) is convex in x for fixed y, then is convex in x g(x) = sup f(x, y) y A Composition rules if h is convex and increasing and g is convex, then h(g(x)) is convex Perspective if f(x) is convex then tf(x/t) is convex in x, t for t > 0 16

Example m lamps illuminating n (small, flat) patches lamp power p j θ kj r kj illumination I k intensity I k at patch k depends linearly on lamp powers p j : I k = a T k p Problem: achieve desired illumination I k 1 with bounded lamp powers minimize max k=1,...,n log(a T k p) subject to 0 p j p max, j = 1,...,m 17

Convex formulation: problem is equivalent to minimize max k=1,...,n max{a T k p,1/at k p} subject to 0 p j p max, j = 1,...,m 5 max{u, 1/u} 4 3 2 1 0 0 1 2 3 4 cost function is convex because maximum of convex functions is convex u 18

Quasiconvex functions domain dom f is convex and the sublevel sets are convex for all α S α = {x dom f f(x) α} β α a b c f is quasiconcave if f is quasiconvex 19

x is quasiconvex on R Examples ceil(x) = inf{z Z z x} is quasiconvex and quasiconcave log x is quasiconvex and quasiconcave on R ++ f(x 1, x 2 ) = x 1 x 2 is quasiconcave on R 2 ++ linear-fractional function f(x) = at x + b c T x + d, domf = {x ct x + d > 0} is quasiconvex and quasiconcave distance ratio f(x) = x a 2 x b 2, dom f = {x x a 2 x b 2 } is quasiconvex 20

Quasiconvex optimization Example minimize p(x)/q(x) subject to Ax b p convex, q concave, and p(x) 0, q(x) > 0 Equivalent formulation (variables x, t) minimize t subjec to p(x) tq(x) 0 Ax b for fixed t, constraint is a convex feasibility problem can determine optimal t via bisection 21

Modeling Systems

Convex optimization modeling systems allow simple specification of convex problems in natural form declare optimization variables form affine, convex, concave expressions specify objective and constraints automatically transform problem to canonical form, call solver, transform back built using object-oriented methods and/or compiler-compilers 22

Example minimize m w i log(b i a T i x) i=1 variable x R n ; parameters a i, b i, w i > 0 are given Specification in CVX (Grant, Boyd & Ye) cvx begin variable x(n) minimize ( -w * log(b-a*x) ) cvx end 23

Example minimize Ax b 2 + λ x 1 subject to Fx g + ( i=1 x i)h variable x R n ; parameters A, b, F, g, h given CVX specification cvx begin variable x(n) minimize ( norm(a*x-b,2) + lambda*norm(x,1) ) subject to F*x <= g + sum(x)*h cvx end 24

Illumination problem minimize max k=1,...,n max{a T k x, 1/aT k x} subject to 0 x 1 variable x R m ; parameters a k given (and nonnegative) CVX specification cvx begin variable x(m) minimize ( max( [ A*x; inv_pos(a*x) ] ) subject to x >= 0 x <= 1 cvx end 25

History general purpose optimization modeling systems AMPL, GAMS (1970s) systems for SDPs/LMIs (1990s): SDPSOL (Wu, Boyd), LMILAB (Gahinet, Nemirovski), LMITOOL (El Ghaoui) YALMIP (Löfberg 2000) automated convexity checking (Crusius PhD thesis 2002) disciplined convex programming (DCP) (Grant, Boyd, Ye 2004) CVX (Grant, Boyd, Ye 2005) CVXOPT (Dahl, Vandenberghe 2005) GGPLAB (Mutapcic, Koh, et al 2006) CVXMOD (Mattingley 2007) 26

Cone Programming

Linear programming minimize c T x subject to Ax b is elementwise inequality between vectors c Ax b x 27

Linear discrimination separate two sets of points {x 1,...,x N }, {y 1,...,y M } by a hyperplane a T x i + b > 0, i = 1,...,N a T y i + b < 0 i = 1,...,M homogeneous in a, b, hence equivalent to the linear inequalities (in a, b) a T x i + b 1, i = 1,...,N, a T y i + b 1, i = 1,...,M 28

Approximate linear separation of non-separable sets minimize N max{0,1 a T x i b} + i=1 M max{0, 1 + a T y i + b} i=1 can be interpreted as a heuristic for minimizing #misclassified points 29

Linear programming formulation minimize N max{0,1 a T x i b} + i=1 M max{0, 1 + a T y i + b} i=1 Equivalent LP N minimize i=1 u i + M i=1 v i minimize u i 1 a T x i b, v i 1 + a T y i + b, u 0, v 0 i = 1,...,N i = 1,...,M variables a, b, u R N, v R M 30

Cone programming minimize c T x subject to Ax K b y K z means z y K, where K is a proper convex cone extends linear programming (K = R m +) to nonpolyhedral cones (duality) theory and algorithms very similar to linear programming 31

Second-order cone programming 1 Second-order cone t 0.5 C m+1 = {(x, t) R m R x t} Second-order cone program 1 0 0 x 2 1 1 x 1 0 1 minimize f T x subject to A i x + b i 2 c T i x + d i, Fx = g i = 1,...,m inequality constraints require (A i x + b i, c T i x + d i) C mi +1 32

Linear program with chance constraints minimize c T x subject to prob(a T i x b i) η, i = 1,...,m a i is Gaussian with mean ā i, covariance Σ i, and η 1/2 Equivalent SOCP minimize c T x subject to ā T i x + Φ 1 (η) Σ 1/2 i x 2 b i, i = 1,...,m Φ(x) is zero-mean unit-variance Gaussian CDF 33

Semidefinite programming 1 Positive semidefinite cone X22 0.5 S m + = {X S m X 0} 1 0 0 X 12 1 0 0.5 X 11 1 Semidefinite programming minimize c T x subject to x 1 A 1 + + x n A n B constraint requires B x 1 A 1 x n A n S m + 34

Eigenvalue minimization minimize λ max (A(x)) where A(x) = A 0 + x 1 A 1 + + x n A n (with given A i S k ) equivalent SDP minimize t subject to A(x) ti variables x R n, t R follows from λ max (A) t A ti 35

Matrix norm minimization minimize A(x) 2 = ( λ max (A(x) T A(x)) ) 1/2 where A(x) = A 0 + x 1 A 1 + + x n A n (with given A i R p q ) equivalent SDP minimize subject to t [ ti A(x) T A(x) ti ] 0 variables x R n, t R constraint follows from A 2 t A T A t 2 I, t 0 [ ti A A T ti ] 0 36

Chebyshev inequalities Classical inequality: if X is a r.v. with E X = 0, E X 2 = σ 2, then prob( X 1) σ 2 Generalized inequality: sharp lower bounds on prob(x C) X R n is a random variable with known moments E X = a, E XX T = S C R n is defined by quadratic inequalities C = {x x T A i x + 2b T i x + c i < 0, i = 1,...,m} 37

Equivalent SDP maximize subject to 1 tr(sp) 2a T q r [ ] [ ] P q Ai b q T τ i r 1 i b T, i = 1,...,m i c i τ i 0, i = 1,...,m [ ] P q q T 0 r an SDP with variables P S n, q R n, scalars r, τ i optimal value is tight lower bound on prob(x C) solution provides distribution that achieves lower bound 38

Example C a a = E X; dashed line shows {x (x a) T (S aa T ) 1 (x a) = 1} lower bound on prob(x C) is 0.3992 achieved by distribution in red 39

Detection example x = s + v x R n : received signal s: transmitted signal s {s 1,s 2,...,s N } (one of N possible symbols) v: noise with E v = 0, E vv T = σ 2 I Detection problem: given observed value of x, estimate s 40

Example (N = 7): bound on probability of correct detection of s 1 is 0.205 s 3 s 2 s 4 s 1 s 7 s 5 s 6 dots: distribution with probability of correct detection 0.205 41

Duality Cone program minimize c T x subject to Ax K b Dual cone program maximize b T z subject to A T z + c = 0 z K 0 K is the dual cone: K = {z z T x 0 for all x K} nonnegative orthant, 2nd order cone, PSD cone are self-dual: K = K Properties: optimal values are equal (if primal or dual is strictly feasible) 42

Robust Optimization

Robust optimization (worst-case) robust convex optimization problem minimize sup θ A f 0 (x, θ) subject to sup θ A f i (x, θ) 0, i = 1,...,m x is optimization variable; θ is an unknown parameter f i convex in x for fixed θ tractability depends on A (Ben-Tal, Nemirovski, El Ghaoui, Bertsimas,... ) 43

Robust linear programming minimize c T x subject to a T i x b i a i A i, i = 1,...,m coefficients unknown but contained in ellipsoids A i : A i = {ā i + P i u u 2 1} (ā i R n, P i R n n ) center is ā i, semi-axes determined by singular values/vectors of P i Equivalent SOCP minimize c T x subject to ā T i x + P i Tx 2 b i, i = 1,...,m 44

Robust least-squares minimize sup u 2 1 (A 0 + u 1 A 1 + + u p A p )x b 2 coefficient matrix lies in ellipsoid; choose x to minimize worst-case residual norm Equivalent SDP minimize t 1 + t 2 I P(x) A 0 x b subject to P(x) T t 1 I 0 (A 0 x b) T 0 t 2 0 where P(x) = [ A 1 x A 2 x A p x ] 45

Example (p = 2, u uniformly distributed in unit disk) 0.25 0.2 x rls frequency 0.15 0.1 x tik 0.05 x ls 0 0 1 2 3 4 5 r(u) = A(u)x b 2 x tik minimizes A 0 x b 2 2 + x 2 2 46

Semidefinite Relaxations

Relaxation and randomization convex optimization is increasingly used to find good bounds for hard (i.e., nonconvex) problems, via relaxation as a heuristic for finding suboptimal points, often via randomization 47

Boolean least-squares Semidefinite relaxations minimize Ax b 2 2 subject to x 2 i = 1, i = 1,...,n. a basic problem in digital communciations non-convex, very hard to solve exactly Equivalent formulation minimize tr(a T AZ) 2b T Az + b T b subject to Z ii = 1, i = 1,...,n Z = zz T follows from Az b 2 2 = tr(a T AZ) 2b T Az + b T b if Z = zz T 48

Semidefinite relaxation replace constraint Z = zz T with Z zz T an SDP with variables Z, z minimize tr(a T AZ) 2b T Az + b T b subject to Z ii = 1, i = 1,...,n [ ] Z z z T 0 1 optimal value is a lower bound for Boolean LS optimal value rounding Z, z gives suboptimal solution for Boolean LS Randomized rounding generate vector from N(z, Z zz T ) round components to ±1 49

Example (randomly chosen) parameters A R 150 100, b R 150 x R 100, so feasible set has 2 100 10 30 points 0.5 0.4 SDP bound rounded LS solution frequency 0.3 0.2 0.1 0 1 1.2 Ax b 2 /(SDP bound) distribution of randomized solutions based on SDP solution 50

Sums of squares and semidefinite programming Sum of squares: a function of the form f(t) = s ( y T k q(t) ) 2 k=1 q(t): vector of basis functions (polynomial, trigonometric,... ) SDP parametrization: f(t) = q(t) T Xq(t), X 0 a sufficient condition for nonnegativity of f, useful in nonconvex polynomial optimization (Parrilo, Lasserre, Henrion, De Klerk... ) in some important special cases, necessary and sufficient 51

Example: Cosine polynomials f(ω) = x 0 + x 1 cosω + + x 2n cos 2nω 0 Sum of squares theorem: f(ω) 0 for α ω β if and only if f(ω) = g 1 (ω) 2 + s(ω)g 2 (ω) 2 g 1, g 2 : cosine polynomials of degree n and n 1 s(ω) = (cosω cosβ)(cos α cos ω) is a given weight function Equivalent SDP formulation: f(ω) 0 for α ω β if and only if x T p(ω) = q 1 (ω) T X 1 q 1 (ω) + s(ω)q 2 (ω) T X 2 q 2 (ω), X 1 0, X 2 0 p, q 1, q 2 : basis vectors (1, cosω, cos(2ω),...) up to order 2n, n, n 1 52

Example: Linear-phase Nyquist filter minimize sup ω ωs h 0 + h 1 cosω + + h n cos nω with h 0 = 1/M, h km = 0 for positive integer k 10 0 H(ω) 10 1 10 2 10 3 0 0.5 1 1.5 2 2.5 3 ω (Example with n = 50, M = 5, ω s = 0.69) 53

SDP formulation minimize t subject to t H(ω) t, ω s ω π Equivalent SDP minimize t subject to t H(ω) = q 1 (ω) T X 1 q 1 (ω) + s(ω)q 2 (ω) T X 2 q 2 (ω) t + H(ω) = q 1 (ω) T X 3 q 1 (ω) + s(ω)q 2 (ω) T X 3 q 2 (ω) X 1 0, X 2 0, X 3 0, X 4 0 Variables t, h i (i km), 4 matrices X i of size roughly n 54

Multivariate trigonometric sums of squares h(ω) = n k= nx k e jkt ω = i g i (ω) 2, (x k = x k, ω R d ) g i is a polynomial in e jktω ; can have degree higher than n necessary for positivity of R restricting the degrees of g i gives a sufficient condition for nonnegativity Spectral mask constraints defined by trigonometric polynomials d i h(ω) = s 0 (ω) + i d i (ω)s i (ω), s i is s.o.s. guarantees h(ω) 0 on {ω d i (ω 0} (B. Dumitrescu) 55

Two-dimensional FIR filter design minimize δ s subject to 1 H(ω) δ p, ω D p H(ω) δ s, ω D s, where H(ω) = n i=0 n k=0 h ik cosiω 1 coskω 2 3 ω2 2 1 0 1 2 3 D s D p 2 0 2 ω 1 H(ω) (db) 0 50 100 2 0 ω2 2 2 ω 1 0 2 56

1-Norm Sparsity Heuristics

1-Norm heuristics use l 1 -norm x 1 as convex approximation of the l 0 - norm card(x) sparse regressor selection (Tibshirani, Hastie,... ) minimize Ax b 2 + ρ x 1 sparse signal representation (basis pursuit, sparse compression) (Donoho, Candes, Tao, Romberg,... ) minimize x 1 subject to Ax = b minimize x 1 subject to Ax b 2 ǫ 57

Norm approximation minimize Ax b 2 minimize Ax b 1 example (A is 100 30): histograms of residuals 2-norm 10 8 6 4 2 0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 1-norm 40 35 30 25 20 15 10 5 0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 note large number of zero residuals in 1-norm solution 58

Robust regression 25 20 15 10 f(t) 5 0-5 -10-15 -20-10 -5 0 5 10 t 42 points t i, y i (circles), including two outliers function f(t) = α + βt fitted using 2-norm (dashed) and 1-norm 59

Sparse reconstruction signal ˆx R n with n = 1000, 10 nonzero components 2 1 0-1 -2 0 200 400 600 800 1000 m = 100 random noisy measurements b = Aˆx + v A ij N(0, 1) i.i.d. and v N(0, σ 2 I), σ = 0.01 60

l 2 -Norm reconstruction minimize Ax b 2 2 + x 2 2 2 2 1 1 0 0-1 -1-2 0 200 400 600 800 1000-2 0 200 400 600 800 1000 left: exact signal ˆx; right: l 2 reconstruction 61

l 1 -Norm reconstruction minimize Ax b 2 + x 1 2 2 1 1 0 0-1 -1-2 0 200 400 600 800 1000-2 0 200 400 600 800 1000 left: exact signal ˆx; right: l 1 reconstruction 62

Interior-Point Algorithms

Interior-point algorithms handle linear and nonlinear convex problems follow central path as guide to the solution (using Newton s method) worst-case complexity theory: # Newton iterations problem size in practice: # Newton steps between 10 and 50 performance is similar across wide range of problem dimensions, problem data, problem classes controlled by a small number of easily tuned algorithm parameters 63

Primal and dual cone program Cone program minimize c T x subject to Ax + s = b s K 0 maximize b T y subject to A T z + c = 0 z K 0 s K 0 means s K (convex cone) z K 0 means z K (dual cone K = {z s T z 0 s K}) Examples (of self-dual cones: K = K ) linear program: K is nonnegative orthant second order cone program: K is second order cone {(t,x) x 2 t} semidefinite program: K is cone of positive semidefinite matrices 64

Central path solution {(x(t), s(t)) t > 0} of minimize tc T x + φ(s) subject to Ax + s = b φ is a logarithmic barrier for primal cone K nonnegative orthant: φ(u) = k log u k second order cone: φ(u,v) = log(u 2 v T v) positive semidefinite cone: φ(v ) = log detv 65

Example: central path for linear program minimize c T x subject to Ax b c x x(t) 66

Central path optimality conditions Newton equation Ax + s = b, A T z + c = 0, z + 1 t φ(s) = 0 Newton equation: linearize optimality conditions [ 0 s ] + [ 0 A T A 0 ] [ x z ] = [ c A T z b Ax s ] z + 1 t 2 φ(s) s = z 1 t φ(s) gives search directions x, s, z many variations (e.g., primal-dual symmetric linearizations) 67

Computational effort per Newton step Newton step effort dominated by solving linear equations to find search direction equations inherit structure from underlying problem equations same as for weighted LS problem of similar size and structure Conclusion we can solve a convex problem with about the same effort as solving 30 least-squares problems 68

Direct methods for exploiting sparsity well developed, since late 1970s based on (heuristic) variable orderings, sparse factorizations standard in general purpose LP, QP, GP, SOCP implementations can solve problems with up to 10 5 variables, constraints (depending on sparsity pattern) 69

Some convex optimization solvers primal-dual, interior-point, exploit sparsity many for LP, QP (GLPK, CPLEX,... ) SeDuMi, SDPT3 (open source; Matlab; LP, SOCP, SDP) DSDP, CSDP, SDPA (open source; C; SDP) MOSEK (commercial; C with Matlab interface; LP, SOCP, GP,... ) solver.com (commercial; excel interface; LP, SOCP) GPCVX (open source; Matlab; GP) CVXOPT (open source; Python/C; LP, SOCP, SDP, GP,... )... and many others 70

Problem structure beyond sparsity state structure Toeplitz, circulant, Hankel; displacement rank fast transform (DFT, wavelet,... ) Kronecker, Lyapunov structure symmetry can exploit for efficiency, but not in most generic solvers 71

Example: 1-norm approximation minimize Ax b 1 Equivalent LP minimize k y k subject to y Ax b y Newton equation (D 1, D 2 positive diagonal) 0 0 A T A T 0 0 I I A I D 1 0 A I 0 D 2 x y z 1 z 2 = r 1 r 2 r 3 r 4 reduces to equation of the form A T DA x = r cost = cost of (weighted) least squares problem 72

Iterative methods conjugate-gradient (and variants like LSQR) exploit general structure rely on fast methods to evaluate Ax and A T y, where A is huge can terminate early, to get truncated-newton interior-point method can solve huge problems (10 7 variables, constraints), with good preconditioner proper tuning some luck 73

Solving specific problems in developing custom solver for specific application, we can exploit structure very efficiently determine ordering, memory allocation beforehand cut corners in algorithm, e.g., terminate early use warm start to get very fast solver opens up possibility of real-time embedded convex optimization 74

Conclusions

Convex optimization Fundamental theory recent advances include new problem classes, robust optimization, semidefinite relaxations of nonconvex problems, l 1 -norm heuristics... Applications Recent applications in wide range of areas; many more to be discovered Algorithms and software High-quality general-purpose implementations of interior-point methods Customized implementations can be orders of magnitude faster Good modeling systems With the right software, suitable for embedded applications 75