Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization

Size: px

Start display at page:

Download "Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization"

Ellen Barnett
5 years ago
Views:

1 Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization

2 Outline Today: Convex versus nonconvex? Classical nonconvex problems Eigen problems Graph problems Nonconvex proximal operators Discrete problems Infinite-dimensional problems Statistical problems Miscellaneous 2

3 Beyond the tip? 3

4 Some takeaway points If possible, formulate task in terms of convex optimization typically easier to solve, easier to analyze Nonconvex does not necessarily mean nonscientific! However, statistically, it can often mean high(er) variance This is true both intrinsically, and because we can rarely solve nonconvex problems (to global optimality) In more cases than you might expect, nonconvex problems can be solved exactly (to global optimality) 4

5 What does it mean for a problem to be nonconvex? Consider a generic optimization problem: x subject to f(x) g i (x) 0, i = 1,... m h j (x) = 0, j = 1,... r This is a convex problem if f, g i, i = 1,... m are convex, and h j, j = 1,... r are affine A nonconvex problem is one of this form, where not all conditions are met on the functions But trivial modifications of convex problems can lead to nonconvex formulations... so we really just consider nonconvex problems that are not trivially equivalent to convex ones 5

6 What does it mean to solve a nonconvex problem? Nonconvex problems can have local ima, i.e., there can exist a feasible x such that f(y) f(x) for all feasible y such that x y 2 R but x is still not globally optimal. (Note: we proved that this could not happen for convex problems) Hence by solving a nonconvex problem, we mean finding the global imizer We also implicitly mean doing it efficiently, i.e., in polynomial time 6

7 Addendum This is really about putting together a list of interesting problems, that are suprisingly tractable... so there will be exceptions about nonconvexity and/or requiring exact global optima (Also, I m sure that there are many more examples out there that I m missing, so I invite you to give me ideas / contribute!) 7

8 Classical nonconvex problems 8

9 Linear-fractional programs A linear-fractional program is of the form x c T x + d e T x + f subject to Gx h, e T x + f > 0 Ax = b This is nonconvex (but quasiconvex). Provided that this problem is feasible, it is in fact equivalent to the linear program y,z c T y + dz subject to Gy hz 0, z 0 Ay bz = 0, e T y + fz = 1 9

10 10 The link between the two problems is the transformation y = x e T x + f, z = 1 e T x + f The proof of their equivalence is simple; e.g., see B & V Chapter 4 Linear-fractional problems show up in the study of solutions paths for some common statistical estimation problems E.g., knots in the lasso path (values of λ at which coefficient becomes nonzero) are optimal values of linear-fractional programs See Taylor et al. (2013), Inference in adaptive regression via the Kac-Rice formula

11 11 Geometric programs A monomial is a function f : R n ++ R of the form f(x) = γx a 1 1 xa 2 2 xan n for γ > 0, a 1,... a n R. A posynomial is a sum of monomials, p f(x) = γ k x a k1 1 x a k2 2 x a kn n k=1 A geometric program of the form x subject to f(x) g i (x) 1, i = 1,... m h j (x) = 1, j = 1,... r where f, g i, i = 1,... m are posynomials and h j, j = 1,... r are monomials. This is nonconvex

12 12 This is equivalent to a convex problem, via a simple transformation. Given f(x) = γx a 1 1 xa 2 2 xan n, let y i = log x i and rewrite this as γ(e y 1 ) a 1 (e y 2 ) a2 (e yn ) an = e at y+b for b = log γ. Also, a posynomial can be written as p k=1 eat k y+b k. With this variable substitution, and after taking logs, a geometric program is equivalent to ) y subject to log log ( p0 e at 0k y+b 0k k=1 ( pi e at ik y+b ik k=1 ) c T j y + d j = 0, j = 1,... r 0, i = 1,... m This is convex, recalling the convexity of soft max functions

13 13 Many interesting problems are geometric programs; see Boyd et al. (2007), A tutorial on geometric programg, and also Chapters 4.5 and 8.8 of B & V book. Example floor planning program: w i W,H, x,y,w,h W H (x i,y i) C i h i H subject to 0 x i W, i = 1,... n 0 y i H, i = 1,... n x i + w i x j, (i, j) L y i + h i y j, (i, j) B W w i h i = C i, i = 1,... n (Extension: Sra and Hosseini (2013), Geometric optimization on positive definite matrices with application to elliptically contoured distributions )

14 14 Problems with two quadratic functions Consider a problem involving two quadratics x x T A 0 x + 2b T 0 x + c 0 subject to x T A 1 x + 2b T 1 x + c 1 0 Here A 0, A 1 need not be positive definite, so this is nonconvex. The dual problem can be cast as max u,v subject to u [ ] A0 + va 1 b 0 + vb 1 (b 0 + vb 1 ) T 0, v 0 c 0 + vc 1 u Dual is convex (as always), and strong duality holds. See Appendix B of B & V, and also Beck and Eldar (2006), Strong duality in nonconvex quadratic optimization with two quadratic constraints

15 15 Handling convex equality constraints Given convex f, g i, i = 1,... m, the problem x subject to f(x) g i (x) 0, i = 1,... m h(x) = 0 is nonconvex when h is convex but not affine. A convex relaxation of this problem is x subject to f(x) g i (x) 0, i = 1,... m h(x) 0 If we can ensure that h(x ) = 0 at any solution x of the above problem, then the two are equivalent

16 16 From B & V Exercises 4.6 and 4.58, e.g., consider the maximum utility problem T α t u(x t ) max x,b subject to t=1 b t+1 = b t + f(b t ) x t, t = 1,... T 0 x t b t, t = 1,... T where b 0 0 is fixed. Interpretation: x t is the amount spent of your total available money b t at time t; concave function u gives utility, concave function f measures investment return This is not a convex problem, because of the equality constraint; but can relax to b t+1 b t + f(b t ) x t, t = 0,... T without changing solution (think about throwing out money)

17 17 Convexifying constraint sets Given nonconvex set C, consider the nonconvex problem x c T x subject to x C Due to linearity of objective, this is equivalent to convex problem x c T x subject to x conv(c) Proof: let f be optimal value in first problem, x be solution in second. Then x = i a ix i where x i C, a i 0, and i a i = 1. Note f c T x = i a ic T x i i a if = f. Thus all x i must be optimal for first problem But note that the convex problem is not necessarily easy! Could be very hard to even form conv(c) (cf. the cutting plane method for integer programg)

18 Eigen problems 18

19 19 Principal component analysis Given a matrix X R n p, consider the nonconvex problem R X R 2 F subject to rank(r) = k for some fixed k. The solution here is given by the singular value decomposition of X: if X = UDV T, then ˆR = U k D k V T k, where U k, V k are the first k columns of U, V, and D k is the first k diagonal elements of D. I.e., ˆR is the reconstruction of X from its first k principal components This is often called the Eckart-Young Theorem, established in 1936, but was probably known even earlier see Stewart (1992), On the early history of the singular value decomposition

20 20 Fantope Another characterization of the SVD is via the following nonconvex problem, given X R n p : X Z S XZ 2 p F subject to rank(z) = k, Z projection max Z S p XT X, Z subject to rank(z) = k, Z projection The solution here is Ẑ = V kv T k, where the columns of V k R p k give the first k eigenvectors of X T X This is equivalent to a convex problem. Express constraint set C as { } C = Z S p : rank(z) = k, Z is a projection { } = Z S p : λ i (Z) {0, 1}, i = 1,... p, tr(z) = k

21 21 Now consider the convex hull F k = conv(c): { } F k = Z S p : λ i (Z) [0, 1], i = 1,... p, tr(z) = k { } = Z S p : 0 Z I, tr(z) = k This is called the Fantope of order k. Further, the convex problem max Z S p XT X, Z subject to Z F k admits the same solution as the original one, i.e., Ẑ = V kv T k See Fan (1949), On a theorem of Weyl conerning eigenvalues of linear transformations, and Overton and Womersley (1992), On the sum of the largest eigenvalues of a symmetric matrix Sparse PCA extension: Vu et al. (2013), Fantope projection and selection: near-optimal convex relaxation of sparse PCA

22 Classical multidimensional scaling Let x 1,... x n R p, and define similarities S ij = (x i x) T (x j x). For fixed k, classical multidimensional scaling or MDS solves the nonconvex problem z 1,...z n n i,j=1 ( ) 2 S ij (z i z) T (z j z) From Hastie et al. (2009), The elements of statistical learning 22

23 23 Let S be the similarity matrix (entries S ij = (x i x) T (x j x)) The classical MDS problem has an exact solution in terms of the eigendecomposition S = UD 2 U T : ẑ 1,... ẑ n are the rows of U k D k where U k is the first k columns of U, and D k the first k diagonal entries of D Note that other very similar forms of MDS are not convex, and not directly solveable, e.g., least squares scaling, with d ij = x i x j 2 : z 1,...z n n ( ) 2 dij z i z j 2 i,j=1 See Hastie et al. (2009), Chapter 14

24 24 Generalized eigenvalue problems Given B, W S p ++, consider the nonconvex problem max v v T Bv v T W v This is a generalized eigenvalue problem, with exact solution given by the top eigenvector of W 1 B This is important, e.g., in Fisher s discriant analysis, where B is the between-class covariance matrix, and W the within-class covariance matrix See Hastie et al. (2009), Chapter 4

25 Graph problems 25

s = 0, x t = 1 Think of b ij as the indicator that the edge (i, j) traverses the cut from s to t; think of x

26 26 Min cut Given a graph G = (V, E) with V = {1,... n}, two nodes s, t V, and costs c ij 0 on edges (i, j) E. Min cut problem: b R E, x R V subject to (i,j) E b ij c ij b ij x i x j b ij, x i, x j {0, 1} for all i, j, x s = 0, x t = 1 Think of b ij as the indicator that the edge (i, j) traverses the cut from s to t; think of x i as an indicator that node i is grouped with t. This nonconvex problem can be solved exactly using max flow

27 27 A relaxation of cut b R E, x R V subject to b ij c ij (i,j) E b ij x i x j for all i, j b 0, x s = 0, x t = 1 This is an LP, and recall that it is the dual of the max flow LP: max f R E (s,j) E f sj subject to f ij 0, f ij c ij for all (i, j) E f ik = f kj for all k V \ {s, t} (i,k) E (k,j) E Max flow cut theorem tells us that the relaxed cut is tight

28 28 Max cut Given an undirected graph G = (V, E) with V = {1,... n}, edge costs c ij 0, (i, j) E. Max cut problem: v 1,...,v n R subject to (i,j) E c ij (1 v i v j ) 2 v 2 i = 1, i = 1,... n Nonconvex, NP hard! SDP relaxation of Goemans and Williamson (1994), Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programg : v 1,...,v n R n subject to (i,j) E c ij (1 v T i v j) 2 v i 2 2 = 1, i = 1,... n Remarkable fact: with randomized rounding, E[f SDP ] 0.878f OPT

29 Nonconvex proximal operators 29

30 Hard-thresholding Given y R n, one of the simplest problems in signal processing: β n (y i β i ) 2 + i=1 n λ i 1{β i 0} i=1 Solution is given by hard-thresholding y, { y i if yi 2 β i = > λ i 0 otherwise, i = 1,... n and can be seen by inspection. Special case of λ i = λ, i = 1,... n: β y β λ β 0 Compare to soft-thresholding, proximal operator for l 1 norm. Also, note: changing the loss to y Xβ 2 2 gives best subset selection, which is NP hard for general matrix X 30

31 Potts imization Consider 1d segmentation problem, also called Potts imization: β n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 i=1 Nonconvex, but solveable by dynamic programg, in two ways: Bellman

31 31 Potts imization Consider 1d segmentation problem, also called Potts imization: β n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 i=1 Nonconvex, but solveable by dynamic programg, in two ways: Bellman (1961), On the approximation of curves by line segments using dynamic programg, and Johnson (2013) A dynamic programg algorithm for the fused lasso and L 0 -segmentation Johnson: more efficient, Bellman: more general Worst-case O(n 2 ), but with practical performance more like O(n)

32 32 Tree-leaves projection Given target u R n, tree g on R n, and label y {0, 1}, consider z u z λ 1{g(z) y} Interpretation: find z close to u, whose label under g is not unlike y. Argue directly that solution is either ẑ = u or ẑ = P S (u), where S = g 1 (1) = {z : g(z) = y} the set of leaves of g assigned label y. Just compute both options for ẑ and compare costs; problem reduces to computing P S (y), the projection onto a set of tree leaves, highly nonconvex (Subroutine of broader algorithm for nonconvex optimization; see Carreira-Perpinan and Wang (2012), Distributed optimization of deeply nested systems )

33 The set S is a union of axis-aligned boxes; projection onto any one box is fast, O(n) operations 33

34 34 To project onto S, could just scan through all boxes, and take the closest Faster: decorate each node of tree with labels of its leaves, and bounding box. Perform depthfirst search, pruning nodes: that do not contain a leaf labeled y, or whose bounding box is farther away than the current closest box

35 Discrete problems 35

36 36 Binary graph segmentation Given y R n, undirected graph G = (V, E) with V = {1,... n}, consider binary graph segmentation: β {0,1} n n (y i β i ) 2 + λ ij 1{β i β j } i=1 (i,j) E Nonconvex, but simple manipulation delivers the equivalent form max a i + b j λ ij A {1,...n} i A j A c (i,j) E, A {i,j} =1 which is a segmentation problem that can be solved exactly using cut/max flow. E.g., Kleinberg and Tardos (2005), Algorithm design, Chapter 7

37 37 Can apply recursively to get a verison of graph hierarchical clustering (divisive) E.g., take the graph as a 2d grid for image segmentation (From ac.kr)

38 38 Discrete Potts imization Given y R n, now consider discrete Potts imization: β {b 1,...b k } n n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 where {b 1,... b k } is some fixed discrete set. This is nonconvex and can be efficiently solved using classical dynamic programg Key insight is that the 1-dimensional structure allows us to exactly solve and store ˆβ 1 (β 2 ) = ˆβ 2 (β 3 ) =... arg β 1 {b 1,...b k } arg β 2 {b 1,...b k } i=1 (y 1 β 1 ) 2 + λ 1{β 1 β 2 } }{{} f 1 (β 1,β 2 ) f 1 ( ˆβ1 (β 2 ), β 2 ) + (y2 β 2 ) 2 + λ 1{β 2 β 3 }

39 39 DP agorithm: Make a forward pass over β 1,... β n 1, keeping a look-up table; also keep a look-up table for the optimal partial criterion values f 1,... f n 1 Solve exactly for β n Make a backward pass β n 1,... β 1, reading off the look-up table β 1 β 2... β n 1 f 1 f 2... f n 1 b 1 b 2... b k b 1 b 2... b k Requires O(nk 2 ) operations

40 40 Nearly optimal K-means Given data points x 1,... x n R p, the K-means problem is c 1,...c K n x i c k 2 2 k=1,...k i=1 }{{} f(c 1,...c K ) This is nonconvex, NP hard, and it is usually approximately solved using Lloyd s algorithm, run many times, with random starts Careful choice of starting positions makes a big impact: if we run Lloyd s algorithm once, starting at c 1 = s 1,... c K = s K, for special (random) s 1,... s K, then we get estimates ĉ 1,... ĉ K with E [ f(ĉ 1,... ĉ K ) ] 8(log k + 2) f(c 1,... c K ) c 1,...c K R p

41 41 See Arthur and Vassilvitskii (2007), k-means++: The advantages of careful seeding. Their construction of s 1,... s K is simple: Begin by choosing s 1 uniformly at random among x 1,... x n Compute squared distances d 2 i = x i s for all points i not chosen, and choose s 2 by drawing from the remaining points, with probability weights d 2 i / j d2 j Recompute the squared distances as d 2 i = { x i s 1 2 2, x i s 2 2 } 2 and choose s 3 according to the same recipe And so on, until all of s 1,... s K are chosen

42 Infinite-dimensional problems 42

43 43 Smoothing splines Given pairs (x i, y i ) R R, i = 1,... n, smoothing spline solves f n ( yi f(x i ) ) 2 (f + λ (t) ) 2 dt i=1 Optimization domain: all functions f such that (f (t)) 2 dt <, infinite-dimensional set Can show that the solution ˆf to the above problem is unique, and given by natural cubic spline, that has knots at x 1,... x n. (Proof: use integration by parts.) Hence we can parametrize by n f = θ j η j j=1 where η 1,... η n are natural cubic spline basis functions. Task now is to solve for coefficients θ R n

44 44 Plugging in f = n j=1 θ jη j, transform smoothing spline problem into finite-dimensional form: θ y Nθ λθ T Ωθ where N ij = η j (x i ), and Ω ij = η i (t) η j (t) dt. The solution is explicitly given by ˆθ = (N T N + λω) 1 N T y and fitted function is ˆf = n j=1 ˆθ j η j. With proper choice of basis function (B-splines), calculation of ˆθ is O(n) See Wahba (1990), Splines models for observational data ; Green and Silverman (1994), Nonparametric regression and generalized linear models ; Hastie et al. (2009), Chapter 5

45 45 Locally adaptive regression splines Given same setup, (cubic) locally adaptive regression spline solves f n ( yi f(x i ) ) 2 + λ TV(f ) i=1 Optimization domain: all functions f with TV(f ) <, which is again infinite-dimensional Similar to before, can show that the solution ˆf to above problem is a cubic spline, but two key differences: Can have any number of knots n 4 (tuned by λ) Knots do not necessarily lie at input points x 1,... x n! Details in Mammen and van de Geer (1997), Locally adaptive regression splines. Summary: these are statistically more adaptive but computationally more challenging than smoothing splines

46 46 Smoothing spline (easier to compute) Locally adaptive spline (more adaptive) Finite-dimensional approximation is given in Mammen and van de Geer (1997), and a much faster approximation in Tibshirani (2014), Adaptive piecewise polynomial estimation via trend filtering

47 47 Reproducing kernel Hilbert spaces Let κ : R d R d R ++ be a positive definite kernel, and H κ the function space generated by (possibly infinite) linear combinations of functions κ(, z), z R d. This is a reproducing kernel Hilbert space or RKHS, and is equipped with a norm Hκ Given data (x i, y i ) R d R, i = 1,... n, consider the problem f H κ n ( yi f(x i ) ) 2 + λ f 2 Hκ i=1 This is infinite-dimensional, but by Mercer s Theorem, the solution satisfies f = n j=1 α jk(, x j ). Letting K R n n have elements K ij = κ(x i, x j ), our problem reduces to α y Kα λα T Kα (Beyond regression: same trick for kernel SVMs, kernel PCA, etc.)

48 Statistical problems 48

49 49 Sparse underdetered linear systems Suppose that X R n p has unit normed columns, X i 2 = 1, for i = 1,... n. Given y R n, consider the problem of finding sparsest solution to linear system β β 0 subject to Xβ = y This is nonconvex and known to be NP hard, for a generic X. A natural convex relaxation is the l 1 basis pursuit problem: β β 1 subject to Xβ = y It turns out that there is a deep connection between the two; we cite results from Donoho (2006), For most large underdetered systems of linear equations, the imal l 1 norm solution is also the sparsest solution

50 50 As n, p grow large, p > n, there exists a threshold ρ (depending on the ratio p/n), such that for most matrices X, the following holds. If we solve the l 1 problem and find a solution with: fewer than ρn nonzero components, then we must have found the unique solution of the l 0 problem greater than ρn nonzero components, then there is no solution of the linear system with less than ρn nonzero components Here most can be quantified precisely via a uniform probability measure over matrices X with unit norm columns There is a large and fast-moving body of related literature. See Donoho et al. (2009), Message-passing algorithms for compressed sensing for a nice review

51 51 Exact low-rank matrix completion Given a matrix Y R n n, partially observed, over a set of indices Ω {1,..., n} 2. Consider the problem of finding the lowest-rank matrix matching Y on the observed set B rank(b) subject to B ij = Y ij, (i, j) Ω This is nonconvex. Natural convex relaxation: B B tr subject to B ij = Y ij, (i, j) Ω Under some assumptions, it can be shown that the solution to the convex problem is exactly equal to the solution to the nonconvex problem, with high probability over the sampling model See, e.g., Candes and Recht (2008), Exact matrix completion via convex optimization, and many papers since

52 Miscellaneous 52

53 53 Curvature doation Given convex f and nonconvex g, consider a problem x f(x) + g(x) For convex h, we can always rewrite this problem as x f(x) + g(x) h(x) + h(x) }{{} F (x) Sometimes, can choose h to make F smooth and strictly convex! How? Prove that 2 f(x) 2 (h g)(x) for all x This is apparently an old idea. See Parekh and Selesnick (2015), Convex fused lasso denoising with non-convex regularization and its use for pulse detection and references therein. (Related ideas on curvature doation from statistics: Zhang, Loh, Wainwright)

54 Setting of Parekh and Selesnick (2015) (simplified): n 1 1 β 2 y β λ φ a (β i β i+1 ) where φ a (x) = 1 a log(1 + a x ), for a > 0. This uses a nonconvex segmentation penalty whole problem appears to be nonconvex i= Fact: s a (x) = φ a (x) x is twice differentiable, strictly concave, with a s a(x) 0 for all x

55 55 Rewrite problem (denoting by D the first difference operator) as n 1 1 β 2 y ( β λ φa ([Dβ] i ) [Dβ] i ) +λ Dβ 1 i=1 }{{} F a(β) Now compute 2 F a (β) = I + λd T S a (Dβ)D where S a (x) = diag(s a(x 1 ),... s a(x n 1 )). Easy to check that D T S a (Dβ)D ad T D 4a using our previous fact, and λ max (D T D) = 4. Thus F a (β) and our whole problem is strictly convex provided 1 4aλ > 0, i.e., a < 1/(4λ)

56 From Parekh and Selesnick (2015): contour plots of criterion (b) G(x),a 0 =1/2,a 1 =1/3 a small enough a too large Fig. 1. Surface plots illustrating the convexity condition. (a)thefunction G(x) (15) is convex for a 0 =1/2 and a 1 =1/8. (b)thefunctiong(x) is not convex for a 0 =1/2 and a 1 =1/3 (these values violate Theorem 1). Fig. 2. R strictly co For the strict convexity of G, weneedtoensurethat 2 G is a 0 (0, 56

57 57 Gradient descent converges to imizers Given f twice continuously differentiable, dom(f) = R n, and initial point x (0) R n. Our old friend gradient descent, repeats: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... When f has Lipschitz gradient, constant L > 0, but is nonconvex, can anything be said? A very old problem. Remarkable result from Lee et al. (2016), Gradient descent red converges to imizers : if step sizes are small enough t k 1/L, k = 1, 2, 3,..., and x (0) is drawn from any density over R n, then saddle points are unlikely limit points, i.e., ( ) P lim k x(k) = x = 0, for any isolated strict saddle point x of f Panageas and Piliouras (2016) have already relaxed isolated saddle point and global Lipschitz conditions. Much progress since...

Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization /36-725

Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization 10-725/36-725 1 Beyond the tip? 2 Some takeaway points If possible, formulate task in terms of convex optimization typically easier to solve,