The Optimization Work of Paul Tseng

Size: px

Start display at page:

Download "The Optimization Work of Paul Tseng"

Kristian Elliott
5 years ago
Views:

1 The Optimization Work of Paul Tseng Zhi-Quan Luo University of Minnesota May 23, 2010

2 Paul Y. Tseng ( )

3 Overview Paul s thesis work Paul s work on algorithm analysis (first order methods) Paul s work on applications (QCQP/SDP relaxations in wireless communication)

4 Paul s Optimization Work Central issues in optimization: analysis, algorithms, software, applications Paul made fundamental contributions to all four aspects of optimization Paul was a leader in the field of optimization publishing over 120 journal articles Editorial positions in leading optimization/computational Math journals Invited, plenary talks in various conferences the impact of his work still being felt

5 Thesis work Ph.D. Thesis: Relaxation Methods for Monotropic Programming Problems, MIT, Supervisor: Dimitri Bertsekas Dealt with Gauss-Seidel-type relaxation methods for general monotropic programming, problems with convex, separable objective, linear constraints f(x) = i f i(x i ), X = a polytope. Included are algorithms for several important special cases: - network flow problems with linear and convex cost functions; - network flow problems with and without gains; - standard linear programming problems; etc. Wrote very popular codes (with Bertsekas) called RELAX,ERELAX for network flows

6 Thesis work Received 2nd place for George Nicholson Student Paper Competition in 1985 Published 7 journal papers on thesis work Another 9 journal papers on post-thesis work with Bertsekas

7 Coordinate Descent for QP In the next 20+ years, Paul worked on many important theoretical and application problems in optimization First to establish linear convergence of matrix splitting method (a long standing open question, L.-Tseng 91) First to show the convergence of affine scaling algorithm for linear programming (Tseng-L. 92) Analysis of first order methods for smooth/nonsmooth optimization Interior point methods and smoothing methods for LP/QP/LCP/... Applications in image denoising, wireless communication, statistics, compressive sensing, etc

8 Coordinate Descent for QP Consider a convex quadratic minimization problem over R n + minimize 1 2 xt Ax + b T x subject to x X (1) If X = R, we solve (1) by a simple coordinate descent (Gauss-Seidel type) method: write A = A low + D + A upp 0 and iterate (A low + D)x r+1 + A upp x r + b = 0 x r+1 = (A low + D) 1 (A upp x r + b). If X = R n +, the coordinate descent algorithm becomes x r+1 = [ x r+1 (A low + D)x r+1 A upp x r b ] +. where [u] + = max{0,u}; related to SOR, block SOR.

9 Coordinate Descent (CD) for QP Consider a convex quadratic minimization problem over R n + minimize 1 2 xt Ax + b T x subject to x X CD has many applications in image processing, MRI,... Convergence was known for symmetric A 0, with unique optimal solution Convergence for the case symmetric A 0 was unresolved for many years..., unbounded optimal solution set. Studied by Hildreth, Cryer, Mangasarian, Pang,... in various contexts since 1950 s.

10 Matrix Splitting Algorithm Given a symmetric A 0, consider a splitting A = B + C with B C 0. Optimality condition: minimize f(x) = 1 2 xt Ax + b T x subject to x X = R n + x = Proj X [x Ax b], with Proj X [ ] = projection to X Optimal solution set X = {x x = Proj X [x f(x)]}. Matrix splitting algorithm: x r+1 = Proj R n + [ x r+1 Bx r+1 Cx r b ], r = 1,2,... CD is a special case Linear convergence was settled in 1989 (L.-Tseng 91).

11 Convergence Analysis of Matrix Splitting Algorithm Key proof steps Sufficient decrease: for some c 1 0. Cost to go estimate: f(x r ) f(x r+1 ) c 1 x r+1 x r 2, f(x r ) f c 2 φ(x r ) 2, where c 2 > 0 and φ(x) := min x X x x Local error bound: φ(x) c 3 x Proj X [x f(x)], for all x such that x Proj X [x f(x)] c 4 for some c 3, c 4 > 0.

12 Convergence Analysis of Matrix Splitting Algorithm Comments from leading researchers... The notable iterate convergence recently established by Luo and Tseng for exact subproblem solution is extended here to inexact subproblem solution for a symmetric matrix splitting., Olvi Managasarian, SIAM J. Optimization, Vol. 1, pp , Luo and Tseng [13] were the first to provide an affirmative answer to this outstanding question; but their proof was not easy to follow. Shortly afterwards, Mangasarian established a similar result by assuming a symmetric splitting but allowing inexact solutions in the subproblems., Jong-Shi Pang, Math. Prog., Vol. 79, pp , 1997.

13 Many Extensions Let X be a polyhedral set, g( ) be strictly convex. Consider minimize f(x) = g(kx) + c T x subject to x X The three key steps remain valid for a broad class of algorithms x r+1 = Proj X [x r α r f(x r ) + e r ], r = 1,2,... where step size α r > 0 can be Armijo rule, or Goldstein, Levin-Polyak s rule, etc e r satisfies e r c 5 x r+1 x r for some c 5 > 0. Includes many known algorithms: extra-gradient method, proximal point method, coordinate descent method, matrix splitting method Convergence result: the iterates converge linearly to an optimal solution in X. (L.-Tseng 93,94,95)

14 Extensions to Non-monotone Affine Equilibrium Problems Consider symmetric affine variational inequality problem (A = A T ) find a x such that Ax + b,x x 0, for all x X. includes linearly constrained non-convex QP solution set X := {x x = Proj X [x Ax b]} may be unbounded or disconnected Convergence result: the iterates converge linearly to an optimal solution in X. (L.-Tseng 93)

15 Dual Coordinate Ascent Consider a convex minimization problem min f(x), s.t. Ax b Assume feasibility X := {x Ax b}. Its dual is maxf ( A T y) b T y, s.t. y 0 where f is the Legendre transform of f. Apply CD to the dual corresponds to row action algorithm, successive projection to convex sets (POCS), etc Convergence result: If X, then primal/dual iterates converge linearly to an primal/dual optimal solution pair. (L.-Tseng 93)

16 Row Action/Projection to Convex Set (POCS) Method D x 0 x 1 x 2 x 3 x C

17 Incremental Gradient Method If X =, we can still use row action or POCS type algorithms. min f 1 (x) + f 2 (x) + + f m (x) (2) Incremental gradient method: x r+ i m = x r+ i m α r f i (x r+ i m ), i = 1,2,...,m, r = 1,2,... back propagation for neural nets,... reduces to dual coordinate ascent for linear feasibility problem applications in distributed optimization, sensor nets Nedic/Bertsekas, Mangasarian, Solodov, Lewis,... Convergence result: If r αr =, r (αr ) 2 <, then the iterates x r converges to a stationary point of (2). (L.-Tseng 94)

18 Sparse Signal Reconstruction, Compressive Sensing Suppose x is sparse. Given (A,b), find the sparsest point: ˆx = argmin x x 0, subject to Ax = b (3) where x 0 = number of nonzero entries in x From a hard combinatorial problem to convex optimization: ˆx = argmin x x 1, subject to Ax = b (4) Motivation: 1-norm is sparsity promoting Basis pursuit (Donoho et al 98) Many variants; e.g., Ax b 2 σ for noisy b Greedy algorithms (e.g., Tropp-Gilbert 05,...) Key question: when are (3) and (4) equivalent?

19 Compressive Sensing: Exact Recovery via Convex Optimization Assume Ψ = I. Let x be the true sparse solution and ˆx = argmin x x 1, subject to: Ax = A x. Theorem: (Candes-Tao 2005, Candes-Romberg-Tao 2005) If A R m n is iid standard normal and suppose x 0 O(m/[1 + log(n/m)]) < m < n. Then, with high probability (WHP) ˆx = x.

20 Magnetic Resonance Imaging MRI Scan = Fourier Coefficients = Images Is it possible to reduce the time to scan and reconstruct images? Compressive sensing: reconstruct image from subset of the Fourier coefficients

21 Compressive Sensing for Magnetic Resonance Imaging Pick 25% coefficients at random (with bias) Reconstruct image from the 25% coefficients

22 Compressive Sensing for Magnetic Resonance Imaging Use 1/4 of fourier Coefficients. Figure: Original vs. Reconstructed Image (courtesy of Y. Zhang, Rice University)

23 Convex Formulations min Ψx 1, subject to: Ax = b x C n min x 1, subject to: Ax = b; x 0 x R n min Ψx x C n ρ Ax b 2 2 min Ax x R b 2 n 2, subject to: x 1 ν.

24 Convex Formulations Simple and convex problems, but involve Large and dense matrices real-time processing requirement Standard (simplex, interior-point) methods not suitable First-order algorithms can be efficient: involve matrix multiplications using A or A T Fast transforms and structures help Sparsity reduces computational complexity

25 Coordinated Descent for Non-smooth Optimization Smooth case Non-smooth case CD method for non-smooth problems can get stuck at non-interesting points

26 Coordinated Descent for Non-smooth Optimization Consider the nonsmooth formulation: min x R n τ x 1 + Ax b 2 2 Coordinate descent algorithm (CD): Iteratively and cyclically minimize wrt each variable Soft thresholding: let x + = arg min τ y + (y y R x)2, then x + τ, x τ x + = 0, τ x τ x τ x τ Convergence: In general, the CD algorithm does not converge for nonsmooth optimization For nonsmoothness on the variables directly, e.g., x 1, the CD algorithm does converge linearly (Tseng, 2000).

27 Projected Landweber Algorithm Consider the formulation: min Kx x R c 2 n 2, s.t. x 0. Optimality condition: x = Proj R n + [x αk T (Kx c)] where α > 0. Projected Landweber Algorithm: x r+1 = Proj R n + [x r αk T (Kx r c)], r = 0,1,2,... Convergence: If K is full column rank, then x k converges linearly to the unique optimum. (Bertsekas, 1999) If K 0 (entry-wise) and each row is nonzero, then the iterates {x r } converge linearly. (Johansson-Elfving-Kozlovc-Censor-Forssen-Granlund, 2006)

28 Projected Landweber Algorithm Consider the formulation: Optimality condition: min Kx x R c 2 2, subject to: x 1 ν, n x = Proj X [x αk T (Kx c)] where α > 0 and X = {x x 1 ν} Projection to X is simple: iterative thresholding Projected Landweber Algorithm: x r+1 = Proj X [x r α r K T (Kx r c)], r = 0,1,2,... Convergence (Daubechies-Fornasier-Loris, 2008; Bredies-Lorenz, 2008;...): {x k } converges in Hilbert space; if K has full column rank, then linear convergence

29 Recall A General Convergence Result Let X = {x Cx d} be an empty polyhedral set. Consider min f(x) = g(kx) + x X pt x where g is strictly convex, g is Lipschitz continuous. Optimality: x = Proj X [x α f(x)] Gradient projection algorithm: x k+1 = Proj X [x k α k f(x k )] Convergence (L.-Tseng, 1991): For various stepsize rules (including Armijo rule), x k x linearly, for some optimal solution x. If g( ) = c 2, p = 0 and X = {x x 1 ν}, then we recover the Projected Landweber algorithm.

30 Extensions: Proximal Gradient Methods Consider min f 1 (x) + f 2 (x), s.t. x R (5) where f 1 is convex, nonsmooth, while f 2 is convex and smooth. Prox operator: Prox f1 (x) = arg min f 1 (y) + 1 x y 2 2 If f 1 (x) = 1 X (x) (indicator function) for some convex set X, then problem (5) becomes min f 2(x) and X 1 Prox f1 (x) = arg min y X 2 x y 2 = Proj X [x]. Optimality: x = Prox f1 (x α f 2 (x)), α > 0. Proximal gradient (proximal splitting) method: x r+1 = Prox f1 (x r α r f 2 (x r ))

31 Proximal Gradient Method D x 0 y 0 x 1 x 2 x 3 y 1 y 2 y 3 y 4 x 4 C x Paul s formulation: x r+1 = Argmin{ f 2 (x r ),x x r + f 1 (x) + 1 x 2α r x xr 2 } 2 can be replaced by Bregman distance Also related to Mirror-descent, Augmented Lagrangian

32 Analysis of Proximal Gradient Method Consider f 1 (x) = x 1, f 2 (x) = 1 Kx c 2 2 Then Prox τf1 (x) = soft thresholding to [ τ,τ] n In general, consider f 1 (x) = J J x J 2, f 2 (x) = g(kx) where J is a partition of coordinates, g( ) is strongly convex. Then Prox τf1 (x) can still be computed efficiently (e.g., in closed form). When g( ) = b 2, this corresponds to the Group Lasso algorithm in statistics. Also includes the Total Variation (TV) minimization in image processing.

33 Linear Convergence of Proximal Gradient Method Proximal gradient method consists of two steps: gradient descent wrt f 2 (x), then soft thresholding. The three key proof steps: sufficient decrease, cost to go estimate, local error bound, as established by L. and Tseng in early 90 s still hold true for proximal gradient method. Convergence (Tseng, 2009): For any matrix K, linear convergence of iterates to an optimal solution.

34 Application Work: Transmit Beamforming Downlink transmission: basestation has K antennas; m receivers n multicast groups, {G 1,..., G n }, G k G l = k G k = {1,..., m}, G k := G k, n k=1 G k = m. w k : beamforming vector for G k s k : signal sent to group G k transmitted signal: s(t) = n s k (t)w k k=1

35 Transmit Beamforming assume each receiver has one antenna, with channel vector h i For ULA, LoS, far field, ) h i = (1, e jφi,..., e j(k 1)φi where φ i = 2π d λ sinθ i; Vandermonde. signal at receiver i G k : s h i +v i = s k w k h i + l =k s l w l h i + v i SINR for receiver i G k w k h i 2 l =k w l h i 2 + σi 2

36 Problem Description: Transmit Beamforming Transmit beamforming problem: minimize transmit power, subject to QoS constraint for each receiver in each group. minimize subject to minimize subject to n w k 2 k=1 w k h i 2 l =k w l h c i, i G k, k {1,..., n} i 2 + σi 2 n w k 2 k=1 w k h i 2 c i w l h i 2 c i σi 2, i G k, k l =k separable homogeneous QCQP How effective is SDP relaxation for this problem?

37 Empirical Behavior Measured VDSL channel data by France Telecom R&D; 17 measured channel scenarios, 28 loops, 300m, Mhz; Compared SDP solution vs no precoding (ignore crosstalk) SDP yields nearly doubling of minimum received signal power SDP relaxation is tight in over 50% of instances. (SDL 05) SDP solved by the basestation Question: a theoretical performance bound?

38 Problem Description: Separable Homogeneous QCQP Let C i, A ij be Hermitian matrices, and H = R or C. n min wi H K i i=1 w i Ciwi s. t. w 1 A11w1 + w 2 A12w2 + + w na 1nw n b 1 w 1 A21w1 + w 2 A22w2 + + w na 2nw n b 2... NP-hard w 1 Am1w1 + w 2 Am2w2 + + w na mnw n b m n minimize Tr(C iw i) i=1 subject to Tr(A 11W 1 + A 12W A 1nW n) b 1... SDP relaxation Tr(A m1w 1 + A m2w A mnw n) b m W i 0. Question: How good is the SDP relaxation?

39 Special Case: Vandermonde h i For uniform linear array, line of sight and far-field propagation, h i is Vandermonde h i = ( 1, e jφi,..., e j(k 1)φi ) := a(φi ). In this case, the SDP relaxation of the separable homogeneous QCQP always has a rank-1 solution. (m > n allowed) minimize subject to minimize subject to n w k 2 k=1 w k h i 2 c i w l h i 2 c i σi 2, i G k, k {1,..., n} l =k n Tr(W k ) k=1 Tr(W k h ih i ) ci Tr(W l h ih i ) ciσ2 i, i G k, k, W k 0. l k

40 Why? Suppose W i = l w il w il is optimal. Goal: find a rank-1 W i, s.t. Tr(W i h jh j ) = Tr( W i h jh j ) and Tr( W i ) = Tr(W i ). Recall a(φ) = (1, e jφ,..., e j(k 1)φ ). Write Tr(W i h jh j ) = l a(φ j) w il 2 The trig polynomial f i (φ) := l a(φ) w il 2 0, φ. f i (φ) = a(φ) w i 2 for some w i. (Riesz-Fejer) Integrating out φ in a(φ) w i 2 = l a(φ) w il 2 yields Tr( w i ( w i ) ) = w i 2 = l w il 2 = l Tr(w il (w il ) ). Tr(W i h jh j ) = l a(φ j) w il 2 = f(φ j ) = Tr( w i ( w i ) h j h j ) Thus, { W i = w i ( w i ) } n i=1 is feasible and has the same objective value as {Wi }n i=1. a rank-1 solution.

41 Worst-case SDP Approximation Performance In general, SDP relaxation for separable homogeneous QCQP is not tight. Example: For any M > 0, consider υ qp := min x 2 + y 2 s.t. y 2 1, x 2 Mxy 1, x 2 + Mxy 1. Notice that the last two constraints of QCQP imply x 2 M x y + 1, x 2 1, which, in light of the constraints y 2 1, further imply x 2 M + 1. υ qp M + 2. However, for the SDP relaxation υ sdp := min X(11) + X(22) s.t.y (11) 1, X(11) MX(12) 1, X(22) + MX(12) 1,X 0. X = I is clearly a feasible solution, implying υ sdp 2. the performance ratio υ qp /υ sdp is at least (M + 2)/2, which can be arbitrarily large.

42 Formulation is dual to the max norm over ellipsoids problem considered by NRT 99. Special Case: Single Group Multicasting When n = 1 (single group), the transmit beamforming problem becomes υ qp := min w H n w 2 s.t. w H iw := l I i h l w 2 1, i = 1,..., m, h l 0 H n (H = C or R), I 1 I m = {1,..., M} z = x + iy (x, y R n ), z = x T iy T I i : number of antennas at receiver i. w = the least norm vector in the exteriors of ellipsoids.

43 SDP Relaxation Finding a global minimum of QP is NP-hard (reduction from PARTITION). Approximate QP by an easy convex optimization problem, a semidefinite program (SDP) relaxation (Lovász 91, Shor 87). Let H i = l I i h l h l. The SDP relaxation is υ sdp := min Tr(W) s.t. Tr(H i W) 1, i = 1,..., m, W 0. Then 0 υ sdp υ qp? Cυsdp (C 1)

44 Work and Hike

45 Work and Hike

46 Approximation Upper & Lower Bounds Theorem 1 (LSTZ 07): υ qp Cυ sdp where 1 2π 2m2 C 27 π m2 if H = R 1 m C 8m if H = C 2(3.6π) 2

47 Numerical experience Simulation with randomly generated h l (m = 8, n = 4) shows that both the mean and the maximum of the upper bound υ qp υ sdp are lower in the H = C case than the H = R case.

48 Summary: SDP Relaxation for Nonconvex QP A i,āi 0, i = 0,1, 2,..., m B j 0 indefinite, j = 0, 1,2,..., d R, d = 0 R, d = 1 or C, d = 0,1 R or C, d 2 minw H A 0w s.t. w H A iw 1, w H B jw 1 maxw H B 0w s.t. w H A iw 1, w H B jw 1 max min 1 i m s.t. w 2 P w H A iw whāiw + σ2 Θ(m 2 ) Θ(m) Θ(log 1 m) Θ(log 1 m) Θ(m 2 ) Θ(m) N.A. Blue: NRT 99, Red: LSTZ 07, CLC 07, HLNZ 07,SYZ 08

49 The Optimization Work of Paul Tseng The Paul s work is very broad in scope has tremendous depth Only able to give a glimpse of Paul s overall contribution... Still has several unpublished work, unfinished collaborations... Sadly missed by all...

50 Thank You! Questions/Comments?

SDP Relaxation of Quadratic Optimization with Few Homogeneous Quadratic Constraints

SDP Relaxation of Quadratic Optimization with Few Homogeneous Quadratic Constraints Paul Tseng Mathematics, University of Washington Seattle LIDS, MIT March 22, 2006 (Joint work with Z.-Q. Luo, N. D. Sidiropoulos,