Algorithms for sparse optimization

Size: px

Start display at page:

Download "Algorithms for sparse optimization"

Maximilian Carr
5 years ago
Views:

1 Algorithms for sparse optimization Michael P. Friedlander University of California, Davis Google May 19, 2015

2 Sparse representations = + y = Hx y = Dx y = [ H D ] x Haar DCT Haar/DCT dictionary 2 / 40

3 Sparsity via a transform Represent a signal y as a superposition of elementary signal atoms: y = j φ jx j ie, y = Φx Orthogonal bases yield a unique representation: x = Φ T y Overcomplete dictionaries, eg, A = [Φ 1 Φ 2 ] have more flexibility. But representation is not unique. One approach: minimize nnz(x) subj to Ax y 3 / 40

4 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

5 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

6 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

Efficient transforms 100 100 100 80 80 80 Percentage of energy 60 40 Percentage of energy 60 40 Percentage of energy 60 40 20 20 20 0 0 20 40 60 80 100

7 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

8 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

9 Efficient transforms Percentage of energy Percentage of energy Percentage of energy Percentage of coefficients Percentage of coefficients Percentage of coefficients Identity Fourier Wavelet 4 / 40

10 Prototypical workflow Measure structured signal y via linear measurements: b i = f i, y, i = 1,..., m ie, b = My, M = Decode the m measurements: MΦx b, x sparse Reconstruct the signal: ŷ := Φx Guarantees: number of samples m needed to reconstruct signal w.h.p. Compressed sensing (eg, M Gaussian/Φ orthogonal): O(k log n k ) matrix completion: # matrix samples O(n 5/4 r log n) 5 / 40

11 Problem: find sparse solution x Sparse optimization minimize nnz(x) subj to Ax b Convex relaxations, eg: basis pursuit [Chen et al. (1998); Candès et al. (2006)] minimize x 1 subj to Ax b (m n) x R n nuclear norm [Fazel (2002); Candes and Recht (2009)] minimize X R n p i σ i(x) subj to AX b (m np) 6 / 40

12 APPLICATIONS and FORMULATIONS

13 Compressed sensing = + φ 1, y,..., φ m, y Φy = Φ = Gaussian Φy Φ Haar DCT x b A x minimize x 1 subj to Ax b 7 / 40

14 Image deblurring Observe b = My, y = image, M = blurring operator Recover significant coeff s via minimize 1 2 MWx b 2 + λ x 1 λ 2.3e+0 2.3e-2 2.3e-4 Ax b 2.6e+2 3.0e+0 3.1e-2 nnz(x) 72 1,741 28,484 Sparco problem blurrycam [Figueiredo et al. (2007)] 8 / 40

15 Joint sparsity: source localization b 1 b 2 b p arrival angles x 1 x 2 x p [Malioutov et al. (2005)] sensors = phase-shift gain matrix minimize X 1,2 subj to B AX F σ λ θ s 1 s 2 s 3 s 4 s 5 9 / 40

16 Mass spectrometry separating mixtures Relative intensity = Relative intensity Relative intensity Relative intensity m/z A = m/z m/z m/z mixture molecule 1 molecule 2 molecule 3 Percentage Actual Recovered Recovered (Noisy) spectral fingerprints Component minimize i x i subj to Ax b, x 0 [Du-Angeletti 06, Berg-F. 11] 10 / 40

17 Phase retrieval Observe signal x C n through magnitudes. b k = a k, x 2, k = 1,..., m Lift. b k = a k, x 2 = a k a k, xx = a k a k, X with X = xx Decode. Find X such that a k a k, X b k, X 0, rank(x) = 1 Convex relaxation. minimize tr(x) subj to a k ak, X b k, X 0 X H n [Chai et al. (2011); Candès et al. (2012); Waldspurger et al. (2015)] 11 / 40

18 Matrix completion Observe random matrix entries of an m p matrix Y. b k = Y ik,j k e T i k Ye jk = e ik e T j k, Y, k = 1,..., m Decode. Find m p matrix X such that e ik e T j k, X b k, rank(x) = min Convex relaxation. minimize X R m p σ i (X) subj to e ik ej T k, X b k i [Fazel (2002); Candes and Recht (2009)] 12 / 40

19 CONVEX OPTIMIZATION

20 Convex optimization minimize x C f (x) f : R n R is a convex function, C R n is a convex set Essential objective. f (x) : R n R := R {+ } = { f (x) if x C + otherwise No loss of generality in restricting ourselves to unconstrained problems: minimize x f (x) + g(ax) f : R n R, g : R n R, A is m n matrix 13 / 40

21 Convex functions Definition. f : R n R is convex if its epigraph is convex. epi f = { (x, τ) R n R f (x) τ } 14 / 40

22 Examples Indicator on a convex set. { 0 if x C, δ C (x) = + otherwise, epi δ C = C R + Supremum of convex functions. f := sup j J f j, {f j } j J convex functions epi f = epi f j j J Infimal convolution (epi-addition). { } (f g)(x) = inf f (z) + g(x z) z epi(f g) = epi f + epi g 15 / 40

23 Examples Generic form. minimize x f (x) + g(ax) Basis pursuit. min x x 1 st Ax = b, f = λ 1, g = δ {b} Basis pursuit denoising. min x x 1 st Ax b 2 σ, f = 1, g = δ { b 2 σ} Lasso. min x 1 2 Ax b 2 st x 1 τ, f = δ τb1, g = 1 2 b 2 16 / 40

24 DUALITY

25 Duality minimize x f (x) + g(ax) Decouple the objective terms: minimize x,z f (x) + g(z) subj to Ax = z Lagrangian function. L(x, z, y) := f (x) + g(z) + y, Ax z encapsulates all the information about the problem: { f (x) + g(z) if Ax = z sup L(x, z, y) = y + otherwise p = inf x,z sup L(x, z, y) y 17 / 40

26 Primal-dual pair. p = inf sup x,z y d = sup inf y x,z L(x, z, y) L(x, z, y) (primal problem) (dual problem) The following always holds: p d (weak duality) Under some constraint qualification, p = d (strong duality) 18 / 40

27 Fenchel dual Dual problem. sup inf y x,z L(x, z, y) = sup inf y x,z { = sup y f (x) + g(z) + y, Ax z sup x [ ] A T y, x f (x) [ ] } sup y, z g(z) z Conjugate function. h (u) = sup { u, x h(u)} x Fenchel-dual pair. minimize x minimize y f (x) + g(ax) f (A T y) + g ( y) 19 / 40

28 PROXIMAL METHODS

29 Proximal algorithm minimize x f (x) (g 0) Proximal operator. prox αf (x) := arg min z Optimality. Every optimal solution x solves { f (z) + 1 2α z x 2} f 1 2α 2 x = prox αf (x), α > 0 Proximal iteration. x + = prox αf (x) [Martinet (1970)] 20 / 40

30 { x k+1 = prox αf (x k ) := arg min f (x) + 1 } z x 2 z 2α Descent. Because x k+1 is the minimizing argument f (x k+1 ) + 1 2α k x k+1 x k 2 f (x) + 1 2α k x x k 2 x Set x = x k : f (x k+1 ) f (x k ) 1 2α k x k+1 x k 2 Convergence. finite convergence for f polyhedral [Polyak and Tretyakov (1973)] subproblems may be solved approximately [Rockafellar (1976)] f (x k ) p O(1/k) [Güler (1991)] f (x k ) p O(1/k 2 ) via an accelerated variant [Güler (1992)] 21 / 40

31 Moreau-Yosida regularization M-Y envelope f α (x) = min f (z) + 1 z 2α z x 2 proximal map prox αf (x) = arg min z f (z) + 1 2α z x 2 Useful properties: differentiable: preserves minima: decomposition: f α (x) = 1 α [ ] x prox αf (x) argmin f α = argmin f x = prox f (x) + prox f (x) Examples vector 1-norm prox α 1 (x) = sgn(x). max { x α, 0 } Schatten 1-norm prox α 1 (X) = U ΣV T, Σ = Diag [ prox α 1 ( σ(x) ) ] 22 / 40

32 Augmented Lagrangian (AL) algorithm Rockafellar ( 73) connects AL to proximal method on dual minimize f (x) subj to Ax = b ie, f (x) + δ {b} (Ax) Lagrangian: L (x, y) = f (x) + y T (Ax b) aug Lagrangian: L ρ (x, y) = f (x) + y T (Ax b) + 1 2ρ Ax b 2 AL algorithm: x k+1 = arg min x L ρ (x, y k ) y k+1 = y k + ρ(ax k+1 b) [may be inexact] [multiplier update] for linear constraints, AL algorithm Bregman used by L1-Bregman for f (x) = x 1 [Yin et al. (2008)] proposed by Hestenes/Powell ( 69) for smooth nonlinear optimization 23 / 40

33 SPLITTING METHODS projected gradient, proximal-gradient, iterative soft-thresholding, alternating projections

34 Proximal gradient minimize x f (x) + g(x) Algorithm. x + = prox αg ( x α f (x) ) with α (µ, 2/L), µ > 0, f (x) f (y) L x y Convergence. f (x k ) p O(1/k ) constant stepsize (α 1/L) f (x k ) p O(1/k 2 ) accelerated variant [Beck and Teboulle (2009)] f (x k ) p O(γ k ), γ < 1, with stronger assumptions 24 / 40

35 Interpretations Majorization minimization. Lipshitz assumption on f implies f (z) f (x) + f (x), z x + L z x 2 x, 2 z Proximal-gradient iteration minimizes the majorant: x + ( ) = prox αg x α f (x) arg min z α (0, 1/L) { g(z) + f (x) + f (x), z x + 1 z x 2} 2α Forward-backward splitting. x + = prox αg (x α f (x)) = (I + α g) 1 }{{} forward step (I α f )(x) }{{} backward step 25 / 40

36 Projected gradient minimize f (x) subj to x C x k+1 = proj C (x k α k f (x k )) LASSO minimize x 1 τ 1 2 Ax b 2 via f (x) = 1 2 Ax b 2, g(x) = δ τb1 (x) proj C ( ) can be computed in O(n) (randomized) [Duchi et al. (2008)] used by SPGL1 for Lasso subproblems [vdberg and Friedlander (2008)] BPDN (Lagrangian form) 1 minimize x R n 2 Ax b 2 + λ x 1 via minimize x R 2n + proj 0 ( ) can be computed in O(n) 1 2 Ā x b 2 + λe T x GPSR uses Barzilai-Borwein for step α k [Figueiredo-Nowak-Wright 07] 26 / 40

37 Iterative soft-thresholding (IST) minimize x 1 2 Ax b 2 + λ x 1 x + = prox αλ 1 (x αa T r), r Ax b = S αλ (x αa T r) where α (0, 2/ A 2 ) and S τ (x) = sgn(x) max{ x τ, 0} [aka, iterative shrinkage, iterative Landweber] Derived from various viewpoints: expectation-maximization [Figueiredo & Nowak 03] surrogate functionals [Daubechies et al 04] forward-backward splitting: [Combettes & Wajs 05] FISTA (accelerated variant) [Beck-Teboulle 09] 27 / 40

38 Low-rank approximation via nuclear-norm: minimize 1 2 AX b 2 + λ X n with X n = σ j (X) Singular-value soft-thresholding algorithm: x + = S αλ (X αa (r)) with r := AX b, S λ (Z) = U diag( σ)v T, σ = S λ (σ(z)) Computing S λ : No need to compute full SVD of Z, only leading singular values/vectors Monte-Carlo approach [Ma-Goldfarb-Chen 08] Lanczos bidiagonalization (via PROPACK) [Cai-Candes-Shen 08; Jain et al 10] 28 / 40

39 Typical behavior minimize X R n 1 n 2 X 1 subj to X Ω b recover matrix X rank(x ) = 4 no. of singular triples iteration 29 / 40

40 Backward-backward splitting Approximate one of the functions via its (smooth) Moreau envelope. minimize x f α (x) + g(x) (both nonsmooth) f α = f 1 2α 2, f α (x) = 1 α [x prox αf (x)] Proximal gradient. x + = prox αg ( x α fα (x) ) = prox αg prox αf (x) Projection onto convex sets (POCS). solves the problem f = δ C, x + = proj D g = δ D proj C (x) minimize x,z 1 2 z x 2 subj to x D, z C 30 / 40

41 Alternating direction of method of multipliers (ADMM) minimize x f (x) + g(x) ADMM algorithm. x + = prox αf (z u) z + = prox αg (x + + u) u + = u + x + z + used by YALL1 [Zhang et al. (2011)] ADMM algo Douglas-Rachford ( ) split Bregman linear constraints Esser s UCLA PhD thesis ( 10) ; survey by Boyd et al. ( 11) 31 / 40

42 PROXIMAL SMOOTHING

43 Primal smoothing minimize x f (x) + g(ax) Partial smoothing via the Moreau envelope: minimize x f µ (x) + g(ax), f µ = f 1 2µ 2 Example (Basis Pursuit). Huber approximation to 1-norm: f = 1, g = δ {b} = minimize x x 1 subj to Ax = b Smoothed function is Huber with parameter µ: { x 2 /2µ if x µ f µ (x) = x µ/2 otherwise µ 0 = x µ x 32 / 40

44 Regularize the primal problem. Dual smoothing minimize x f (x) + µ 2 x 2 + g(ax) Dualize. Regularizing f smoothes the conjugate: (f + µ 2 2 ) = (f 1 2µ 2 ) = f µ minimize y f µ (A T y) + g ( y) Example (Basis Pursuit). f = 1, g = δ {b} = min x x 1 + µ 2 x 2 subj to Ax = b f µ = 1 2µ dist2 B = max y b, y + 1 2µ dist2 B (A T y) Exact regularization: x µ = x for all µ [0, µ), µ > 0 [Mangasarian and Meyer (1979); Friedlander and Tseng (2007)] [Nesterov (2005); Beck and Teboulle (2012)]; TFOCS [Becker et al. (2011)] 33 / 40

45 CONCLUSION

46 much left unsaid, eg, optimal first-order methods [Nesterov (1988, 2005)] block coordinate descent [Sardy et al. (2004)] active-set methods (eg, LARS & Homotopy) [Osborne et al. (2000)] my own work. avoiding proximal operators greedy coordinate descent [Friedlander et al. (2014) (SIOPT)] [Nutini et al. (2015) (ICML)] web mpf 34 / 40

47 REFERENCES

48 References I A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , A. Beck and M. Teboulle. Smoothing and first order methods: A unified framework. SIAM J. Optim., 22(2): , S. Becker, E. Candès, and M. Grant. Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comp., 3: , doi: /s S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, J.-F. Cai, E. J. Candés, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM J. Optim., 4(20): , doi: / E. J. Candes and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9: , December / 40

49 References II E. J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8): , E. J. Candès, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Commun. Pur. Appl. Ana., A. Chai, M. Moscoso, and G. Papanicolaou. Array imaging using intensity-only measurements. Inverse Problems, 27(1):015005, URL S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33 61, P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul., 4(4): , I. Daubechies, M. Defrise, and C. D. Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57: , J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1 -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages , / 40

50 References III J. E. Esser. Primal Dual Algorithms for Convex Models and Applications to Image Restoration, Registration and Nonlocal Inpainting. PhD thesis, University of California, Los Angeles, M. Fazel. Matrix rank minimization with applications. PhD thesis, Elec. Eng. Dept, Stanford University, M. Figueiredo, R. Nowak, and S. J. Wright. Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems. Sel. Top. in Signal Process., IEEE J., 1(4): , M. A. T. Figueiredo and R. D. Nowak. An EM algorithm for wavelet-based image restoration. IEEE Trans. Image Process., 12(8): , August M. P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM J. Optim., 18(4): , doi: / M. P. Friedlander, I. Macêdo, and T. K. Pong. Gauge optimization and duality. SIAM J. Optim., 24(4): , doi: / O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2): , doi: / URL 37 / 40

51 References IV O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4): , doi: / URL M. R. Hestenes. Multiplier and gradient methods. J. Optim. Theory Appl., 4 (5): , S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program., 128: , doi: /s D. Malioutov, M. Çetin, and A. S. Willsky. A sparse signal reconstruction perspective for source localization with sensor arrays. IEEE Trans. Sig. Proc., 53(8): , August O. L. Mangasarian and R. R. Meyer. Nonlinear perturbation of linear programs. SIAM J. Control Optim., 17(6): , November B. Martinet. Régularisation d inéquations variationnelles par approximations successives. Rev. FranÂÿcaise Informat. Recherche OpÂťerationnelle, 4 (Ser. R-3):154âĂŞ158, Y. Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody, 24, / 40

52 References V Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103: , P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages Curran Associates, Inc., URL J. Nutini, I. Laradji, M. Schmidt, and M. Friedlander. Coordinate descent converges faster with the gauss-southwell rule than random selection. In Inter. Conf. Mach. Learning, M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least squares problems. IMA J. Numer. Anal., 20(3): , B. Polyak and N. Tretyakov. The method of penalty bounds for constrained extremum problems. Zh. Vych. Mat i Mat. Fiz, 13:34 46, M. J. D. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, chapter 19. Academic Press, London and New York, / 40

53 References VI R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1(2):97 116, S. Sardy, A. Antoniadis, and P. Tseng. Automatic smoothing with wavelets for a wide class of distributions. J. Comput. Graph. Stat., 13, E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM J. Optim., 31(2): , doi: / I. Waldspurger, A. dâăźaspremont, and S. Mallat. Phase recovery, maxcut and complex semidefinite programming. Math. Program., 149(1-2):47 81, W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for l1 minimization with applications to compressed sensing. SIAM J. Imag. Sci., 1, Y. Zhang, W. Deng, J. Yang,, and W. Yin, URL 40 / 40

Gauge optimization and duality

1 / 54 Gauge optimization and duality Junfeng Yang Department of Mathematics Nanjing University Joint with Shiqian Ma, CUHK September, 2015 2 / 54 Outline Introduction Duality Lagrange duality Fenchel