Sparse Optimization Lecture: Dual Methods, Part I

Similar documents
Math 273a: Optimization Convex Conjugacy

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Operator Splitting

Accelerated primal-dual methods for linearly constrained convex problems

Math 273a: Optimization Lagrange Duality

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Overview of First-Order Optimization Algorithms

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

Convergence of Fixed-Point Iterations

Lecture: Algorithms for Compressed Sensing

Sparse Optimization Lecture: Dual Certificate in l 1 Minimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Lecture: Smoothing.

Bias-free Sparse Regression with Guaranteed Consistency

Math 164: Optimization Barzilai-Borwein Method

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Math 273a: Optimization Subgradients of convex functions

Operator Splitting for Parallel and Distributed Optimization

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Dual and primal-dual methods

Optimization for Learning and Big Data

Dual Proximal Gradient Method

A Unified Approach to Proximal Algorithms using Bregman Distance

ANALYSIS AND GENERALIZATIONS OF THE LINEARIZED BREGMAN METHOD

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Sparse Optimization Lecture: Basic Sparse Optimization Models

Optimization methods

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Conditional Gradient (Frank-Wolfe) Method

A Primal-dual Three-operator Splitting Scheme

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Solving DC Programs that Promote Group 1-Sparsity

A General Framework for a Class of Primal-Dual Algorithms for TV Minimization

Tight Rates and Equivalence Results of Operator Splitting Schemes

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Fast proximal gradient methods

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

You should be able to...

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Nonlinear Optimization for Optimal Control

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Splitting methods for decomposing separable convex programs

Lecture 24 November 27

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Least Sparsity of p-norm based Optimization Problems with p > 1

Sparse Covariance Selection using Semidefinite Programming

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

A primal-dual fixed point algorithm for multi-block convex minimization *

Optimization methods

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Nonconvex ADMM: Convergence and Applications

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

9. Dual decomposition and dual algorithms

A Tutorial on Primal-Dual Algorithm

The Alternating Direction Method of Multipliers

1 Sparsity and l 1 relaxation

CPSC 540: Machine Learning

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Primal-dual algorithms for the sum of two and three functions 1

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

EE364b Convex Optimization II May 30 June 2, Final exam

Distributed Optimization via Alternating Direction Method of Multipliers

Primal-dual fixed point algorithms for separable minimization problems and their applications in imaging

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

SPARSE SIGNAL RESTORATION. 1. Introduction

First Order Methods for Large-Scale Sparse Optimization. Necdet Serhat Aybat

Convex Optimization & Lagrange Duality

An interior-point stochastic approximation method and an L1-regularized delta rule

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Constrained Optimization and Lagrangian Duality

Proximal splitting methods on convex problems with a quadratic term: Relax!

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Sequential Unconstrained Minimization: A Survey

Subgradient Method. Ryan Tibshirani Convex Optimization

Convex Optimization Notes

Math 273a: Optimization Subgradients of convex functions

Dual Ascent. Ryan Tibshirani Convex Optimization

Proximal methods. S. Villa. October 7, 2014

First-order methods for structured nonsmooth optimization

Low Complexity Regularization 1 / 33

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Sparse and Regularized Optimization

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS

LINEARIZED AUGMENTED LAGRANGIAN AND ALTERNATING DIRECTION METHODS FOR NUCLEAR NORM MINIMIZATION

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Transcription:

Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration (linearzed Bregman iteration) dual smoothing minimization augmented Lagrangian iteration Bregman iteration and addback iteration 1 / 39

Review Last two lectures studied explicit and implicit (proximal) gradient updates derived the Lagrange dual problem overviewed the following dual methods 1. dual (sub)gradient method (a.k.a. Uzawa s method) 2. dual proximal method (a.k.a., augmented Lagrangian method (ALM)) 3. operator splitting methods applied to the dual of min{f(x) + g(z) : Ax + Bz = b} x,z The operator splitting methods studied includes forward-backward splitting Peaceman-Rachford splitting Douglas-Rachford splitting (giving rise to ADM or ADMM) This lecture study these dual methods in more details and present their applications to sparse optimization models. 2 / 39

About sparse optimization During the lecture, keep in mind that sparse optimization models typically have one or more nonsmooth yet simple part(s) and one or more smooth parts with dense data. Common preferences: to the nonsmooth and simple part, proximal operation is preferred over subgradient descent if smoothing is applied, exact smoothing is preferred over inexact smoothing to the smooth part with dense data, simple (gradient) operator is preferred over more complicated operators when applying divide-and-conquer, fewer divisions and simpler subproblems are preferred In general, the dual methods appear to be more versatile than the primal-only methods (e.g., (sub)gradient and prox-linear methods) 3 / 39

Dual (sub)gradient ascent Primal problem Lagrangian relaxation: min x f(x), s.t. Ax = b. L(x; y) = f(x) + y T (Ax b) Lagrangian dual problem min d(y) or max d(y) y y If g is differentiable, you can apply y k+1 y k c k d(y k ) otherwise, apply y k+1 y k c k g, where g d(y k ). 4 / 39

Dual (sub)gradient ascent Derive d or d by hand, or use L(x; y k ): compute x k min x L(x; y k ), then b Aˉx d(y k ). Iteration: x k+1 min x L(x; y k ), y k+1 y k + c k (Ax k+1 b). Application: augmented l 1 minimization, a.k.a. linearized Bregman. 5 / 39

Augmented l 1 minimization Augment l 1 by l 2 2: (L1+LS) min x 1 + 1 2α x 2 2 s.t. Ax = b primal objective becomes strongly convex (but still non-differentiable) hence, its dual is unconstrained and differentiable a sufficiently large but finite α leads the exact l 1 solution related to: linearized Bregman algorithm, the elastic net model a test with Gaussian A and sparse x with Gaussian entries: min{ x 1 : Ax = b} min{ x 2 2 : Ax = b} min{ x 1 + 25 1 x 2 2 : Ax = b} Exactly the same as l 1 solution 6 / 39

Lagrangian dual of (L1+LS) Theorem (Convex Analysis, Rockafellar [1970]) If a convex program has a strictly convex objective, it has a unique solution and its Lagrangian dual program is differentiable. Lagrangian: (separable in x for fixed y) L(x; y) = x 1 + 1 2α x 2 2 y T (Ax b) Lagrange dual problem: min y d(y) = b y + α 2 A y Proj [ 1,1] n(a y) 2 2 note: shrink(x, γ) = max{ x γ, 0}sign(x) = x Proj [ γ,γ] (x). Objective gradient: Dual gradient iteration: d(y) = b + αa shrink(a y) x k+1 = α shrink(a y k ), y k+1 = y k + c k (b Ax k+1 ). 7 / 39

How to choose α Exact smoothing: a finite α 0 so that all α > α 0 lead to l 1 solution In practice, α = 10 x sol suffices 1, with recovery guarantees under RIP, NSP, and other conditions. Although α > α 0 lead to the same and unique primal solution x, the dual solution set Y is a multi-set and it depends on α. Dynamically adjusting α may not be a good idea for the dual algorithms. 1 Lai and Yin [2012] 8 / 39

Exact regularization Theorem (Friedlander and Tseng [2007], Yin [2010]) There exists a finite α 0 > 0 such that whenever α > α 0, the solution to is also a solution to (L1+LS) min x 1 + 1 2α x 2 2 (L1) min x 1 s.t. Ax = b. s.t. Ax = b L1 L1+LS 9 / 39

L1+LS in compressive sensing If in some scenario, L1 gives exact or stable recovery provided that #measurements m C F (signal dim n, signal sparsity k). Then, adding 1 2α x 2 2, the condition becomes #measurements m (C + O( 1 )) F (signal dim n, signal sparsity k). 2α Theorem (exact recovery, Lai and Yin [2012]) Under the assumptions 1. x 0 is k-sparse, and A satisfies RIP with δ 2k 0.4404, and 2. α 10 x 0, (L1+LS) uniquely recovers x 0. The bound on δ 2k is tighter than that for l 1, and it depends on the bound on α. 10 / 39

Stable recovery For approximately sparse signals and/or noisy measurements, consider: { min x 1 + 1 } x 2α x 2 2 : Ax b 2 σ (1) Theorem (stable recovery, Lai and Yin [2012]) Let x 0 be an arbitrary vector, S = {largest k entries of x 0 }, and Z = S C. Let b := Ax 0 + n, where n is arbitrary noisy. If A satisfies RIP with δ 2k 0.3814 and α 10 x 0, then the solution x of (1) with σ = n 2 satisfies x x 0 2 ˉC 1 n 2 + ˉC 2 x 0 Z 1/ k, where ˉC 1, and ˉC 2 are constants depending on δ 2k. 11 / 39

Implementation Since dual is C 1 and unconstrained, various first-order techniques apply accelerated gradient descent 2 Barzilai-Borwein step size 3 / (non-monotone) line search 4 Quasi Newton method (with some cautions since it is not C 2 ) Fits the dual-decomposition framework, easy to parallelize (later lecture) Results generalize to l 1-like functions. Matlab codes and demos at www.caam.rice.edu/~optimization/linearized_bregman/ 2 Nesterov [1983] 3 Barzilai and Borwein [1988] 4 Zhang and Hager [2004] 12 / 39

Review: convex conjugate Recall convex conjugate (the Legendre transform): f (y) = sup {y T x f(x)} x domf f is convex since it is point-wise maximum of linear functions if f is proper, closed, convex, then (f ) = f, i.e., Examples: f(x) = sup y domf {y T x f (y)}. f(x) = ι C(x), indicator function, and f (y) = sup x C y T x, support function f(x) = ι { 1 x 1} and f (y) = y 1 f(x) = ι { x 2 1} and f (y) = y 2 lots of smooth examples... 13 / 39

Review: convex conjugate One can introduce an alternative representation via convex conjugacy Example: f(x) = sup y domf {y T (Ax + b) h (y)} = h(ax + b). let C = {y = [y 1; y 2] : y 1 + y 2 = 1, y 1, y 2 0} and [ ] 1 A =. 1 Since x 1 = sup y {(y 1 y 2) T x ι C(y)} = sup y {y T Ax ι C(y)}, we have x 1 = ι C(Ax) where ι C([x 1; x 2]) = max{x 1, x 2} entry-wise. 14 / 39

Dual smoothing Idea: strongly convexify h = f becomes differentiable and has Lipschitz f Represent f using h : f(x) = sup y domh {y T (Ax + b) h (y)} Strongly convexify h by adding strongly convex function d: Obtain a differentiable approximation: f μ(x) = ĥ (y) = h (y) + μd(y) sup {y T (Ax + b) ĥ (y)} y domh f μ(x) is differentiable since h (y) + μd(y) is strongly convex. 15 / 39

Example: augmented l 1 primal problem min{ x 1 : Ax = b} dual problem: max{b T y + ι [ 1,1] n(a T y)} f(y) = ι [ 1,1] n(y) is non-differentiable let f (x) = x 1 and represent f(y) = sup x {y T x f (x)}, add μ 2 x 2 to f (x) and obtain f μ (y) = sup{y T x ( x 1 + μ x 2 x 2 2)} = 1 2μ y Proj [ 1,1] n(y) 2 2 f μ(y) is differentiable; f μ(y) = 1 μ shrink(y). On the other hand, we can also smooth f (x) = x 1 and obtain differentiable f μ(x) by adding d(y) to f(y). (see the next slide...) 16 / 39

Recall Let d(y) = y 2 /2 Example: smoothed absolute value f (x) = x = sup{yx ι [ 1,1] (y)} y fμ = sup{yx (ι [ 1,1] (y) + μy 2 /2)} = y which is the Huber function let d(y) = 1 1 y 2 Recall x /(2μ), x μ, x μ/2, x > μ, fμ = sup{yx (ι [ 1,1] (y) μ 1 y 2 )} μ = x 2 + μ 2 μ. y x = sup{(y 1 y 2)x ι C(y)} y for C = {y : y 1 + y 2 = 1, y 1, y 2 0}. Let d(y) = y 1 log y 1 + y 2 log y 2 + log 2 f μ(x) = sup y {(y 1 y 2)x (ι C(y) + μd(y))} = μ log ex/μ + e x/μ. 2 17 / 39

Compare three smoothed functions Courtesy of L. Vandenberghe 18 / 39

Example: smoothed maximum eigenvalue Let C = {Y S n : try = 1, Y 0}. Let X S n Negative entropy of {λ i (Y)}: f(x) = λ max(x) = sup{y X ι C(Y)} Y d(y) = n λ i(y) log λ i(y) + log n i=1 Smoothed function (Courtesy of L. Vandenberghe) f μ(x) = sup{y X (ι C(Y) + μd(y))} = μ log Y ( n i=1 e λ i(x)/μ ) μ log n 19 / 39

Application: smoothed minimization 5 Instead of solving min f(x), solve min f μ (x) = sup {y T (Ax + b) [h (y) + μd(y)]} y domh by gradient descent, with acceleration, line search, etc... Gradient is given by: f μ(x) = A T ȳ, where ȳ = arg max y domh {y T (Ax + b) [h (y) + μd(y)]}. If d(y) is strongly convex with modulus ν > 0, then h (y) + μd(y) is strongly convex with modulus at least μν f μ(x) is Lipschitz continuous with constant no more than A 2 /μν. Error control by bounding f μ(x) f(x) or x μ x. 5 Nesterov [2005] 20 / 39

Augmented Lagrangian (a.k.a. method of multipliers) Augment L(x; y k ) = f(x) (y k ) T (Ax b) by adding c 2 Ax b 2 2. Augmented Lagrangian: Iteration: L A(x; y k ) = f(x) (y k ) T (Ax b) + c 2 Ax b 2 2 x k+1 = arg min L A(x; y k ) x y k+1 = y k + c(b Ax k+1 ) from k = 0 and y 0 = 0. c > 0 can change. The objective of the first step is convex in x, if f( ) is convex, and linear in y. Equivalent to the dual implicit (proximal) iteration. 21 / 39

Augmented Lagrangian (a.k.a. method of multipliers) Recall KKT conditions (omitting the complementarity part): (primal feasibility) (dual feasibility) Ax = b 0 f(x ) A T y Compare the 2nd condition with the optimality condition of ALM subproblem 0 f(x k+1 ) A T (y k + c(b Ax k+1 )) = f(x k+1 ) A T y k+1 Conclusion: dual feasibility is maintained for (x k+1, y k+1 ) for all k. Also, it works toward primal feasibility: (y k ) T (Ax b) + c 2 A b 2 2 = c k 2 Ax b, (Ax i b) + (Ax b) It keeps adding penalty to the violation of Ax = b. In the limit, Ax = b holds (for polyhedral f( ) in finitely many steps). i=1 22 / 39

Augmented Lagrangian (a.k.a. Method of Multipliers) Compared to dual (sub)gradient ascent Pros: It converges for nonsmooth and extended-value f (thanks to the proximal term) Cons: If f is nice and dual ascent works, it may be slower than dual ascent since the subproblem is more difficult The term 1 Ax 2 b 2 2 in the x-subproblem couples different blocks of x (unless A has a block-diagonal structure) Application/alternative derivation: Bregman iterative regularization 6 6 Osher, Burger, Goldfarb, Xu, and Yin [2005], Yin, Osher, Goldfarb, and Darbon [2008] 23 / 39

Bregman Distance Definition: let r(x) be a convex function D r(x, y; p) = r(x) r(y) p, x y, where p r(y) Not a distance but has a flavor of distance. Examples: D l 2 2 (u, u k ; p k ) versus D l1 (u, u k ; p k ) differentiable case non-differentiable case 24 / 39

Bregman iterative regularization Iteration x k+1 = arg min D r(x, x k ; p k ) + g(x), p k+1 = p k g(x k+1 ), starting k = 0 and (x 0, p 0 ) = (0, 0). The update of p follows from 0 r(x k+1 ) p k + g(x k+1 ), so in the next iteration, D r(x, x k+1 ; p k+1 ) is well defined. Bregman iteration is related, or equivalent, to 1. Proximal point iteration 2. Residual addback iteration 3. Augmented Lagrangian iteration 25 / 39

Bregman and Proximal Point If r(x) = c 2 x 2 2, Bregman method reduces to the classical proximal point method x k+1 = arg min g(x) + c x 2 x xk 2 2. Hence, Bregman iteration with function r is r-proximal algorithm for min g(x) Traditional, r = l 2 2 or other smooth functions is used as the proximal function. Few uses non-differentiable convex functions like l 1 to generate the proximal term because l 1 is not stable! But using l 1 proximal function has interesting properties. 26 / 39

Bregman convergence If g(x) = p(ax b) is a strictly convex function that penalizes Ax = b, which is feasible, then the iteration converges to a solution of x k+1 = arg min D r(x, x k ; p k ) + g(x) x min r(x), s.t. Ax = b. Recall, the augmented Lagrangian algorithm also has a similar property. Next we restrict our analysis to g(x) = c 2 Ax b 2 2. 27 / 39

Residual addback iteration If g(x) = c Ax 2 b 2 2, we can derive the equivalent iteration b k+1 = b + (b k Ax k ), x k+1 = arg min r(x) + c 2 Ax bk+1 2 2. Interpretation: every iteration, the residual b Ax k is added back to b k ; every subproblem is identical but with different data. 28 / 39

Bregman = Residual addback Equivalence the two forms is given by p k = ca T (Ax k b k ). Proof by induction. Assume both iterations have the same x k so far (c = δ below) 29 / 39

Bregman = Residual addback = Augmented Lagrangian Assume f(x) = r(x) and g(x) = c 2 Ax b 2 2. Addback iteration: b k+1 = b k + (b Ax k ) = = b 0 + Augmented Lagrangian iteration: Bregman iteration: k (b Ax i ). i=0 k+1 y k+1 = y k + c(b Ax k+1 ) = = y 0 + c (b Ax i ). k+1 p k+1 = p k + ca T (b Ax k+1 ) = = p 0 + c A T (b Ax i ). Their equivalence is established by i=0 i=0 y k = cb k+1 and p k = A T y k, k = 0, 1,... and initial values x 0 = 0, b 0 = 0, p 0 = 0, y 0 = 0. 30 / 39

Residual addback in a regularization perspective Adding the residual b Ax k back to b k is somewhat counter intuitive. In the regularized least-squares problem min x r(x) + c 2 Ax b 2 2 the residual b Ax k contains both unwanted error and wanted features. The question is how to extract the features out of b Ax k. An intuitive approach is to solve y = arg min r (y) + c y 2 Ay (b Axk ) 2 and then let x k+1 x k + y. However, the addback iteration keeps the same r and adds residuals back to b k. Surprisingly, this gives good denoising results. 31 / 39

Good denoising effect Compare to addback (Bregman) to BPDN: min{ x 1 : Ax b 2 σ} where b = Ax o + n. 1.5 true signal 1.5 true signal 1 BPDN sol'n 1 Bregman itr 1 0.5 0.5 0 0-0.5-0.5-1 -1-1.5 0 50 100 150 200 250 (a) BPDN with σ = n 2-1.5 0 50 100 150 200 250 (b) Bregman Iteration 1 1.5 true signal 1.5 true signal 1 Bregman itr 3 1 Bregman itr 5 0.5 0.5 0 0-0.5-0.5-1 -1-1.5 0 50 100 150 200 250 (c) Bregman Iteration 3-1.5 0 50 100 150 200 250 (d) Bregman Iteration 5 32 / 39

Good denoising effect From this example, given noisy observations and starting being over regularized, some intermediate solutions have better fitting and less noise than x σ = arg min{ x 1 : Ax b 2 σ}, in fact, no matter how σ is chosen. the addback intermediate solutions are not on the path of x σ = arg min{ x 1 : Ax b 2 σ} by varying σ > 0. Recall if the add iteration is continued, it will converge to the solution of min x x 1, s.t. Ax = b. 33 / 39

Example of total variation denoising Problem: u a 2D image, b noisy image. Noise is Gaussian. Apply the addback iteration with A = I and The subproblem has the form r(u) = TV(u) = u 1. min u TV(u) + δ 2 u f k 2 2, where initial f 0 is a noisy observation of u original. 34 / 39

35 / 39

When to stop the addback iteration? The 2nd curve shows that optimal stopping iteration is 5. The 1st curve shows that residual just gets across noise level. Solution: stop when Ax k b 2 noise level. Some theoretical results exist 7 7 Osher, Burger, Goldfarb, Xu, and Yin [2005] 36 / 39

Numerical stability The addback/augmented-lagrangian iterations are more stable the Bregman iteration, though they are same same on paper. Two reasons: p k in the Bregman iteration may gradually lose the property of R(A T ) due to accumulated round-off errors; the other two iterations explicitly multiply A T and are thus more stable The addback iteration enjoy errors forgetting and error cancellation when r is polyhedral. 37 / 39

Numerical stability for l 1 minimization l 1-based errors forgetting simulation: use five different subproblem solvers for l 1 Bregman iterations for each subproblem, stop solver at accuracy 10 6 track and plot xk x 2 x 2 vs iteration k Errors made in subproblems get cancelled iteratively. See Yin and Osher [2012]. 38 / 39

Summary Dual gradient method, after smoothing Exact smoothing for l 1 Smooth a function by adding a strongly convex function to its convex conjugate Augmented Lagrangian, Bregman, and residual addback iterations; their equivalence Better denoising result of residual addback iteration Numerical stability, error forgetting and cancellation of residual addback iteration; numerical instability of Bregman iteration 39 / 39

(Incomplete) References: R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970. M.-J. Lai and W. Yin. Augmented l 1 and nuclear-norm models with a globally linearly convergent algorithm. Submitted to SIAM Journal on Imaging Sciences, 2012. M.P. Friedlander and P. Tseng. Exact regularization of convex programs. SIAM Journal on Optimization, 18(4):1326 1350, 2007. W. Yin. Analysis and generalizations of the linearized Bregman method. SIAM Journal on Imaging Sciences, 3(4):856 877, 2010. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Mathematics Doklady, 27:372 376, 1983. J. Barzilai and J.M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988. Hongchao Zhang and William W Hager. A nonmonotone line search technique and its application to unconstrained optimization. SIAM Journal on Optimization, 14(4): 1043 1056, 2004. Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005. S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variation-based image restoration. SIAM Journal on Multiscale Modeling and Simulation, 4(2):460 489, 2005. 40 / 39

W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143 168, 2008. W. Yin and S. Osher. Error forgetting of Bregman iteration. Journal of Scientific Computing, 54(2):684 698, 2012. 41 / 39