Lecture: Algorithms for Compressed Sensing

Size: px

Start display at page:

Download "Lecture: Algorithms for Compressed Sensing"

Jasper Burns
6 years ago
Views:

1 1/56 Lecture: Algorithms for Compressed Sensing Zaiwen Wen Beijing International Center For Mathematical Research Peking University Acknowledgement: part of this slides is based on Prof. Lieven Vandenberghe and Prof. Wotao Yin s lecture notes

2 Outline 2/56 Proximal gradient method Complexity analysis of Proximal gradient method Alternating direction methods of Multipliers (ADMM) Linearized Alternating direction methods of Multipliers Check lecture notes at http: //bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html

3 3/56 l 1 -regularized least square problem Consider min ψ µ (x) := µ x Ax b 2 2 Approaches: Interior point method: l1_ls Spectral gradient method: GPSR Fixed-point continuation method: FPC Active set method: FPC_AS Alternating direction augmented Lagrangian method Nesterov s optimal first-order method many others

4 Subgradient recall basic inequality for convex differentiable f : f (y) f (x) + f (x) (y x). g is a subgradient of a convex function f at x domf if f (y) f (x) + g (y x), y domf. g 2, g 3 are subgradients at x 2, g 1 is a subgradient at x 1. 4/56

5 Optimality conditions unconstrained 5/56 x minimizes f (x) if and only 0 f (x ) Proof: by definition f (y) f (x ) + 0 (y x ) for all y 0 f (x ).

6 6/56 Optimality conditions constrained min f 0 (x) s.t. f i (x) 0, i = 1,..., m. From Lagrange duality: if strong duality holds, then x, λ are primal, dual optimal if and only if x is primal feasible λ 0 complementary: λ i f i(x ) = 0 for i = 1,..., m x is a minimizer of min L(x, λ ) = f 0 (x) + i λ i f i(x), i.e., 0 x L(x, λ ) = f 0 (x ) + i λ i f i (x )

7 Proximal Gradient Method Let f (x) = 1 2 Ax b 2 2. The gradient f (x) = A (Ax b). Consider min ψ µ (x) := µ x 1 + f (x). First-order approximation + proximal term: x k+1 := arg min x R n µ x 1 + ( f (x k )) (x x k ) + 1 2τ x xk 2 2 = shrink(x k τ f (x k ), µτ) Forward-backward operator splitting: x k+1 := (I + τt 1 ) 1 (I τt 2 )x k 0 φ(x) := µ x 1 + f (x) = T 1 (x) + T 2 (x) = T(x) 0 (x + τt 1 (x)) (x τt 2 (x)) x = (I + τt 1 ) 1 (I τt 2 )x gradient step: bring in candidates for nonzero components shrinkage step: eliminate some of them by soft thresholding 7/56

8 Shrinkage (soft thresholding) 8/56 shrink(y, ν) : = arg min x R ν x x y 2 2 Chambolle, Devore, Lee and Lucier Figueirdo, Nowak and Wright Elad, Matalon and Zibulevsky Hales, Yin and Zhang Darbon, Osher Many others = sgn(y) max( y ν, 0) { y νsgn(y), if y > ν = 0, otherwise

9 Proximal gradient method For General Problems Consider the model min F(x) := f (x) + r(x) f (x) is convex, differentiable r(x) is convex but may be nondifferentiable General scheme: linearize f (x) and add a proximal term: x k+1 := arg min x R n r(x) + ( f (xk )) (x x k ) + 1 2τ x xk 2 2 Proximal Operator = arg min x R n τr(x) x (xk τ f (x k )) 2 = prox τr (x k τ f (x k )) prox r (y) := arg min x r(x) x y /56

10 10/56 Proximal mapping if g is convex and closed (has a closed epigraph), then exists and is unique for all y. prox r (y) := arg min x r(x) x y 2 2 from optimality conditions of minimization in the definition: x = prox r (y) y x r(x) r(z) r(x) + (y x) (z x), z subdifferential of a convex function is a monotone operator: (u v) (x y) 0 x, y, u f (x), v f (y) proof: combining the following two inequalities f (y) f (x) + u (y x), f (x) f (y) + v (x y)

11 11/56 Nonexpansiveness if u = prox r (x), v = prox r (y), then (u v) (x y) u v 2 prox r is is firmly nonexpansive, or co-coercive with constant 1 follows from last page and monotonicity of subgradients x u r(u), y v r(v) = (x u y + v) (u v) 0 implies (from Cauchy-Schwarz inequality) prox r (x) prox r (y) 2 x y 2

12 12/56 Global convergence The proximal mapping is nonexpansive Under certain assumptions, h(x) = x τ f (x) is nonexpansive: h(x) h(x ) x x, and r(x) = r(x ) whenever the equality holds. x is a fixed point if prox τr (h(x)) prox τr (h(x )) prox τr (h(x)) x = x x

13 Proximal gradient method for l 1 -minimization 13/56 Advantages Low computational cost (first-order method) Finite convergence of the nondegenerate support Global q-linear convergence Disadvantages Takes a lot of steps when the support is relatively large Slow convergence: degenerate support identification or magnitude recovery? Strategies for improving shrinkage Adjust λ dynamically (use λ k ) Line search x k+1 = x k + α k (shrink(x k λ k f k, µλ k ) x k ) Continuation: approximately solve for µ 0 > µ 1 > > µ p = µ (FPC) Debiasing or subspace optimization

14 Line search approach 14/56 Let x k (τ k ) = shrink(x k τ k f k, µτ k ). Then set x k+1 = x k + α k (x k (τ k ) x k ) = x k + α k d k Choosing τ k : Barzilai-Borwein method f k = A (Ax k b) s k 1 = x k x k 1 and y k 1 = f k f k 1 τ k,bb1 = (sk 1 ) s k 1 (s k 1 ) y k 1 or τ k,bb2 = (sk 1 ) y k 1 (y k 1 ) y k 1. Choosing α k : Armijo-like Line search ψ µ (x k + α k d k ) C k + σα k k FPC: k := ( f k ) d k FPC_AS: k := ( f k ) d k + µ x k (τ k ) 1 µ x k 1 C k = (ηq k 1 C k 1 + ψ µ (x k ))/Q k, Q k = ηq k 1 + 1, C 0 = ψ µ (x 0 ) and Q 0 = 1 (Zhang and Hager)

15 15/56 Outline: Complexity Analysis Amir Beck and Marc Teboulle, A fast iterative shrinkage thresholding algorithm for linear inverse problems Paul Tseng, On accelerated proximal gradient methods for convex-concave optimization Paul Tseng, Approximation accuracy, gradient methods and error bound for structured convex optimization N. S. Aybat and G. Iyengar, A first-order augmented Lagrangian method for compressed sensing

16 Proximal Gradient/ISTA/FPC The following terminologies refer to the same algorithm: proximal gradient method ISTA: iterative shrinkage thresholding algorithm FPC: fixed-point continuation method Consider ISTA/FPC computes min F(x) := µ x Ax b 2 2 x k+1 := arg min µ x 1 + ( f k ) (x x k ) + 1 2τ x x k 2 2 = shrink(x k τ f k, µτ) Complexity? How can we speed it up? 16/56

17 Generalization of ISTA Consider a solvable model min F(x) := f (x) + r(x) r(x) continuous convex, may be nonsmooth f (x) continuously diff. with Lipschitz continuous gradient f (x) f (y) 2 L f x y 2 The cost of the following problem is tractable p L(y) := arg min x = arg min x r(x) + L 2 Q L(x, y) := f (y) + f (y), x y + L 2 x y r(x) ( x y 1 ) 2 L f (y) = prox 1/L (y 1 f (y)) L 2 ISTA Basic iteration: x k = p L (x k 1 ) 17/56

18 Quadratic upper bound Suppose f is Lipschitz continuous with parameter L and domf is convex. Then f (y) f (x) + f (x) (y x) + L 2 y x 2 2, x, y. Proof: Let v = y x. Then f (y) = f (x) + f (x) v + f (x) + f (x) v + f (x) + f (x) v ( f (x + tv) f (x)) vdt f (x + tv) f (x) 2 v 2 dt Lt v 2 2dt = f (x) + f (x) v + L 2 y x 2 2 The above inequality also implies that F(p L (y)) Q(p L (y), y). 18/56

19 19/56 Lemma: Assume F(p L (y)) Q(p L (y), y), then for any x R n, F(x) F(p L (y)) L 2 p L(y) y 2 + L y x, p L (y) y = L 2 ( pl (y) x 2 2 x y 2 2) Proof: f (y) + L(p L (y) y) + z = 0 for some z g(p L (y)). The convexity of f and g gives f (x) f (y) + f (y) (x y) r(x) g(p L (y)) + z (x p L (y)). Using the definition of Q(p L (y), y) and z F(x) F(p L (y)) F(x) Q(p L (y), y) L 2 p L(y) y 2 + (x p L (y)) ( f (y) + z)

20 Complexity analysis of ISTA 20/56 Complexity results: F(x k ) F(x ) L f x 0 x 2 2 2k Let x = x, y = x j, then 2 L (F(x ) F(x j+1 )) x x j x x j 2 2 Summing over j := 0,, k 1 gives 2 k 1 L (kf(x ) F(x j+1 )) x x k 2 2 x x j=0 Let x = y = x j yields 2 L (F(x j) F(x j+1 )) x j x j Multiplying the last ineq. by j and summing over j := 0,, k 1 gives 2 L ( kf(x k) + k 1 j=0 F(x j+1)) k 1 j=0 j x j x j Therefore 2k L (F(x ) F(x k )) x x k k 1 j=0 j x j x j x x 0 2 2

21 FISTA: Accelerated proximal gradient 21/56 Given y 1 = x 0 and t 1 = 1, compute: Complexity results: x k = p L (y k ) t k+1 = tk 2 2 y k+1 = x k + t k 1 (x k x k 1 ) t k+1 F(x k ) F(x ) 2L f x 0 x 2 2 (k + 1) 2 Lipschitz constant L f is unknown? Choose L by backtracking so that F(p L (y k )) Q L (p L (y k ), y k )

22 22/56 Complexity analysis of FISTA Let v k := F(x k ) F(x ) and u k := t k x k (t k 1)x k 1 x, 2 ( t 2 L k v k tk+1v 2 ) k+1 uk u k 2 2 If a k a k+1 b k+1 b k, with a 1 + b 1 c, then a k c t k (k + 1)/2 Let a k = 2 L t2 k v k, b k = u k 2 2, and c = x 0 x 2 2. Check a 1 + b 1 c: F(x ) F(x 1 ) = F(x ) F(p L (y 1 )) L 2 p L(y 1 ) y L y 1 x, p L (y 1 ) y 1 = L 2 ( x 1 x 2 2 y 1 x 2 2)

23 Complexity analysis of FISTA 23/56 Let (x = x k, y = y k+1) and (x = x, y = y k+1). Note x k+1 = p Lk+1 (y k+1): 2L 1 (v k v k+1) x k+1 y k x k+1 y k+1, y k+1 x k (1) 2L 1 v k+1 x k+1 y k x k+1 y k+1, y k+1 x (2) ((1) (t k+1 1) + (2)) t k+1 and using t 2 k = t 2 k+1 t k+1: 2t k+1 L ((tk+1 1)vk tk+1vk+1) = 2 L (t2 kv k tk+1v 2 k+1) t k+1(x k+1 y k+1) t k+1 x k+1 y k+1, t k+1y k+1 (t k+1 1)x k x b a b a, a c = b c 2 2 a c 2 2 y k+1 = x k + t k 1 t k+1 (x k x k 1 ): 2 L (t2 kv k t 2 k+1v k+1) t k+1x k+1 (t k+1 1)x k x 2 2 t k+1y k+1 (t k+1 1)x k x 2 2 = u k u k 2 2

24 Proximal gradient method 24/56 Consider the model Linearize f (x): min F(x) := f (x) + r(x) l f (x, y) := f (y) + f (y), x y + r(x) Given a strictly convex function h(x), Bregman distance: D(x, y) := h(x) h(y) h(y), x y. For example, take h(x) = x 2 2. Then D(x, y) = x y 2 2. Proximal gradient method can also be written as x k+1 := arg min x l f (x, x k ) + L 2 D(x, xk )

25 APG Variant 1 25/56 Acclerated proximal gradient (APG): Set x 1 = x 0 and θ 1 = θ 0 = 1: y k = x k + θ k (θ 1 k 1 1)(x k x k 1 ) x k+1 = arg min l f (x, y k ) + L x 2 x y k 2 2 θ k+1 = θk 4 + 4θ2 k θ2 k 2 Question: what is the difference between θ k and t k?

26 APG Variant 2 26/56 Replace 1 2 x yk 2 2 by Bregman distance D(x, y k) Set x 0, z 0 and θ 0 = 1: y k = (1 θ k )x k + θ k z k z k+1 = arg min l f (x, y k ) + θ k LD(x, z k ) x x k+1 = (1 θ k )x k + θ k z k+1 θ k+1 = θk 4 + 4θ2 k θ2 k 2

27 APG Variant 3 27/56 Set x 0, z 0 := arg min h(x) and θ 0 = 1: y k = (1 θ k )x k + θ k z k z k+1 = k l f (x, y i ) arg min x θ i i=0 x k+1 = (1 θ k )x k + θ k z k+1 θ k+1 = θk 4 + 4θ2 k θ2 k 2 + Lh(x)

28 Complexity analysis 28/56 Proximal gradient method F(x k ) F(x) + 1 k LD(x, x0 ) APG1: APG2: APG3: F(x k ) F(x) + F(x k ) F(x) + F(x k ) F(x) + 4 (k + 1) 2 LD(x, x0 ) 4 (k + 1) 2 LD(x, z0 ) 4 (k + 1) 2 L(h(x) h(x 0))

29 Complexity analysis Let f (x) is convex and assume f (x) is Lipschitz cont. F(x) l f (x, y) F(x) L 2 x y 2 2. For any proper lsc convex function ψ(x), if z + = arg min x ψ(x) + D(x, z) and h(x) is diff. at z +, then ψ(x) + D(x, z) ψ(z + ) + D(x, z + ) + D(z +, z). Proof: The optimality at z + and definition of subgradient gives ψ(x) + x D(z +, z) (x z + ) ψ(z + ). Note x D(z +, z) = h(z + ) h(z). Rearranging terms yields ψ(x) h(z) (x z) ψ(z + ) h(z) (z + z) h(z + ) (x z + ). Adding h(x) h(z) to both sides. 29/56

30 30/56 Complexity analysis of APG2 F(x k+1 ) l f (x k+1, y k ) + L 2 x k+1 y k 2 2 = l f ((1 θ k )x k + θz k+1, y k ) + Lθ2 k 2 z k+1 z k 2 2 (1 θ k )l f (x k, y k ) + θ k l f (z k+1, y k ) + θ 2 kld(z k+1, z k ) (1 θ k )l f (x k, y k ) + θ k (l f (x, y k ) + θ k LD(x, z k ) θ k LD(x, z k+1 )) (1 θ k )F(x k ) + θ k F(x) + θ 2 kld(x, z k ) θ 2 kld(x, z k+1 ) The third inequality applies the inequality for l f (x, z) + θld(x, z). Let e k = F(x k ) F(x) and k = LD(x, z k ),then e k+1 (1 θ k )e k + θ 2 k k θ 2 k k+1

31 Complexity analysis of APG2 Divide both sides by θ 2 k and using 1 θ 2 k = 1 θ k+1 : θk θ k+1 θ 2 k+1 e k+1 + k+1 1 θ k θk 2 e k + k Hence 1 θ k+1 θ 2 k+1 e k+1 + k+1 1 θ 0 θ0 2 e Using 1 θ 2 k = 1 θ k+1 θ 2 k+1 and θ 0 = 1: 1 e k+1 0 k+1 0 = LD(x, z 0 ) θ 2 k 31/56

32 An application to Basis Pursuit problem 32/56 Consider min x 1 s.t. Ax = b Augmented Lagragian (Bregman) framework (b 1 = b, k = 1): x k := arg min x F k (x) := µ k x Ax b k 2 2 b k+1 := b + µ k+1 µ k (b k Ax k ) obtaining xk+1 exactly is difficult inexact solver: how do we control the accuracy? Analysis of Bregman approach (see wotao), no complexity Solving each xk+1 by the APG algorithms?

33 33/56 Outline: ADMM Alternating direction augmented Lagrangian methods Variable splitting method Convergence for problems with two blocks of variables

34 34/56 References Wotao Yin, Stanley Osher, Donald Goldfarb, Jerome Darbon, Bregman Iterative Algorithms for l1-minimization with Applications to Compressed Sensing Junfeng Yang, Yin Zhang, Alternating direction algorithms for l1-problems in Compressed Sensing Tom Goldstein, Stanely Osher, The Split Bregman Method for L1-Regularized Problems B.S. He, H. Yang, S.L. Wang, Alternating Direction Method with Self-Adaptive Penalty Parameters for Monotone Variational Inequalities

35 35/56 Basis pursuit problem Primal: min x 1, s.t. Ax = b Dual: max b λ, s.t. A λ 1 The dual problem is equivalent to max b λ, s.t. A λ = s, s 1.

36 36/56 Augmented Lagrangian (Bregman) framework Augmented Lagrangian function: L(λ, s, x) := b λ + x (A λ s) + 1 2µ A λ s 2 Algorithmic framework Compute λ k+1 and s k+1 at k-th iteration (DL) min λ,s L(λ, s, x k ), s.t. s 1 Update the Lagrangian multiplier: x k+1 = x k + A λ k+1 s k+1 µ Pros and Cons: Pros: rich theory, well understood and a lot of algorithms Cons: L(λ, s, x k ) is not separable in λ and s, and the subproblem (DL) is difficult to minimize

37 An alternating direction minimization scheme 37/56 ADMM Divide variables into different blocks according to their roles Minimize the augmented Lagrangian function with respect to one block at a time while all other blocks are fixed λ k+1 = arg min L(λ, s k, x k ) λ s k+1 = arg min L(λ k+1, s, x k ), s.t. s 1 s x k+1 = x k + A λ k+1 s k+1 µ

38 An alternating direction minimization scheme 38/56 Explicit solutions: λ k+1 = (AA ) 1 ( µ(ax k b) + As k) s k+1 = arg min s A λ k+1 µx k 2, s.t. s 1 = P [ 1,1] (A λ k+1 + µx k ) x k+1 = x k + A λ k+1 s k+1 µ

39 39/56 ADMM for BP-denoising Primal: min x 1, s.t. Ax b 2 σ which is equivalent to min x 1, s.t. Ax b + r = 0, r 2 σ Lagrangian function: L(x, r, λ) := x 1 λ (Ax b + r) + π( r 2 σ) = x 1 (A λ) x + π r 2 λ r + b λ πσ Hence, the dual problem is: max b λ πσ, s.t. A λ 1, λ 2 π

40 ADMM for BP-denoising The dual problem is equivalent to: max b λ πσ, s.t. A λ = s, s 1, λ 2 π Augmented Lagrangian function is: ADM scheme: L(λ, s, x) := b λ + πσ + x (A λ s) + 1 2µ A λ s 2 λ k+1 = arg min 1 2µ A λ s k 2 + (Ax k b) λ, s.t. λ 2 π k s k+1 = arg min s A λ k+1 µx k 2, s.t. s 1 = P [ 1,1] (A λ k+1 + µx k ) π k+1 = λ k+1 2 x k+1 = x k + A λ k+1 s k+1 µ 40/56

41 ADMM for l 1 -regularized problem 41/56 Primal: min µ x Ax b 2 2 which is equivalent to min µ x r 2 2, s.t. Ax b = r. Lagrangian function: L(x, r, λ) := µ x r 2 2 λ (Ax b r) = µ x 1 (A λ) x r λ r + b λ Hence, the dual problem is: max b λ 1 2 λ 2, s.t. A λ µ

42 ADMM for l 1 -regularized problem 42/56 The dual problem is equivalent to max b λ 1 2 λ 2, s.t. A λ = s, s µ. Augmented Lagrangian function is: L(λ, s, x) := b λ λ 2 + x (A λ s) + 1 2µ A λ s 2 ADM scheme: λ k+1 = (AA + µi) 1 ( µ(ax k b) + As k) s k+1 = arg min s A λ k+1 µx k 2, s.t. s µ = P [ µ,µ] (A λ k+1 + µx k ) x k+1 = x k + A λ k+1 s k+1 µ

43 YALL1 43/56 Derive ADMM for the following problems: BP: min x C n Wx w,1, s.t. Ax = b L1/L1: min x C n Wx w,1 + 1 ν Ax b 1 L1/L2: min x C n Wx w, ρ Ax b 2 2 BP+: min x R n x w,1, s.t. Ax = b, x 0 L1/L1+: min x R n x w,1 + 1 ν Ax b 1, s.t. x 0 L1/L2+: min x R n x w, ρ Ax b 2 2, s.t. x 0 ν, ρ 0, A C m n, b C m,x C n for the first three and x R n for the last three, W C n n is an unitary matrix serving as a sparsifying basis, and x w,1 := n i=1 w i x i.

44 Variable splitting 44/56 Given A R m n, consider min f (x) + g(ax), which is Augmented Lagrangian function: min f (x) + g(y), s.t. Ax = y L(x, y, λ) = f (x) + g(y) λ (Ax y) + 1 2µ Ax y 2 2 ADM (P x ) : x k+1 := arg min x X (P y ) : y k+1 := arg min y Y L(x, y k, λ k ), L(x k+1, y, λ k ), (P λ ) : λ k+1 := λ k γ Axk+1 y k+1 µ

45 Variable splitting 45/56 split Bregman (Goldstein and Osher) for anisotropic TV: min α Du 1 + β Ψu Au f 2 2 Introduce y = Du and w = Ψu, obtain min α y 1 + β w Au f 2 2, s.t. y = Du, w = Ψu Augmented Lagrangian function: L := α y 1 + β w Au f 2 2 p (Du y) + 1 2µ Du y 2 2 q (Ψu w) + 1 2µ Ψu w 2 2

46 Variable splitting 46/56 The variable u can be otained by ( A A + 1 ) µ (D D + I) u = A f + 1 µ (D y + Ψ w) + D p + Ψ q If A and D are diagonalizable by FFT, then the computational cost is very cheap. For example, A = RF, both R and D are circulant matrices. Variables y and w: y := shrink(du µp, αµ) w := shrink(ψu µq, αµ) apply a few iterations before updating the Lagrangian multipliers p and q Exercise: isotropic TV min α Du 2 + β Ψu Au f 2 2

47 FTVd: Fast TV deconvolution 47/56 Wang-Yang-Yin-Zhang consider: min u Di u µ Ku f 2 2 Introducing w and quadratic penalty: min u,w Alternating minimization: ( w i β w i D i u 2 2 ) + 1 2µ Ku f 2 2 For fixed u, {w i } can be solved by shrinkage at O(N) For fixed {w i }, u can be solved by FFT at O(N log N)

48 48/56 Outline: Linearized ADMM Linearized Bregman and Bregmanized operator splitting ADM + proximal point method Xiaoqun Zhang, Martin Burgerz, Stanley Osher, A unified primal-dual algorithm framework based on Bregman iteration

49 Review of Bregman method 49/56 Consider BP: Bregman method: min x 1, s.t. Ax = b D pk J (x, xk ) := x 1 x k 1 p k, x x k x k+1 := arg min x µd pk J (x, xk ) Ax b 2 2 p k+1 = p k + 1 µ A (b Ax k+1 ) Augmented Lagrangian (updating multiplier or b): x k+1 := arg min x µ x Ax bk 2 2 b k+1 = b + (b k Ax k+1 ) They are equivalent, see Yin-Osher-Goldfarb-Darbon

50 Linearized approaches 50/56 Linearized Bregman method: x k+1 := arg min µd pk J (x, xk ) + (A (Ax k b)) (x x k ) + 1 2δ x xk 2 2, p k+1 := p k 1 µδ (xk+1 x k ) 1 µ A (Ax k b), which is equivalent to x k+1 := arg min µ x δ x vk 2 2 v k+1 := v k δa (Ax k+1 b) Bregmanized operator splitting: x k+1 := arg min µ x 1 + (A (Ax k b k )) (x x k ) + 1 2δ x xk 2 2 b k+1 = b + (b k Ax k+1 ) Are they equivalent?

51 50/56 Linearized approaches Linearized Bregman method: x k+1 := arg min µd pk J (x, xk ) + (A (Ax k b)) (x x k ) + 1 2δ x xk 2 2, p k+1 := p k 1 µδ (xk+1 x k ) 1 µ A (Ax k b), which is equivalent to x k+1 := shrink(v k, µδ) v k+1 := v k δa (Ax k+1 b) Bregmanized operator splitting: or x k+1 := shrink(δa b k, µδ) b k+1 := b + (b k Ax k+1 ) x k+1 := shrink(x k δ(a (Ax k b k )), µδ) = shrink(δa b k + x k δa Ax k, µδ) b k+1 = b + (b k Ax k+1 )

52 51/56 Linearized approaches Linearized Bregman: If the sequence x k converges and p k is bounded, then the limit of x k is the unique solution of min µ x δ x 2 2 s.t. Ax = b. For µ large enough, the limit solution solves BP. Exact regularization if δ > δ What about Bregmanized operator splitting?

53 Primal ADM for l 1 -regularized problem 52/56 Primal: min µ x Ax b 2 2 which is equivalent to Augmented Lagrangian function: min µ x r 2 2, s.t. Ax b = r. L(x, r, λ) = µ x r 2 2 λ (Ax b r) + 1 2δ Ax b r 2 2 ADM scheme: x k+1 = arg min x µ x δ Ax b rk δλ k 2 2 original problem r k+1 1 = arg min r 2 r δ Axk+1 b r δλ k 2 2 λ k+1 = λ k + Axk+1 b r k+1 δ

54 Primal ADM for l 1 -regularized problem 52/56 Primal: min µ x Ax b 2 2 which is equivalent to Augmented Lagrangian function: min µ x r 2 2, s.t. Ax b = r. L(x, r, λ) = µ x r 2 2 λ (Ax b r) + 1 2δ Ax b r 2 2 ADM scheme: x k+1 = arg min x µ x 1 + (g k ) (x x k ) + 1 2τ x xk 2 2 r k+1 1 = arg min r 2 r δ Axk+1 b r δλ k 2 2 λ k+1 = λ k + Axk+1 b r k+1 Convergence of the linearized scheme? δ

55 53/56 Generalized algorithm Consider Proximal-point method: x k+1 := arg min x min f (x), s.t. Ax = b f (x) λ k, Ax b + 1 2δ Ax b x xk 2 Q Cλ k+1 := Cλ k + (b Ax k+1 ) Augmented Lagrangian or Bregman method if Q = 0 and C = δ Proximal point method by Rockafellar if Q = I and C = γ Bregmanized operator splitting if Q = 1 δ (I A A) Theoretical results: lim k Ax k b 2 = 0, lim k f (x k ) = f ( x) and all limit point of (x k, λ k ) are saddle points.

56 Convergence proof 54/56 From the x-subproblem: f (x k+1 ) s k+1 := A λ k 1 δ A (Ax k+1 b) Q(x k+1 x k ) Assume ( x, λ) is a saddle point, then A x = b, and s A λ = 0. Let s k+1 e = s k+1 s, xe k+1 = x k+1 x and λ k+1 e = λ k+1 λ: s k+1 e + 1 δ A Axe k+1 + Qxe k+1 = Qxe k + A λ k e, Cλ k+1 e = Cλ k e Axe k+1 Taking the inner product with xe k+1 ( 1 2 x k+1 e on both sides of the first equality: ) 2 Q + x k+1 x k 2 Q xe k 2 Q = s k+1 e, xe k+1 1 δ Axk+1 e Taking the inner product with λ k+1 e on both sides of the second equality: 1 ( ) λ k+1 e 2 C λ k e 2 C Axe k+1 2 C 2 1 = λ k e, Axe k+1 λ k e, Ax k+1 e Adding the above inequality (w k = (x k, λ k )): w k+1 e 2 Q + x k+1 x k 2 Q + 2 s k+1 e, xe k δ Axk+1 e 2 2 Axe k+1 2 C 1 = wk e 2 Q

57 Convergence proof 55/56 The convexity of f (x) gives s k+1 e, xe k+1 0. If 2 δ > 1 λ C m, then 2 s k+1 e, xe k δ Axk+1 e 2 2 Axe k+1 2 C 1 0 Hence, w k+1 e 2 Q w k e 2 Q, which implies the boundedness of w k = (x k, λ k ). Furthermore, { ( )} 2 s k+1 e, xe k+1 + x k+1 x k 2 2 Q + δ Axk+1 e 2 2 Axe k+1 2 C 1 k w 0 e 2 Q < Hence, 0 = lim k Axe k = lim k Ax k b, and lim k s k A λ k = 0 follows from s k+1 := A λ k 1 δ A (Ax k+1 b) Q(x k+1 x k )

58 Generalized algorithm 56/56 Consider Proximal-point method: x k+1 := arg min x z k+1 := arg min z min f (x) + g(z), s.t. Bx = z f (x) λ k, Bx g(z) λ k, z Cλ k+1 := Cλ k + (Bx k+1 z k+1 ) ADM if Q = M = 0 and C = δ + 1 2δ Bx zk x xk 2 Q + 1 2δ Bxk+1 z z zk 2 M Proximal point method by Rockafellar if Q = I and C = γ Bregmanized operator splitting if Q = 1 δ (I A A) Theoretical results: lim k Bx k z k 2 = 0, lim k f (x k ) = f ( x), lim k g(z k ) = g( z) and all limit point of (x k, z k, λ k ) are saddle points.

Sparse Optimization Lecture: Dual Methods, Part I

Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration