Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods

Size: px

Start display at page:

Download "Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods"

Matthew Franklin
5 years ago
Views:

1 Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods Mário A. T. Figueiredo 1 and Stephen J. Wright 2 1 Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal 2 Computer Sciences Department, University of Wisconsin, Madison, WI, USA Condensed version of ICCOPT tutorial, Lisbon, Portugal, 2013 M. Figueiredo and S. Wright First-Order Methods April / 68

2 In general, optimization problems are unsolvable, Nesterov, 1984 M. Figueiredo and S. Wright First-Order Methods April / 68

3 In general, optimization problems are unsolvable, Nesterov, 1984 M. Figueiredo and S. Wright First-Order Methods April / 68

4 Focus (Initially) on Smooth Convex Functions Consider min f (x), x R n with f smooth and convex. Usually assume µi 2 f (x) LI, x, with 0 µ L (thus L is a Lipschitz constant of f ). If µ > 0, then f is µ-strongly convex (as seen in Part 1) and f (y) f (x) + f (x) T (y x) + µ 2 y x 2 2. Define conditioning (or condition number) as κ := L/µ. We are often interested in convex quadratics: f (x) = 1 2 x T A x, f (x) = 1 2 Bx b 2 2, µi A LI or µi B T B LI M. Figueiredo and S. Wright First-Order Methods April / 68

5 What s the Setup? We consider iterative algorithms x k+1 = Φ(x k ), or x k+1 = Φ(x k, x k 1 ) For now, assume we can evaluate f (x t ) and f (x t ) at each iteration. Later, we look at broader classes of problems: nonsmooth f ; f not available (or too expensive to evaluate exactly); only an estimate of the gradient is available; nonsmooth regularization; i.e., instead of simply f (x), we want to minimize f (x) + τψ(x). We focus on algorithms that can be adapted to those scenarios. M. Figueiredo and S. Wright First-Order Methods April / 68

6 Skip to slide 29 Recall Francis Bach s talk M. Figueiredo and S. Wright First-Order Methods April / 68

7 Steepest Descent Steepest descent (a.k.a. gradient descent): x k+1 = x k α k f (x k ), for some α k > 0. Different ways to select an appropriate α k. 1 Hard: interpolating scheme with safeguarding to identify an approximate minimizing α k Easy: backtracking. ᾱ, 2ᾱ, 4ᾱ, 8ᾱ,... until sufficient decrease in f is obtained. 3 Trivial: don t test for function decrease; use rules based on L and µ. Analysis for 1 and 2 usually yields global convergence at unspecified rate. The greedy strategy of getting good decrease in the current search direction may lead to better practical results. Analysis for 3: Focuses on convergence rate, and leads to accelerated multi-step methods. M. Figueiredo and S. Wright First-Order Methods April / 68

8 Line Search Seek α k that satisfies Wolfe conditions: sufficient decrease in f : f (x k α k f (x k )) f (x k ) c 1 α k f (x k ) 2, (0 < c 1 1) while not being too small (significant increase in the directional derivative): f (x k+1 ) T f (x k ) c 2 f (x k ) 2, (c 1 < c 2 < 1). (works for nonconvex f.) Can show that accumulation points x of {x k } are stationary: f ( x) = 0 (thus minimizers, if f is convex) Can do one-dimensional line search for α k, taking minima of quadratic or cubic interpolations of the function and gradient at the last two values tried. Use brackets to ensure steady convergence. Often finds suitable α within 3 attempts. (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April / 68

9 Backtracking Try α k = ᾱ, ᾱ 2, ᾱ 4, ᾱ 8,... until the sufficient decrease condition is satisfied. No need to check the second Wolfe condition: the α k thus identified is within striking distance of an α that s too large so it is not too short. Backtracking is widely used in applications, but doesn t work on nonsmooth problems, or when f is not available / too expensive. M. Figueiredo and S. Wright First-Order Methods April / 68

10 Constant (Short) Steplength By elementary use of Taylor s theorem, and since 2 f (x) LI, ( f (x k+1 ) f (x k ) α k 1 α ) k 2 L f (x k ) 2 2 For α k 1/L, f (x k+1 ) f (x k ) 1 2L f (x k) 2 2, thus f (x k ) 2 2L[f (x k ) f (x k+1 )] Summing for k = 0, 1,..., N, and telescoping the sum, N f (x k ) 2 2L[f (x 0 ) f (x N+1 )]. k=0 It follows that f (x k ) 0 if f is bounded below. M. Figueiredo and S. Wright First-Order Methods April / 68

11 Rate Analysis Suppose that the minimizer x is unique. Another elementary use of Taylor s theorem shows that ( ) 2 x k+1 x 2 x k x 2 α k L α k f (x k ) 2, so that { x k x } is decreasing. Define for convenience: k := f (x k ) f (x ). By convexity, have k f (x k ) T (x k x ) f (x k ) x k x f (x k ) x 0 x. From previous page (subtracting f (x ) from both sides of the inequality), and using the inequality above, we have k+1 k (1/2L) f (x k ) 2 k 1 2L x 0 x 2 2 k. M. Figueiredo and S. Wright First-Order Methods April / 68

12 Weakly convex: 1/k sublinear; Strongly convex: linear Take reciprocal of both sides and manipulate (using (1 ɛ) ɛ): which yields 1 k+1 1 k + The classic 1/k convergence rate! 1 2L x 0 x f (x k+1 ) f (x ) 2L x 0 x 2. k + 1 k + 1 2L x 0 x 2, By assuming µ > 0, can set α k 2/(µ + L) and get a linear (geometric) rate: Much better than sublinear, in the long run x k x 2 ( ) L µ 2k ( x 0 x 2 = 1 2 ) 2k x 0 x 2. L + µ κ + 1 M. Figueiredo and S. Wright First-Order Methods April / 68

13 Weakly convex: 1/k sublinear; Strongly convex: linear Since by Taylor s theorem we have it follows immediately that k = f (x k ) f (x ) (L/2) x k x 2, f (x k ) f (x ) L 2 ( 1 2 ) 2k x 0 x 2. κ + 1 Note: A geometric / linear rate is generally better than almost any sublinear (1/k or 1/k 2 ) rate. M. Figueiredo and S. Wright First-Order Methods April / 68

14 Exact minimizing α k : Faster rate? Question: does taking α k as the exact minimizer of f along f (x k ) yield better rate of linear convergence? Consider f (x) = 1 2 x T A x (thus x = 0 and f (x ) = 0.) We have f (x k ) = A x k. Exactly minimizing w.r.t. α k, 1 α k = arg min α 2 (x k αax k ) T A(x k αax k ) = x k T [ A2 x k 1 xk T A3 x k L, 1 ] µ Thus so, defining z k := Ax k, we have f (x k+1 ) f (x k ) 1 (xk T A2 x k ) 2 2 (xk T Ax k)(xk T A3 x k ), f (x k+1 ) f (x ) f (x k ) f (x ) 1 z k 4 (z T k A 1 z k )(z T k Az k). M. Figueiredo and S. Wright First-Order Methods April / 68

15 Exact minimizing α k : Faster rate? Using Kantorovich inequality: (z T Az)(z T A 1 z) (L + µ)2 4Lµ z 4. Thus and so f (x k+1 ) f (x ) f (x k ) f (x ) f (x k ) f (x ) 1 4Lµ ( (L + µ) 2 = 1 2 ) 2, κ + 1 ( 1 2 ) 2k [f (x 0 ) f (x )]. κ + 1 No improvement in the linear rate over constant steplength! M. Figueiredo and S. Wright First-Order Methods April / 68

16 The slow linear rate is typical! Not just a pessimistic bound! M. Figueiredo and S. Wright First-Order Methods April / 68

17 Multistep Methods: The Heavy-Ball Enhance the search direction using a contribution from the previous step. (known as heavy ball, momentum, or two-step) Consider first a constant step length α, and a second parameter β for the momentum term: x k+1 = x k α f (x k ) + β(x k x k 1 ) Analyze by defining a composite iterate vector: [ xk x w k := ] x k 1 x. Thus w k+1 = Bw k + o( w k ), B := [ α 2 f (x ] ) + (1 + β)i βi. I 0 M. Figueiredo and S. Wright First-Order Methods April / 68

18 Multistep Methods: The Heavy-Ball Matrix B has same eigenvalues as [ ] αλ + (1 + β)i βi, Λ = diag(λ I 0 1, λ 2,..., λ n ), where λ i are the eigenvalues of 2 f (x ). Choose α, β to explicitly minimize the max eigenvalue of B, obtain α = 4 1 L (1 + 1/ κ) 2, β = ( 1 2 κ + 1 ) 2. Leads to linear convergence for x k x with rate approximately ( ) 2 1. κ + 1 M. Figueiredo and S. Wright First-Order Methods April / 68

19 Summary: Linear Convergence, Strictly Convex f Steepest descent: Linear rate approx Heavy-ball: Linear rate approx ( 1 2 ) ; κ ( 1 2 κ ). Big difference! To reduce x k x by a factor ɛ, need k large enough that ( 1 2 ) k ɛ k κ κ ( 1 2 κ ) k ɛ k log ɛ (steepest descent) 2 κ 2 log ɛ (heavy-ball) A factor of κ difference; e.g. if κ = 1000 (not at all uncommon in inverse problems), need 30 times fewer steps. M. Figueiredo and S. Wright First-Order Methods April / 68

20 Conjugate Gradient Basic conjugate gradient (CG) step is x k+1 = x k + α k p k, p k = f (x k ) + γ k p k 1. Can be identified with heavy-ball, with β k = α kγ k α k 1. However, CG can be implemented in a way that doesn t require knowledge (or estimation) of L and µ. Choose α k to (approximately) miminize f along p k ; Choose γ k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. γ k = f (x k) 2 f (x k 1 ) 2 M. Figueiredo and S. Wright First-Order Methods April / 68

21 Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes. Restarting periodically with p k = f (x k ) is useful (e.g. every n iterations, or when p k is not a descent direction). For quadratic f, convergence analysis is based on eigenvalues of A and Chebyshev polynomials, min-max arguments. Get Finite termination in as many iterations as there are distinct eigenvalues; Asymptotic linear convergence with rate approx 1 2. κ (like heavy-ball.) (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April / 68

22 Accelerated First-Order Methods Accelerate the rate to 1/k 2 for weakly convex, while retaining the linear rate (related to κ) for strongly convex case. Nesterov (1983) describes a method that requires κ. Initialize: Choose x 0, α 0 (0, 1); set y 0 x 0. Iterate: x k+1 y k 1 L f (y k); (*short-step*) find α k+1 (0, 1): αk+1 2 = (1 α k+1)αk 2 + α k+1 κ ; set β k = α k(1 α k ) αk 2 + α ; k+1 set y k+1 x k+1 + β k (x k+1 x k ). Still works for weakly convex (κ = ). M. Figueiredo and S. Wright First-Order Methods April / 68

23 y k+2 x k+2 y k+1 x k+1 x k y k Separates the gradient descent and momentum step components. M. Figueiredo and S. Wright First-Order Methods April / 68

24 Convergence Results: Nesterov If α 0 1/ κ, have f (x k ) f (x ) c 1 min ( ( 1 1 κ ) k, where constants c 1 and c 2 depend on x 0, α 0, L. 4L ( L + c 2 k) 2 Linear convergence heavy-ball rate for strongly convex f ; 1/k 2 sublinear rate otherwise. ), In the special case of α 0 = 1/ κ, this scheme yields α k 1 κ, β k 1 2 κ + 1. M. Figueiredo and S. Wright First-Order Methods April / 68

25 Beck and Teboulle Beck and Teboulle (2009) propose a similar algorithm, with a fairly short and elementary analysis (though still not intuitive). Initialize: Choose x 0 ; set y 1 = x 0, t 1 = 1; Iterate: x k y k 1 L f (y k); ( t k t 2 k ) ; y k+1 x k + t k 1 t k+1 (x k x k 1 ). For (weakly) convex f, converges with f (x k ) f (x ) 1/k 2. When L is not known, increase an estimate of L until it s big enough. Beck and Teboulle (2009) do the convergence analysis in 2-3 pages; elementary, but technical. M. Figueiredo and S. Wright First-Order Methods April / 68

26 A Non-Monotone Gradient Method: Barzilai-Borwein Barzilai and Borwein (1988) (BB) proposed an unusual choice of α k. Allows f to increase (sometimes a lot) on some steps: non-monotone. where Explicitly, we have x k+1 = x k α k f (x k ), α k := arg min α s k αz k 2, s k := x k x k 1, z k := f (x k ) f (x k 1 ). α k = st k z k zk T z. k Note that for f (x) = 1 2 x T Ax, we have α k = st k As k s T k A2 s k [ 1 L, 1 ]. µ BB can be viewed as a quasi-newton method, with the Hessian approximated by α 1 k I. M. Figueiredo and S. Wright First-Order Methods April / 68

27 Comparison: BB vs Greedy Steepest Descent M. Figueiredo and S. Wright First-Order Methods April / 68

28 There Are Many BB Variants use α k = s T k s k/s T k z k in place of α k = s T k z k/z T k z k; alternate between these two formulae; hold α k constant for a number (2, 3, 5) of successive steps; take α k to be the steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last M (say 10) iterates. The original 1988 analysis in BB s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959). More recently, see Ascher, Dai, Fletcher, Hager and others. M. Figueiredo and S. Wright First-Order Methods April / 68

29 Extending to the Constrained Case: x Ω How to change these methods to handle the constraint x Ω? (assuming that Ω is a closed convex set) Some algorithms and theory stay much the same,...if we can involve the constraint x Ω explicity in the subproblems. Example: Nesterov s constant step scheme requires just one calculation to be changed from the unconstrained version. Initialize: Choose x 0, α 0 (0, 1); set y 0 x 0. Iterate: x k+1 arg min y Ω 1 2 y [y k 1 L f (y k)] 2 2 ; find α k+1 (0, 1): α 2 k+1 = (1 α k+1)α 2 k + α k+1 κ ; set β k = α k(1 α k ) α 2 k +α ; k+1 set y k+1 x k+1 + β k (x k+1 x k ). Convergence theory is unchanged. M. Figueiredo and S. Wright First-Order Methods April / 68

30 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. M. Figueiredo and S. Wright First-Order Methods April / 68

31 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step M. Figueiredo and S. Wright First-Order Methods April / 68

32 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step or something more complicated; e.g., an accelerated method with Φ(x k, x k 1 ) instead of Φ(x k ) M. Figueiredo and S. Wright First-Order Methods April / 68

33 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step or something more complicated; e.g., an accelerated method with Φ(x k, x k 1 ) instead of Φ(x k ) This is the shrinkage/tresholding step; how to solve it with nonsmooth ψ? That s the topic of the following slides. M. Figueiredo and S. Wright First-Order Methods April / 68

34 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: M. Figueiredo and S. Wright First-Order Methods April / 68

35 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} linear lower bound M. Figueiredo and S. Wright First-Order Methods April / 68

36 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} If f is differentiable, f (x) = { f (x)} linear lower bound M. Figueiredo and S. Wright First-Order Methods April / 68

Subgradients generalize gradients for general convex

v T (x x) Subdifferential: f (x) = {all subgradients of f

37 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} If f is differentiable, f (x) = { f (x)} linear lower bound nondifferentiable case M. Figueiredo and S. Wright First-Order Methods April / 68

38 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) M. Figueiredo and S. Wright First-Order Methods April / 68

39 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) Example: 2x 1, x 1 f (x) = x, 1 < x 0 x 2 /2, x > 0 { 2}, x < 1 [ 2, 1], x = 1 f (x) = { 1}, 1 < x < 0 [ 1, 0], x = 0 {x}, x > 0 M. Figueiredo and S. Wright First-Order Methods April / 68

{ 2}, x < 1 [ 2, 1], x = 1 f (x) = { 1}, 1 < x < 0 [ 1, 0], x = 0 {x}, x > 0 Fermat s Rule:

40 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) Example: 2x 1, x 1 f (x) = x, 1 < x 0 x 2 /2, x > 0 { 2}, x < 1 [ 2, 1], x = 1 f (x) = { 1}, 1 < x < 0 [ 1, 0], x = 0 {x}, x > 0 Fermat s Rule: x arg min x f (x) 0 f (x) M. Figueiredo and S. Wright First-Order Methods April / 68

41 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. M. Figueiredo and S. Wright First-Order Methods April / 68

42 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} M. Figueiredo and S. Wright First-Order Methods April / 68

43 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} Block separability: x = (x 1,..., x N ) (a partition of the components of x) ψ(x) = i ψ i (x i ) (prox ψ (y)) i = prox ψi (y i ) Relationship with subdifferential: z = prox ψ (y) z y ψ(z) M. Figueiredo and S. Wright First-Order Methods April / 68

44 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} Block separability: x = (x 1,..., x N ) (a partition of the components of x) ψ(x) = i ψ i (x i ) (prox ψ (y)) i = prox ψi (y i ) Relationship with subdifferential: z = prox ψ (y) z y ψ(z) Resolvent: z = prox ψ (y) 0 ψ(z) + (z y) y ( ψ + I )z prox ψ (y) = ( ψ + I ) 1 y M. Figueiredo and S. Wright First-Order Methods April / 68

45 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. M. Figueiredo and S. Wright First-Order Methods April / 68

46 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

47 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. Squared Euclidean norm (separable, smooth): prox τ 2 2 (y) = arg min x x y τ x 2 2 = y 1 + τ 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

48 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. Squared Euclidean norm (separable, smooth): prox τ 2 2 (y) = arg min x x y τ x 2 2 = y 1 + τ Euclidean norm (not separable, nonsmooth): prox τ 2 (y) = { x x 2 ( x 2 τ), if x 2 > τ 0 if x 2 τ 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

49 More Proximity Operators (Combettes and Pesquet, 2011) M. Figueiredo and S. Wright First-Order Methods April / 68

50 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x M. Figueiredo and S. Wright First-Order Methods April / 68

51 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. M. Figueiredo and S. Wright First-Order Methods April / 68

52 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. Moreau s decomposition: prox f (u) + prox f (u) = u...meaning that, if you know prox f, you know prox f, and vice-versa. M. Figueiredo and S. Wright First-Order Methods April / 68

53 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. Moreau s decomposition: prox f (u) + prox f (u) = u...meaning that, if you know prox f, you know prox f, and vice-versa. Conjugate of indicator: if f (x) = ι C (x), where C is a convex set, f (u) = sup x x T u ι C (x) = sup x T u σ C (u) x C (support function of C). M. Figueiredo and S. Wright First-Order Methods April / 68

54 From Conjugates to Proximity Operators Notice that u = sup x [ 1,1] x T u = σ [ 1,1] (u), thus = ι [ 1,1]. Using Moreau s decomposition, we easily derive the soft-threshold: prox τ = 1 prox ι[ τ,τ] = 1 P [ τ,τ] = soft(, τ) Conjugate of a norm: if f (x) = τ x p then f = ι {x: x q τ}, where 1 q + 1 p = 1 (a Hölder pair, or Hölder conjugates). M. Figueiredo and S. Wright First-Order Methods April / 68

τ) Conjugate of a norm: if f (x) = τ x p then f = ι {x: x q

That is, p and q are dual norms: z q = sup{x T z : x p 1} =

55 From Conjugates to Proximity Operators Notice that u = sup x [ 1,1] x T u = σ [ 1,1] (u), thus = ι [ 1,1]. Using Moreau s decomposition, we easily derive the soft-threshold: prox τ = 1 prox ι[ τ,τ] = 1 P [ τ,τ] = soft(, τ) Conjugate of a norm: if f (x) = τ x p then f = ι {x: x q τ}, where 1 q + 1 p = 1 (a Hölder pair, or Hölder conjugates). That is, p and q are dual norms: z q = sup{x T z : x p 1} = sup x T z = σ Bp(1)(z) x B p(1) M. Figueiredo and S. Wright First-Order Methods April / 68

56 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. M. Figueiredo and S. Wright First-Order Methods April / 68

57 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. Example: computing prox (notice l is not separable): Since = 1, prox τ = I P B1(τ)... the proximity operator of l norm is the residual of the projection on an l 1 ball. M. Figueiredo and S. Wright First-Order Methods April / 68

58 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. Example: computing prox (notice l is not separable): Since = 1, prox τ = I P B1(τ)... the proximity operator of l norm is the residual of the projection on an l 1 ball. Projection on l 1 ball has no closed form, but there are efficient (linear cost) algorithms (Brucker, 1984), (Maculan and de Paula, 1989). M. Figueiredo and S. Wright First-Order Methods April / 68

59 Geometry and Effect of prox l Whereas l 1 promotes sparsity, l promotes equality (in absolute value). M. Figueiredo and S. Wright First-Order Methods April / 68

60 From Conjugates to Proximity Operators The dual of the l 2 norm is the l 2 norm. M. Figueiredo and S. Wright First-Order Methods April / 68

61 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). M. Figueiredo and S. Wright First-Order Methods April / 68

62 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). In the tree-structured case, can get a complete ordering of the groups: G 1 G 2... G M, where (G G ) (G G ) or (G G = ). M. Figueiredo and S. Wright First-Order Methods April / 68

63 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). In the tree-structured case, can get a complete ordering of the groups: G 1 G 2... G M, where (G G ) (G G ) or (G G = ). Define Π m : R n R N : Then (Π m (u)) Gm = prox λm p (u Gm ), (Π m (u))ḡm = uḡm, where Ḡm = {1,..., n} \ G m prox ψ = Π M Π 2 Π 1...only valid for p {1, 2, } (Jenatton et al., 2011). M. Figueiredo and S. Wright First-Order Methods April / 68

64 Matrix Nuclear Norm and its Prox Operator Recall the trace/nuclear norm: X = min{m,n} The dual of a Schatten p-norm is a Schatten q-norm, with = 1. Thus, the dual of the nuclear norm is the spectral norm: 1 q + 1 p i=1 σ i. X = max { σ 1,..., σ min{m,n} }. If Y = UΛV T is the SVD of Y, we have prox τ (Y ) = UΛV T P {X :max{σ1,...,σ min{m,n} } τ}(uλv T ) = U soft ( Λ, τ ) V T. M. Figueiredo and S. Wright First-Order Methods April / 68

65 Atomic Norms: A Unified View M. Figueiredo and S. Wright First-Order Methods April / 68

66 Another Use of Fenchel-Legendre Conjugates The original problem: min x f (x) + ψ(x) M. Figueiredo and S. Wright First-Order Methods April / 68

67 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x M. Figueiredo and S. Wright First-Order Methods April / 68

68 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x Using the definition of conjugate g(a x) = sup u u T A x g (u) min x g(a x) + ψ(x) = inf x sup u T A x g (u) + ψ(x) u = sup( g (u)) + inf u T A x + ψ(x) u x = sup( g (u)) sup x T A T u ψ(x) u x } {{ } ψ ( A T u) = inf u g (u) + ψ ( A T u) M. Figueiredo and S. Wright First-Order Methods April / 68

69 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x Using the definition of conjugate g(a x) = sup u u T A x g (u) min x g(a x) + ψ(x) = inf x sup u T A x g (u) + ψ(x) u = sup( g (u)) + inf u T A x + ψ(x) u x = sup( g (u)) sup x T A T u ψ(x) u x } {{ } ψ ( A T u) = inf u g (u) + ψ ( A T u) The problem inf u g (u) + ψ ( A T u) is sometimes easier to handle. M. Figueiredo and S. Wright First-Order Methods April / 68

70 Basic Proximal-Gradient Algorithm Use basic structure: x k = arg min x Φ(x k ) 2 x 2 + ψ(x). with Φ(x k ) a simple gradient descent step, thus ( x k+1 = prox αk ψ xk α k f (x k ) ) This approach goes by different names, such as proximal gradient algorithm (PGA), iterative shrinkage/thresholding (IST or ISTA), forward-backward splitting (FBS) It has been proposed in different communities: optimization, PDEs, convex analysis, signal processing, machine learning. M. Figueiredo and S. Wright First-Order Methods April / 68

71 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) M. Figueiredo and S. Wright First-Order Methods April / 68

72 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) generalized (possibly inexact) version: ( ) ) x k+1 = (1 λ k )x k + λ k (prox αk ψ xk α k f (x k ) + b k + ak where a k and b k are errors in computing the prox and the gradient; λ k is an over-relaxation parameter. M. Figueiredo and S. Wright First-Order Methods April / 68

73 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) generalized (possibly inexact) version: ( ) ) x k+1 = (1 λ k )x k + λ k (prox αk ψ xk α k f (x k ) + b k + ak where a k and b k are errors in computing the prox and the gradient; λ k is an over-relaxation parameter. Convergence is guaranteed (Combettes and Wajs, 2006) if 0 < inf α k sup α k < 2 L λ k (0, 1], with inf λ k > 0 k a k < and k b k < M. Figueiredo and S. Wright First-Order Methods April / 68

74 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. M. Figueiredo and S. Wright First-Order Methods April / 68

75 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. M. Figueiredo and S. Wright First-Order Methods April / 68

76 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) M. Figueiredo and S. Wright First-Order Methods April / 68

77 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) In this case, some more refined convergence results are available. M. Figueiredo and S. Wright First-Order Methods April / 68

78 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) In this case, some more refined convergence results are available. Even more refined results are available if ψ(x) = τ x 1 M. Figueiredo and S. Wright First-Order Methods April / 68

79 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. M. Figueiredo and S. Wright First-Order Methods April / 68

80 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. The zero set: Z {1,..., n} : x G x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008): after a finite number of iterations (x k ) Z = 0. M. Figueiredo and S. Wright First-Order Methods April / 68

81 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. The zero set: Z {1,..., n} : x G x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008): after a finite number of iterations (x k ) Z = 0. After that, if B T Z B Z µi, with µ > 0 (thus κ(b T Z B Z) = L/µ < ), we have linear convergence x k+1 x 2 1 κ 1 + κ x k x 2 for the optimal choice α = 2/(L + µ) (see unconstrained theory). M. Figueiredo and S. Wright First-Order Methods April / 68

82 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; y k+1 x k + t k 1 t k+1 (x k x k 1 ). M. Figueiredo and S. Wright First-Order Methods April / 68

83 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; Acceleration: y k+1 x k + t k 1 t k+1 (x k x k 1 ). FISTA: f (x k ) f ( x) O ( ) 1 k 2 IST: f (x k ) f ( x) O ( ) 1. k M. Figueiredo and S. Wright First-Order Methods April / 68

84 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; Acceleration: y k+1 x k + t k 1 t k+1 (x k x k 1 ). FISTA: f (x k ) f ( x) O ( ) 1 k 2 IST: f (x k ) f ( x) O ( ) 1. k When L is not known, increase an estimate of L until it s big enough. M. Figueiredo and S. Wright First-Order Methods April / 68

85 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) M. Figueiredo and S. Wright First-Order Methods April / 68

86 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. M. Figueiredo and S. Wright First-Order Methods April / 68

87 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. Optimal parameters: γ = ρ 2 + 1, β = 2α µ+l, where ρ = 1 κ 1+ κ, yield linear convergence x k+1 x 2 1 κ 1 + κ x k x 2 M. Figueiredo and S. Wright First-Order Methods April / 68

88 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. Optimal parameters: γ = ρ 2 + 1, β = 2α µ+l, where ρ = 1 κ 1+ κ, yield linear convergence x k+1 x 2 1 κ ( ) 1 + κ x k x 2 versus 1 κ 1+κ for IST M. Figueiredo and S. Wright First-Order Methods April / 68

89 Illustration of the TwIST Acceleration M. Figueiredo and S. Wright First-Order Methods April / 68

90 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid M. Figueiredo and S. Wright First-Order Methods April / 68

91 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009): Barzilai-Borwein, to mimic Newton steps. keep increasing α k until monotonicity is violated: backtrack. M. Figueiredo and S. Wright First-Order Methods April / 68

92 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009): Barzilai-Borwein, to mimic Newton steps. keep increasing α k until monotonicity is violated: backtrack. Convergence to critical points (minima in the convex case) is guaranteed for a safeguarded version: ensure sufficient decrease w.r.t. the worst value in previous M iterations. M. Figueiredo and S. Wright First-Order Methods April / 68

93 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. M. Figueiredo and S. Wright First-Order Methods April / 68

94 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. With z = [ u v ], problem can be written in canonical form 1 min z 2 zt Q z + c T z s.t. z 0 M. Figueiredo and S. Wright First-Order Methods April / 68

95 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. With z = [ u v ], problem can be written in canonical form 1 min z 2 zt Q z + c T z s.t. z 0 Solving this problem with projected gradient using Barzilai-Borwein steps: GPSR (gradient projection for sparse reconstruction) (Figueiredo et al., 2007). M. Figueiredo and S. Wright First-Order Methods April / 68

96 Speed Comparisons Lorenz (2011) proposed a way of generating problem instances with known solution x: useful for speed comparison. Define: R k = x k x 2 x 2 and r k = L(x k) L( x) L( x) (where L(x) = f (x) + τψ(x)). M. Figueiredo and S. Wright First-Order Methods April / 68

97 More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April / 68

98 Even More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April / 68

99 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. M. Figueiredo and S. Wright First-Order Methods April / 68

100 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); M. Figueiredo and S. Wright First-Order Methods April / 68

101 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); Often the solution path x(τ), for a range of values of τ is desired, anyway (e.g., within an outer method to choose an optimal τ) M. Figueiredo and S. Wright First-Order Methods April / 68

102 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); Often the solution path x(τ), for a range of values of τ is desired, anyway (e.g., within an outer method to choose an optimal τ) Shown to be very effective in practice (Hale et al., 2008; Wright et al., 2009). Recently analyzed by Xiao and Zhang (2012). M. Figueiredo and S. Wright First-Order Methods April / 68

103 Acceleration by Continuation: An Example Classical sparse reconstruction problem (Wright et al., 2009) 1 x arg min x 2 B x b τ x 1 with B R (thus x R 4096 and b R 1024 ). M. Figueiredo and S. Wright First-Order Methods April / 68

104 A Final Touch: Debiasing 1 Consider problems of the form x arg min x R n 2 B x b τ x 1 Often, the original goal was to minimize the quadratic term, after the support of x had been found. But the l 1 term can cause the nonzero values of x i to be suppressed. Debiasing: find the zero set (complement of the support of x): Z( x) = {1,..., n} \ supp( x). solve min x B x b 2 2 s.t. x Z( x) = 0. (Fix the zeros and solve an unconstrained problem over the support.) Often, this problem has to be solved using an algorithm that only involves products by B and B T, since this matrix cannot be partitioned. M. Figueiredo and S. Wright First-Order Methods April / 68

105 Effect of Debiasing 1 0 Original (n = 4096, number of nonzeros = 160) SpaRSA reconstruction (m = 1024, tau = 0.08, MSE = ) Debiased (MSE = 3.377e 005) Minimum norm solution (MSE = 1.568) M. Figueiredo and S. Wright First-Order Methods April / 68

106 Example: Matrix Recovery (Toh and Yun, 2010) M. Figueiredo and S. Wright First-Order Methods April / 68

107 Conditional Gradient Also known as Frank-Wolfe after the authors who devised it in the 1950s. Later analysis by Dunn (around 1990). Suddenly a topic of enormous renewed interest; see for example (Jaggi, 2013). min f (x), x Ω where f is a convex function and Ω is a closed, bounded, convex set. Start at x 0 Ω. At iteration k: v k := arg min v Ω v T f (x k ); x k+1 := (1 α k ) x k + α k v k, α k = 2 k + 2. Potentially useful when it is easy to minimize a linear function over the original constraint set Ω; Admits an elementary convergence theory: 1/k sublinear rate. Same convergence theory holds if we use a line search for α k. M. Figueiredo and S. Wright First-Order Methods April / 68

108 Conditional Gradient for Atomic-Norm Constraints Conditional Gradient is particularly useful for optimization over atomic-norm constraints. min f (x) s.t. x A τ. Reminder: Given the set of atoms A (possibly infinite) we have { } x A := inf c a a, c a 0. a A c a : x = a A The search direction v k is τā k, where ā k := arg min a A a, f (x k). That is, we seek the atom that lines up best with the negative gradient direction f (x k ). M. Figueiredo and S. Wright First-Order Methods April / 68

109 Generating Atoms We can think of each step as the addition of a new atom to the basis. Note that x k is expressed in terms of {ā 0, ā 1,..., ā k }. If few iterations are needed to find a solution of acceptable accuracy, then we have an approximate solution that s represented in terms of few atoms, that is, sparse or compactly represented. For many atomic sets A of interest, the new atom can be found cheaply. Example: For the constraint x 1 τ, the atoms are {±e i : i = 1, 2,..., n}. if i k is the index at which [ f (x k )] i attains its maximum, we have ā k = sign([ f (x k )] ik ) e ik Example: For the constraint x τ, the atoms are the 2 n vectors with entries ±1. We have [ā k ] i = sign[ f (x k )] i, i = 1, 2,..., n. M. Figueiredo and S. Wright First-Order Methods April / 68

110 More Examples Example: Nuclear Norm. For the constraint X τ, for which the atoms are the rank-one matrices, we have Ā k = u k v T k, where u k and v k are the first columns of the matrices U k and V k obtained from the SVD f (X k ) = U k Σ k V T k. Example: sum-of-l 2. For the constraint m x [i] 2 τ, i=1 the atoms are the vectors a that contain all zeros except for a vector u [i] with unit 2-norm in the [i] block position. (Infinitely many.) The atom ā k contains nonzero components in the block i k for which [ f (x k )] [i] is maximized, and the nonzero part is u [i] = [ f (x k )] [ik ]/ [ f (x k )] [ik ]. M. Figueiredo and S. Wright First-Order Methods April / 68

111 Other Enhancements Reoptimizing. Instead of fixing the contribution α k from each atom at the time it joins the basis, we can periodically and approximately reoptimize over the current basis. This is a finite dimension optimization problem over the (nonnegative) coefficients of the basis atoms. It need only be solved approximately. If any coefficient is reduced to zero, it can be dropped from the basis. Dropping Atoms. Sparsity of the solution can be improved by dropping atoms from the basis, if doing so does not degrade the value of f too much (see (Rao et al., 2013)). In the important least-squares case, the effect of dropping can be evaluated efficiently. M. Figueiredo and S. Wright First-Order Methods April / 68

112 References I Akaike, H. (1959). On a successive transformation of probability distribution and its application to the analysis fo the optimum gradient method. Annals of the Institute of Statistics and Mathematics of Tokyo, 11:1 17. Barzilai, J. and Borwein, J. (1988). Two point step size gradient methods. IMA Journal of Numerical Analysis, 8: Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): Bioucas-Dias, J. and Figueiredo, M. (2007). A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16: Brucker, P. (1984). An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3: Candès, E. and Romberg, J. (2005). l 1 -MAGIC: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology. Combettes, P. and Pesquet, J.-C. (2011). Signal recovery by proximal forward-backward splitting. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages Springer. Combettes, P. and Wajs, V. (2006). Proximal splitting methods in signal processing. Multiscale Modeling and Simulation, 4: M. Figueiredo and S. Wright First-Order Methods April / 68

113 References II Ferris, M. C. and Munson, T. S. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization, 13(3): Figueiredo, M., Nowak, R., and Wright, S. (2007). Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on Convex Optimization Methods for Signal Processing, 1: Fountoulakis, K., Gondzio, J., and Zhlobich, P. (2012). Matrix-free interior point method for compressed sensing problems. Technical Report, University of Edinburgh. Gertz, E. M. and Wright, S. J. (2003). Object-oriented software for quadratic programming. ACM Transations on Mathematical Software, 29: Gondzio, J. and Fountoulakis, K. (2013). Second-order methods for l1-regularization. Talk at Optimization and Big Data Workshop, Edinburgh. Hale, E., Yin, W., and Zhang, Y. (2008). Fixed-point continuation for l1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19: Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. Technical Report, École Polytechnique, France. Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12: Lorenz, D. (2011). Constructing test instances for basis pursuit denoising. arxiv.org/abs/ M. Figueiredo and S. Wright First-Order Methods April / 68

114 References III Maculan, N. and de Paula, G. G. (1989). A linear-time median-finding algorithm for projecting a vector on the simplex of R n. Operations Research Letters, 8: Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A Math, 255: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Doklady, 27: Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York. Rao, N., Shah, P., Wright, S. J., and Nowak, R. (2013). A greedy forward-backward algorithm for atomic-norm-constrained optimization. In Proceedings of ICASSP. Toh, K.-C. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, 6: Wright, S., Nowak, R., and Figueiredo, M. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57: Xiao, L. and Zhang, T. (2012). A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization. (to appear; available at M. Figueiredo and S. Wright First-Order Methods April / 68

Optimization Algorithms for Compressed Sensing

Optimization Algorithms for Compressed Sensing Stephen Wright University of Wisconsin-Madison SIAM Gator Student Conference, Gainesville, March 2009 Stephen Wright (UW-Madison) Optimization and Compressed