Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods

Size: px
Start display at page:

Download "Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods"

Transcription

1 Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods Mário A. T. Figueiredo 1 and Stephen J. Wright 2 1 Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal 2 Computer Sciences Department, University of Wisconsin, Madison, WI, USA Condensed version of ICCOPT tutorial, Lisbon, Portugal, 2013 M. Figueiredo and S. Wright First-Order Methods April / 68

2 In general, optimization problems are unsolvable, Nesterov, 1984 M. Figueiredo and S. Wright First-Order Methods April / 68

3 In general, optimization problems are unsolvable, Nesterov, 1984 M. Figueiredo and S. Wright First-Order Methods April / 68

4 Focus (Initially) on Smooth Convex Functions Consider min f (x), x R n with f smooth and convex. Usually assume µi 2 f (x) LI, x, with 0 µ L (thus L is a Lipschitz constant of f ). If µ > 0, then f is µ-strongly convex (as seen in Part 1) and f (y) f (x) + f (x) T (y x) + µ 2 y x 2 2. Define conditioning (or condition number) as κ := L/µ. We are often interested in convex quadratics: f (x) = 1 2 x T A x, f (x) = 1 2 Bx b 2 2, µi A LI or µi B T B LI M. Figueiredo and S. Wright First-Order Methods April / 68

5 What s the Setup? We consider iterative algorithms x k+1 = Φ(x k ), or x k+1 = Φ(x k, x k 1 ) For now, assume we can evaluate f (x t ) and f (x t ) at each iteration. Later, we look at broader classes of problems: nonsmooth f ; f not available (or too expensive to evaluate exactly); only an estimate of the gradient is available; nonsmooth regularization; i.e., instead of simply f (x), we want to minimize f (x) + τψ(x). We focus on algorithms that can be adapted to those scenarios. M. Figueiredo and S. Wright First-Order Methods April / 68

6 Skip to slide 29 Recall Francis Bach s talk M. Figueiredo and S. Wright First-Order Methods April / 68

7 Steepest Descent Steepest descent (a.k.a. gradient descent): x k+1 = x k α k f (x k ), for some α k > 0. Different ways to select an appropriate α k. 1 Hard: interpolating scheme with safeguarding to identify an approximate minimizing α k Easy: backtracking. ᾱ, 2ᾱ, 4ᾱ, 8ᾱ,... until sufficient decrease in f is obtained. 3 Trivial: don t test for function decrease; use rules based on L and µ. Analysis for 1 and 2 usually yields global convergence at unspecified rate. The greedy strategy of getting good decrease in the current search direction may lead to better practical results. Analysis for 3: Focuses on convergence rate, and leads to accelerated multi-step methods. M. Figueiredo and S. Wright First-Order Methods April / 68

8 Line Search Seek α k that satisfies Wolfe conditions: sufficient decrease in f : f (x k α k f (x k )) f (x k ) c 1 α k f (x k ) 2, (0 < c 1 1) while not being too small (significant increase in the directional derivative): f (x k+1 ) T f (x k ) c 2 f (x k ) 2, (c 1 < c 2 < 1). (works for nonconvex f.) Can show that accumulation points x of {x k } are stationary: f ( x) = 0 (thus minimizers, if f is convex) Can do one-dimensional line search for α k, taking minima of quadratic or cubic interpolations of the function and gradient at the last two values tried. Use brackets to ensure steady convergence. Often finds suitable α within 3 attempts. (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April / 68

9 Backtracking Try α k = ᾱ, ᾱ 2, ᾱ 4, ᾱ 8,... until the sufficient decrease condition is satisfied. No need to check the second Wolfe condition: the α k thus identified is within striking distance of an α that s too large so it is not too short. Backtracking is widely used in applications, but doesn t work on nonsmooth problems, or when f is not available / too expensive. M. Figueiredo and S. Wright First-Order Methods April / 68

10 Constant (Short) Steplength By elementary use of Taylor s theorem, and since 2 f (x) LI, ( f (x k+1 ) f (x k ) α k 1 α ) k 2 L f (x k ) 2 2 For α k 1/L, f (x k+1 ) f (x k ) 1 2L f (x k) 2 2, thus f (x k ) 2 2L[f (x k ) f (x k+1 )] Summing for k = 0, 1,..., N, and telescoping the sum, N f (x k ) 2 2L[f (x 0 ) f (x N+1 )]. k=0 It follows that f (x k ) 0 if f is bounded below. M. Figueiredo and S. Wright First-Order Methods April / 68

11 Rate Analysis Suppose that the minimizer x is unique. Another elementary use of Taylor s theorem shows that ( ) 2 x k+1 x 2 x k x 2 α k L α k f (x k ) 2, so that { x k x } is decreasing. Define for convenience: k := f (x k ) f (x ). By convexity, have k f (x k ) T (x k x ) f (x k ) x k x f (x k ) x 0 x. From previous page (subtracting f (x ) from both sides of the inequality), and using the inequality above, we have k+1 k (1/2L) f (x k ) 2 k 1 2L x 0 x 2 2 k. M. Figueiredo and S. Wright First-Order Methods April / 68

12 Weakly convex: 1/k sublinear; Strongly convex: linear Take reciprocal of both sides and manipulate (using (1 ɛ) ɛ): which yields 1 k+1 1 k + The classic 1/k convergence rate! 1 2L x 0 x f (x k+1 ) f (x ) 2L x 0 x 2. k + 1 k + 1 2L x 0 x 2, By assuming µ > 0, can set α k 2/(µ + L) and get a linear (geometric) rate: Much better than sublinear, in the long run x k x 2 ( ) L µ 2k ( x 0 x 2 = 1 2 ) 2k x 0 x 2. L + µ κ + 1 M. Figueiredo and S. Wright First-Order Methods April / 68

13 Weakly convex: 1/k sublinear; Strongly convex: linear Since by Taylor s theorem we have it follows immediately that k = f (x k ) f (x ) (L/2) x k x 2, f (x k ) f (x ) L 2 ( 1 2 ) 2k x 0 x 2. κ + 1 Note: A geometric / linear rate is generally better than almost any sublinear (1/k or 1/k 2 ) rate. M. Figueiredo and S. Wright First-Order Methods April / 68

14 Exact minimizing α k : Faster rate? Question: does taking α k as the exact minimizer of f along f (x k ) yield better rate of linear convergence? Consider f (x) = 1 2 x T A x (thus x = 0 and f (x ) = 0.) We have f (x k ) = A x k. Exactly minimizing w.r.t. α k, 1 α k = arg min α 2 (x k αax k ) T A(x k αax k ) = x k T [ A2 x k 1 xk T A3 x k L, 1 ] µ Thus so, defining z k := Ax k, we have f (x k+1 ) f (x k ) 1 (xk T A2 x k ) 2 2 (xk T Ax k)(xk T A3 x k ), f (x k+1 ) f (x ) f (x k ) f (x ) 1 z k 4 (z T k A 1 z k )(z T k Az k). M. Figueiredo and S. Wright First-Order Methods April / 68

15 Exact minimizing α k : Faster rate? Using Kantorovich inequality: (z T Az)(z T A 1 z) (L + µ)2 4Lµ z 4. Thus and so f (x k+1 ) f (x ) f (x k ) f (x ) f (x k ) f (x ) 1 4Lµ ( (L + µ) 2 = 1 2 ) 2, κ + 1 ( 1 2 ) 2k [f (x 0 ) f (x )]. κ + 1 No improvement in the linear rate over constant steplength! M. Figueiredo and S. Wright First-Order Methods April / 68

16 The slow linear rate is typical! Not just a pessimistic bound! M. Figueiredo and S. Wright First-Order Methods April / 68

17 Multistep Methods: The Heavy-Ball Enhance the search direction using a contribution from the previous step. (known as heavy ball, momentum, or two-step) Consider first a constant step length α, and a second parameter β for the momentum term: x k+1 = x k α f (x k ) + β(x k x k 1 ) Analyze by defining a composite iterate vector: [ xk x w k := ] x k 1 x. Thus w k+1 = Bw k + o( w k ), B := [ α 2 f (x ] ) + (1 + β)i βi. I 0 M. Figueiredo and S. Wright First-Order Methods April / 68

18 Multistep Methods: The Heavy-Ball Matrix B has same eigenvalues as [ ] αλ + (1 + β)i βi, Λ = diag(λ I 0 1, λ 2,..., λ n ), where λ i are the eigenvalues of 2 f (x ). Choose α, β to explicitly minimize the max eigenvalue of B, obtain α = 4 1 L (1 + 1/ κ) 2, β = ( 1 2 κ + 1 ) 2. Leads to linear convergence for x k x with rate approximately ( ) 2 1. κ + 1 M. Figueiredo and S. Wright First-Order Methods April / 68

19 Summary: Linear Convergence, Strictly Convex f Steepest descent: Linear rate approx Heavy-ball: Linear rate approx ( 1 2 ) ; κ ( 1 2 κ ). Big difference! To reduce x k x by a factor ɛ, need k large enough that ( 1 2 ) k ɛ k κ κ ( 1 2 κ ) k ɛ k log ɛ (steepest descent) 2 κ 2 log ɛ (heavy-ball) A factor of κ difference; e.g. if κ = 1000 (not at all uncommon in inverse problems), need 30 times fewer steps. M. Figueiredo and S. Wright First-Order Methods April / 68

20 Conjugate Gradient Basic conjugate gradient (CG) step is x k+1 = x k + α k p k, p k = f (x k ) + γ k p k 1. Can be identified with heavy-ball, with β k = α kγ k α k 1. However, CG can be implemented in a way that doesn t require knowledge (or estimation) of L and µ. Choose α k to (approximately) miminize f along p k ; Choose γ k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. γ k = f (x k) 2 f (x k 1 ) 2 M. Figueiredo and S. Wright First-Order Methods April / 68

21 Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes. Restarting periodically with p k = f (x k ) is useful (e.g. every n iterations, or when p k is not a descent direction). For quadratic f, convergence analysis is based on eigenvalues of A and Chebyshev polynomials, min-max arguments. Get Finite termination in as many iterations as there are distinct eigenvalues; Asymptotic linear convergence with rate approx 1 2. κ (like heavy-ball.) (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April / 68

22 Accelerated First-Order Methods Accelerate the rate to 1/k 2 for weakly convex, while retaining the linear rate (related to κ) for strongly convex case. Nesterov (1983) describes a method that requires κ. Initialize: Choose x 0, α 0 (0, 1); set y 0 x 0. Iterate: x k+1 y k 1 L f (y k); (*short-step*) find α k+1 (0, 1): αk+1 2 = (1 α k+1)αk 2 + α k+1 κ ; set β k = α k(1 α k ) αk 2 + α ; k+1 set y k+1 x k+1 + β k (x k+1 x k ). Still works for weakly convex (κ = ). M. Figueiredo and S. Wright First-Order Methods April / 68

23 y k+2 x k+2 y k+1 x k+1 x k y k Separates the gradient descent and momentum step components. M. Figueiredo and S. Wright First-Order Methods April / 68

24 Convergence Results: Nesterov If α 0 1/ κ, have f (x k ) f (x ) c 1 min ( ( 1 1 κ ) k, where constants c 1 and c 2 depend on x 0, α 0, L. 4L ( L + c 2 k) 2 Linear convergence heavy-ball rate for strongly convex f ; 1/k 2 sublinear rate otherwise. ), In the special case of α 0 = 1/ κ, this scheme yields α k 1 κ, β k 1 2 κ + 1. M. Figueiredo and S. Wright First-Order Methods April / 68

25 Beck and Teboulle Beck and Teboulle (2009) propose a similar algorithm, with a fairly short and elementary analysis (though still not intuitive). Initialize: Choose x 0 ; set y 1 = x 0, t 1 = 1; Iterate: x k y k 1 L f (y k); ( t k t 2 k ) ; y k+1 x k + t k 1 t k+1 (x k x k 1 ). For (weakly) convex f, converges with f (x k ) f (x ) 1/k 2. When L is not known, increase an estimate of L until it s big enough. Beck and Teboulle (2009) do the convergence analysis in 2-3 pages; elementary, but technical. M. Figueiredo and S. Wright First-Order Methods April / 68

26 A Non-Monotone Gradient Method: Barzilai-Borwein Barzilai and Borwein (1988) (BB) proposed an unusual choice of α k. Allows f to increase (sometimes a lot) on some steps: non-monotone. where Explicitly, we have x k+1 = x k α k f (x k ), α k := arg min α s k αz k 2, s k := x k x k 1, z k := f (x k ) f (x k 1 ). α k = st k z k zk T z. k Note that for f (x) = 1 2 x T Ax, we have α k = st k As k s T k A2 s k [ 1 L, 1 ]. µ BB can be viewed as a quasi-newton method, with the Hessian approximated by α 1 k I. M. Figueiredo and S. Wright First-Order Methods April / 68

27 Comparison: BB vs Greedy Steepest Descent M. Figueiredo and S. Wright First-Order Methods April / 68

28 There Are Many BB Variants use α k = s T k s k/s T k z k in place of α k = s T k z k/z T k z k; alternate between these two formulae; hold α k constant for a number (2, 3, 5) of successive steps; take α k to be the steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last M (say 10) iterates. The original 1988 analysis in BB s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959). More recently, see Ascher, Dai, Fletcher, Hager and others. M. Figueiredo and S. Wright First-Order Methods April / 68

29 Extending to the Constrained Case: x Ω How to change these methods to handle the constraint x Ω? (assuming that Ω is a closed convex set) Some algorithms and theory stay much the same,...if we can involve the constraint x Ω explicity in the subproblems. Example: Nesterov s constant step scheme requires just one calculation to be changed from the unconstrained version. Initialize: Choose x 0, α 0 (0, 1); set y 0 x 0. Iterate: x k+1 arg min y Ω 1 2 y [y k 1 L f (y k)] 2 2 ; find α k+1 (0, 1): α 2 k+1 = (1 α k+1)α 2 k + α k+1 κ ; set β k = α k(1 α k ) α 2 k +α ; k+1 set y k+1 x k+1 + β k (x k+1 x k ). Convergence theory is unchanged. M. Figueiredo and S. Wright First-Order Methods April / 68

30 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. M. Figueiredo and S. Wright First-Order Methods April / 68

31 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step M. Figueiredo and S. Wright First-Order Methods April / 68

32 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step or something more complicated; e.g., an accelerated method with Φ(x k, x k 1 ) instead of Φ(x k ) M. Figueiredo and S. Wright First-Order Methods April / 68

33 Regularized Optimization How to change these methods to handle regularized optimization? min x f (x) + τψ(x), where f is convex and smooth, while ψ is convex but usually nonsmooth. Often, all that is needed is to change the update step to x k = arg min x Φ(x k ) 2 x 2 + λψ(x). where Φ(x k ) is a gradient descent step or something more complicated; e.g., an accelerated method with Φ(x k, x k 1 ) instead of Φ(x k ) This is the shrinkage/tresholding step; how to solve it with nonsmooth ψ? That s the topic of the following slides. M. Figueiredo and S. Wright First-Order Methods April / 68

34 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: M. Figueiredo and S. Wright First-Order Methods April / 68

35 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} linear lower bound M. Figueiredo and S. Wright First-Order Methods April / 68

36 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} If f is differentiable, f (x) = { f (x)} linear lower bound M. Figueiredo and S. Wright First-Order Methods April / 68

37 Handling Nonsmoothness (e.g. l 1 Norm) Convexity continuity (on the domain of the function). Convexity differentiability (e.g., ψ(x) = x 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f (x ) f (x) + v T (x x) Subdifferential: f (x) = {all subgradients of f at x} If f is differentiable, f (x) = { f (x)} linear lower bound nondifferentiable case M. Figueiredo and S. Wright First-Order Methods April / 68

38 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) M. Figueiredo and S. Wright First-Order Methods April / 68

39 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) Example: 2x 1, x 1 f (x) = x, 1 < x 0 x 2 /2, x > 0 { 2}, x < 1 [ 2, 1], x = 1 f (x) = { 1}, 1 < x < 0 [ 1, 0], x = 0 {x}, x > 0 M. Figueiredo and S. Wright First-Order Methods April / 68

40 More on Subgradients and Subdifferentials The subdifferential is a set-valued function: f : R d R f : R d 2 Rd (power set of R d ) Example: 2x 1, x 1 f (x) = x, 1 < x 0 x 2 /2, x > 0 { 2}, x < 1 [ 2, 1], x = 1 f (x) = { 1}, 1 < x < 0 [ 1, 0], x = 0 {x}, x > 0 Fermat s Rule: x arg min x f (x) 0 f (x) M. Figueiredo and S. Wright First-Order Methods April / 68

41 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. M. Figueiredo and S. Wright First-Order Methods April / 68

42 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} M. Figueiredo and S. Wright First-Order Methods April / 68

43 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} Block separability: x = (x 1,..., x N ) (a partition of the components of x) ψ(x) = i ψ i (x i ) (prox ψ (y)) i = prox ψi (y i ) Relationship with subdifferential: z = prox ψ (y) z y ψ(z) M. Figueiredo and S. Wright First-Order Methods April / 68

44 A Key Tool: Moreau s Proximity Operators Moreau (1962) proximity operator 1 x arg min x 2 x y ψ(x) =: prox ψ (y)...well defined for convex ψ, since y 2 2 is coercive and strictly convex. Example: (seen above) prox τ (y) = soft(y, τ) = sign(y) max{ y τ, 0} Block separability: x = (x 1,..., x N ) (a partition of the components of x) ψ(x) = i ψ i (x i ) (prox ψ (y)) i = prox ψi (y i ) Relationship with subdifferential: z = prox ψ (y) z y ψ(z) Resolvent: z = prox ψ (y) 0 ψ(z) + (z y) y ( ψ + I )z prox ψ (y) = ( ψ + I ) 1 y M. Figueiredo and S. Wright First-Order Methods April / 68

45 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. M. Figueiredo and S. Wright First-Order Methods April / 68

46 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

47 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. Squared Euclidean norm (separable, smooth): prox τ 2 2 (y) = arg min x x y τ x 2 2 = y 1 + τ 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

48 Important Proximity Operators Soft-thresholding is the proximity operator of the l 1 norm. Consider the indicator ι S of a convex set S; 1 prox ιs (u) = arg min x 2 x u ι S (x) = arg min x S...the Euclidean projection on S. Squared Euclidean norm (separable, smooth): prox τ 2 2 (y) = arg min x x y τ x 2 2 = y 1 + τ Euclidean norm (not separable, nonsmooth): prox τ 2 (y) = { x x 2 ( x 2 τ), if x 2 > τ 0 if x 2 τ 1 2 x y 2 2 = P S (u) M. Figueiredo and S. Wright First-Order Methods April / 68

49 More Proximity Operators (Combettes and Pesquet, 2011) M. Figueiredo and S. Wright First-Order Methods April / 68

50 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x M. Figueiredo and S. Wright First-Order Methods April / 68

51 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. M. Figueiredo and S. Wright First-Order Methods April / 68

52 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. Moreau s decomposition: prox f (u) + prox f (u) = u...meaning that, if you know prox f, you know prox f, and vice-versa. M. Figueiredo and S. Wright First-Order Methods April / 68

53 Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f denoted by f : R n R is defined by f (u) = sup x T u f (x) x Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f = f. Moreau s decomposition: prox f (u) + prox f (u) = u...meaning that, if you know prox f, you know prox f, and vice-versa. Conjugate of indicator: if f (x) = ι C (x), where C is a convex set, f (u) = sup x x T u ι C (x) = sup x T u σ C (u) x C (support function of C). M. Figueiredo and S. Wright First-Order Methods April / 68

54 From Conjugates to Proximity Operators Notice that u = sup x [ 1,1] x T u = σ [ 1,1] (u), thus = ι [ 1,1]. Using Moreau s decomposition, we easily derive the soft-threshold: prox τ = 1 prox ι[ τ,τ] = 1 P [ τ,τ] = soft(, τ) Conjugate of a norm: if f (x) = τ x p then f = ι {x: x q τ}, where 1 q + 1 p = 1 (a Hölder pair, or Hölder conjugates). M. Figueiredo and S. Wright First-Order Methods April / 68

55 From Conjugates to Proximity Operators Notice that u = sup x [ 1,1] x T u = σ [ 1,1] (u), thus = ι [ 1,1]. Using Moreau s decomposition, we easily derive the soft-threshold: prox τ = 1 prox ι[ τ,τ] = 1 P [ τ,τ] = soft(, τ) Conjugate of a norm: if f (x) = τ x p then f = ι {x: x q τ}, where 1 q + 1 p = 1 (a Hölder pair, or Hölder conjugates). That is, p and q are dual norms: z q = sup{x T z : x p 1} = sup x T z = σ Bp(1)(z) x B p(1) M. Figueiredo and S. Wright First-Order Methods April / 68

56 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. M. Figueiredo and S. Wright First-Order Methods April / 68

57 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. Example: computing prox (notice l is not separable): Since = 1, prox τ = I P B1(τ)... the proximity operator of l norm is the residual of the projection on an l 1 ball. M. Figueiredo and S. Wright First-Order Methods April / 68

58 From Conjugates to Proximity Operators Proximity of norm: prox τ p = I P Bq(τ) where B q (τ) = {x : x q τ} and 1 q + 1 p = 1. Example: computing prox (notice l is not separable): Since = 1, prox τ = I P B1(τ)... the proximity operator of l norm is the residual of the projection on an l 1 ball. Projection on l 1 ball has no closed form, but there are efficient (linear cost) algorithms (Brucker, 1984), (Maculan and de Paula, 1989). M. Figueiredo and S. Wright First-Order Methods April / 68

59 Geometry and Effect of prox l Whereas l 1 promotes sparsity, l promotes equality (in absolute value). M. Figueiredo and S. Wright First-Order Methods April / 68

60 From Conjugates to Proximity Operators The dual of the l 2 norm is the l 2 norm. M. Figueiredo and S. Wright First-Order Methods April / 68

61 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). M. Figueiredo and S. Wright First-Order Methods April / 68

62 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). In the tree-structured case, can get a complete ordering of the groups: G 1 G 2... G M, where (G G ) (G G ) or (G G = ). M. Figueiredo and S. Wright First-Order Methods April / 68

63 Group Norms and their Prox Operators Group-norm regularizer: ψ(x) = M λ m x Gm p. m=1 In the non-overlapping case (G 1,..., G m is a partition of {1,..., n}), simply use separability: ( proxψ (u) ) G m = prox λm p ( ugm ). In the tree-structured case, can get a complete ordering of the groups: G 1 G 2... G M, where (G G ) (G G ) or (G G = ). Define Π m : R n R N : Then (Π m (u)) Gm = prox λm p (u Gm ), (Π m (u))ḡm = uḡm, where Ḡm = {1,..., n} \ G m prox ψ = Π M Π 2 Π 1...only valid for p {1, 2, } (Jenatton et al., 2011). M. Figueiredo and S. Wright First-Order Methods April / 68

64 Matrix Nuclear Norm and its Prox Operator Recall the trace/nuclear norm: X = min{m,n} The dual of a Schatten p-norm is a Schatten q-norm, with = 1. Thus, the dual of the nuclear norm is the spectral norm: 1 q + 1 p i=1 σ i. X = max { σ 1,..., σ min{m,n} }. If Y = UΛV T is the SVD of Y, we have prox τ (Y ) = UΛV T P {X :max{σ1,...,σ min{m,n} } τ}(uλv T ) = U soft ( Λ, τ ) V T. M. Figueiredo and S. Wright First-Order Methods April / 68

65 Atomic Norms: A Unified View M. Figueiredo and S. Wright First-Order Methods April / 68

66 Another Use of Fenchel-Legendre Conjugates The original problem: min x f (x) + ψ(x) M. Figueiredo and S. Wright First-Order Methods April / 68

67 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x M. Figueiredo and S. Wright First-Order Methods April / 68

68 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x Using the definition of conjugate g(a x) = sup u u T A x g (u) min x g(a x) + ψ(x) = inf x sup u T A x g (u) + ψ(x) u = sup( g (u)) + inf u T A x + ψ(x) u x = sup( g (u)) sup x T A T u ψ(x) u x } {{ } ψ ( A T u) = inf u g (u) + ψ ( A T u) M. Figueiredo and S. Wright First-Order Methods April / 68

69 Another Use of Fenchel-Legendre Conjugates The original problem: min f (x) + ψ(x) x Often this has the form: min g(a x) + ψ(x) x Using the definition of conjugate g(a x) = sup u u T A x g (u) min x g(a x) + ψ(x) = inf x sup u T A x g (u) + ψ(x) u = sup( g (u)) + inf u T A x + ψ(x) u x = sup( g (u)) sup x T A T u ψ(x) u x } {{ } ψ ( A T u) = inf u g (u) + ψ ( A T u) The problem inf u g (u) + ψ ( A T u) is sometimes easier to handle. M. Figueiredo and S. Wright First-Order Methods April / 68

70 Basic Proximal-Gradient Algorithm Use basic structure: x k = arg min x Φ(x k ) 2 x 2 + ψ(x). with Φ(x k ) a simple gradient descent step, thus ( x k+1 = prox αk ψ xk α k f (x k ) ) This approach goes by different names, such as proximal gradient algorithm (PGA), iterative shrinkage/thresholding (IST or ISTA), forward-backward splitting (FBS) It has been proposed in different communities: optimization, PDEs, convex analysis, signal processing, machine learning. M. Figueiredo and S. Wright First-Order Methods April / 68

71 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) M. Figueiredo and S. Wright First-Order Methods April / 68

72 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) generalized (possibly inexact) version: ( ) ) x k+1 = (1 λ k )x k + λ k (prox αk ψ xk α k f (x k ) + b k + ak where a k and b k are errors in computing the prox and the gradient; λ k is an over-relaxation parameter. M. Figueiredo and S. Wright First-Order Methods April / 68

73 Convergence of the Proximal-Gradient Algorithm ( Basic algorithm: x k+1 = prox αk ψ xk α k f (x k ) ) generalized (possibly inexact) version: ( ) ) x k+1 = (1 λ k )x k + λ k (prox αk ψ xk α k f (x k ) + b k + ak where a k and b k are errors in computing the prox and the gradient; λ k is an over-relaxation parameter. Convergence is guaranteed (Combettes and Wajs, 2006) if 0 < inf α k sup α k < 2 L λ k (0, 1], with inf λ k > 0 k a k < and k b k < M. Figueiredo and S. Wright First-Order Methods April / 68

74 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. M. Figueiredo and S. Wright First-Order Methods April / 68

75 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. M. Figueiredo and S. Wright First-Order Methods April / 68

76 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) M. Figueiredo and S. Wright First-Order Methods April / 68

77 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) In this case, some more refined convergence results are available. M. Figueiredo and S. Wright First-Order Methods April / 68

78 Proximal-Gradient Algorithm: Quadratic Case Consider the quadratic case (of great interest): f (x) = 1 2 B x b 2 2. Here, f (x) = B T (B x b) and the IST/PGA/FBS algorithm is ( x k+1 = prox αk ψ xk α k B T (B x b) ) requires only matrix-vector multiplications with B and B T. Very important in signal/image processing, where fast algorithms exist to compute these products (e.g. FFT, DWT) In this case, some more refined convergence results are available. Even more refined results are available if ψ(x) = τ x 1 M. Figueiredo and S. Wright First-Order Methods April / 68

79 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. M. Figueiredo and S. Wright First-Order Methods April / 68

80 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. The zero set: Z {1,..., n} : x G x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008): after a finite number of iterations (x k ) Z = 0. M. Figueiredo and S. Wright First-Order Methods April / 68

81 More on IST/FBS/PGA for the l 2 -l 1 Case 1 Problem: x G = arg min x R n 2 B x b τ x 1 (recall B T B LI ) IST/FBS/PGA becomes x k+1 = soft ( x k αb T (B x b), ατ ) with α < 2/L. The zero set: Z {1,..., n} : x G x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008): after a finite number of iterations (x k ) Z = 0. After that, if B T Z B Z µi, with µ > 0 (thus κ(b T Z B Z) = L/µ < ), we have linear convergence x k+1 x 2 1 κ 1 + κ x k x 2 for the optimal choice α = 2/(L + µ) (see unconstrained theory). M. Figueiredo and S. Wright First-Order Methods April / 68

82 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; y k+1 x k + t k 1 t k+1 (x k x k 1 ). M. Figueiredo and S. Wright First-Order Methods April / 68

83 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; Acceleration: y k+1 x k + t k 1 t k+1 (x k x k 1 ). FISTA: f (x k ) f ( x) O ( ) 1 k 2 IST: f (x k ) f ( x) O ( ) 1. k M. Figueiredo and S. Wright First-Order Methods April / 68

84 Heavy Ball Acceleration: FISTA FISTA (fast iterative shrinkage-thresholding algorithm) is heavy-ball-type acceleration of IST (based on Nesterov (1983)) (Beck and Teboulle, 2009). Initialize: Choose α 1/L, x 0 ; set y 1 = x 0, t 1 = 1; ( Iterate: x k prox ταψ yk α f (y k ) ) ; ( t k ) 1 + 4tk 2 ; Acceleration: y k+1 x k + t k 1 t k+1 (x k x k 1 ). FISTA: f (x k ) f ( x) O ( ) 1 k 2 IST: f (x k ) f ( x) O ( ) 1. k When L is not known, increase an estimate of L until it s big enough. M. Figueiredo and S. Wright First-Order Methods April / 68

85 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) M. Figueiredo and S. Wright First-Order Methods April / 68

86 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. M. Figueiredo and S. Wright First-Order Methods April / 68

87 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. Optimal parameters: γ = ρ 2 + 1, β = 2α µ+l, where ρ = 1 κ 1+ κ, yield linear convergence x k+1 x 2 1 κ 1 + κ x k x 2 M. Figueiredo and S. Wright First-Order Methods April / 68

88 Heavy Ball Acceleration: TwIST TwIST (two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007)) is a heavy-ball-type acceleration of IST, for Iterations (with α < 2/L) 1 min x 2 B x b τψ(x) x k+1 = (γ β) x k + (1 γ)x k 1 + β prox ατψ ( xk α B T (B x b) ) Analysis in the strongly convex case: µi B T B LI, with µ > 0. Condition number κ = L/µ <. Optimal parameters: γ = ρ 2 + 1, β = 2α µ+l, where ρ = 1 κ 1+ κ, yield linear convergence x k+1 x 2 1 κ ( ) 1 + κ x k x 2 versus 1 κ 1+κ for IST M. Figueiredo and S. Wright First-Order Methods April / 68

89 Illustration of the TwIST Acceleration M. Figueiredo and S. Wright First-Order Methods April / 68

90 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid M. Figueiredo and S. Wright First-Order Methods April / 68

91 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009): Barzilai-Borwein, to mimic Newton steps. keep increasing α k until monotonicity is violated: backtrack. M. Figueiredo and S. Wright First-Order Methods April / 68

92 Acceleration via Larger Steps: SpaRSA The standard step-size α k 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009): Barzilai-Borwein, to mimic Newton steps. keep increasing α k until monotonicity is violated: backtrack. Convergence to critical points (minima in the convex case) is guaranteed for a safeguarded version: ensure sufficient decrease w.r.t. the worst value in previous M iterations. M. Figueiredo and S. Wright First-Order Methods April / 68

93 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. M. Figueiredo and S. Wright First-Order Methods April / 68

94 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. With z = [ u v ], problem can be written in canonical form 1 min z 2 zt Q z + c T z s.t. z 0 M. Figueiredo and S. Wright First-Order Methods April / 68

95 Another Approach: Gradient Projection min x 1 2 B x b τ x 1 can be written as a standard QP: min u,v 1 2 B(u v) b τu T 1 + τu T 1 s.t. u 0, v 0, where u i = max{0, x i } and v i = max{0, x i }. With z = [ u v ], problem can be written in canonical form 1 min z 2 zt Q z + c T z s.t. z 0 Solving this problem with projected gradient using Barzilai-Borwein steps: GPSR (gradient projection for sparse reconstruction) (Figueiredo et al., 2007). M. Figueiredo and S. Wright First-Order Methods April / 68

96 Speed Comparisons Lorenz (2011) proposed a way of generating problem instances with known solution x: useful for speed comparison. Define: R k = x k x 2 x 2 and r k = L(x k) L( x) L( x) (where L(x) = f (x) + τψ(x)). M. Figueiredo and S. Wright First-Order Methods April / 68

97 More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April / 68

98 Even More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April / 68

99 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. M. Figueiredo and S. Wright First-Order Methods April / 68

100 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); M. Figueiredo and S. Wright First-Order Methods April / 68

101 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); Often the solution path x(τ), for a range of values of τ is desired, anyway (e.g., within an outer method to choose an optimal τ) M. Figueiredo and S. Wright First-Order Methods April / 68

102 Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 τ, starting point x, factor σ (0, 1), and k = 0. Iterations: Find approx solution x(τ k ) of min x f (x) + τ k ψ(x), starting from x; if τ k = τ STOP; Set τ k+1 max(τ, στ k ) and x x(τ k ); Often the solution path x(τ), for a range of values of τ is desired, anyway (e.g., within an outer method to choose an optimal τ) Shown to be very effective in practice (Hale et al., 2008; Wright et al., 2009). Recently analyzed by Xiao and Zhang (2012). M. Figueiredo and S. Wright First-Order Methods April / 68

103 Acceleration by Continuation: An Example Classical sparse reconstruction problem (Wright et al., 2009) 1 x arg min x 2 B x b τ x 1 with B R (thus x R 4096 and b R 1024 ). M. Figueiredo and S. Wright First-Order Methods April / 68

104 A Final Touch: Debiasing 1 Consider problems of the form x arg min x R n 2 B x b τ x 1 Often, the original goal was to minimize the quadratic term, after the support of x had been found. But the l 1 term can cause the nonzero values of x i to be suppressed. Debiasing: find the zero set (complement of the support of x): Z( x) = {1,..., n} \ supp( x). solve min x B x b 2 2 s.t. x Z( x) = 0. (Fix the zeros and solve an unconstrained problem over the support.) Often, this problem has to be solved using an algorithm that only involves products by B and B T, since this matrix cannot be partitioned. M. Figueiredo and S. Wright First-Order Methods April / 68

105 Effect of Debiasing 1 0 Original (n = 4096, number of nonzeros = 160) SpaRSA reconstruction (m = 1024, tau = 0.08, MSE = ) Debiased (MSE = 3.377e 005) Minimum norm solution (MSE = 1.568) M. Figueiredo and S. Wright First-Order Methods April / 68

106 Example: Matrix Recovery (Toh and Yun, 2010) M. Figueiredo and S. Wright First-Order Methods April / 68

107 Conditional Gradient Also known as Frank-Wolfe after the authors who devised it in the 1950s. Later analysis by Dunn (around 1990). Suddenly a topic of enormous renewed interest; see for example (Jaggi, 2013). min f (x), x Ω where f is a convex function and Ω is a closed, bounded, convex set. Start at x 0 Ω. At iteration k: v k := arg min v Ω v T f (x k ); x k+1 := (1 α k ) x k + α k v k, α k = 2 k + 2. Potentially useful when it is easy to minimize a linear function over the original constraint set Ω; Admits an elementary convergence theory: 1/k sublinear rate. Same convergence theory holds if we use a line search for α k. M. Figueiredo and S. Wright First-Order Methods April / 68

108 Conditional Gradient for Atomic-Norm Constraints Conditional Gradient is particularly useful for optimization over atomic-norm constraints. min f (x) s.t. x A τ. Reminder: Given the set of atoms A (possibly infinite) we have { } x A := inf c a a, c a 0. a A c a : x = a A The search direction v k is τā k, where ā k := arg min a A a, f (x k). That is, we seek the atom that lines up best with the negative gradient direction f (x k ). M. Figueiredo and S. Wright First-Order Methods April / 68

109 Generating Atoms We can think of each step as the addition of a new atom to the basis. Note that x k is expressed in terms of {ā 0, ā 1,..., ā k }. If few iterations are needed to find a solution of acceptable accuracy, then we have an approximate solution that s represented in terms of few atoms, that is, sparse or compactly represented. For many atomic sets A of interest, the new atom can be found cheaply. Example: For the constraint x 1 τ, the atoms are {±e i : i = 1, 2,..., n}. if i k is the index at which [ f (x k )] i attains its maximum, we have ā k = sign([ f (x k )] ik ) e ik Example: For the constraint x τ, the atoms are the 2 n vectors with entries ±1. We have [ā k ] i = sign[ f (x k )] i, i = 1, 2,..., n. M. Figueiredo and S. Wright First-Order Methods April / 68

110 More Examples Example: Nuclear Norm. For the constraint X τ, for which the atoms are the rank-one matrices, we have Ā k = u k v T k, where u k and v k are the first columns of the matrices U k and V k obtained from the SVD f (X k ) = U k Σ k V T k. Example: sum-of-l 2. For the constraint m x [i] 2 τ, i=1 the atoms are the vectors a that contain all zeros except for a vector u [i] with unit 2-norm in the [i] block position. (Infinitely many.) The atom ā k contains nonzero components in the block i k for which [ f (x k )] [i] is maximized, and the nonzero part is u [i] = [ f (x k )] [ik ]/ [ f (x k )] [ik ]. M. Figueiredo and S. Wright First-Order Methods April / 68

111 Other Enhancements Reoptimizing. Instead of fixing the contribution α k from each atom at the time it joins the basis, we can periodically and approximately reoptimize over the current basis. This is a finite dimension optimization problem over the (nonnegative) coefficients of the basis atoms. It need only be solved approximately. If any coefficient is reduced to zero, it can be dropped from the basis. Dropping Atoms. Sparsity of the solution can be improved by dropping atoms from the basis, if doing so does not degrade the value of f too much (see (Rao et al., 2013)). In the important least-squares case, the effect of dropping can be evaluated efficiently. M. Figueiredo and S. Wright First-Order Methods April / 68

112 References I Akaike, H. (1959). On a successive transformation of probability distribution and its application to the analysis fo the optimum gradient method. Annals of the Institute of Statistics and Mathematics of Tokyo, 11:1 17. Barzilai, J. and Borwein, J. (1988). Two point step size gradient methods. IMA Journal of Numerical Analysis, 8: Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): Bioucas-Dias, J. and Figueiredo, M. (2007). A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16: Brucker, P. (1984). An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3: Candès, E. and Romberg, J. (2005). l 1 -MAGIC: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology. Combettes, P. and Pesquet, J.-C. (2011). Signal recovery by proximal forward-backward splitting. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages Springer. Combettes, P. and Wajs, V. (2006). Proximal splitting methods in signal processing. Multiscale Modeling and Simulation, 4: M. Figueiredo and S. Wright First-Order Methods April / 68

113 References II Ferris, M. C. and Munson, T. S. (2002). Interior-point methods for massive support vector machines. SIAM Journal on Optimization, 13(3): Figueiredo, M., Nowak, R., and Wright, S. (2007). Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on Convex Optimization Methods for Signal Processing, 1: Fountoulakis, K., Gondzio, J., and Zhlobich, P. (2012). Matrix-free interior point method for compressed sensing problems. Technical Report, University of Edinburgh. Gertz, E. M. and Wright, S. J. (2003). Object-oriented software for quadratic programming. ACM Transations on Mathematical Software, 29: Gondzio, J. and Fountoulakis, K. (2013). Second-order methods for l1-regularization. Talk at Optimization and Big Data Workshop, Edinburgh. Hale, E., Yin, W., and Zhang, Y. (2008). Fixed-point continuation for l1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19: Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. Technical Report, École Polytechnique, France. Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12: Lorenz, D. (2011). Constructing test instances for basis pursuit denoising. arxiv.org/abs/ M. Figueiredo and S. Wright First-Order Methods April / 68

114 References III Maculan, N. and de Paula, G. G. (1989). A linear-time median-finding algorithm for projecting a vector on the simplex of R n. Operations Research Letters, 8: Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A Math, 255: Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Doklady, 27: Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York. Rao, N., Shah, P., Wright, S. J., and Nowak, R. (2013). A greedy forward-backward algorithm for atomic-norm-constrained optimization. In Proceedings of ICASSP. Toh, K.-C. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific Journal of Optimization, 6: Wright, S., Nowak, R., and Figueiredo, M. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57: Xiao, L. and Zhang, T. (2012). A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization. (to appear; available at M. Figueiredo and S. Wright First-Order Methods April / 68

Optimization Algorithms for Compressed Sensing

Optimization Algorithms for Compressed Sensing Optimization Algorithms for Compressed Sensing Stephen Wright University of Wisconsin-Madison SIAM Gator Student Conference, Gainesville, March 2009 Stephen Wright (UW-Madison) Optimization and Compressed

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Some Relevant Topics in Optimization

Some Relevant Topics in Optimization Some Relevant Topics in Optimization Stephen Wright University of Wisconsin-Madison IPAM, July 2012 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 1 / 118 1 Introduction/Overview 2 Gradient Methods

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimization (and Learning)

Optimization (and Learning) Optimization (and Learning) Steve Wright 1 1 Computer Sciences Department, University of Wisconsin, Madison, WI, USA MLSS, Tübingen, August 2013 S. J. Wright () Optimization MLSS, August 2013 1 / 158 Overview

More information

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem. ACCELERATED LINEARIZED BREGMAN METHOD BO HUANG, SHIQIAN MA, AND DONALD GOLDFARB June 21, 2011 Abstract. In this paper, we propose and analyze an accelerated linearized Bregman (A) method for solving the

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Low Complexity Regularization 1 / 33

Low Complexity Regularization 1 / 33 Low Complexity Regularization 1 / 33 Low-dimensional signal models Information level: pixels large wavelet coefficients (blue = 0) sparse signals low-rank matrices nonlinear models Sparse representations

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Sparse Optimization: Algorithms and Applications. Formulating Sparse Optimization. Motivation. Stephen Wright. Caltech, 21 April 2007

Sparse Optimization: Algorithms and Applications. Formulating Sparse Optimization. Motivation. Stephen Wright. Caltech, 21 April 2007 Sparse Optimization: Algorithms and Applications Stephen Wright 1 Motivation and Introduction 2 Compressed Sensing Algorithms University of Wisconsin-Madison Caltech, 21 April 2007 3 Image Processing +Mario

More information

Sparse Optimization Lecture: Dual Methods, Part I

Sparse Optimization Lecture: Dual Methods, Part I Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

TRACKING SOLUTIONS OF TIME VARYING LINEAR INVERSE PROBLEMS

TRACKING SOLUTIONS OF TIME VARYING LINEAR INVERSE PROBLEMS TRACKING SOLUTIONS OF TIME VARYING LINEAR INVERSE PROBLEMS Martin Kleinsteuber and Simon Hawe Department of Electrical Engineering and Information Technology, Technische Universität München, München, Arcistraße

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Step-size Estimation for Unconstrained Optimization Methods

Step-size Estimation for Unconstrained Optimization Methods Volume 24, N. 3, pp. 399 416, 2005 Copyright 2005 SBMAC ISSN 0101-8205 www.scielo.br/cam Step-size Estimation for Unconstrained Optimization Methods ZHEN-JUN SHI 1,2 and JIE SHEN 3 1 College of Operations

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

arxiv: v1 [cs.lg] 20 Feb 2014

arxiv: v1 [cs.lg] 20 Feb 2014 GROUP-SPARSE MATRI RECOVERY iangrong Zeng and Mário A. T. Figueiredo Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal ariv:1402.5077v1 [cs.lg] 20 Feb 2014 ABSTRACT We apply the

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Proximal Algorithm Meets a Conjugate descent

Proximal Algorithm Meets a Conjugate descent Proximal Algorithm Meets a Conjugate descent Matthieu Kowalski To cite this version: Matthieu Kowalski. Proximal Algorithm Meets a Conjugate descent. 2010. HAL Id: hal-00505733 https://hal.archives-ouvertes.fr/hal-00505733v1

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Large-Scale L1-Related Minimization in Compressive Sensing and Beyond Yin Zhang Department of Computational and Applied Mathematics Rice University, Houston, Texas, U.S.A. Arizona State University March

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

MS&E 318 (CME 338) Large-Scale Numerical Optimization

MS&E 318 (CME 338) Large-Scale Numerical Optimization Stanford University, Management Science & Engineering (and ICME) MS&E 318 (CME 338) Large-Scale Numerical Optimization 1 Origins Instructor: Michael Saunders Spring 2015 Notes 9: Augmented Lagrangian Methods

More information

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes

More information

Homotopy methods based on l 0 norm for the compressed sensing problem

Homotopy methods based on l 0 norm for the compressed sensing problem Homotopy methods based on l 0 norm for the compressed sensing problem Wenxing Zhu, Zhengshan Dong Center for Discrete Mathematics and Theoretical Computer Science, Fuzhou University, Fuzhou 350108, China

More information

Math 164: Optimization Barzilai-Borwein Method

Math 164: Optimization Barzilai-Borwein Method Math 164: Optimization Barzilai-Borwein Method Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 online discussions on piazza.com Main features of the Barzilai-Borwein (BB) method The BB

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Optimization in Machine Learning

Optimization in Machine Learning Optimization in Machine Learning Stephen Wright University of Wisconsin-Madison Singapore, 14 Dec 2012 Stephen Wright (UW-Madison) Optimization Singapore, 14 Dec 2012 1 / 48 1 Learning from Data: Applications

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Lecture: Algorithms for Compressed Sensing

Lecture: Algorithms for Compressed Sensing 1/56 Lecture: Algorithms for Compressed Sensing Zaiwen Wen Beijing International Center For Mathematical Research Peking University http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html wenzw@pku.edu.cn Acknowledgement:

More information

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Linear Programming: Simplex

Linear Programming: Simplex Linear Programming: Simplex Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Linear Programming: Simplex IMA, August 2016

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between

More information

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems Z. Akbari 1, R. Yousefpour 2, M. R. Peyghami 3 1 Department of Mathematics, K.N. Toosi University of Technology,

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

arxiv: v1 [cs.na] 13 Nov 2014

arxiv: v1 [cs.na] 13 Nov 2014 A FIELD GUIDE TO FORWARD-BACKWARD SPLITTING WITH A FASTA IMPLEMENTATION TOM GOLDSTEIN, CHRISTOPH STUDER, RICHARD BARANIUK arxiv:1411.3406v1 [cs.na] 13 Nov 2014 Abstract. Non-differentiable and constrained

More information

Elaine T. Hale, Wotao Yin, Yin Zhang

Elaine T. Hale, Wotao Yin, Yin Zhang , Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Compressed Sensing: a Subgradient Descent Method for Missing Data Problems

Compressed Sensing: a Subgradient Descent Method for Missing Data Problems Compressed Sensing: a Subgradient Descent Method for Missing Data Problems ANZIAM, Jan 30 Feb 3, 2011 Jonathan M. Borwein Jointly with D. Russell Luke, University of Goettingen FRSC FAAAS FBAS FAA Director,

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Interpolation-Based Trust-Region Methods for DFO

Interpolation-Based Trust-Region Methods for DFO Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information