Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Similar documents
Optimization methods

6. Proximal gradient method

Optimization methods

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Optimization for Machine Learning

Proximal methods. S. Villa. October 7, 2014

6. Proximal gradient method

Conditional Gradient (Frank-Wolfe) Method

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Math 273a: Optimization Subgradient Methods

Proximal Methods for Optimization with Spasity-inducing Norms

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization Lecture 16

Lasso: Algorithms and Extensions

Accelerated Proximal Gradient Methods for Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

Coordinate Descent and Ascent Methods

Optimization for Machine Learning

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Dual Proximal Gradient Method

Gradient Methods Using Momentum and Memory

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization for Learning and Big Data

Oslo Class 6 Sparsity based regularization

Dual Decomposition.

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Lecture 9: September 28

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Convergence of Fixed-Point Iterations

Sparse Optimization Lecture: Dual Methods, Part I

Composite nonlinear models at scale

CPSC 540: Machine Learning

Math 273a: Optimization Subgradients of convex functions

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Splitting methods for decomposing separable convex programs

Algorithms for Nonsmooth Optimization

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

The Frank-Wolfe Algorithm:

Lecture 23: November 19

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Adaptive Restarting for First Order Optimization Methods

The Proximal Gradient Method

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Dual and primal-dual methods

STA141C: Big Data & High Performance Statistical Computing

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Composite Objective Mirror Descent

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

ECS289: Scalable Machine Learning

Descent methods. min x. f(x)

Nonlinear Optimization for Optimal Control

Proximal gradient methods

Tight Rates and Equivalence Results of Operator Splitting Schemes

Lecture: Smoothing.

Stochastic and online algorithms

Primal-dual coordinate descent

About Split Proximal Algorithms for the Q-Lasso

Lecture 14: October 17

1 Sparsity and l 1 relaxation

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Newton s Method. Javier Peña Convex Optimization /36-725

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 17: October 27

5. Subgradient method

The Alternating Direction Method of Multipliers

ADMM and Fast Gradient Methods for Distributed Optimization

8 Numerical methods for unconstrained problems

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Generalized Conditional Gradient and Its Applications

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Coordinate descent methods

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

10. Unconstrained minimization

Optimization and Optimal Control in Banach Spaces

Learning with stochastic proximal gradient

Transcription:

Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra

Optimal gradient methods 2 / 27

Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. 3 / 27

Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x 2 2 32(k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ 3 / 27

Optimal gradient methods We saw following efficiency estimates for the gradient method f CL 1 : f(x k ) f 2L x0 x 2 2 k + 4 ( ) L µ 2k f SL,µ 1 : f(x k ) f L 2 x 0 x 2 L + µ 2. We also saw lower complexity bounds f CL 1 : f(x k ) f(x ) 3L x0 x 2 2 32(k + 1) 2 ( fsl,µ : f(x k ) f(x ) µ ) 2k L µ 2 x 0 x 2 2. L + µ Can we close the gap? 3 / 27

Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) 4 / 27

Gradient with momentum Polyak s method (aka heavy-ball) for f S 1 L,µ x k+1 = x k α k f(x k ) + β k (x k x k 1 ) Converges (locally, i.e., for x 0 x 2 ɛ) as x k x 2 2 ( L µ L + µ ) 2k x 0 x 2 2, ( ) 2 4 for α k = ( L+ and β L µ µ) 2 k = L+ µ 4 / 27

Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 5 / 27

Nesterov s optimal gradient method 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l min x f(x), where S 1 L,µ with µ 0 5 / 27

Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) 5 / 27

Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 5 / 27

Nesterov s optimal gradient method min x f(x), where S 1 L,µ with µ 0 1. Choose x 0 R n, α 0 (0, 1) 2. Let y 0 x 0 ; set q = µ/l 3. k-th iteration (k 0): a). Compute f(y k ) and f(y k ); update primary solution x k+1 = y k 1 L f(yk ) b). Compute stepsize α k+1 by solving α 2 k+1 = (1 α k+1 )α 2 k + qα k+1 c). Set β k = α k (1 α k )/(α 2 k + α k+1) d). Update secondary solution y k+1 = x k+1 + β k (x k+1 x k ) 5 / 27

Optimal gradient method rate Theorem Let { x k} be sequence generated by above algorithm. If α 0 µ/l, then f(x k ) f(x ) c 1 min { ( 1 where constants c 1, c 2 depend on α 0, L, µ. Proof: Somewhat involved; see notes. µ ) } k, 4L L (2, L + c 2 k) 2 6 / 27

Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ 7 / 27

Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): 7 / 27

Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) 7 / 27

Strongly convex case simplification If µ > 0, select α 0 = µ/l. The two main steps get simplified: 1. Set β k = α k (1 α k )/(α 2 k + α k+1) 2. y k+1 = x k+1 + β k (x k+1 x k ) α k = µ L µ L β k =, k 0. L + µ Optimal method simplifies to 1. Choose y 0 = x 0 R n 2. k-th iteration (k 0): a). x k+1 = y k 1 L f(yk ) b). y k+1 = x k+1 + β(x k+1 x k ) Notice similarity to Polyak s method! 7 / 27

Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) 8 / 27

Summary so far Convex f(x) with f G subgradient method Differentiable f CL 1 using gradient methods Rate of convergence for smooth convex problems Faster rate of convergence for smooth, strongly convex Constrained gradient methods Frank-Wolfe method Constrained gradient methods gradient projection Nesterov s optimal gradient methods (smooth) Gap between lower and upper bounds O(1/ t) convex (subgradient method); O(1/t 2 ) for CL 1 ; linear for smoooth, strongly convex 8 / 27

Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. 9 / 27

Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) 9 / 27

Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ 9 / 27

Nonsmooth optimization Unconstrained problem: min f(x), where x R n f convex on R n, and Lipschitz cont. on bounded set f(x) f(y) L x y 2, x, y X. At each step, we access to f(x) and g f(x) Find x k R n such that f(x k ) f ɛ First-order methods: x k x 0 + span { g 0,..., g k 1} 9 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R 10 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. 10 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 10 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. 10 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general 10 / 27

Nonsmooth optimization EXAMPLE Let φ(x) = x for x R Subgradient method x k+1 = x k α k g k, where g k x k. If x 0 = 1 and α k = 1 k+1 + 1 k+2 (this stepsize is known to be optimal), then x k = 1 k+1 Thus, O( 1 ɛ 2 ) iterations are needed to obtain ɛ-accuracy. This behavior typical for the subgradient method which exhibits O(1/ k) convergence in general Can we do better in general? 10 / 27

Nonsmooth optimization Nope! 11 / 27

Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. 11 / 27

Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? 11 / 27

Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! 11 / 27

Nonsmooth optimization Nope! Theorem (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in CL 0 (B) (with L > 0), such that for 0 k n 1, the lower-bound f(x k ) f(x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Should we give up? No! Several possibilities remain! 11 / 27

Nonsmooth optimization Blackbox too pessimistic 12 / 27

Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization 12 / 27

Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox 12 / 27

Nonsmooth optimization Blackbox too pessimistic Nesterov s breakthroughs Excessive gap technique Composite objective minimization Nemirovski s workshorse of general convex optimization Mirror-descent, NERML Mirror-prox Other techniques, problem classes, etc. 12 / 27

Proximal splitting 13 / 27

Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) 14 / 27

Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r 14 / 27

Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing 14 / 27

Composite objectives Frequently nonsmooth problems take the form minimize f(x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR 14 / 27

Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) 15 / 27

Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) 15 / 27

Composite objective minimization minimize f(x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f(x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! 15 / 27

Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. 16 / 27

Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. 16 / 27

Projections: another view Let I X be the indicator function for closed, cvx X, defined as: { 0 if x X, I X (x) := otherwise. Recall orthogonal projection P X (y) P X (y) := argmin 1 2 x y 2 2 s.t. x X. Rewrite orthogonal projection P X (y) as P X (y) := argmin x R n 1 2 x y 2 2 + I X (x). 16 / 27

Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) 17 / 27

Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y 2 2 + r(x) 17 / 27

Generalizing projections proximity Projection P X (y) := argmin x R n 1 2 x y 2 2 + I X (x) Proximity: Replace I X by some convex function! prox r (y) := argmin x R n 1 2 x y 2 2 + r(x) Def. prox R : R n R n is called a proximity operator 17 / 27

Proximity operator l 1 -norm ball of radius ρ(λ) 18 / 27

Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y 2 2 + λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 19 / 27

Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y 2 2 + λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases separately: either x = 0 or x 0 Aka: Soft-thresholding operator 19 / 27

Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) 20 / 27

Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). 20 / 27

Basics of proximal splitting Recall Gradient projection for solving min X f(x) for f C 1 L : x k+1 = P X (x k α k f(x k )) Proximal gradient method solves min l(x) + r(x) x k+1 = prox αk r(x k α k f(x k )). This method aka: Forward-backward splitting (FBS) Forward step: The gradient-descent step Backward step: The prox-operator 20 / 27

FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. 21 / 27

FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! 21 / 27

FBS example Lasso / L1-LS min 1 2 Ax b 2 2 + λ x 1. prox λ x 1 y = sgn(y) max( y λ, 0) x k+1 = prox αk λ 1 (x k α k A T (Ax k b)). so-called iterative soft-thresholding algorithm! Exercise: Try solving the problem: min 1 2 Ax b 2 2 + λ x 2. 21 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve min 1 2 DT x b 2 2 + λ x 1. 22 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code 22 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b 22 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal 22 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent 22 / 27

Exercise Recall our older example: 1 2 DT x b 2 2. We solved its unconstrained and constrained versions so far. Now implement a Matlab script to solve Use FBS as shown above min 1 2 DT x b 2 2 + λ x 1. Try different values of λ > 0 in your code Use different choices of b Show a choice of b for which x = 0 (zero vector) is optimal Experiment with different stepsizes (try α k = 1/4 and if that does not work try smaller values that might work). Do not expect monotonic descent Compare with versions of the subgradient method 22 / 27

Proximity operators 23 / 27

Proximity operators prox r has several nice properties Read / Skim the paper: Proximal Splitting Methods in Signal Processing, by Combettes and Pesquet (2010). Theorem The operator prox r is firmly nonexpansive (FNE) prox r x prox r y 2 2 prox r x prox r y, x y Proof: (blackboard) Corollary. The operator prox r is nonexpansive Proof: apply Cauchy-Schwarz to FNE. 24 / 27

Consequence of FNE Gradient projection x k+1 = P X (x k α k f(x k )) Proximal gradients / FBS x k+1 = prox αk r(x k α k f(x k )) Same convergence theory goes through! Exercise: Try extending proof of gradient-projection convergence to convergence for FBS. Hint: First show that at x, the fixed-point equation x = prox αr (x α f(x )), α > 0 25 / 27

Moreau Decomposition Aim: Compute prox r y Sometimes it is easier to compute prox r y r (u) := sup u T x r(x) x Moreau decomposition: y = prox R y + prox R y 26 / 27

Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) 27 / 27

Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z 27 / 27

Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z Derive Lagrangian dual for this 27 / 27

Moreau decomposition Proof sketch: Consider min 1 2 x y 2 2 + r(x) Introduce new variable z = x, to get prox r y := 1 2 x y 2 2 + r(z), s.t. x = z Derive Lagrangian dual for this Simplify, and conclude! 27 / 27