Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Similar documents
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Convergence of Fixed-Point Iterations

Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)

Coordinate Update Algorithm Short Course Operator Splitting

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Convex Conjugacy

Sparse Optimization Lecture: Dual Methods, Part I

Tight Rates and Equivalence Results of Operator Splitting Schemes

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Conditional Gradient (Frank-Wolfe) Method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

About Split Proximal Algorithms for the Q-Lasso

Proximal methods. S. Villa. October 7, 2014

Math 273a: Optimization Subgradients of convex functions

Splitting methods for decomposing separable convex programs

Lecture 9: September 28

Optimization for Learning and Big Data

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

6. Proximal gradient method

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Math 273a: Optimization Lagrange Duality

Convex Optimization Notes

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Lecture 8: February 9

Optimization methods

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

A Unified Approach to Proximal Algorithms using Bregman Distance

6. Proximal gradient method

Operator Splitting for Parallel and Distributed Optimization

1 Sparsity and l 1 relaxation

The proximal mapping

Math 273a: Optimization Subgradients of convex functions

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Math 273a: Optimization Basic concepts

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Optimization for Machine Learning

Lasso: Algorithms and Extensions

On convergence rate of the Douglas-Rachford operator splitting method

Sequential Unconstrained Minimization: A Survey

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Dual and primal-dual methods

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

4TE3/6TE3. Algorithms for. Continuous Optimization

Lecture 7: September 17

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Dual Proximal Gradient Method

Sparse Regularization via Convex Analysis

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Smoothing Proximal Gradient Method. General Structured Sparse Regression

ARock: an algorithmic framework for asynchronous parallel coordinate updates

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Lecture 3. Optimization Problems and Iterative Algorithms

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

9. Dual decomposition and dual algorithms

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Subgradient Method. Ryan Tibshirani Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization

Lecture 1: September 25

minimize x subject to (x 2)(x 4) u,

Convex Functions. Pontus Giselsson

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Algorithms for Nonsmooth Optimization

Algorithms for Constrained Optimization

You should be able to...

Constrained Optimization

Proximal splitting methods on convex problems with a quadratic term: Relax!

14. Nonlinear equations

Proximal gradient methods

A GENERALIZATION OF THE REGULARIZATION PROXIMAL POINT METHOD

Lecture 5 : Projections

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Positive Definite Matrix

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

1 Introduction and preliminaries

Newton s Method. Javier Peña Convex Optimization /36-725

BASICS OF CONVEX ANALYSIS

1 Non-negative Matrix Factorization (NMF)

A Tutorial on Primal-Dual Algorithm

A Brief Review on Convex Optimization

Lecture 6: Conic Optimization September 8

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Oslo Class 6 Sparsity based regularization

Primal/Dual Decomposition Methods

Proximal Methods for Optimization with Spasity-inducing Norms

Sparse Optimization Lecture: Basic Sparse Optimization Models

Transcription:

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36

Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow modest size Gradient method: for C 1 -smooth, unconstrained problems give large size and parallel implementations Proximal method: for smooth and non-smooth, constrained and unconstrained exploit problem structures gives large size and parallel implementations 2 / 36

Newton s algorithm uses low-level (explicit) operation: x k+1 x k λh 1 (x k ) f(x k ) Gradient algorithm uses low-level (explicit) operation: x k+1 x k λ f(x k ) Proximal-point algorithm uses high-level (implicit) operation: x k+1 prox λf (x k ) well-known algorithms are special cases Proximal operator prox λf is an optimization problem either standalone or used as a subproblem simple for structured f, but there are many 3 / 36

Notation and Assumptions f : R n R { } is a closed, proper, convex function (why? to ensure prox λf is well-defined and unique) proper refers to domf closed refers to epif is a closed set can extend R n to general Hilbert space H can take. This saves x domf in many settings. an operator maps R n to R n, also called a mapping or a map indicator function for set C: ι C(x) := { 0, if x C, otherwise. arg min x f(x) is the set of minimizers of f; if unique, is the minimizer 4 / 36

Definition Definition The proximal operator prox f : R n R n of a function f is defined by: ( prox f (v) = arg min f(x) + 1 ) x R n 2 x v 2 The (scaled) proximal operator prox λf : R n R n is defined by: ( prox λf (v) = arg min f(x) + 1 ) x R n 2λ x v 2 Strong convex problems with unique minimizers, so = is used Moreau envelope: ( f(v) := min f(x) + 1 ) x R n 2λ x v 2 is a differentiable function. f(v) = 1 λ (v x ) 5 / 36

Proximal of indicator function is projection Consider a nonempty closed convex set C prox ιc (x) = arg min (ι C(y) + 1 ) y 2 y x 2 1 = arg min y x 2 y C 2 =: proj C (x) 6 / 36

7 / 36 1D illustration f(x) + 1 x 2 v 2 is above f(x) unless v arg min f, prox λf (v) moves away from v f ( prox λf (v) ) < f(v)

Proximal parameter Tuning λ in ( prox λf (v) = arg min f(x) + 1 ) x R n 2λ x v 2 as λ : prox λf (v) proj arg min f(x) (v) as λ 0, prox λf (v) proj domf (v) where { } 1 proj domf (v) = arg min x R n 2 v x 2 : f(x) is finite (prox λf (v) v) is generally nonlinear in λ, so λ is a nonlinear step size 8 / 36

prox λf is soft projection The path {prox λf (v) : λ > 0} domf prox λf (v) is between and proj domf (v) and proj arg min f(x) (v) The paths by different v may overlap or join If v arg min f, then prox arg min f(x) (v) = v is a fixed point 9 / 36

Examples

Examples linear function: let a R n, b R and proximal of linear function: f(x) := a T x + b. ( prox λf (v) := arg min (a T x + b) + 1 ) x R n 2λ x v 2 has first-order optimality conditions: a + 1 λ (prox λf (v) v) = 0 prox λf (v) = v λa application: proximal of the linear approximation of f let f (1) (x) = f(x 0 ) + f(x 0 ), x x 0 then, prox λf (1)(x 0 ) = x 0 λ f(x 0 ) is a gradient step with size λ 10 / 36

Examples quadratic function let A be a symmetric positive semi-definite matrix, b R n, and proximal of quadratic function: f(x) := 1 2 xt Ax b T x + c. ( prox λf (v) := arg min f(x) + 1 ) x R n 2λ x v 2 has first order optimality conditions: (Av b) + 1 λ (v v) = 0 v = (λa + I) 1 (λb + v) v = (λa + I) 1 (λb + λav + v λav) v = v + (A + 1 λ I) 1 (b Av) It recovers an iterative refinement method for Ax = b. 11 / 36

application: proximal of the quadratic approximation of f let f (2) (x) = f(x 0 ) + f(x 0 ), x x 0 + 1 (x 2 x0 ) T 2 f(x 0 )(x x 0 ) =: 1 2 xt Ax b T x + c where A = 2 f(x 0 ) b = ( 2 f(x 0 )) T x 0 f(x 0 ) by letting v = x 0, we get prox λf (2)(x 0 ) = x 0 ( 2 f(x 0 ) + 1 λ I) 1 f(x 0 ) recovers the modified-hessian Newton or Levenberg-Marquardt method 12 / 36

Examples l 1-norm: let f(x) = x 1. proximal of l 1-norm: Then ( prox λf (v) := arg min x 1 + 1 ) x R n 2λ x v 2 The subgradient optimality condition 0 f(v ) + 1 λ (v v) v v λ f(v ) Recall f(x) = x 1 x n Hence, it reduces to component-wise subproblems: v i v i v i 13 / 36

proximal of l 1-norm (cont). Three cases: v i > 0, then v i v i = λ, so v i = v i λ v i < 0, then v i v i = λ, so v i = v i + λ v i = 0, then v i = v i v i [ λ, λ]. Rewriting these conditions in terms of v, if v i > λ, then vi = v i λ if v i < λ, then vi = v i + λ if v i [ λ, λ], then vi = 0. prox λf is the shrinkage (or element-wise soft-thresholding) operator shrink(v, λ) i = max( v i λ, 0) vi v i In Matlab: max(abs(v)-lambda,0).*sign(v) 14 / 36

Examples proximal of l 2-norm: let f(x) = x 2, then x prox λf (x) = max( x 2 λ, 0) x 2 = x proj B2 (0,λ)x, special convention: 0/0 = 0 if x = 0. general pattern: proximal of l p-norm: suppose p 1 + q 1 = 1, p, q [1, ]. Then, prox λf (x) = x proj x, Bq(0,λ) Useful for getting the proximals of l -norm and l 2,1-norm 15 / 36

Examples unitary-invariant matrix norms are vector norms on the singular values Frobenius norm: l 2 of singular values nuclear norm: l 1 of singular values l 2-operator norm: l of singular values note: the spectral norm returns the square root of the max eigenvalue of A T A, may not equal the max singular value for asymmetric matrices notation: let be a unitary-invariant matrix norm let be the corresponding vector norm (for the singular values) 16 / 36

proximals of unitary-invariant matrix norms: computation steps: X = prox (A) := arg min X + 1 2λ X A 2 F 1. SVD: A Udiag(σ)V T 2. proximal: σ arg min s s + 1 s σ 2 2λ 3. return: X Udiag(σ )V T 17 / 36

Proximable functions definition: a function f : R n R is proximable if prox γf can be computed in O(n) or O(npolylog(n)) time examples: norms: l 1, l 2, l 2,1, l,... separable functions/constraints: x 0, l x u standard simplex: {x R n : 1 T x = 1, x 0}... In general, f, g both proximable f + g proximable. But, there are exceptions. If f + g is proximable, we can simplify operator splitting 18 / 36

f + g proximable functions Let denote operator composition: for example ( (prox f prox g )(x) := prox f proxg (x) ) rule 1: if f : R R is convex and f (0) = 0, then the scalar function f + is proximable: prox f+ = prox f prox Application: the elastic net regularizer 1 2 x 2 2 + α x 1 19 / 36

rule 2: g is a 1-homogeneous function if g(αx) = αg(x), α 0. Examples: l 1, l, ι 0, ι 0 If g is a 1-homogeneous function, then 2 + g is proximable: prox 2 +g = prox 2 prox g 20 / 36

rule 3: 1D discrete total variation is TV(x) := n 1 xi+1 xi i=1 f is component prox-monotonic if, x R n and i, j {1,..., n}, x i < x j x i = x j ( prox f (x) ) i ( prox f (x) ) j ( prox f (x) ) i = ( prox f (x) ) j Examples: l 1, l 2, l, ι l, ι u, ι [l,u]. If f is component prox-monotonic, then f + TV is proximable: prox f+tv = prox f prox TV Application: Fused LASSO regularizer α x 1 + TV(x) 21 / 36

Properties

Separable Sum Proximal Proposition For a separable function f(x, y) = φ(x) + ψ(y), prox λf (v, w) = (prox λφ (v), prox λψ (w)) We have observed this with the proximal of x 1 := n i=1 xi Can be used to derive x 2,1 := p i=1 x (i) (as in nonoverlap group LASSO) 22 / 36

Proximal fixed-point Theorem (fixed-point = minimizer) Let λ > 0. Point x R n is a minimizer of f if, and only if, prox λf (x ) = x. Proof. : Let x arg min f(x). Then for any x R n, f(x) + 1 2λ x x 2 f(x ) + 1 2λ x x 2. Thus, x = arg min f(x) + 1 2λ x x 2, so x = prox λf (x ). : Let x = prox λf (x ), then by the subgradient optimality condition: 0 f(x ) + 1 λ (x x ) = f(x ) Thus, 0 f(x ), and x arg min f(x). 23 / 36

Proximal operator and resolvent Definition For a monotone operator T, (I + λt ) 1 is the (well-defined) resolvent of T. Proposition prox λf = (I + λ f) 1. Informal proof. x (I + λ f) 1 (v) v (I + λ f)(x) v x + λ f(x) 0 x v + λ f(x) 0 1 (x v) + f(x) λ ( x = arg min f(x) + 1 ) x R d 2λ x v 2 24 / 36

Proximal-Point Algorithm

Proximal-point algorithm (PPA) iteration: x k+1 prox λf (x k ) seldom used to minimize f because prox λf is as difficult recovers the method of multipliers or augmented Lagrangian method (later lecture) has iterating convergence properties can relax λ to take values in an interval R ++ 25 / 36

Proximal is firmly nonexpansive definition: a map T is nonexpansive if T (x) T (y) x y, x, y definition: a map T is firmly nonexpansive if T (x) T (y) 2 x y 2 (x T (x)) (y T (y)) 2, x, y A key property in the development and analysis of first-order algorithms! 26 / 36

Proposition For proper closed convex f and λ > 0, prox γf is firmly nonexpansive. Proof. Take arbitrary x, y, and let x := prox γf (x) and y := prox γf (y). By the subgradient optimality conditions: (x x ) λ f(x ) (y y ) λ f(y ). Since f is monotone, i.e., f(x ) f(y ), x y 0 element-wise, we have (x x ) (y y ), x y 0, which is equivalent to x y 2 x y 2 (x x ) (y y ) 2. 27 / 36

PPA convergence properties (without proofs) Assume: convex f, x exists. Since prox γf is firmly nonexpansive x k x (weakly in inf-dim H) above is still true subject to summable error in computing prox γf (x k ) fixed-point residual rate prox λf (x k ) x k 2 = o(1/k 2 ) objective rate f(x k ) f(x ) = o(1/k) 28 / 36

Assume: strongly convex f: C > 0 p q, x y C x y, x, y and p f(x), q f(y) Then, prox λf is a contraction: prox λf (x) prox λf (y) 2 and (assuming x exists) thus x k+1 x 2 Therefore, x k x linearly. 1 1 + 2λC x y 2. 1 (1 + 2λC) k xk x. 29 / 36

PPA Interpretations (destination) subgradient-descent interpretation: x k+1 = prox λf (x k ) x k+1 = (I + λ f) 1 (x k ) x k (I + λ f)x k+1 x k x k+1 + λ f(x k+1 ) x k+1 = x k λ f(x k+1 ) where f(x k+1 ) f(x k+1 ) interpretation: descent using the negative destination subgradient compare: origin subgradient is not necessarily a descent direction 30 / 36

dual interpretation: Let y k+1 = f(x k+1 ) f(x k+1 ). Substituting formula of x k+1, we get y k+1 f(x k λy k+1 ). Computing prox λf (x k ) is equivalent to solving for a subgradient at the descent destination. Related to the Moreau decomposition (in a later lecture). 31 / 36

approx-gradient interpretation: Assume that f is twice differentiable, then as λ 0, prox λf (x) = (I + f) 1 (x) = x λ f(x) + o(λ) 32 / 36

disappearing Tikhonov-regularization interpretation: ( x k+1 prox λf (x k ) = arg min f(x) + 1 ) x R n 2λ x xk 2 2 The second term is regularization: x k+1 should be close to x k Regularization goes away as x k converges 33 / 36

Bregman iterative regularization: x k+1 arg min x 1 λ Dp r (x; x k ) + f(x) given a proper closed convex function r and a subgradient p(x k ) r(x k ), D p r (x; x k ) := r(x) r(x k ) p, x x k PPA is the special case corresponding to setting r = 1 2 2 2 34 / 36

Summary Proximal operator is easy to understand Is a standard tool for nonsmooth/constrained optimization Gives a fixed-point optimality condition PPA is more stable than gradient descent Sits at a high level of abstraction In closed form for many functions 35 / 36

Not covered Usage in operator splitting Proximals of dual functions, compute by minimizing augmented Lagrangian Proximals of nonconvex functions 36 / 36