A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

Similar documents
A Unified Approach to Proximal Algorithms using Bregman Distance

First Order Methods beyond Convexity and Lipschitz Gradient Continuity with Applications to Quadratic Inverse Problems

Auxiliary-Function Methods in Optimization

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

Proximal methods. S. Villa. October 7, 2014

Hessian Riemannian Gradient Flows in Convex Programming

Sequential Unconstrained Minimization: A Survey

Auxiliary-Function Methods in Iterative Optimization

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

6. Proximal gradient method

6. Proximal gradient method

Proximal Alternating Linearized Minimization for Nonconvex and Nonsmooth Problems

Fast proximal gradient methods

A user s guide to Lojasiewicz/KL inequalities

Gauge optimization and duality

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Optimization and Optimal Control in Banach Spaces

Dual Proximal Gradient Method

A Greedy Framework for First-Order Optimization

Sparse Regularization via Convex Analysis

Sequential convex programming,: value function and convergence

A proximal-like algorithm for a class of nonconvex programming

Lecture Notes on Iterative Optimization Algorithms

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Accelerated Proximal Gradient Methods for Convex Optimization

A semi-algebraic look at first-order methods

ALGORITHMS FOR MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS AND APPLICATIONS

Descent methods. min x. f(x)

A New Look at the Performance Analysis of First-Order Methods

The Proximal Gradient Method

1 Sparsity and l 1 relaxation

Bayesian Paradigm. Maximum A Posteriori Estimation

On Total Convexity, Bregman Projections and Stability in Banach Spaces

Stochastic model-based minimization under high-order growth

Basic Properties of Metric and Normed Spaces

Conditional Gradient (Frank-Wolfe) Method

Iterative Convex Optimization Algorithms; Part Two: Without the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

The proximal mapping

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

SPARSE SIGNAL RESTORATION. 1. Introduction

Compressed Sensing: a Subgradient Descent Method for Missing Data Problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Iteration-complexity of first-order penalty methods for convex programming

PDEs in Image Processing, Tutorials

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

A UNIFIED CHARACTERIZATION OF PROXIMAL ALGORITHMS VIA THE CONJUGATE OF REGULARIZATION TERM

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Dedicated to Michel Théra in honor of his 70th birthday

Alternating Minimization and Alternating Projection Algorithms: A Tutorial

Lecture 5 : Projections

About Split Proximal Algorithms for the Q-Lasso

1 Introduction and preliminaries

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

Existence and Approximation of Fixed Points of. Bregman Nonexpansive Operators. Banach Spaces

Downloaded 09/27/13 to Redistribution subject to SIAM license or copyright; see

Convex Functions. Pontus Giselsson

Radial Subgradient Descent

Subgradient Projectors: Extensions, Theory, and Characterizations

Dual methods for the minimization of the total variation

Douglas-Rachford splitting for nonconvex feasibility problems

Sequential unconstrained minimization algorithms for constrained optimization

Optimization Tutorial 1. Basic Gradient Descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

4TE3/6TE3. Algorithms for. Continuous Optimization

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization

WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE

Convex and Nonconvex Optimization Techniques for the Constrained Fermat-Torricelli Problem

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Sparsity Regularization

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

Variable Metric Forward-Backward Algorithm

On duality gap in linear conic problems

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

CONTINUOUS CONVEX SETS AND ZERO DUALITY GAP FOR CONVEX PROGRAMS

arxiv: v3 [math.oc] 19 Oct 2017

Sparse Optimization Lecture: Dual Methods, Part I

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Generalized greedy algorithms.

arxiv: v3 [math.oc] 29 Jun 2016

Linear convergence of iterative soft-thresholding

Bregman Divergence and Mirror Descent

Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control

Strongly convex functions, Moreau envelopes and the generic nature of convex functions with strong minimizers

Transcription:

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization and Discrete Geometry: Theory and Practice 24-26 April, 2018, Tel Aviv University

Recall: The Basic Pillar underlying FOM inf{φ(x) := f (x) + g(x) : x R d }, f, g convex, with g C 1. Captures many applied problems, and the source for fundamental FOM. Usual key assumption: g admits L-Lipschitz continuous gradient on R d

Recall: The Basic Pillar underlying FOM inf{φ(x) := f (x) + g(x) : x R d }, f, g convex, with g C 1. Captures many applied problems, and the source for fundamental FOM. Usual key assumption: g admits L-Lipschitz continuous gradient on R d A simple, yet key consequence of this, is the so-called descent Lemma: g(x) g(y) + g(y), x y + L 2 x y 2, x, y R d. This inequality naturally provides 1. An upper quadratic approximation of g 2. A crucial pillar in the analysis of current FOM. However, in many contexts and applications: the differentiable function g does not have a L-smooth gradient Hence precludes direct use of basic FOM methodology and schemes.

FOM Beyond Lipschitz Gradient Continuity Goals/Outline: Circumvent the longstanding question of Lipschitz Gradient continuity imposed on FOM. Derive FOM free from this smoothness assumption, with guaranteed complexity estimates and convergence results. Apply our results to a broad class of important problems lacking smooth gradients.

Main Observation: An Elementary Fact

Main Observation: An Elementary Fact Consider the descent Lemma for the smooth g C 1,1 L on R d : g(x) g(y) + x y, g(y) + L 2 x y 2, x, y R d.

Main Observation: An Elementary Fact Consider the descent Lemma for the smooth g C 1,1 L on R d : g(x) g(y) + x y, g(y) + L 2 x y 2, x, y R d. Simple algebra shows that it can be equivalently written as: ( ) ( ) L L 2 x 2 g(x) 2 y 2 g(y) Ly g(y), x y x, y R d

Main Observation: An Elementary Fact Consider the descent Lemma for the smooth g C 1,1 L on R d : g(x) g(y) + x y, g(y) + L 2 x y 2, x, y R d. Simple algebra shows that it can be equivalently written as: ( ) ( ) L L 2 x 2 g(x) 2 y 2 g(y) Ly g(y), x y x, y R d Nothing else but the gradient inequality for the convex L 2 x 2 g(x)! Thus, for a given smooth function g on R d Descent Lemma L 2 x 2 g(x) is convex on R d.

Main Observation: An Elementary Fact Consider the descent Lemma for the smooth g C 1,1 L on R d : g(x) g(y) + x y, g(y) + L 2 x y 2, x, y R d. Simple algebra shows that it can be equivalently written as: ( ) ( ) L L 2 x 2 g(x) 2 y 2 g(y) Ly g(y), x y x, y R d Nothing else but the gradient inequality for the convex L 2 x 2 g(x)! Thus, for a given smooth function g on R d Descent Lemma L 2 x 2 g(x) is convex on R d. Capture the Geometry of Constraint/Objective Naturally suggests to replace the squared norm with a general convex function h( ) that captures the geometry of the constraint/objective.

A Lipschitz-Like Convexity Condition Following our basic observation: Replace the 2 with a convex h. Trade L-smooth gradient of g on R d with Convexity condition on couple (g, h), dom g dom h, g C 1 (int dom h). A Lipschitz-like/Convexity Condition (LC) L > 0 with Lh g convex on int dom h, Condition (LC) New descent Lemma we seek for. It also naturally leads to the well-known Bregman distance.

A Descent Lemma without Lipschitz Gradient Continuity Lemma (Descent lemma without Lipschitz Gradient Continuity) The condition (LC): Lh g convex on int dom h is equivalent to g(x) g(y) + g(y), x y + LD h (x, y), (x, y) dom h int dom h D h stands for the Bregman Distance associated to a convex h: D h (x, y) := h(x) h(y) h(y), x y, x dom h, y int dom h. Proof of Descent Lemma. D Lh g (x, y) 0 for the convex function Lh g!

A Descent Lemma without Lipschitz Gradient Continuity Lemma (Descent lemma without Lipschitz Gradient Continuity) The condition (LC): Lh g convex on int dom h is equivalent to g(x) g(y) + g(y), x y + LD h (x, y), (x, y) dom h int dom h D h stands for the Bregman Distance associated to a convex h: D h (x, y) := h(x) h(y) h(y), x y, x dom h, y int dom h. Proof of Descent Lemma. D Lh g (x, y) 0 for the convex function Lh g! Distance-Like Properties - For all (x, y) dom h int dom h x D h (x, y) is convex with h convex. D h (x, y) 0 and = 0 iff x = y.(h strictly convex). However, note that D h is in general not symmetric! The use of Bregman distances in optimization started with Bregman (67). For initial works and main results on Proximal Bregman Algorithms: [Censor-Zenios (92), T. (92), Chen-T. (93), Eckstein (93), Bauschke-Borwein (97).]

Some Useful Examples for Bregman Distances D h Each example is a one dimensional convex h. The corresponding function h and Bregman distance in R d simply use the formulae h(x) = n j=1 n h(x j ) and D h (x, y) = j=1 D h (x j, y j ). Name h dom h 1 Energy IR 2 Boltzmann-Shannon entropy x log x [0, ) Burg s entropy log x (0, ) Fermi-Dirac entropy x log x + (1 x) log(1 x) [0, 1] Hellinger (1 x 2 ) 1/2 [ 1, 1] Fractional Power (px x p )/(1 p), p (0, 1) [0, ) Other possible/useful kernels h include: Nonseparable Bregman, e.g., any convex h on R d as well as for handling matrix problems: PSD matrices, cone constraints, etc.., [details in Auslender and T. (2005)].

The Convex Model and Blanket Assumption Our aim is to solve the composite convex problem v(p) = inf{φ(x) := f (x) + g(x) x dom h}, where dom h denotes the closure of dom h, Under the following standard assumption. The Hidden h (in unconstrained case) will adapt to Nonlinear Geometry of P Blanket Assumption as Usual: (i) f : R d (, ] is proper lower semicontinuous (lsc) convex, (ii) h : R d (, ] is proper, lsc convex. (iii) g : R d (, ] is proper lsc convex with dom g dom h and g C 1 (int dom h) (iv) dom f int dom h, (v) < v(p) = inf{φ(x) : x dom h} = inf{φ(x) : x dom h}.

Algorithm NoLips for inf{f (x) + g(x) : x C dom h} Main Algorithmic Operator [Reduces to classical prox-grad, when h quadratic] { T λ (x) := argmin f(u) + g(x) + g(x), u x + 1 } λ D h(u, x) : u dom h (λ > 0). NoLips Main Iteration: x int dom h, x + = T λ (x), (λ > 0).

Algorithm NoLips for inf{f (x) + g(x) : x C dom h} Main Algorithmic Operator [Reduces to classical prox-grad, when h quadratic] { T λ (x) := argmin f(u) + g(x) + g(x), u x + 1 } λ D h(u, x) : u dom h (λ > 0). NoLips Main Iteration: x int dom h, x + = T λ (x), (λ > 0). Algorithm NoLips in More Details 0. Input. Choose a convex function h such that there exists L > 0 with Lh g convex on int dom h. 1. Initialization. Start with any x 0 int dom h. 2. Recursion. For each k 1 with λ k > 0, generate { x k} int dom h via k N { x k = T λk (x k 1 ) = argmin f (x) + g(x k 1 ), x x k 1 + 1 } D h (x, x k 1 ) x λ k

Main Issues / Questions for NoLips Well posedness and Computation of T λ ( )? What is the complexity of NoLips? Does NoLips converge to an optimal solution? In particular: Can we identify the most aggressive step-size in terms of problem s data?

NoLips is Well Defined We assume h is a Legendre function [Rockafellar 70]. h is strictly convex and differentiable on int dom h and dom h = int dom h with h(x) = { h(x)}, x int dom h. h(x k ) whenever {x k } int dom h, x k x Bdy(dom h.) With h Legengre: h is a bijection from int dom h int dom h and ( h) 1 = h.

NoLips is Well Defined We assume h is a Legendre function [Rockafellar 70]. h is strictly convex and differentiable on int dom h and dom h = int dom h with h(x) = { h(x)}, x int dom h. h(x k ) whenever {x k } int dom h, x k x Bdy(dom h.) With h Legengre: h is a bijection from int dom h int dom h and ( h) 1 = h. Note: Legendre functions abound for defining useful D h. (All previous examples and more..). Crucial for deriving meaningful convergence results. Equipped with the above, one can prove (see technical details in our paper.) Lemma (Well posedness of the method) The proximal gradient map T λ, is single-valued and maps int dom h in int dom h.

NoLips Decomposition of T λ ( ) into Elementary Steps T λ shares the same structural decomposition as the usual proximal gradient. It splits into elementary steps useful for computational purposes.

NoLips Decomposition of T λ ( ) into Elementary Steps T λ shares the same structural decomposition as the usual proximal gradient. It splits into elementary steps useful for computational purposes. Define Bregman gradient map { p λ (x) := argmin λ g(x), u + D h (u, x) : u R d} h ( h(x) λ g(x)) Clearly reduces to the usual explicit gradient step when h = 1 2 2. Define the proximal Bregman map { prox h λf (y) := argmin λf (u) + D h (u, y) : u R d}, y int dom h

NoLips Decomposition of T λ ( ) into Elementary Steps T λ shares the same structural decomposition as the usual proximal gradient. It splits into elementary steps useful for computational purposes. Define Bregman gradient map { p λ (x) := argmin λ g(x), u + D h (u, x) : u R d} h ( h(x) λ g(x)) Clearly reduces to the usual explicit gradient step when h = 1 2 2. Define the proximal Bregman map { prox h λf (y) := argmin λf (u) + D h (u, y) : u R d}, y int dom h One can show NoLips Composition of these two Bregman maps: NoLips Main Iteration: x int dom h, x + = prox h λf p λ (x) (λ > 0) For Specific and Useful Examples, see the paper.

The Key Estimation Inequality for Analyzing NoLips Lemma (Descent inequality for NoLips) Let λ > 0. For all x in int dom h, let x + := T λ (x). Then, λ ( Φ(x + ) Φ(u) ) D h (u, x) D h (u, x + ) (1 λl)d h (x +, x), u dom h.

The Key Estimation Inequality for Analyzing NoLips Lemma (Descent inequality for NoLips) Let λ > 0. For all x in int dom h, let x + := T λ (x). Then, λ ( Φ(x + ) Φ(u) ) D h (u, x) D h (u, x + ) (1 λl)d h (x +, x), u dom h. Proof simply combines the NoLips Descent Lemma with known old results: [ Lemma 3.1 and Lemma 3.2 Chen and T. (1993)]. 1. (The three points identity) For any x, y int(dom h) and u dom h: D h (u, y) D h (u, x) D h (x, y) = h(y) h(x), x u. 2. (Bregman Based Proximal Inequality) Given z int dom h, define u + := argmin{ϕ(u) + t 1 D h (u, z) : u X }; ϕ convex, t > 0. Then, for any u dom h, t(ϕ(u + ) ϕ(u)) D h (u, z) D h (u, u + ) D h (u +, z).

Complexity for NoLips: O(1/k) Theorem (NoLips: Complexity) (i) (Global estimate in function values) Let {x k } k N be the sequence generated by NoLips with λ (0, 1/L]. Then Φ(x k ) Φ(u) LD h(u, x 0 ) k u dom h. (ii) (Complexity for h with closed domain) Assume in addition, that dom h = dom h and that (P) has at least a solution. Then for any solution x of (P), Φ(x k ) min Φ LD h( x, x 0 ) C k Notes The entropies of Boltzmann-Shannon, Fermi-Dirac and Hellinger are non trivial examples for which the assumption (dom h = dom h) holds. When h(x) = 1 2 x 2, g C 1,1 L, and we thus recover the classical sublinear global rate of the usual proximal gradient method.

Does NoLips Globally Converge to a Minimizer?

Does NoLips Globally Converge to a Minimizer? Yes! by introducing an interesting notion of symmetry for D h. Bregman distances are in general not symmetric, except when h is the energy.

Does NoLips Globally Converge to a Minimizer? Yes! by introducing an interesting notion of symmetry for D h. Bregman distances are in general not symmetric, except when h is the energy. Definition (Symmetry coefficient-measures Lack of Symmetry) Let h : R d (, ] be a Legendre function. Its symmetry coefficient is defined by { } Dh (x, y) α(h) := inf : (x, y) int dom h int dom h, x y. D h (y, x)

Does NoLips Globally Converge to a Minimizer? Yes! by introducing an interesting notion of symmetry for D h. Bregman distances are in general not symmetric, except when h is the energy. Definition (Symmetry coefficient-measures Lack of Symmetry) Let h : R d (, ] be a Legendre function. Its symmetry coefficient is defined by { } Dh (x, y) α(h) := inf : (x, y) int dom h int dom h, x y. D h (y, x) Properties of the Symmetry Coefficient α(h): α(h) [0, 1]. The closer is α(h) to 1 the more symmetric D h is. α(h) = 1 Perfect symmetry! when h is the energy. α(h) = 0 Total lack of symmetry, e.g., h(x) = x log x and h(x) = log x. α(h) > 0 Some symmetry.., e.g., h(x) = x 4, α(h) = 2 3. The symmetry coefficient allows to determine the best step size of NoLips for pointwise convergence of the generated sequence {x k }.

The Step-Size Choice λ for Global Convergence of NoLips Defining Step Size in Terms of Problem s Data 0 < λ ( 1 + α(h) ) δ L for some δ (0, 1 + α(h)), [0, 1] α(h) is the symmetry coefficient of h. L > 0 is the constant in condition (LC) Lh g convex. When h( ) := 2 /2, then α(h) = 1, L is usual Lipchitz constant for g and above reduces to 0 < λ 2 δ L recovers the classical step size allowed for pointwise convergence of the classical proximal gradient method [Combettes-Wajs 05]. With the above step-size choice, we can establish global convergence of the sequence {x k } generated by NoLips.

Pointwise Convergence for NoLips Theorem (NoLips: Point convergence - With λ (0, L 1 (1 + α(h)) ) Assume that the solution set S of (P) is nonsempty. Then, the following holds. (i) (Subsequential convergence) If S is compact, any limit point of {x k } k N is a solution to (P). (ii) (Global convergence) Assume that dom h = dom h and that (H) is satisfied. Then the sequence {x k } k N converges to some solution x of (P). Note Nontrivial examples: Boltzmann-Shannon, Fermi-Dirac and Hellinger entropies satisfy the set of assumptions in H and dom h = dom h. Additional assumption on D h is to ensure separation properties of D h at the boundary. Assumption H: (i) For every x dom h and β IR, the level set { y int dom h : D h (x, y) β} is bounded. (ii) If {x k } k N converges to some x in dom h then D h (x, x k ) 0. (iii) Reciprocally, if x is in dom h and if {x k } k N is such that D h (x, x k ) 0, then x k x.

Applications - A Prototype: Linear Inverse Problems with Poisson Noise A very large class of problems arising in Statistical and Image Sciences areas: inverse problems where data measurements are collected by counting discrete events (e.g., photons, electrons) contaminated by noise described by a Poisson process. Huge amount of literature: astronomy, nuclear medicine (PET), electronic microscropy, statistical estimation (EM), image deconvolution, denoising speckle (multiplicative) noise, ect... Problem: Given a matrix A R m n + and b R m ++ the goal is to reconstruct the signal/image x R n + from the noisy measurements b such that Ax b.

Applications - A Prototype: Linear Inverse Problems with Poisson Noise A very large class of problems arising in Statistical and Image Sciences areas: inverse problems where data measurements are collected by counting discrete events (e.g., photons, electrons) contaminated by noise described by a Poisson process. Huge amount of literature: astronomy, nuclear medicine (PET), electronic microscropy, statistical estimation (EM), image deconvolution, denoising speckle (multiplicative) noise, ect... Problem: Given a matrix A R m n + and b R m ++ the goal is to reconstruct the signal/image x R n + from the noisy measurements b such that Ax b. A natural proximity measure in R n + - (Kullback-Liebler Divergence): m b i D(b, Ax) := {b i log + (Ax) i b i }. (Ax) i i=1 which (up to some const.) is the negative Poisson log-likelihood function. The optimization problem: (E) minimize {µf (x) + g(x) : x R n +} g(x) D(d, Ax), f a regularizer smooth or nonsmooth, µ > 0 x D(b, Ax) convex, but does not admit a globally Lipschitz continuous gradient.

NoLips in Action : New Simple Schemes for Many Problems The optimization problem will be of the form: (E) min{µf (x) + D φ (b, Ax)} or min{µf (x) + D φ (Ax, b)} x x where g(x) := D φ (b, Ax) for some convex φ, and f (x) some convex regularizer.

NoLips in Action : New Simple Schemes for Many Problems The optimization problem will be of the form: (E) min{µf (x) + D φ (b, Ax)} or min{µf (x) + D φ (Ax, b)} x x where g(x) := D φ (b, Ax) for some convex φ, and f (x) some convex regularizer. Applying NoLips requires: 1. To pick an adequate h, so that Lh g convex; L in terms of problem s data. 2. In turns, this determines the step-size λ defined through (L, α(h)). 3. Compute p λ ( ) and prox h λf ( )) Bregman - gradient and proximal steps. Our convergence/complexity results hold and produce new simple algorithms: Simple schemes via explicit map M j ( ) x > 0, x + j = M j (b, A, x; µ, λ) x j, j = 1,..., n.

Two Simple Algorithms for Poisson Linear Inverse Problems Given g(x) := D φ (b, Ax) ( φ(u) = u log u), to apply NoLips: We take h(x) = n j=1 log x j, dom h = IR n ++. We need to find L > 0 such that Lh g is convex in IR n ++.

Two Simple Algorithms for Poisson Linear Inverse Problems Given g(x) := D φ (b, Ax) ( φ(u) = u log u), to apply NoLips: We take h(x) = n j=1 log x j, dom h = IR n ++. We need to find L > 0 such that Lh g is convex in IR n ++. Lemma. With (g, h) above, Lh g is convex on IR n ++ for any L b 1 := m i=1 b i.

Two Simple Algorithms for Poisson Linear Inverse Problems Given g(x) := D φ (b, Ax) ( φ(u) = u log u), to apply NoLips: We take h(x) = n j=1 log x j, dom h = IR n ++. We need to find L > 0 such that Lh g is convex in IR n ++. Lemma. With (g, h) above, Lh g is convex on IR n ++ for any L b 1 := m i=1 b i. Thus, we can take λ = L 1 = b 1 1, and applying NoLips with x IRn ++ reads: { n ( x + uj = argmin µf (u) + g(x), u + b 1 log u ) } j 1 : u > 0. x j j=1 x j The above yields closed form algorithms for Poisson reconstruction problems with two typical regularizers.

Example 1 Sparse Poisson Linear Inverse Problem Sparse regularization. Let f (x) := µ x 1, known to promote sparsity. Define, c j (x) := m i=1 b i a ij a i, x, r j := i a ij > 0. NoLips for Sparse Poisson Linear Inverse Problems x j > 0, x + j = b 1x j b 1 + (µx j + x j (r j c j (x))), j = 1,... n

Example 1 Sparse Poisson Linear Inverse Problem Sparse regularization. Let f (x) := µ x 1, known to promote sparsity. Define, c j (x) := m i=1 b i a ij a i, x, r j := i a ij > 0. NoLips for Sparse Poisson Linear Inverse Problems x j > 0, x + j = b 1x j b 1 + (µx j + x j (r j c j (x))), j = 1,... n Special Case: µ = 0, (E) is the Poisson Maximum Likelihood Estimation. NoLips yields in that case: A New Scheme for Poisson MLE x j > 0, x + j = b 1x j, j = 1,... n. b 1 + x j (r j c j (x))

Example 2 - Thikhonov - Poisson Linear Inverse Problems Tikhonov regularization. Let f (x) := µ x 2 /2. Recall that this term is used as a penalty in order to promote solutions of Ax = b with small Euclidean norms.

Example 2 - Thikhonov - Poisson Linear Inverse Problems Tikhonov regularization. Let f (x) := µ x 2 /2. Recall that this term is used as a penalty in order to promote solutions of Ax = b with small Euclidean norms. Using previous notation, NoLips yields a A Poisson-Thikonov method : Set λ = b 1 1 and start with x IR n ++ where x + j = ρ 2 j (x) + 4µλx 2 j ρ j (x) 2µλx j, j = 1,..., n. ρ j (x) := 1 + λx j (r j c j (x)), j = 1,..., n. As just mentioned, many other interesting methods can be considered By choosing different kernels for φ, or By reversing the order of the arguments in the proximity measure (which is not symmetric!..hence defining different problems, see the paper.)

Conclusion and More Details/Results on NoLips Proposed framework offers a new paragdim for FOM Breaks the longstanding question asking for L-smooth gradient. Proven Complexity and Pointwise Convergence as Classical case. Allows to derive new FOM without Lipschitz gradient. Details and More Results: Bauschke H., Bolte J., and Teboulle M. A Descent Lemma beyond Lipshitz Gradient Continuity: First Order Methods Revisited and Applications. Mathematics of Operations Research, (2017), 330 348. Available Online http://dx.doi.org/10.1287/moor.2016.0817

Conclusion and More Details/Results on NoLips Proposed framework offers a new paragdim for FOM Breaks the longstanding question asking for L-smooth gradient. Proven Complexity and Pointwise Convergence as Classical case. Allows to derive new FOM without Lipschitz gradient. Details and More Results: Bauschke H., Bolte J., and Teboulle M. A Descent Lemma beyond Lipshitz Gradient Continuity: First Order Methods Revisited and Applications. Mathematics of Operations Research, (2017), 330 348. Available Online http://dx.doi.org/10.1287/moor.2016.0817

THANK YOU FOR LISTENING!

NoLips (Red) Versus a FAST NoLips (Blue)...