Perturbed Proximal Gradient Algorithm

Similar documents
dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Stochastic Proximal Gradient Algorithm

On Perturbed Proximal Gradient Algorithms

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS

Learning with stochastic proximal gradient

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Monte Carlo methods for sampling-based Stochastic Optimization

Stochastic gradient descent and robustness to ill-conditioning

Consistency of the maximum likelihood estimator for general hidden Markov models

Proximal methods. S. Villa. October 7, 2014

Adaptive Markov Chain Monte Carlo: Theory and Methods

Some Results on the Ergodicity of Adaptive MCMC Algorithms

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Penalized Barycenters in the Wasserstein space

Mathematical methods for Image Processing

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Beyond stochastic gradient descent for large-scale machine learning

Stochastic and online algorithms

Oslo Class 6 Sparsity based regularization

An Optimal Affine Invariant Smooth Minimization Algorithm.

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

consistent learning by composite proximal thresholding

On Markov chain Monte Carlo methods for tall data

LECTURE 15 Markov chain Monte Carlo

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Generalized greedy algorithms.

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

SVRG++ with Non-uniform Sampling

Variational inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Class 2 & 3 Overfitting & Regularization

Inference in state-space models with multiple paths from conditional SMC

Computer intensive statistical methods

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral

STA 4273H: Statistical Machine Learning

A mathematical framework for Exact Milestoning

Gradient Estimation for Attractor Networks

Accelerated Training of Max-Margin Markov Networks with Kernels

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Sélection adaptative des paramètres pour le débruitage des images

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

STA414/2104 Statistical Methods for Machine Learning II

Lasso: Algorithms and Extensions

arxiv: v1 [math.st] 4 Dec 2015

Semi-Parametric Importance Sampling for Rare-event probability Estimation

A Backward Particle Interpretation of Feynman-Kac Formulae

Model Selection and Geometry

Introduction. log p θ (y k y 1:k 1 ), k=1

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan

Stochastic Optimization Algorithms Beyond SG

Numerical methods for a fractional diffusion/anti-diffusion equation

Fast proximal gradient methods

Monte Carlo Methods. Leon Gu CSD, CMU

The Theory behind PageRank

HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING

Sequential convex programming,: value function and convergence

Information theoretic perspectives on learning algorithms

Accelerated Proximal Gradient Methods for Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Convex Optimization Lecture 16

Stochastic gradient methods for machine learning

Large-scale machine learning and convex optimization

Sampling multimodal densities in high dimensional sampling space

Lecture 10. Theorem 1.1 [Ergodicity and extremality] A probability measure µ on (Ω, F) is ergodic for T if and only if it is an extremal point in M.

Introduction to Restricted Boltzmann Machines

Stochastic Gradient Descent with Variance Reduction

Towards stability and optimality in stochastic gradient descent

A framework for adaptive Monte-Carlo procedures

Large-scale machine learning and convex optimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Learning Energy-Based Models of High-Dimensional Data

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Bandit Convex Optimization

Stochastic Dynamic Programming: The One Sector Growth Model

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

MCMC: Markov Chain Monte Carlo

Markov chain Monte Carlo

Non-homogeneous random walks on a semi-infinite strip

Stochastic Gradient Descent in Continuous Time

Lecture 2 February 25th

Recent Advances in Regional Adaptation for MCMC

Estimators based on non-convex programs: Statistical and computational guarantees

OWL to the rescue of LASSO

arxiv: v3 [stat.me] 12 Jul 2015

Titles and Abstracts

Transcription:

Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing and astrophysics Grenoble, November 2015

Introduction Works in collaboration with Eric Moulines (Professor, Ecole Polytechnique) Yves Atchadé (Assistant Professor, Univ. Michigan, USA) and also Jean-Francois Aujol (IMB, Univ. Bordeaux), Charles Dossal (IMB, Univ. Bordeaux) and Soukaina Douissi. Y. Atchadé, G. Fort and E. Moulines. On Stochastic Proximal Gradient Algorithms. arxiv:1402:2365 math.st.

Introduction Optimization problem Outline Introduction Optimization problem Proximal-Gradient algorithm Untractable proximal-gradient iteration Perturbed Proximal Gradient Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Introduction Optimization problem Problem (arg)min θ Θ F (θ) with F (θ) = f(θ) + g(θ) where Θ finite-dimensional Euclidean space with scalar product, and norm the function f:θ R is a smooth function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous in the case f(θ) and f are intractable.

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (1/3) Since f is Lipschitz, for any u, θ, which yields for any L γ 1 f(θ) f(u) + f(u), θ u + L θ u 2 2 F (θ) f(u) + f(u), θ u + 1 2γ θ u 2 + g(θ) 3 2.5 2 1.5 1 0.5 F(θ) 0 0.5 1 3 2 1 0 1 2 3 u The RHS satisfies for fixed u, an upper bound of θ F (θ) for θ = u, this upper bound is equal to F (u). for fixd u, it is convex (in θ) C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ)

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (2/3) Denote the upper bound by Q γ(θ u) def = C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ) Majorization-Minimization (MM) algorithm Define {θ n, n 0} iteratively by or equivalently with θ n+1 = argmin θ Q γ(θ θ n) θ n+1 = Prox γ(θ n γ f(θ n)) Prox γ(τ) def = argmin θ g(θ) + 1 θ τ 2 2γ also called Proximal-Gradient algorithm

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (3/3) The sequence {θ n, n 0} is given by θ n+1 = argmin θ Q γ(θ θ n) where the upper bound θ Q γ(θ u) satisfies F (θ) Q γ(θ u) F (u) = Q γ(u u) Lyapunov function 3 2.5 2 1.5 1 0.5 0 0.5 F(θ) Q γ (θ θ n ) θ n+1 1 3 2 1 0 1 2 3 θ n F (θ n+1) F (θ n) since F (θ n+1) Q γ(θ n+1 θ n) Q γ(θ n θ n) = F (θ n)

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration The exact proximal-gradient algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) where {γ n, n 0} is a step-size sequence in (0, 1/L]. 1 Prox γ(u) can be untractable (not in this talk) 2 f can be untractable (in this talk)

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: explicit proximal operator (Projection on { C) 0 if θ C When g(θ) = + otherwise where C is closed convex, Prox γ(τ) = min τ θ 2 θ C (Elastic net penalty) g(θ) = λ ( ) 1 α 2 θ 2 2 + α θ 1 τ 1 i γλα si τ i γλα (Prox γ(τ)) i = τ i + γλα si τ i γλα 1 + γλ(1 α) 0 sinon proximal gradient algorithm = thresholded gradient algorithm

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo.

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo. 2 Large scale optimization In this case f(θ) = 1 N f k (θ), large N N f(θ) = 1 N N f k (θ) 1 m f Ik (θ) m

Introduction Perturbed Proximal Gradient In this talk The exact proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) The perturbed proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 { f(θ n) + η n+1}) 1 Which conditions on γ n, η n to ensure the convergence to the same limiting set as for the exact algorithm? 2 When η n is a (random) Monte Carlo approximation, which conditions on γ n, m n?

Convergence of the (stable) perturbed proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions On the convergence of {θ n, n 0} On the convergence of F ( θ n) Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions Assumptions (arg)min θ Θ F (θ) F (θ) = f(θ) + g(θ) 1 the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous. 2 the function f: Θ R is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ 3 the function f is convex and the set L def = argmin θ F (θ) is not empty. 4 the stepsize {γ n, n 0} is positive and γ n (0, 1/L].

Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions The algorithm Stable sequence Let K int(dom(g)) be a compact subset of Θ such that K L =. Algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) θ n+1 = Proj K ( θ n+1) Weighted average sequence Let {a n, n 0} be a non-negative sequence. θ n = 1 n a k a k θk

Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of {θn, n 0} Convergence of {θ n, n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, n γn = + and γ n+1η n+1 < n γ n+1 Tγn+1 (θ n), η n+1 < n γn+1 η 2 n+1 2 < n there exists θ L K such that lim n θ n = lim n θn = θ where Tγ(θ) = Prox γ(θ γ f(θ)) Includes the convergence analysis for the exact algorithm (η n = 0) Beck and Teboulle (2009); improves previous results Combettes and Wajs (2005); Combettes and Pesquet (2014).

Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of F ( θn) Rates of convergence for {F (θ n ), n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, for any a k 0 with U n def = 1 2 } a k {F ( θ k ) min F U n ( ak a ) k 1 θ k 1 θ 2 + a0 θ 0 θ 2 γ k γ k 1 2γ 0 a k T γk (θ k 1 ) θ, η k + a k γ k η k 2. Includes the convergence analysis for the exact algorithm (η n+1 = 0); Extends previous results in the case γ n = γ, a n = 1 Schmidt, Le Roux, Bach (2011) where it is assumed n ηn <.

Convergence of the Monte Carlo proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Additional assumptions Convergence of θ n Convergence of F ( θ n) How to choose γ n, m n? Conclusion, Other results and Works in progress

Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1

Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation)

Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation) 2 {X (1) n+1,, X(m n+1) n+1 } is a non-stationary Markov chain (e.g. MCMC path) with invariant distribution π θn : E [η n+1 Past n] 0 (biased approximation)

Convergence of the Monte Carlo proximal-gradient algorithm Additional assumptions Additional assumptions 5 the error is of the form η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) where f(θ) = n+1 H θ (x) π θ (dx) 6 {X (k) n+1, k 0} is a Markov chain with transition kernel P θ n. For all θ, π θ is invariant for P θ. 7 The kernels {P θ, θ Θ} are geometrically ergodic uniformly-in-θ (aperiodic, phi-irreducible, uniform-in-θ geometric drift inequalities w.r.t. W p where p 2, level sets of W p are small): there exists p 2 and for any l (0, p], there exist C > 0, ρ (0, 1) s.t. sup Pθ n (x, ) π θ W l Cρ n W l (x). θ K Trivial condition in the i.i.d. case There exist many sufficient conditions for the Markov case when samples are drawn from MCMC samplers.

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ. The key ingredient for the proof is the control F. and Moulines (2003) for p 2: w.p.1 and the decomposition E [η n+1 F n] C m 1 n+1w (X (mn) n ), E [ η n+1 p F n] C m p/2 n+1 W p (X (mn) n ). η n+1 = η n+1 E [η n+1 F n] + E [η n+1 F n] = Martingale Increment + Bias

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n = m η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn+1 = +, n γ2 n+1 <. If the approximation is biased, assume also: there exists a constant C such that for any θ, θ K H θ H θ W + P θ P θ W + π θ π θ W C θ θ. sup γ (0,1/L] sup θ K γ 1 Prox γ(θ) θ <. n γn+1 γn <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n γ k + γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + a 2 km 1 k+1) 1/2 + a k (γ k + υ)m 1 a k a k 1 n γ k + γ k 1 a k (γ k + υ)m 1 k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise. k+1 ),

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n = m Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n 1/2 γ k + ak) 2 + a k γ k + υ a k+1 a k γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + γ k 1 a k a k 1 n ) γ k + a k γ k + υ a k+1 a k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise.

Convergence of the Monte Carlo proximal-gradient algorithm How to choose γn, mn? Fixed or Increasing batch-size m n? Fixed or Decreasing step-size γ n? Consider the L q -convergence rate: ( n ) 1 a k n a k F ( θ k ) F (θ ) L q Increasing batch size m n : With γ n = γ m n n a n = 1, Rate: O(ln n/n) Complexity: O(ln n/ n). Fixed batch size m n = m With γ n γ / n a n = 1 or a n = γ n, Rate: O(1/ n) Complexity: O(1/ n).

Conclusion, Other results and Works in progress Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Conclusion, Other results and Works in progress Conclusion Contributions: a) NOT in the strongly convex case. b) Sufficient conditions for the convergence of perturbed Proximal-Gradient algorithms. c) Case of Monte Carlo approximations, biased or unbiased, increasing or fixed batch-size. Major contributions a) for Monte Carlo approximations b) biased approximations c) fixed batch-size

Conclusion, Other results and Works in progress Other results, Works in progress and Future works a) When f is not convex. b) Accelerations (Nesterov, ) c) Convergence of the Proximal Stochastic Approximation Expectation Maximization algorithm : for the maximization of a penalized likelihood in latent models by using a generalization of the SAEM algorithm. d) Rates of convergence explicit controls.