dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Similar documents
Perturbed Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm

On Perturbed Proximal Gradient Algorithms

Learning with stochastic proximal gradient

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Monte Carlo methods for sampling-based Stochastic Optimization

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand

consistent learning by composite proximal thresholding

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Hmms with variable dimension structures and extensions

Proximal methods. S. Villa. October 7, 2014

Sequential convex programming,: value function and convergence

Consistency of the maximum likelihood estimator for general hidden Markov models

Stochastic approximation EM algorithm in nonlinear mixed effects model for viral load decrease during anti-hiv treatment

Markov Chain Monte Carlo methods

Computational statistics

Basic math for biology

Penalized Barycenters in the Wasserstein space

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

Markov Chain Monte Carlo (MCMC)

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Stochastic gradient descent and robustness to ill-conditioning

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Density Estimation. Seungjin Choi

On Markov chain Monte Carlo methods for tall data

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Bayesian Graphical Models

STA 4273H: Statistical Machine Learning

Lasso: Algorithms and Extensions

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

VCMC: Variational Consensus Monte Carlo

Mathematical methods for Image Processing

Bayesian Inference and MCMC

A Backward Particle Interpretation of Feynman-Kac Formulae

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

ADMM and Fast Gradient Methods for Distributed Optimization

Random Eigenvalue Problems Revisited

Graphical Models for Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Learning Gaussian Graphical Models with Unknown Group Sparsity

Notes on pseudo-marginal methods, variational Bayes and ABC

Image segmentation combining Markov Random Fields and Dirichlet Processes

Beyond stochastic gradient descent for large-scale machine learning

SVRG++ with Non-uniform Sampling

Beyond stochastic gradient descent for large-scale machine learning

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Sparse Stochastic Inference for Latent Dirichlet Allocation

Optimization for Machine Learning

MCMC algorithms for fitting Bayesian models

Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral

Lecture 6: Graphical Models: Learning

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Manifold Monte Carlo Methods

Lecture: Smoothing.

Stochastic and online algorithms

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Quantifying Stochastic Model Errors via Robust Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Markov Chain Monte Carlo

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems

14 : Theory of Variational Inference: Inner and Outer Approximation

Beyond stochastic gradient descent for large-scale machine learning

Algorithms for Nonsmooth Optimization

Learning Energy-Based Models of High-Dimensional Data

Inference in state-space models with multiple paths from conditional SMC

Introduction to Restricted Boltzmann Machines

Big Data Analytics: Optimization and Randomization

Metropolis-Hastings Algorithm

Machine Learning Summer School

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Restricted Boltzmann Machines

Bayes: All uncertainty is described using probability.

A Unified Approach to Proximal Algorithms using Bregman Distance

Statistical Data Mining and Machine Learning Hilary Term 2016

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Unsupervised Learning

EM Algorithm II. September 11, 2018

Convex relaxation for Combinatorial Penalties

Monte Carlo Methods. Leon Gu CSD, CMU

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

CONVERGENCE OF A STOCHASTIC APPROXIMATION VERSION OF THE EM ALGORITHM

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

MCMC Sampling for Bayesian Inference using L1-type Priors

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Advanced Monte Carlo integration methods. P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP

Dual Proximal Gradient Method

A framework for adaptive Monte-Carlo procedures

A Review of Pseudo-Marginal Markov Chain Monte Carlo

On Bayesian Computation

Variational Inference via Stochastic Backpropagation

Transcription:

Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse, France

Based on joint works with Yves Atchadé (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). Penalized inference in Mixed Models by Proximal Gradients methods (work in progress) Jean-François Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). Acceleration for perturbed Proximal Gradient algorithms (work in progress)

Motivation : Pharmacokinetic (1/2) N patients. For patient i, observations {Y ij, 1 j J}: evolution of the concentration at times t ij, 1 j J. Initial dose D. Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) i.i.d. d i N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

Motivation : Pharmacokinetic (1/2) N patients. For patient i, observations {Y ij, 1 j J}: evolution of the concentration at times t ij, 1 j J. Initial dose D. Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) i.i.d. d i N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of (β, σ 2, Ω), under sparsity constraints on β selection of the covariates based on ˆβ. Penalized Maximum Likelihood

Motivation : Pharmacokinetic (2/2) Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) d i i.i.d. N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: The distribution of {Y ij, X i; 1 i N, 1 j J} has an explicit expression. The distribution of {Y ij; 1 i N, 1 j J} does not have an explicit expression; at least, the marginal distribution of the previous one.

Penalized Maximum Likelihood inference in models with untractable likelihood Outline Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

Penalized Maximum Likelihood inference in models with untractable likelihood Penalized Maximum Likelihood inference with untractable Likelihood N observations : Y = (Y 1,, Y N ) A parametric statistical model θ Θ R d θ L(θ) likelihood of the observations A penalty constraint on the parameter θ: θ g(θ) for sparsity constraints on θ. Usually, g non-smooth and convex. Goal: Computation of θ argmin θ Θ ( 1 N log L(θ) + g(θ) ) when the likelihood L has no closed form expression, and can not be evaluated.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form θ log L(θ) L(θ) = p θ (x) µ(dx), where µ is a positive σ-finite measure on a set X. x are the missing/latent data; (x, Y) are the complete data. X In these models, the complete likelihood p θ (x) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). What about the gradient of the (log)-likelihood?

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model log L(θ) = log p θ (x) µ(dx) Under regularity conditions, θ log L(θ) is C 1 and θ p θ (x) µ(dx) log L(θ) = pθ (z) µ(dz) p θ (x) µ(dx) = θ log p θ (x) pθ (z) µ(dz) }{{} the a posteriori distribution

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model log L(θ) = log p θ (x) µ(dx) Under regularity conditions, θ log L(θ) is C 1 and θ p θ (x) µ(dx) log L(θ) = pθ (z) µ(dz) p θ (x) µ(dx) = θ log p θ (x) pθ (z) µ(dz) }{{} the a posteriori distribution The gradient of the log-likelihood { θ 1 } N log L(θ) = H θ (x) π θ (dx) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y. For all (x, θ), H θ (x) can be evaluated.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient { θ 1 } N log L(θ) = H θ (x) π θ (dx) X 1 Quadrature techniques: poor behavior w.r.t. the dimension of X 2 use i.i.d. samples from π θ to define a Monte Carlo approximation: not possible, in general. 3 use m samples from a non stationary Markov chain {X j,θ, j 0} with unique stationary distribution π θ, and define a Monte Carlo approximation. MCMC samplers provide such a chain.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient { θ 1 } N log L(θ) = H θ (x) π θ (dx) X 1 Quadrature techniques: poor behavior w.r.t. the dimension of X 2 use i.i.d. samples from π θ to define a Monte Carlo approximation: not possible, in general. 3 use m samples from a non stationary Markov chain {X j,θ, j 0} with unique stationary distribution π θ, and define a Monte Carlo approximation. MCMC samplers provide such a chain. Stochastic approximation of the gradient a biased approximation E [h(x j,θ )] h(x) π θ (dx). If the Markov chain is ergodic enough, the bias vanishes when j.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution y = (y 1,, y p) π θ (y) def = 1 p exp θ kk B(y k, y k ) + Z θ where B is a symmetric function. θ is a symmetric p p matrix. k=1 = 1 exp ( ) θ, B(y) Z θ 1 j<k p θ kj B(y k, y j) the normalizing constant (partition function) Z θ can not be computed - sum over X p terms.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field Likelihood of the form (scalar product between matrices = Frobenius inner product) 1 log L(θ) = θ, 1 N B(Y i) log Z θ N N The likelihood is untractable. i=1

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field Likelihood of the form (scalar product between matrices = Frobenius inner product) 1 log L(θ) = θ, 1 N B(Y i) log Z θ N N The likelihood is untractable. Gradient of the form with ( ) 1 θ N log L(θ) = 1 N N i=1 i=1 B(Y i) X p π θ (y) def = 1 Z θ exp ( θ, B(y) ). The gradient of the (log)-likelihood is untractable. B(y) πθ (y) µ(dy)

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Approximation of the gradient ( ) 1 θ N log L(θ) = 1 N N i=1 B(Y i) X p B(y) πθ (y) µ(dy). The Gibbs measure π θ (y) def = 1 exp ( ) θ, B(y) Z θ is known up to the constant Z θ. Exact sampling from π θ can be approximated by MCMC samplers (Gibbs-type samplers such as Swendsen-Wang,...) A biased approximation of the gradient is available.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) To summarize, Problem: when θ Θ R d argmin θ Θ F (θ) g convex non-smooth function (explicit). with F (θ) = f(θ) + g(θ) f is C 1, with an untractable gradient of the form f(θ) = H θ (x) π θ (dx); which can be approximated by biased Monte Carlo techniques. f is Lipschitz L > 0, θ, θ f(θ) f(θ ) L θ θ. f is not necessarily convex.

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Numerical illustration Convergence analysis

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The Proximal-Gradient algorithm (1/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The Proximal-Gradient algorithm (1/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective function. An iterative algorithm (from the Majorize-Minimize optim method) which produces a sequence {θ n, n 0} such that F (θ n+1) F (θ n).

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The proximal-gradient algorithm (2/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ About the Prox-step: when g = 0: Prox(τ) = τ when g is the projection on a compact set: the algorithm is the projected gradient. in some cases, Prox is explicit (e.g. elastic net penalty). Otherwise, numerical approximation: θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) + ɛ n+1

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a deterministic sequence {γ n, n 0}, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n).

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Monte Carlo-Proximal Gradient for Penalized ML When the gradient of the log-likelihood is of the form f(θ) = H θ (x) π θ (x)µ(dx), The MC-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = 1 m n+1 H θn (X j,n). m n+1 j=1 3 Update the current value of the parameter θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Stochastic Approximation-Proximal gradient for Penalized ML If in addition, H θ (x) = Φ(θ) + Ψ(θ)S(x) which implies ( f(θ) = Φ(θ) + Ψ(θ) ) S(x) π θ (x)µ(dx), The SA-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = Φ(θ n) + Ψ(θ n)s n+1 with ( m n+1 ) 1 S n+1 = S n + δ n+1 S(X j,n) S n. m n+1 3 Update the current value of the parameter j=1 θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms (*) Penalized Expectation-Maximization vs Proximal-Gradient EM Dempster et al. (1977) is a Majorize-Minimize algorithm for the computation of the ML estimate in a latent variable model. The Proximal Gradient algorithm is a Penalized Generalized-EM algorithm: g(θ n+1) + Q(θ n+1, θ n) g(θ n) + Q(θ n, θ n) where Q(θ, θ ) def = log p θ (x) π θ (x)dµ(x), π θ (x) def = p θ (x) pθ (z)dµ(z)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms (*) Penalized Expectation-Maximization vs Proximal-Gradient EM Dempster et al. (1977) is a Majorize-Minimize algorithm for the computation of the ML estimate in a latent variable model. The Proximal Gradient algorithm is a Penalized Generalized-EM algorithm: g(θ n+1) + Q(θ n+1, θ n) g(θ n) + Q(θ n, θ n) where Q(θ, θ ) def = log p θ (x) π θ (x)dµ(x), π θ (x) def = p θ (x) pθ (z)dµ(z) MC-Proximal Gradient corresponds to the Penalized Generalized-MCEM algorithm Wei and Tanner (1990) SA-Proximal Gradient corresponds to the Penalized Generalized-SAEM algorithm Delyon et al. (1999)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (1/3): pharmacokinetic For the implementation of the algorithm Penalty term: g(θ) = λ β 1. How to choose λ? Weighted norm? Step size sequences: constant or vanishing stepsize sequence {γ n, n 0}? (and δ n for the SA-Prox Gdt algorithm) Monte Carlo approximation: fixed or increasing batch size?

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (2/3): pharmacokinetic Proximal MCEM Decreasing Step Size Proximal MCEM Adaptive Step Size 0.5 1 0.5 1 Parameter value 0.4 0 0.3 1 0.2 2 0.1 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.50 1.00 1.25 0.75 Parameter value 0.4 0 0.3 1 0.2 0.1 2 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.00 1.0 0.9 0.75 1.00 0.50 0.8 0.7 0.50 0.75 0.25 0.6 0.25 0.50 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration Proximal SAEM Decreasing Step Size 0.5 1 0.5 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration Proximal SAEM Adaptive Step Size 0.5 1 Parameter value 0.4 0 0.3 1 0.2 2 0.1 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.00 1.25 0.75 Parameter value 0.4 0 0.3 0.2 1 0.1 2 0.0 0.1 3 0 100 200 300 400 500 0 100 200 300 400 500 1.0 1.00 0.9 0.75 1.00 0.50 0.8 0.7 0.50 0.75 0.25 0.6 0.25 0.50 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration 0.5 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (3/3): pharmacokinetic 0.5 0.4 Parameter value 0.3 0.2 0.1 0.0 0 200 400 600 0 200 400 600 0 200 400 600 Iteration Figure: Low dimension simulation setting. Regularization path of the covariate parameters for the clearance (left), absorption constant (middle) and volume of distribution (right) parameters. Black dashed line corresponds to the λ value selected by EBIC. Each colored curve corresponds to a covariate.

Convergence analysis Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

Convergence analysis Problem: argmin θ Θ F (θ) with F (θ) = f(θ) + }{{} g(θ) }{{} smooth non smooth The Perturbed Proximal Gradient algorithm Given a (0, 1/L]-valued deterministic sequence {γ n, n 0}, set for n 0, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n). Which conditions on the sequence {γ n, n 0} and on the perturbations H n+1 f(θ n), for this algorithm to converge to the minima of F?

Convergence analysis The assumptions where argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) the function g: R d [0, ] is convex, non smooth, not identically equal to +, and lower semi-continuous the function f: R d R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ θ, θ R d Θ R d is the domain of g: Θ = {θ R d : g(θ) < }. The set argmin Θ F is a non-empty subset of Θ.

Convergence analysis Existing results in the literature There exist results under (some of) the assumptions inf n γn > 0, H n+1 f(θ n) <, i.i.d. Monte Carlo approx i.e. results for n unbiased sampling. Almost NO conditions for the biased sampling, such as the MCMC one. non vanishing stepsize sequence {γ n, n 0}. increasing batch size: when H n+1 is a Monte Carlo sum i.e. then lim n m n = + at some rate. Combettes (2001) Elsevier Science. Combettes-Wajs (2005) Multiscale Modeling and Simulation. Combettes-Pesquet (2015, 2016) SIAM J. Optim, arxiv Lin-Rosasco-Villa-Zhou (2015) arxiv Rosasco-Villa-Vu (2014,2015) arxiv Schmidt-Leroux-Bach (2011) NIPS H n+1 = 1 m n+1 H θn (X j,n) m n+1 j=1

Convergence analysis Convergence of the perturbed proximal gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 H n+1) with H n+1 f(θ n) Set: L = argmin Θ (f + g) η n+1 = H n+1 f(θ n) Theorem (Atchadé, F., Moulines (2015)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L; L is non empty. n γn = + and γn (0, 1/L]. Convergence of the series γn+1 η 2 n+1 2, γ n+1η n+1, γ n+1 T n, η n+1 n where T n = Prox γn+1,g(θ n γ n+1 f(θ n)). Then there exists θ L such that lim n θ n = θ. n n

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ;

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ; let us check a condition: γ n+1η n+1 = γ n+1 (H n+1 f(θ n)) n n = n = n γ n+1 {H n+1 E [H n+1 F n]} + n γ n+1martingale increment + n γ n+1 E [H n+1 F n] f(θ n) }{{} bias, non vanishing when m n = m γ n+1remainder

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ; let us check a condition: γ n+1η n+1 = γ n+1 (H n+1 f(θ n)) n n = n = n γ n+1 {H n+1 E [H n+1 F n]} + n γ n+1martingale increment + n γ n+1 E [H n+1 F n] f(θ n) }{{} bias, non vanishing when m n = m γ n+1remainder the most technical case is biased approximation with fixed batch size

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (2/3) Increasing batch size: lim n m n = + Conditions on the step sizes and batch sizes γ n = +, n Conditions on the Markov kernels: function W : X [1, + ) such that n γ 2 n m n < ; n γ n m n < (biased case) There exist λ (0, 1), b <, p 2 and a measurable sup H θ W <, sup P θ W p λw p + b. θ Θ θ Θ In addition, for any l (0, p], there exist C < and ρ (0, 1) such that for any x X, Condition on Θ: Θ is bounded. sup P n θ (x, ) π θ W l Cρ n W l (x). (1) θ Θ

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (3/3) Fixed batch size: m n = m Condition on the step size: γ n = + n γn 2 < n γ n+1 γ n < n Condition on the Markov chain: same as in the case increasing batch size and there exists a constant C such that for any θ, θ Θ P θ (x, ) P H θ H θ W + sup θ (x, ) W + π θ π x W (x) θ W C θ θ. Condition on the Prox: sup γ (0,1/L] θ Θ Condition on Θ: Θ is bounded. sup γ 1 Prox γ,g(θ) θ <.

Convergence analysis Rates of convergence (1/3) : the problem For non negative weights a k, find an upper bound of n k=1 a k n l=1 a l F (θ k ) min F It provides an upper bound for the cumulative regret (a k = 1) an upper bound for an averaging strategy when F is convex since ( n ) a k n a F n l=1 a θ k min F k n l l=1 a F (θ k ) min F. l k=1 k=1

Convergence analysis Rates of convergence (2/3): a deterministic control Theorem (Atchadé, F., Moulines (2016)) For any θ argmin Θ F, n k=1 where A n = a k F (θ k ) min F a0 θ 0 θ 2 A n 2γ 0A n + 1 2A n + 1 A n n k=1 n k=1 ( ak a ) k 1 θ k 1 θ 2 γ k γ k 1 a k γ k η k 2 1 A n n a k T k 1 θ, η k n a l, η k = H k f(θ k 1 ), T k = Prox γk,g(θ k 1 γ k f(θ k 1 )). l=1 k=1

Convergence analysis Rates (3/3): when H n+1 is a Monte Carlo approximation, bound in L q ( 1 F n u n = O(1/ n) ) n θ k min F 1 L q n k=1 n F (θ k ) min F un L q k=1 with fixed size of the batch and (slowly) decaying stepsize γ n = γ, a [1/2, 1] na mn = m. With averaging: optimal rate, even with slowly decaying stepsize γ n 1/ n. u n = O(ln n/n) with increasing batch size and constant stepsize γ n = γ m n n. Rate with O(n 2 ) Monte Carlo samples!

Convergence analysis Acceleration (1) Let {t n, n 0} be a positive sequence s.t. γ n+1t n(t n 1) γ nt 2 n 1 Nesterov acceleration of the Proximal Gradient algorithm Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) θ n+1 = Prox γn+1,g (τ n γ n+1 f(τ n)) τ n+1 = θ n+1 + tn 1 t n+1 (θ n+1 θ n) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) (deterministic) Proximal-gradient (deterministic) Accelerated Proximal-gradient ( ) 1 F (θ n) min F = O n ( ) 1 F (θ n) min F = O n 2

Convergence analysis Acceleration (2) Aujol-Dossal-F.-Moulines, work in progress Perturbed Nesterov acceleration: some convergence results Choose γ n, m n, t n s.t. γ n (0, 1/L], lim n γ nt 2 n = +, Then there exists θ argmin Θ F s.t lim n θ n = θ. In addition ( 1 F (θ n+1) min F = O n γ nt n(1 + γ nt n) 1 m n < γ n+1t 2 n ) Schmidt-Le Roux-Bach (2011); Dossal-Chambolle(2014); Aujol-Dossal(2015) γ n m n t n rate NbrMC γ n 3 n n 2 n 4 γ/ n n 2 n n 3/2 n 3 Table: Control of F (θ n) min F