dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse, France

Based on joint works with Yves Atchadé (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). Penalized inference in Mixed Models by Proximal Gradients methods (work in progress) Jean-François Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). Acceleration for perturbed Proximal Gradient algorithms (work in progress)

Motivation : Pharmacokinetic (1/2) N patients. For patient i, observations {Y ij, 1 j J}: evolution of the concentration at times t ij, 1 j J. Initial dose D. Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) i.i.d. d i N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

Motivation : Pharmacokinetic (2/2) Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) d i i.i.d. N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: The distribution of {Y ij, X i; 1 i N, 1 j J} has an explicit expression. The distribution of {Y ij; 1 i N, 1 j J} does not have an explicit expression; at least, the marginal distribution of the previous one.

Penalized Maximum Likelihood inference in models with untractable likelihood Outline Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

Penalized Maximum Likelihood inference in models with untractable likelihood Penalized Maximum Likelihood inference with untractable Likelihood N observations : Y = (Y 1,, Y N ) A parametric statistical model θ Θ R d θ L(θ) likelihood of the observations A penalty constraint on the parameter θ: θ g(θ) for sparsity constraints on θ. Usually, g non-smooth and convex. Goal: Computation of θ argmin θ Θ ( 1 N log L(θ) + g(θ) ) when the likelihood L has no closed form expression, and can not be evaluated.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form θ log L(θ) L(θ) = p θ (x) µ(dx), where µ is a positive σ-finite measure on a set X. x are the missing/latent data; (x, Y) are the complete data. X In these models, the complete likelihood p θ (x) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). What about the gradient of the (log)-likelihood?

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model log L(θ) = log p θ (x) µ(dx) Under regularity conditions, θ log L(θ) is C 1 and θ p θ (x) µ(dx) log L(θ) = pθ (z) µ(dz) p θ (x) µ(dx) = θ log p θ (x) pθ (z) µ(dz) }{{} the a posteriori distribution The gradient of the log-likelihood { θ 1 } N log L(θ) = H θ (x) π θ (dx) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y. For all (x, θ), H θ (x) can be evaluated.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient { θ 1 } N log L(θ) = H θ (x) π θ (dx) X 1 Quadrature techniques: poor behavior w.r.t. the dimension of X 2 use i.i.d. samples from π θ to define a Monte Carlo approximation: not possible, in general. 3 use m samples from a non stationary Markov chain {X j,θ, j 0} with unique stationary distribution π θ, and define a Monte Carlo approximation. MCMC samplers provide such a chain.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution y = (y 1,, y p) π θ (y) def = 1 p exp θ kk B(y k, y k ) + Z θ where B is a symmetric function. θ is a symmetric p p matrix. k=1 = 1 exp ( ) θ, B(y) Z θ 1 j<k p θ kj B(y k, y j) the normalizing constant (partition function) Z θ can not be computed - sum over X p terms.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field Likelihood of the form (scalar product between matrices = Frobenius inner product) 1 log L(θ) = θ, 1 N B(Y i) log Z θ N N The likelihood is untractable. Gradient of the form with ( ) 1 θ N log L(θ) = 1 N N i=1 i=1 B(Y i) X p π θ (y) def = 1 Z θ exp ( θ, B(y) ). The gradient of the (log)-likelihood is untractable. B(y) πθ (y) µ(dy)

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Approximation of the gradient ( ) 1 θ N log L(θ) = 1 N N i=1 B(Y i) X p B(y) πθ (y) µ(dy). The Gibbs measure π θ (y) def = 1 exp ( ) θ, B(y) Z θ is known up to the constant Z θ. Exact sampling from π θ can be approximated by MCMC samplers (Gibbs-type samplers such as Swendsen-Wang,...) A biased approximation of the gradient is available.

Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) To summarize, Problem: when θ Θ R d argmin θ Θ F (θ) g convex non-smooth function (explicit). with F (θ) = f(θ) + g(θ) f is C 1, with an untractable gradient of the form f(θ) = H θ (x) π θ (dx); which can be approximated by biased Monte Carlo techniques. f is Lipschitz L > 0, θ, θ f(θ) f(θ ) L θ θ. f is not necessarily convex.

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Numerical illustration Convergence analysis

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The Proximal-Gradient algorithm (1/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective function. An iterative algorithm (from the Majorize-Minimize optim method) which produces a sequence {θ n, n 0} such that F (θ n+1) F (θ n).

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The proximal-gradient algorithm (2/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ About the Prox-step: when g = 0: Prox(τ) = τ when g is the projection on a compact set: the algorithm is the projected gradient. in some cases, Prox is explicit (e.g. elastic net penalty). Otherwise, numerical approximation: θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) + ɛ n+1

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a deterministic sequence {γ n, n 0}, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n).

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Monte Carlo-Proximal Gradient for Penalized ML When the gradient of the log-likelihood is of the form f(θ) = H θ (x) π θ (x)µ(dx), The MC-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = 1 m n+1 H θn (X j,n). m n+1 j=1 3 Update the current value of the parameter θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Stochastic Approximation-Proximal gradient for Penalized ML If in addition, H θ (x) = Φ(θ) + Ψ(θ)S(x) which implies ( f(θ) = Φ(θ) + Ψ(θ) ) S(x) π θ (x)µ(dx), The SA-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = Φ(θ n) + Ψ(θ n)s n+1 with ( m n+1 ) 1 S n+1 = S n + δ n+1 S(X j,n) S n. m n+1 3 Update the current value of the parameter j=1 θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms (*) Penalized Expectation-Maximization vs Proximal-Gradient EM Dempster et al. (1977) is a Majorize-Minimize algorithm for the computation of the ML estimate in a latent variable model. The Proximal Gradient algorithm is a Penalized Generalized-EM algorithm: g(θ n+1) + Q(θ n+1, θ n) g(θ n) + Q(θ n, θ n) where Q(θ, θ ) def = log p θ (x) π θ (x)dµ(x), π θ (x) def = p θ (x) pθ (z)dµ(z)

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (1/3): pharmacokinetic For the implementation of the algorithm Penalty term: g(θ) = λ β 1. How to choose λ? Weighted norm? Step size sequences: constant or vanishing stepsize sequence {γ n, n 0}? (and δ n for the SA-Prox Gdt algorithm) Monte Carlo approximation: fixed or increasing batch size?

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (2/3): pharmacokinetic Proximal MCEM Decreasing Step Size Proximal MCEM Adaptive Step Size 0.5 1 0.5 1 Parameter value 0.4 0 0.3 1 0.2 2 0.1 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.50 1.00 1.25 0.75 Parameter value 0.4 0 0.3 1 0.2 0.1 2 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.00 1.0 0.9 0.75 1.00 0.50 0.8 0.7 0.50 0.75 0.25 0.6 0.25 0.50 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration Proximal SAEM Decreasing Step Size 0.5 1 0.5 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration Proximal SAEM Adaptive Step Size 0.5 1 Parameter value 0.4 0 0.3 1 0.2 2 0.1 0.0 3 0 100 200 300 400 500 0 100 200 300 400 500 1.00 1.25 0.75 Parameter value 0.4 0 0.3 0.2 1 0.1 2 0.0 0.1 3 0 100 200 300 400 500 0 100 200 300 400 500 1.0 1.00 0.9 0.75 1.00 0.50 0.8 0.7 0.50 0.75 0.25 0.6 0.25 0.50 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration 0.5 0.00 0 100 200 300 400 500 0 100 200 300 400 500 Iteration

Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (3/3): pharmacokinetic 0.5 0.4 Parameter value 0.3 0.2 0.1 0.0 0 200 400 600 0 200 400 600 0 200 400 600 Iteration Figure: Low dimension simulation setting. Regularization path of the covariate parameters for the clearance (left), absorption constant (middle) and volume of distribution (right) parameters. Black dashed line corresponds to the λ value selected by EBIC. Each colored curve corresponds to a covariate.

Convergence analysis Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

Convergence analysis Problem: argmin θ Θ F (θ) with F (θ) = f(θ) + }{{} g(θ) }{{} smooth non smooth The Perturbed Proximal Gradient algorithm Given a (0, 1/L]-valued deterministic sequence {γ n, n 0}, set for n 0, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n). Which conditions on the sequence {γ n, n 0} and on the perturbations H n+1 f(θ n), for this algorithm to converge to the minima of F?

Convergence analysis The assumptions where argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) the function g: R d [0, ] is convex, non smooth, not identically equal to +, and lower semi-continuous the function f: R d R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ θ, θ R d Θ R d is the domain of g: Θ = {θ R d : g(θ) < }. The set argmin Θ F is a non-empty subset of Θ.

Convergence analysis Existing results in the literature There exist results under (some of) the assumptions inf n γn > 0, H n+1 f(θ n) <, i.i.d. Monte Carlo approx i.e. results for n unbiased sampling. Almost NO conditions for the biased sampling, such as the MCMC one. non vanishing stepsize sequence {γ n, n 0}. increasing batch size: when H n+1 is a Monte Carlo sum i.e. then lim n m n = + at some rate. Combettes (2001) Elsevier Science. Combettes-Wajs (2005) Multiscale Modeling and Simulation. Combettes-Pesquet (2015, 2016) SIAM J. Optim, arxiv Lin-Rosasco-Villa-Zhou (2015) arxiv Rosasco-Villa-Vu (2014,2015) arxiv Schmidt-Leroux-Bach (2011) NIPS H n+1 = 1 m n+1 H θn (X j,n) m n+1 j=1

Convergence analysis Convergence of the perturbed proximal gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 H n+1) with H n+1 f(θ n) Set: L = argmin Θ (f + g) η n+1 = H n+1 f(θ n) Theorem (Atchadé, F., Moulines (2015)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L; L is non empty. n γn = + and γn (0, 1/L]. Convergence of the series γn+1 η 2 n+1 2, γ n+1η n+1, γ n+1 T n, η n+1 n where T n = Prox γn+1,g(θ n γ n+1 f(θ n)). Then there exists θ L such that lim n θ n = θ. n n

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ; let us check a condition: γ n+1η n+1 = γ n+1 (H n+1 f(θ n)) n n = n = n γ n+1 {H n+1 E [H n+1 F n]} + n γ n+1martingale increment + n γ n+1 E [H n+1 F n] f(θ n) }{{} bias, non vanishing when m n = m γ n+1remainder

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (2/3) Increasing batch size: lim n m n = + Conditions on the step sizes and batch sizes γ n = +, n Conditions on the Markov kernels: function W : X [1, + ) such that n γ 2 n m n < ; n γ n m n < (biased case) There exist λ (0, 1), b <, p 2 and a measurable sup H θ W <, sup P θ W p λw p + b. θ Θ θ Θ In addition, for any l (0, p], there exist C < and ρ (0, 1) such that for any x X, Condition on Θ: Θ is bounded. sup P n θ (x, ) π θ W l Cρ n W l (x). (1) θ Θ

Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (3/3) Fixed batch size: m n = m Condition on the step size: γ n = + n γn 2 < n γ n+1 γ n < n Condition on the Markov chain: same as in the case increasing batch size and there exists a constant C such that for any θ, θ Θ P θ (x, ) P H θ H θ W + sup θ (x, ) W + π θ π x W (x) θ W C θ θ. Condition on the Prox: sup γ (0,1/L] θ Θ Condition on Θ: Θ is bounded. sup γ 1 Prox γ,g(θ) θ <.

Convergence analysis Rates of convergence (1/3) : the problem For non negative weights a k, find an upper bound of n k=1 a k n l=1 a l F (θ k ) min F It provides an upper bound for the cumulative regret (a k = 1) an upper bound for an averaging strategy when F is convex since ( n ) a k n a F n l=1 a θ k min F k n l l=1 a F (θ k ) min F. l k=1 k=1

Convergence analysis Rates of convergence (2/3): a deterministic control Theorem (Atchadé, F., Moulines (2016)) For any θ argmin Θ F, n k=1 where A n = a k F (θ k ) min F a0 θ 0 θ 2 A n 2γ 0A n + 1 2A n + 1 A n n k=1 n k=1 ( ak a ) k 1 θ k 1 θ 2 γ k γ k 1 a k γ k η k 2 1 A n n a k T k 1 θ, η k n a l, η k = H k f(θ k 1 ), T k = Prox γk,g(θ k 1 γ k f(θ k 1 )). l=1 k=1

Convergence analysis Rates (3/3): when H n+1 is a Monte Carlo approximation, bound in L q ( 1 F n u n = O(1/ n) ) n θ k min F 1 L q n k=1 n F (θ k ) min F un L q k=1 with fixed size of the batch and (slowly) decaying stepsize γ n = γ, a [1/2, 1] na mn = m. With averaging: optimal rate, even with slowly decaying stepsize γ n 1/ n. u n = O(ln n/n) with increasing batch size and constant stepsize γ n = γ m n n. Rate with O(n 2 ) Monte Carlo samples!

Convergence analysis Acceleration (1) Let {t n, n 0} be a positive sequence s.t. γ n+1t n(t n 1) γ nt 2 n 1 Nesterov acceleration of the Proximal Gradient algorithm Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) θ n+1 = Prox γn+1,g (τ n γ n+1 f(τ n)) τ n+1 = θ n+1 + tn 1 t n+1 (θ n+1 θ n) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) (deterministic) Proximal-gradient (deterministic) Accelerated Proximal-gradient ( ) 1 F (θ n) min F = O n ( ) 1 F (θ n) min F = O n 2

Convergence analysis Acceleration (2) Aujol-Dossal-F.-Moulines, work in progress Perturbed Nesterov acceleration: some convergence results Choose γ n, m n, t n s.t. γ n (0, 1/L], lim n γ nt 2 n = +, Then there exists θ argmin Θ F s.t lim n θ n = θ. In addition ( 1 F (θ n+1) min F = O n γ nt n(1 + γ nt n) 1 m n < γ n+1t 2 n ) Schmidt-Le Roux-Bach (2011); Dossal-Chambolle(2014); Aujol-Dossal(2015) γ n m n t n rate NbrMC γ n 3 n n 2 n 4 γ/ n n 2 n n 3/2 n 3 Table: Control of F (θ n) min F