dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Size: px
Start display at page:

Download "dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés"

Transcription

1 Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse, France

2 Based on joint works with Yves Atchadé (Univ. Michigan, USA) Eric Moulines (Ecole Polytechnique, France) On Perturbed Proximal-Gradient algorithms (JMLR, 2016) Edouard Ollier (ENS Lyon, France) Adeline Samson (Univ. Grenoble Alpes, France). Penalized inference in Mixed Models by Proximal Gradients methods (work in progress) Jean-François Aujol (IMB, Bordeaux, France) Charles Dossal (IMB, Bordeaux, France). Acceleration for perturbed Proximal Gradient algorithms (work in progress)

3 Motivation : Pharmacokinetic (1/2) N patients. For patient i, observations {Y ij, 1 j J}: evolution of the concentration at times t ij, 1 j J. Initial dose D. Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) i.i.d. d i N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

4 Motivation : Pharmacokinetic (1/2) N patients. For patient i, observations {Y ij, 1 j J}: evolution of the concentration at times t ij, 1 j J. Initial dose D. Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) i.i.d. d i N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of (β, σ 2, Ω), under sparsity constraints on β selection of the covariates based on ˆβ. Penalized Maximum Likelihood

5 Motivation : Pharmacokinetic (2/2) Model: Y ij = f (t ij, X i) + ɛ ij X i = Z iβ + d i R L ɛ ij i.i.d. N (0, σ 2 ) d i i.i.d. N L(0, Ω) and independent of ɛ Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: The distribution of {Y ij, X i; 1 i N, 1 j J} has an explicit expression. The distribution of {Y ij; 1 i N, 1 j J} does not have an explicit expression; at least, the marginal distribution of the previous one.

6 Penalized Maximum Likelihood inference in models with untractable likelihood Outline Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 2: Discrete graphical model (Markov random field) Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

7 Penalized Maximum Likelihood inference in models with untractable likelihood Penalized Maximum Likelihood inference with untractable Likelihood N observations : Y = (Y 1,, Y N ) A parametric statistical model θ Θ R d θ L(θ) likelihood of the observations A penalty constraint on the parameter θ: θ g(θ) for sparsity constraints on θ. Usually, g non-smooth and convex. Goal: Computation of θ argmin θ Θ ( 1 N log L(θ) + g(θ) ) when the likelihood L has no closed form expression, and can not be evaluated.

8 Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Example 1: Latent variable model The log-likelihood of the observations Y is of the form θ log L(θ) L(θ) = p θ (x) µ(dx), where µ is a positive σ-finite measure on a set X. x are the missing/latent data; (x, Y) are the complete data. X In these models, the complete likelihood p θ (x) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). What about the gradient of the (log)-likelihood?

9 Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model log L(θ) = log p θ (x) µ(dx) Under regularity conditions, θ log L(θ) is C 1 and θ p θ (x) µ(dx) log L(θ) = pθ (z) µ(dz) p θ (x) µ(dx) = θ log p θ (x) pθ (z) µ(dz) }{{} the a posteriori distribution

10 Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Gradient of the likelihood in a latent variable model log L(θ) = log p θ (x) µ(dx) Under regularity conditions, θ log L(θ) is C 1 and θ p θ (x) µ(dx) log L(θ) = pθ (z) µ(dz) p θ (x) µ(dx) = θ log p θ (x) pθ (z) µ(dz) }{{} the a posteriori distribution The gradient of the log-likelihood { θ 1 } N log L(θ) = H θ (x) π θ (dx) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y. For all (x, θ), H θ (x) can be evaluated.

11 Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient { θ 1 } N log L(θ) = H θ (x) π θ (dx) X 1 Quadrature techniques: poor behavior w.r.t. the dimension of X 2 use i.i.d. samples from π θ to define a Monte Carlo approximation: not possible, in general. 3 use m samples from a non stationary Markov chain {X j,θ, j 0} with unique stationary distribution π θ, and define a Monte Carlo approximation. MCMC samplers provide such a chain.

12 Penalized Maximum Likelihood inference in models with untractable likelihood Example 1: Latent variable models Approximation of the gradient { θ 1 } N log L(θ) = H θ (x) π θ (dx) X 1 Quadrature techniques: poor behavior w.r.t. the dimension of X 2 use i.i.d. samples from π θ to define a Monte Carlo approximation: not possible, in general. 3 use m samples from a non stationary Markov chain {X j,θ, j 0} with unique stationary distribution π θ, and define a Monte Carlo approximation. MCMC samplers provide such a chain. Stochastic approximation of the gradient a biased approximation E [h(x j,θ )] h(x) π θ (dx). If the Markov chain is ergodic enough, the bias vanishes when j.

13 Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Example 2: Discrete graphical model (Markov random field) N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y i in X p with distribution y = (y 1,, y p) π θ (y) def = 1 p exp θ kk B(y k, y k ) + Z θ where B is a symmetric function. θ is a symmetric p p matrix. k=1 = 1 exp ( ) θ, B(y) Z θ 1 j<k p θ kj B(y k, y j) the normalizing constant (partition function) Z θ can not be computed - sum over X p terms.

14 Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field Likelihood of the form (scalar product between matrices = Frobenius inner product) 1 log L(θ) = θ, 1 N B(Y i) log Z θ N N The likelihood is untractable. i=1

15 Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Likelihood and its gradient in Markov random field Likelihood of the form (scalar product between matrices = Frobenius inner product) 1 log L(θ) = θ, 1 N B(Y i) log Z θ N N The likelihood is untractable. Gradient of the form with ( ) 1 θ N log L(θ) = 1 N N i=1 i=1 B(Y i) X p π θ (y) def = 1 Z θ exp ( θ, B(y) ). The gradient of the (log)-likelihood is untractable. B(y) πθ (y) µ(dy)

16 Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) Approximation of the gradient ( ) 1 θ N log L(θ) = 1 N N i=1 B(Y i) X p B(y) πθ (y) µ(dy). The Gibbs measure π θ (y) def = 1 exp ( ) θ, B(y) Z θ is known up to the constant Z θ. Exact sampling from π θ can be approximated by MCMC samplers (Gibbs-type samplers such as Swendsen-Wang,...) A biased approximation of the gradient is available.

17 Penalized Maximum Likelihood inference in models with untractable likelihood Example 2: Discrete graphical model (Markov random field) To summarize, Problem: when θ Θ R d argmin θ Θ F (θ) g convex non-smooth function (explicit). with F (θ) = f(θ) + g(θ) f is C 1, with an untractable gradient of the form f(θ) = H θ (x) π θ (dx); which can be approximated by biased Monte Carlo techniques. f is Lipschitz L > 0, θ, θ f(θ) f(θ ) L θ θ. f is not necessarily convex.

18 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Numerical illustration Convergence analysis

19 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The Proximal-Gradient algorithm (1/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013)

20 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The Proximal-Gradient algorithm (1/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ Proximal map: Moreau(1962) Proximal Gradient algorithm: Beck-Teboulle(2010); Combettes-Pesquet(2011); Parikh-Boyd(2013) A generalization of the gradient algorithm to a composite objective function. An iterative algorithm (from the Majorize-Minimize optim method) which produces a sequence {θ n, n 0} such that F (θ n+1) F (θ n).

21 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The proximal-gradient algorithm (2/2) argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) }{{} }{{} smooth non smooth The Proximal Gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) where ( Prox γ,g(τ) def = argmin θ Θ g(θ) + 1 ) θ τ 2 2γ About the Prox-step: when g = 0: Prox(τ) = τ when g is the projection on a compact set: the algorithm is the projected gradient. in some cases, Prox is explicit (e.g. elastic net penalty). Otherwise, numerical approximation: θ n+1 = Prox γn+1,g (θ n γ n+1 f(θ n)) + ɛ n+1

22 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms The perturbed proximal-gradient algorithm The Perturbed Proximal Gradient algorithm Given a deterministic sequence {γ n, n 0}, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n).

23 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Monte Carlo-Proximal Gradient for Penalized ML When the gradient of the log-likelihood is of the form f(θ) = H θ (x) π θ (x)µ(dx), The MC-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = 1 m n+1 H θn (X j,n). m n+1 j=1 3 Update the current value of the parameter θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

24 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms Algorithm Stochastic Approximation-Proximal gradient for Penalized ML If in addition, H θ (x) = Φ(θ) + Ψ(θ)S(x) which implies ( f(θ) = Φ(θ) + Ψ(θ) ) S(x) π θ (x)µ(dx), The SA-Proximal Gradient algorithm Given the current value θ n, 1 Sample a Markov chain {X j,n, j 0} from a MCMC sampler with target distribution π θn dµ 2 Set H n+1 = Φ(θ n) + Ψ(θ n)s n+1 with ( m n+1 ) 1 S n+1 = S n + δ n+1 S(X j,n) S n. m n+1 3 Update the current value of the parameter j=1 θ n+1 = Prox γn+1,g (θ n γ n+1h n+1)

25 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms (*) Penalized Expectation-Maximization vs Proximal-Gradient EM Dempster et al. (1977) is a Majorize-Minimize algorithm for the computation of the ML estimate in a latent variable model. The Proximal Gradient algorithm is a Penalized Generalized-EM algorithm: g(θ n+1) + Q(θ n+1, θ n) g(θ n) + Q(θ n, θ n) where Q(θ, θ ) def = log p θ (x) π θ (x)dµ(x), π θ (x) def = p θ (x) pθ (z)dµ(z)

26 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Algorithms (*) Penalized Expectation-Maximization vs Proximal-Gradient EM Dempster et al. (1977) is a Majorize-Minimize algorithm for the computation of the ML estimate in a latent variable model. The Proximal Gradient algorithm is a Penalized Generalized-EM algorithm: g(θ n+1) + Q(θ n+1, θ n) g(θ n) + Q(θ n, θ n) where Q(θ, θ ) def = log p θ (x) π θ (x)dµ(x), π θ (x) def = p θ (x) pθ (z)dµ(z) MC-Proximal Gradient corresponds to the Penalized Generalized-MCEM algorithm Wei and Tanner (1990) SA-Proximal Gradient corresponds to the Penalized Generalized-SAEM algorithm Delyon et al. (1999)

27 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (1/3): pharmacokinetic For the implementation of the algorithm Penalty term: g(θ) = λ β 1. How to choose λ? Weighted norm? Step size sequences: constant or vanishing stepsize sequence {γ n, n 0}? (and δ n for the SA-Prox Gdt algorithm) Monte Carlo approximation: fixed or increasing batch size?

28 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (2/3): pharmacokinetic Proximal MCEM Decreasing Step Size Proximal MCEM Adaptive Step Size Parameter value Parameter value Iteration Proximal SAEM Decreasing Step Size Iteration Proximal SAEM Adaptive Step Size Parameter value Parameter value Iteration Iteration

29 Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Numerical illustration Numerical illustration (3/3): pharmacokinetic Parameter value Iteration Figure: Low dimension simulation setting. Regularization path of the covariate parameters for the clearance (left), absorption constant (middle) and volume of distribution (right) parameters. Black dashed line corresponds to the λ value selected by EBIC. Each colored curve corresponds to a covariate.

30 Convergence analysis Outline Penalized Maximum Likelihood inference in models with untractable likelihood Numerical methods for Penalized ML in such models: Perturbed Proximal Gradient algorithms Convergence analysis

31 Convergence analysis Problem: argmin θ Θ F (θ) with F (θ) = f(θ) + }{{} g(θ) }{{} smooth non smooth The Perturbed Proximal Gradient algorithm Given a (0, 1/L]-valued deterministic sequence {γ n, n 0}, set for n 0, θ n+1 = Prox γn+1,g (θ n γ n+1h n+1) where H n+1 is an approximation of f(θ n). Which conditions on the sequence {γ n, n 0} and on the perturbations H n+1 f(θ n), for this algorithm to converge to the minima of F?

32 Convergence analysis The assumptions where argmin θ Θ F (θ) with F (θ) = f(θ) + g(θ) the function g: R d [0, ] is convex, non smooth, not identically equal to +, and lower semi-continuous the function f: R d R is a smooth convex function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ θ, θ R d Θ R d is the domain of g: Θ = {θ R d : g(θ) < }. The set argmin Θ F is a non-empty subset of Θ.

33 Convergence analysis Existing results in the literature There exist results under (some of) the assumptions inf n γn > 0, H n+1 f(θ n) <, i.i.d. Monte Carlo approx i.e. results for n unbiased sampling. Almost NO conditions for the biased sampling, such as the MCMC one. non vanishing stepsize sequence {γ n, n 0}. increasing batch size: when H n+1 is a Monte Carlo sum i.e. then lim n m n = + at some rate. Combettes (2001) Elsevier Science. Combettes-Wajs (2005) Multiscale Modeling and Simulation. Combettes-Pesquet (2015, 2016) SIAM J. Optim, arxiv Lin-Rosasco-Villa-Zhou (2015) arxiv Rosasco-Villa-Vu (2014,2015) arxiv Schmidt-Leroux-Bach (2011) NIPS H n+1 = 1 m n+1 H θn (X j,n) m n+1 j=1

34 Convergence analysis Convergence of the perturbed proximal gradient algorithm θ n+1 = Prox γn+1,g (θ n γ n+1 H n+1) with H n+1 f(θ n) Set: L = argmin Θ (f + g) η n+1 = H n+1 f(θ n) Theorem (Atchadé, F., Moulines (2015)) Assume g convex, lower semi-continuous; f convex, C 1 and its gradient is Lipschitz with constant L; L is non empty. n γn = + and γn (0, 1/L]. Convergence of the series γn+1 η 2 n+1 2, γ n+1η n+1, γ n+1 T n, η n+1 n where T n = Prox γn+1,g(θ n γ n+1 f(θ n)). Then there exists θ L such that lim n θ n = θ. n n

35 Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ;

36 Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ; let us check a condition: γ n+1η n+1 = γ n+1 (H n+1 f(θ n)) n n = n = n γ n+1 {H n+1 E [H n+1 F n]} + n γ n+1martingale increment + n γ n+1 E [H n+1 F n] f(θ n) }{{} bias, non vanishing when m n = m γ n+1remainder

37 Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (1/3) In the case f(θ n) H n+1 = 1 m n+1 H θn (X j,n) m n+1 and {X j,n, j 0} is a non-stationary Markov chain with unique stationary distribution π θn : i.e. for any n 0, j=1 X j+1,n past P θn (X j,n, ) π θ P θ = π θ ; let us check a condition: γ n+1η n+1 = γ n+1 (H n+1 f(θ n)) n n = n = n γ n+1 {H n+1 E [H n+1 F n]} + n γ n+1martingale increment + n γ n+1 E [H n+1 F n] f(θ n) }{{} bias, non vanishing when m n = m γ n+1remainder the most technical case is biased approximation with fixed batch size

38 Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (2/3) Increasing batch size: lim n m n = + Conditions on the step sizes and batch sizes γ n = +, n Conditions on the Markov kernels: function W : X [1, + ) such that n γ 2 n m n < ; n γ n m n < (biased case) There exist λ (0, 1), b <, p 2 and a measurable sup H θ W <, sup P θ W p λw p + b. θ Θ θ Θ In addition, for any l (0, p], there exist C < and ρ (0, 1) such that for any x X, Condition on Θ: Θ is bounded. sup P n θ (x, ) π θ W l Cρ n W l (x). (1) θ Θ

39 Convergence analysis Convergence: when H n+1 is a Monte-Carlo approximation (3/3) Fixed batch size: m n = m Condition on the step size: γ n = + n γn 2 < n γ n+1 γ n < n Condition on the Markov chain: same as in the case increasing batch size and there exists a constant C such that for any θ, θ Θ P θ (x, ) P H θ H θ W + sup θ (x, ) W + π θ π x W (x) θ W C θ θ. Condition on the Prox: sup γ (0,1/L] θ Θ Condition on Θ: Θ is bounded. sup γ 1 Prox γ,g(θ) θ <.

40 Convergence analysis Rates of convergence (1/3) : the problem For non negative weights a k, find an upper bound of n k=1 a k n l=1 a l F (θ k ) min F It provides an upper bound for the cumulative regret (a k = 1) an upper bound for an averaging strategy when F is convex since ( n ) a k n a F n l=1 a θ k min F k n l l=1 a F (θ k ) min F. l k=1 k=1

41 Convergence analysis Rates of convergence (2/3): a deterministic control Theorem (Atchadé, F., Moulines (2016)) For any θ argmin Θ F, n k=1 where A n = a k F (θ k ) min F a0 θ 0 θ 2 A n 2γ 0A n + 1 2A n + 1 A n n k=1 n k=1 ( ak a ) k 1 θ k 1 θ 2 γ k γ k 1 a k γ k η k 2 1 A n n a k T k 1 θ, η k n a l, η k = H k f(θ k 1 ), T k = Prox γk,g(θ k 1 γ k f(θ k 1 )). l=1 k=1

42 Convergence analysis Rates (3/3): when H n+1 is a Monte Carlo approximation, bound in L q ( 1 F n u n = O(1/ n) ) n θ k min F 1 L q n k=1 n F (θ k ) min F un L q k=1 with fixed size of the batch and (slowly) decaying stepsize γ n = γ, a [1/2, 1] na mn = m. With averaging: optimal rate, even with slowly decaying stepsize γ n 1/ n. u n = O(ln n/n) with increasing batch size and constant stepsize γ n = γ m n n. Rate with O(n 2 ) Monte Carlo samples!

43 Convergence analysis Acceleration (1) Let {t n, n 0} be a positive sequence s.t. γ n+1t n(t n 1) γ nt 2 n 1 Nesterov acceleration of the Proximal Gradient algorithm Nesterov(2004), Tseng(2008), Beck-Teboulle(2009) θ n+1 = Prox γn+1,g (τ n γ n+1 f(τ n)) τ n+1 = θ n+1 + tn 1 t n+1 (θ n+1 θ n) Zhu-Orecchia (2015); Attouch-Peypouquet(2015); Bubeck-Lee-Singh(2015); Su-Boyd-Candes(2015) (deterministic) Proximal-gradient (deterministic) Accelerated Proximal-gradient ( ) 1 F (θ n) min F = O n ( ) 1 F (θ n) min F = O n 2

44 Convergence analysis Acceleration (2) Aujol-Dossal-F.-Moulines, work in progress Perturbed Nesterov acceleration: some convergence results Choose γ n, m n, t n s.t. γ n (0, 1/L], lim n γ nt 2 n = +, Then there exists θ argmin Θ F s.t lim n θ n = θ. In addition ( 1 F (θ n+1) min F = O n γ nt n(1 + γ nt n) 1 m n < γ n+1t 2 n ) Schmidt-Le Roux-Bach (2011); Dossal-Chambolle(2014); Aujol-Dossal(2015) γ n m n t n rate NbrMC γ n 3 n n 2 n 4 γ/ n n 2 n n 3/2 n 3 Table: Control of F (θ n) min F

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

On Perturbed Proximal Gradient Algorithms

On Perturbed Proximal Gradient Algorithms On Perturbed Proximal Gradient Algorithms Yves F. Atchadé University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States, yvesa@umich.edu Gersende Fort LTCI, CNRS, Telecom ParisTech,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Monte Carlo methods for sampling-based Stochastic Optimization

Monte Carlo methods for sampling-based Stochastic Optimization Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS & Telecom ParisTech Paris, France Joint works with B. Jourdain, T. Lelièvre, G. Stoltz from ENPC and E. Kuhn from

More information

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand Concentration inequalities for Feynman-Kac particle models P. Del Moral INRIA Bordeaux & IMB & CMAP X Journées MAS 2012, SMAI Clermond-Ferrand Some hyper-refs Feynman-Kac formulae, Genealogical & Interacting

More information

consistent learning by composite proximal thresholding

consistent learning by composite proximal thresholding consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Stéphanie Allassonnière CIS, JHU July, 15th 28 Context : Computational Anatomy Context and motivations :

More information

Hmms with variable dimension structures and extensions

Hmms with variable dimension structures and extensions Hmm days/enst/january 21, 2002 1 Hmms with variable dimension structures and extensions Christian P. Robert Université Paris Dauphine www.ceremade.dauphine.fr/ xian Hmm days/enst/january 21, 2002 2 1 Estimating

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Sequential convex programming,: value function and convergence

Sequential convex programming,: value function and convergence Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Stochastic approximation EM algorithm in nonlinear mixed effects model for viral load decrease during anti-hiv treatment

Stochastic approximation EM algorithm in nonlinear mixed effects model for viral load decrease during anti-hiv treatment Stochastic approximation EM algorithm in nonlinear mixed effects model for viral load decrease during anti-hiv treatment Adeline Samson 1, Marc Lavielle and France Mentré 1 1 INSERM E0357, Department of

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Penalized Barycenters in the Wasserstein space

Penalized Barycenters in the Wasserstein space Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Bayesian Graphical Models

Bayesian Graphical Models Graphical Models and Inference, Lecture 16, Michaelmas Term 2009 December 4, 2009 Parameter θ, data X = x, likelihood L(θ x) p(x θ). Express knowledge about θ through prior distribution π on θ. Inference

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Mathematical methods for Image Processing

Mathematical methods for Image Processing Mathematical methods for Image Processing François Malgouyres Institut de Mathématiques de Toulouse, France invitation by Jidesh P., NITK Surathkal funding Global Initiative on Academic Network Oct. 23

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

A Backward Particle Interpretation of Feynman-Kac Formulae

A Backward Particle Interpretation of Feynman-Kac Formulae A Backward Particle Interpretation of Feynman-Kac Formulae P. Del Moral Centre INRIA de Bordeaux - Sud Ouest Workshop on Filtering, Cambridge Univ., June 14-15th 2010 Preprints (with hyperlinks), joint

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Random Eigenvalue Problems Revisited

Random Eigenvalue Problems Revisited Random Eigenvalue Problems Revisited S Adhikari Department of Aerospace Engineering, University of Bristol, Bristol, U.K. Email: S.Adhikari@bristol.ac.uk URL: http://www.aer.bris.ac.uk/contact/academic/adhikari/home.html

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Learning Gaussian Graphical Models with Unknown Group Sparsity

Learning Gaussian Graphical Models with Unknown Group Sparsity Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation

More information

Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data

Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data Stéphanie Allassonnière with J.B. Schiratti, O. Colliot and S. Durrleman Université Paris Descartes & Ecole Polytechnique

More information

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral Mean field simulation for Monte Carlo integration Part II : Feynman-Kac models P. Del Moral INRIA Bordeaux & Inst. Maths. Bordeaux & CMAP Polytechnique Lectures, INLN CNRS & Nice Sophia Antipolis Univ.

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Manifold Monte Carlo Methods

Manifold Monte Carlo Methods Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October

More information

Lecture: Smoothing.

Lecture: Smoothing. Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Semi-Parametric Importance Sampling for Rare-event probability Estimation Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability

More information

Quantifying Stochastic Model Errors via Robust Optimization

Quantifying Stochastic Model Errors via Robust Optimization Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems Jonas Latz 1 Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems Jonas Latz Technische Universität München Fakultät für Mathematik Lehrstuhl für Numerische Mathematik jonas.latz@tum.de November

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Inference in state-space models with multiple paths from conditional SMC

Inference in state-space models with multiple paths from conditional SMC Inference in state-space models with multiple paths from conditional SMC Sinan Yıldırım (Sabancı) joint work with Christophe Andrieu (Bristol), Arnaud Doucet (Oxford) and Nicolas Chopin (ENSAE) September

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

CONVERGENCE OF A STOCHASTIC APPROXIMATION VERSION OF THE EM ALGORITHM

CONVERGENCE OF A STOCHASTIC APPROXIMATION VERSION OF THE EM ALGORITHM The Annals of Statistics 1999, Vol. 27, No. 1, 94 128 CONVERGENCE OF A STOCHASTIC APPROXIMATION VERSION OF THE EM ALGORITHM By Bernard Delyon, Marc Lavielle and Eric Moulines IRISA/INRIA, Université Paris

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Advanced Monte Carlo integration methods. P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP

Advanced Monte Carlo integration methods. P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP Advanced Monte Carlo integration methods P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP MCQMC 2012, Sydney, Sunday Tutorial 12-th 2012 Some hyper-refs Feynman-Kac formulae,

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

A framework for adaptive Monte-Carlo procedures

A framework for adaptive Monte-Carlo procedures A framework for adaptive Monte-Carlo procedures Jérôme Lelong (with B. Lapeyre) http://www-ljk.imag.fr/membres/jerome.lelong/ Journées MAS Bordeaux Friday 3 September 2010 J. Lelong (ENSIMAG LJK) Journées

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information