Perturbed Proximal Gradient Algorithm

Size: px

Start display at page:

Download "Perturbed Proximal Gradient Algorithm"

Ralf Palmer
5 years ago
Views:

1 Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing and astrophysics Grenoble, November 2015

2 Introduction Works in collaboration with Eric Moulines (Professor, Ecole Polytechnique) Yves Atchadé (Assistant Professor, Univ. Michigan, USA) and also Jean-Francois Aujol (IMB, Univ. Bordeaux), Charles Dossal (IMB, Univ. Bordeaux) and Soukaina Douissi. Y. Atchadé, G. Fort and E. Moulines. On Stochastic Proximal Gradient Algorithms. arxiv:1402:2365 math.st.

3 Introduction Optimization problem Outline Introduction Optimization problem Proximal-Gradient algorithm Untractable proximal-gradient iteration Perturbed Proximal Gradient Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

4 Introduction Optimization problem Problem (arg)min θ Θ F (θ) with F (θ) = f(θ) + g(θ) where Θ finite-dimensional Euclidean space with scalar product, and norm the function f:θ R is a smooth function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous in the case f(θ) and f are intractable.

5 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (1/3) Since f is Lipschitz, for any u, θ, which yields for any L γ 1 f(θ) f(u) + f(u), θ u + L θ u 2 2 F (θ) f(u) + f(u), θ u + 1 2γ θ u 2 + g(θ) F(θ) u The RHS satisfies for fixed u, an upper bound of θ F (θ) for θ = u, this upper bound is equal to F (u). for fixd u, it is convex (in θ) C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ)

6 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (2/3) Denote the upper bound by Q γ(θ u) def = C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ) Majorization-Minimization (MM) algorithm Define {θ n, n 0} iteratively by or equivalently with θ n+1 = argmin θ Q γ(θ θ n) θ n+1 = Prox γ(θ n γ f(θ n)) Prox γ(τ) def = argmin θ g(θ) + 1 θ τ 2 2γ also called Proximal-Gradient algorithm

7 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (3/3) The sequence {θ n, n 0} is given by θ n+1 = argmin θ Q γ(θ θ n) where the upper bound θ Q γ(θ u) satisfies F (θ) Q γ(θ u) F (u) = Q γ(u u) Lyapunov function F(θ) Q γ (θ θ n ) θ n θ n F (θ n+1) F (θ n) since F (θ n+1) Q γ(θ n+1 θ n) Q γ(θ n θ n) = F (θ n)

8 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration The exact proximal-gradient algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) where {γ n, n 0} is a step-size sequence in (0, 1/L]. 1 Prox γ(u) can be untractable (not in this talk) 2 f can be untractable (in this talk)

9 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: explicit proximal operator (Projection on { C) 0 if θ C When g(θ) = + otherwise where C is closed convex, Prox γ(τ) = min τ θ 2 θ C (Elastic net penalty) g(θ) = λ ( ) 1 α 2 θ α θ 1 τ 1 i γλα si τ i γλα (Prox γ(τ)) i = τ i + γλα si τ i γλα 1 + γλ(1 α) 0 sinon proximal gradient algorithm = thresholded gradient algorithm

10 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo.

11 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo. 2 Large scale optimization In this case f(θ) = 1 N f k (θ), large N N f(θ) = 1 N N f k (θ) 1 m f Ik (θ) m

12 Introduction Perturbed Proximal Gradient In this talk The exact proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) The perturbed proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 { f(θ n) + η n+1}) 1 Which conditions on γ n, η n to ensure the convergence to the same limiting set as for the exact algorithm? 2 When η n is a (random) Monte Carlo approximation, which conditions on γ n, m n?

13 Convergence of the (stable) perturbed proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions On the convergence of {θ n, n 0} On the convergence of F ( θ n) Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

14 Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions Assumptions (arg)min θ Θ F (θ) F (θ) = f(θ) + g(θ) 1 the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous. 2 the function f: Θ R is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ 3 the function f is convex and the set L def = argmin θ F (θ) is not empty. 4 the stepsize {γ n, n 0} is positive and γ n (0, 1/L].

15 Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions The algorithm Stable sequence Let K int(dom(g)) be a compact subset of Θ such that K L =. Algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) θ n+1 = Proj K ( θ n+1) Weighted average sequence Let {a n, n 0} be a non-negative sequence. θ n = 1 n a k a k θk

16 Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of {θn, n 0} Convergence of {θ n, n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, n γn = + and γ n+1η n+1 < n γ n+1 Tγn+1 (θ n), η n+1 < n γn+1 η 2 n+1 2 < n there exists θ L K such that lim n θ n = lim n θn = θ where Tγ(θ) = Prox γ(θ γ f(θ)) Includes the convergence analysis for the exact algorithm (η n = 0) Beck and Teboulle (2009); improves previous results Combettes and Wajs (2005); Combettes and Pesquet (2014).

17 Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of F ( θn) Rates of convergence for {F (θ n ), n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, for any a k 0 with U n def = 1 2 } a k {F ( θ k ) min F U n ( ak a ) k 1 θ k 1 θ 2 + a0 θ 0 θ 2 γ k γ k 1 2γ 0 a k T γk (θ k 1 ) θ, η k + a k γ k η k 2. Includes the convergence analysis for the exact algorithm (η n+1 = 0); Extends previous results in the case γ n = γ, a n = 1 Schmidt, Le Roux, Bach (2011) where it is assumed n ηn <.

18 Convergence of the Monte Carlo proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Additional assumptions Convergence of θ n Convergence of F ( θ n) How to choose γ n, m n? Conclusion, Other results and Works in progress

19 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1

20 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation)

21 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation) 2 {X (1) n+1,, X(m n+1) n+1 } is a non-stationary Markov chain (e.g. MCMC path) with invariant distribution π θn : E [η n+1 Past n] 0 (biased approximation)

22 Convergence of the Monte Carlo proximal-gradient algorithm Additional assumptions Additional assumptions 5 the error is of the form η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) where f(θ) = n+1 H θ (x) π θ (dx) 6 {X (k) n+1, k 0} is a Markov chain with transition kernel P θ n. For all θ, π θ is invariant for P θ. 7 The kernels {P θ, θ Θ} are geometrically ergodic uniformly-in-θ (aperiodic, phi-irreducible, uniform-in-θ geometric drift inequalities w.r.t. W p where p 2, level sets of W p are small): there exists p 2 and for any l (0, p], there exist C > 0, ρ (0, 1) s.t. sup Pθ n (x, ) π θ W l Cρ n W l (x). θ K Trivial condition in the i.i.d. case There exist many sufficient conditions for the Markov case when samples are drawn from MCMC samplers.

23 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.

24 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ. The key ingredient for the proof is the control F. and Moulines (2003) for p 2: w.p.1 and the decomposition E [η n+1 F n] C m 1 n+1w (X (mn) n ), E [ η n+1 p F n] C m p/2 n+1 W p (X (mn) n ). η n+1 = η n+1 E [η n+1 F n] + E [η n+1 F n] = Martingale Increment + Bias

25 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n = m η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn+1 = +, n γ2 n+1 <. If the approximation is biased, assume also: there exists a constant C such that for any θ, θ K H θ H θ W + P θ P θ W + π θ π θ W C θ θ. sup γ (0,1/L] sup θ K γ 1 Prox γ(θ) θ <. n γn+1 γn <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.

26 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n γ k + γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + a 2 km 1 k+1) 1/2 + a k (γ k + υ)m 1 a k a k 1 n γ k + γ k 1 a k (γ k + υ)m 1 k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise. k+1 ),

27 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n = m Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n 1/2 γ k + ak) 2 + a k γ k + υ a k+1 a k γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + γ k 1 a k a k 1 n ) γ k + a k γ k + υ a k+1 a k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise.

28 Convergence of the Monte Carlo proximal-gradient algorithm How to choose γn, mn? Fixed or Increasing batch-size m n? Fixed or Decreasing step-size γ n? Consider the L q -convergence rate: ( n ) 1 a k n a k F ( θ k ) F (θ ) L q Increasing batch size m n : With γ n = γ m n n a n = 1, Rate: O(ln n/n) Complexity: O(ln n/ n). Fixed batch size m n = m With γ n γ / n a n = 1 or a n = γ n, Rate: O(1/ n) Complexity: O(1/ n).

29 Conclusion, Other results and Works in progress Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

30 Conclusion, Other results and Works in progress Conclusion Contributions: a) NOT in the strongly convex case. b) Sufficient conditions for the convergence of perturbed Proximal-Gradient algorithms. c) Case of Monte Carlo approximations, biased or unbiased, increasing or fixed batch-size. Major contributions a) for Monte Carlo approximations b) biased approximations c) fixed batch-size

31 Conclusion, Other results and Works in progress Other results, Works in progress and Future works a) When f is not convex. b) Accelerations (Nesterov, ) c) Convergence of the Proximal Stochastic Approximation Expectation Maximization algorithm : for the maximization of a penalized likelihood in latent models by using a generalization of the SAEM algorithm. d) Rates of convergence explicit controls.

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,