Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing and astrophysics Grenoble, November 2015

Introduction Works in collaboration with Eric Moulines (Professor, Ecole Polytechnique) Yves Atchadé (Assistant Professor, Univ. Michigan, USA) and also Jean-Francois Aujol (IMB, Univ. Bordeaux), Charles Dossal (IMB, Univ. Bordeaux) and Soukaina Douissi. Y. Atchadé, G. Fort and E. Moulines. On Stochastic Proximal Gradient Algorithms. arxiv:1402:2365 math.st.

Introduction Optimization problem Outline Introduction Optimization problem Proximal-Gradient algorithm Untractable proximal-gradient iteration Perturbed Proximal Gradient Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Introduction Optimization problem Problem (arg)min θ Θ F (θ) with F (θ) = f(θ) + g(θ) where Θ finite-dimensional Euclidean space with scalar product, and norm the function f:θ R is a smooth function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous in the case f(θ) and f are intractable.

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (1/3) Since f is Lipschitz, for any u, θ, which yields for any L γ 1 f(θ) f(u) + f(u), θ u + L θ u 2 2 F (θ) f(u) + f(u), θ u + 1 2γ θ u 2 + g(θ) 3 2.5 2 1.5 1 0.5 F(θ) 0 0.5 1 3 2 1 0 1 2 3 u The RHS satisfies for fixed u, an upper bound of θ F (θ) for θ = u, this upper bound is equal to F (u). for fixd u, it is convex (in θ) C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ)

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (2/3) Denote the upper bound by Q γ(θ u) def = C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ) Majorization-Minimization (MM) algorithm Define {θ n, n 0} iteratively by or equivalently with θ n+1 = argmin θ Q γ(θ θ n) θ n+1 = Prox γ(θ n γ f(θ n)) Prox γ(τ) def = argmin θ g(θ) + 1 θ τ 2 2γ also called Proximal-Gradient algorithm

Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (3/3) The sequence {θ n, n 0} is given by θ n+1 = argmin θ Q γ(θ θ n) where the upper bound θ Q γ(θ u) satisfies F (θ) Q γ(θ u) F (u) = Q γ(u u) Lyapunov function 3 2.5 2 1.5 1 0.5 0 0.5 F(θ) Q γ (θ θ n ) θ n+1 1 3 2 1 0 1 2 3 θ n F (θ n+1) F (θ n) since F (θ n+1) Q γ(θ n+1 θ n) Q γ(θ n θ n) = F (θ n)

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration The exact proximal-gradient algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) where {γ n, n 0} is a step-size sequence in (0, 1/L]. 1 Prox γ(u) can be untractable (not in this talk) 2 f can be untractable (in this talk)

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: explicit proximal operator (Projection on { C) 0 if θ C When g(θ) = + otherwise where C is closed convex, Prox γ(τ) = min τ θ 2 θ C (Elastic net penalty) g(θ) = λ ( ) 1 α 2 θ 2 2 + α θ 1 τ 1 i γλα si τ i γλα (Prox γ(τ)) i = τ i + γλα si τ i γλα 1 + γλ(1 α) 0 sinon proximal gradient algorithm = thresholded gradient algorithm

Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo.

Introduction Perturbed Proximal Gradient In this talk The exact proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) The perturbed proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 { f(θ n) + η n+1}) 1 Which conditions on γ n, η n to ensure the convergence to the same limiting set as for the exact algorithm? 2 When η n is a (random) Monte Carlo approximation, which conditions on γ n, m n?

Convergence of the (stable) perturbed proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions On the convergence of {θ n, n 0} On the convergence of F ( θ n) Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions Assumptions (arg)min θ Θ F (θ) F (θ) = f(θ) + g(θ) 1 the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous. 2 the function f: Θ R is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ 3 the function f is convex and the set L def = argmin θ F (θ) is not empty. 4 the stepsize {γ n, n 0} is positive and γ n (0, 1/L].

Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions The algorithm Stable sequence Let K int(dom(g)) be a compact subset of Θ such that K L =. Algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) θ n+1 = Proj K ( θ n+1) Weighted average sequence Let {a n, n 0} be a non-negative sequence. θ n = 1 n a k a k θk

Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of {θn, n 0} Convergence of {θ n, n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, n γn = + and γ n+1η n+1 < n γ n+1 Tγn+1 (θ n), η n+1 < n γn+1 η 2 n+1 2 < n there exists θ L K such that lim n θ n = lim n θn = θ where Tγ(θ) = Prox γ(θ γ f(θ)) Includes the convergence analysis for the exact algorithm (η n = 0) Beck and Teboulle (2009); improves previous results Combettes and Wajs (2005); Combettes and Pesquet (2014).

Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of F ( θn) Rates of convergence for {F (θ n ), n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, for any a k 0 with U n def = 1 2 } a k {F ( θ k ) min F U n ( ak a ) k 1 θ k 1 θ 2 + a0 θ 0 θ 2 γ k γ k 1 2γ 0 a k T γk (θ k 1 ) θ, η k + a k γ k η k 2. Includes the convergence analysis for the exact algorithm (η n+1 = 0); Extends previous results in the case γ n = γ, a n = 1 Schmidt, Le Roux, Bach (2011) where it is assumed n ηn <.

Convergence of the Monte Carlo proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Additional assumptions Convergence of θ n Convergence of F ( θ n) How to choose γ n, m n? Conclusion, Other results and Works in progress

Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation)

Convergence of the Monte Carlo proximal-gradient algorithm Additional assumptions Additional assumptions 5 the error is of the form η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) where f(θ) = n+1 H θ (x) π θ (dx) 6 {X (k) n+1, k 0} is a Markov chain with transition kernel P θ n. For all θ, π θ is invariant for P θ. 7 The kernels {P θ, θ Θ} are geometrically ergodic uniformly-in-θ (aperiodic, phi-irreducible, uniform-in-θ geometric drift inequalities w.r.t. W p where p 2, level sets of W p are small): there exists p 2 and for any l (0, p], there exist C > 0, ρ (0, 1) s.t. sup Pθ n (x, ) π θ W l Cρ n W l (x). θ K Trivial condition in the i.i.d. case There exist many sufficient conditions for the Markov case when samples are drawn from MCMC samplers.

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ. The key ingredient for the proof is the control F. and Moulines (2003) for p 2: w.p.1 and the decomposition E [η n+1 F n] C m 1 n+1w (X (mn) n ), E [ η n+1 p F n] C m p/2 n+1 W p (X (mn) n ). η n+1 = η n+1 E [η n+1 F n] + E [η n+1 F n] = Martingale Increment + Bias

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n = m η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn+1 = +, n γ2 n+1 <. If the approximation is biased, assume also: there exists a constant C such that for any θ, θ K H θ H θ W + P θ P θ W + π θ π θ W C θ θ. sup γ (0,1/L] sup θ K γ 1 Prox γ(θ) θ <. n γn+1 γn <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n γ k + γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + a 2 km 1 k+1) 1/2 + a k (γ k + υ)m 1 a k a k 1 n γ k + γ k 1 a k (γ k + υ)m 1 k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise. k+1 ),

Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n = m Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n 1/2 γ k + ak) 2 + a k γ k + υ a k+1 a k γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + γ k 1 a k a k 1 n ) γ k + a k γ k + υ a k+1 a k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise.

Convergence of the Monte Carlo proximal-gradient algorithm How to choose γn, mn? Fixed or Increasing batch-size m n? Fixed or Decreasing step-size γ n? Consider the L q -convergence rate: ( n ) 1 a k n a k F ( θ k ) F (θ ) L q Increasing batch size m n : With γ n = γ m n n a n = 1, Rate: O(ln n/n) Complexity: O(ln n/ n). Fixed batch size m n = m With γ n γ / n a n = 1 or a n = γ n, Rate: O(1/ n) Complexity: O(1/ n).

Conclusion, Other results and Works in progress Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress

Conclusion, Other results and Works in progress Conclusion Contributions: a) NOT in the strongly convex case. b) Sufficient conditions for the convergence of perturbed Proximal-Gradient algorithms. c) Case of Monte Carlo approximations, biased or unbiased, increasing or fixed batch-size. Major contributions a) for Monte Carlo approximations b) biased approximations c) fixed batch-size

Conclusion, Other results and Works in progress Other results, Works in progress and Future works a) When f is not convex. b) Accelerations (Nesterov, ) c) Convergence of the Proximal Stochastic Approximation Expectation Maximization algorithm : for the maximization of a penalized likelihood in latent models by using a generalization of the SAEM algorithm. d) Rates of convergence explicit controls.