Perturbed Proximal Gradient Algorithm
|
|
- Ralf Palmer
- 5 years ago
- Views:
Transcription
1 Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing and astrophysics Grenoble, November 2015
2 Introduction Works in collaboration with Eric Moulines (Professor, Ecole Polytechnique) Yves Atchadé (Assistant Professor, Univ. Michigan, USA) and also Jean-Francois Aujol (IMB, Univ. Bordeaux), Charles Dossal (IMB, Univ. Bordeaux) and Soukaina Douissi. Y. Atchadé, G. Fort and E. Moulines. On Stochastic Proximal Gradient Algorithms. arxiv:1402:2365 math.st.
3 Introduction Optimization problem Outline Introduction Optimization problem Proximal-Gradient algorithm Untractable proximal-gradient iteration Perturbed Proximal Gradient Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress
4 Introduction Optimization problem Problem (arg)min θ Θ F (θ) with F (θ) = f(θ) + g(θ) where Θ finite-dimensional Euclidean space with scalar product, and norm the function f:θ R is a smooth function i.e. f is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous in the case f(θ) and f are intractable.
5 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (1/3) Since f is Lipschitz, for any u, θ, which yields for any L γ 1 f(θ) f(u) + f(u), θ u + L θ u 2 2 F (θ) f(u) + f(u), θ u + 1 2γ θ u 2 + g(θ) F(θ) u The RHS satisfies for fixed u, an upper bound of θ F (θ) for θ = u, this upper bound is equal to F (u). for fixd u, it is convex (in θ) C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ)
6 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (2/3) Denote the upper bound by Q γ(θ u) def = C(u) + 1 2γ θ {u γ f(u)} 2 + g(θ) Majorization-Minimization (MM) algorithm Define {θ n, n 0} iteratively by or equivalently with θ n+1 = argmin θ Q γ(θ θ n) θ n+1 = Prox γ(θ n γ f(θ n)) Prox γ(τ) def = argmin θ g(θ) + 1 θ τ 2 2γ also called Proximal-Gradient algorithm
7 Introduction Proximal-Gradient algorithm Classical algorithm when f tractable (3/3) The sequence {θ n, n 0} is given by θ n+1 = argmin θ Q γ(θ θ n) where the upper bound θ Q γ(θ u) satisfies F (θ) Q γ(θ u) F (u) = Q γ(u u) Lyapunov function F(θ) Q γ (θ θ n ) θ n θ n F (θ n+1) F (θ n) since F (θ n+1) Q γ(θ n+1 θ n) Q γ(θ n θ n) = F (θ n)
8 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration The exact proximal-gradient algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) where {γ n, n 0} is a step-size sequence in (0, 1/L]. 1 Prox γ(u) can be untractable (not in this talk) 2 f can be untractable (in this talk)
9 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: explicit proximal operator (Projection on { C) 0 if θ C When g(θ) = + otherwise where C is closed convex, Prox γ(τ) = min τ θ 2 θ C (Elastic net penalty) g(θ) = λ ( ) 1 α 2 θ α θ 1 τ 1 i γλα si τ i γλα (Prox γ(τ)) i = τ i + γλα si τ i γλα 1 + γλ(1 α) 0 sinon proximal gradient algorithm = thresholded gradient algorithm
10 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo.
11 Introduction Untractable proximal-gradient iteration Untractable proximal-gradient iteration: untractable gradient f 1 Unknown function f and its gradient is of the form f(θ) = H θ (x) π θ (dx). In this case f(θ) 1 m H θ (X k ) m {X k, k 1}: (Online) Learning, Markov chain Monte Carlo. 2 Large scale optimization In this case f(θ) = 1 N f k (θ), large N N f(θ) = 1 N N f k (θ) 1 m f Ik (θ) m
12 Introduction Perturbed Proximal Gradient In this talk The exact proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n)) The perturbed proximal-gradient algorithm θ n+1 = Prox γn+1 (θ n γ n+1 { f(θ n) + η n+1}) 1 Which conditions on γ n, η n to ensure the convergence to the same limiting set as for the exact algorithm? 2 When η n is a (random) Monte Carlo approximation, which conditions on γ n, m n?
13 Convergence of the (stable) perturbed proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions On the convergence of {θ n, n 0} On the convergence of F ( θ n) Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress
14 Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions Assumptions (arg)min θ Θ F (θ) F (θ) = f(θ) + g(θ) 1 the function g: Θ (, ] is convex, not identically equal to +, and lower semi-continuous. 2 the function f: Θ R is continuously differentiable and there exists L > 0 such that f(θ) f(θ ) L θ θ 3 the function f is convex and the set L def = argmin θ F (θ) is not empty. 4 the stepsize {γ n, n 0} is positive and γ n (0, 1/L].
15 Convergence of the (stable) perturbed proximal-gradient algorithm Assumptions The algorithm Stable sequence Let K int(dom(g)) be a compact subset of Θ such that K L =. Algorithm: θ n+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) θ n+1 = Proj K ( θ n+1) Weighted average sequence Let {a n, n 0} be a non-negative sequence. θ n = 1 n a k a k θk
16 Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of {θn, n 0} Convergence of {θ n, n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, n γn = + and γ n+1η n+1 < n γ n+1 Tγn+1 (θ n), η n+1 < n γn+1 η 2 n+1 2 < n there exists θ L K such that lim n θ n = lim n θn = θ where Tγ(θ) = Prox γ(θ γ f(θ)) Includes the convergence analysis for the exact algorithm (η n = 0) Beck and Teboulle (2009); improves previous results Combettes and Wajs (2005); Combettes and Pesquet (2014).
17 Convergence of the (stable) perturbed proximal-gradient algorithm On the convergence of F ( θn) Rates of convergence for {F (θ n ), n 0} θ n+1 = Proj K ( θ n+1) θn+1 = Prox γn+1 (θ n γ n+1 f(θ n) γ n+1η n+1) Theorem (Atchadé, F., Moulines (2015)) If assumptions 1 to 4, for any a k 0 with U n def = 1 2 } a k {F ( θ k ) min F U n ( ak a ) k 1 θ k 1 θ 2 + a0 θ 0 θ 2 γ k γ k 1 2γ 0 a k T γk (θ k 1 ) θ, η k + a k γ k η k 2. Includes the convergence analysis for the exact algorithm (η n+1 = 0); Extends previous results in the case γ n = γ, a n = 1 Schmidt, Le Roux, Bach (2011) where it is assumed n ηn <.
18 Convergence of the Monte Carlo proximal-gradient algorithm Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Additional assumptions Convergence of θ n Convergence of F ( θ n) How to choose γ n, m n? Conclusion, Other results and Works in progress
19 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1
20 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation)
21 Convergence of the Monte Carlo proximal-gradient algorithm Monte Carlo Approximation Monte Carlo approximation of the gradient Assume that f(θ) is of the form f(θ) = H θ (x) π θ (dx). Consider a Monte Carlo perturbation which includes the cases η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 1 {X (1) n+1,, X(m n+1) n+1 } are i.i.d. with distribution π θn : E [η n+1 Past n] = 0 (unbiased approximation) 2 {X (1) n+1,, X(m n+1) n+1 } is a non-stationary Markov chain (e.g. MCMC path) with invariant distribution π θn : E [η n+1 Past n] 0 (biased approximation)
22 Convergence of the Monte Carlo proximal-gradient algorithm Additional assumptions Additional assumptions 5 the error is of the form η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) where f(θ) = n+1 H θ (x) π θ (dx) 6 {X (k) n+1, k 0} is a Markov chain with transition kernel P θ n. For all θ, π θ is invariant for P θ. 7 The kernels {P θ, θ Θ} are geometrically ergodic uniformly-in-θ (aperiodic, phi-irreducible, uniform-in-θ geometric drift inequalities w.r.t. W p where p 2, level sets of W p are small): there exists p 2 and for any l (0, p], there exist C > 0, ρ (0, 1) s.t. sup Pθ n (x, ) π θ W l Cρ n W l (x). θ K Trivial condition in the i.i.d. case There exist many sufficient conditions for the Markov case when samples are drawn from MCMC samplers.
23 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.
24 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn = +, n γ2 n+1m 1 n+1 <. If the approximation is biased, assume also: n γn+1m 1 n+1 <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ. The key ingredient for the proof is the control F. and Moulines (2003) for p 2: w.p.1 and the decomposition E [η n+1 F n] C m 1 n+1w (X (mn) n ), E [ η n+1 p F n] C m p/2 n+1 W p (X (mn) n ). η n+1 = η n+1 E [η n+1 F n] + E [η n+1 F n] = Martingale Increment + Bias
25 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of θn Convergence of θ n when m n = m η n+1 = 1 m n+1 H θn (X (k) n+1 m ) f(θn) n+1 Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7 and n γn+1 = +, n γ2 n+1 <. If the approximation is biased, assume also: there exists a constant C such that for any θ, θ K H θ H θ W + P θ P θ W + π θ π θ W C θ θ. sup γ (0,1/L] sup θ K γ 1 Prox γ(θ) θ <. n γn+1 γn <. With probability one, there exists θ L K such that lim n θ n = lim n θn = θ.
26 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n γ k + γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + a 2 km 1 k+1) 1/2 + a k (γ k + υ)m 1 a k a k 1 n γ k + γ k 1 a k (γ k + υ)m 1 k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise. k+1 ),
27 Convergence of the Monte Carlo proximal-gradient algorithm Convergence of F ( θn) Convergence of F ( θ n ) when m n = m Theorem (Atchadé, F., Moulines (2015)) Assume Assumption 1 to 7. For any q (1, p/2], there exists C > 0 s.t. C and } Lq a k {F ( θ k ) min F a0 γ 0 + a k a ( k 1 n 1/2 γ k + ak) 2 + a k γ k + υ a k+1 a k γ k 1 a k {E[F ( θ k )] min F } C ( a 0 γ 0 + γ k 1 a k a k 1 n ) γ k + a k γ k + υ a k+1 a k where υ = 0 if the Monte-Carlo approximation is unbiased and υ = 1 otherwise.
28 Convergence of the Monte Carlo proximal-gradient algorithm How to choose γn, mn? Fixed or Increasing batch-size m n? Fixed or Decreasing step-size γ n? Consider the L q -convergence rate: ( n ) 1 a k n a k F ( θ k ) F (θ ) L q Increasing batch size m n : With γ n = γ m n n a n = 1, Rate: O(ln n/n) Complexity: O(ln n/ n). Fixed batch size m n = m With γ n γ / n a n = 1 or a n = γ n, Rate: O(1/ n) Complexity: O(1/ n).
29 Conclusion, Other results and Works in progress Outline Introduction Convergence of the (stable) perturbed proximal-gradient algorithm Convergence of the Monte Carlo proximal-gradient algorithm Conclusion, Other results and Works in progress
30 Conclusion, Other results and Works in progress Conclusion Contributions: a) NOT in the strongly convex case. b) Sufficient conditions for the convergence of perturbed Proximal-Gradient algorithms. c) Case of Monte Carlo approximations, biased or unbiased, increasing or fixed batch-size. Major contributions a) for Monte Carlo approximations b) biased approximations c) fixed batch-size
31 Conclusion, Other results and Works in progress Other results, Works in progress and Future works a) When f is not convex. b) Accelerations (Nesterov, ) c) Convergence of the Proximal Stochastic Approximation Expectation Maximization algorithm : for the maximization of a penalized likelihood in latent models by using a generalization of the SAEM algorithm. d) Rates of convergence explicit controls.
dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés
Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationOn Perturbed Proximal Gradient Algorithms
On Perturbed Proximal Gradient Algorithms Yves F. Atchadé University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States, yvesa@umich.edu Gersende Fort LTCI, CNRS, Telecom ParisTech,
More informationSUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS
Submitted to the Annals of Statistics SUPPLEMENT TO PAPER CONERGENCE OF ADAPTIE AND INTERACTING MARKO CHAIN MONTE CARLO ALGORITHMS By G Fort,, E Moulines and P Priouret LTCI, CNRS - TELECOM ParisTech,
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationMonte Carlo methods for sampling-based Stochastic Optimization
Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS & Telecom ParisTech Paris, France Joint works with B. Jourdain, T. Lelièvre, G. Stoltz from ENPC and E. Kuhn from
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationConsistency of the maximum likelihood estimator for general hidden Markov models
Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationAdaptive Markov Chain Monte Carlo: Theory and Methods
Chapter Adaptive Markov Chain Monte Carlo: Theory and Methods Yves Atchadé, Gersende Fort and Eric Moulines 2, Pierre Priouret 3. Introduction Markov chain Monte Carlo (MCMC methods allow to generate samples
More informationSome Results on the Ergodicity of Adaptive MCMC Algorithms
Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationSimultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms
Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms Yan Bai Feb 2009; Revised Nov 2009 Abstract In the paper, we mainly study ergodicity of adaptive MCMC algorithms. Assume that
More informationPenalized Barycenters in the Wasserstein space
Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles
More informationMathematical methods for Image Processing
Mathematical methods for Image Processing François Malgouyres Institut de Mathématiques de Toulouse, France invitation by Jidesh P., NITK Surathkal funding Global Initiative on Academic Network Oct. 23
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationAn Optimal Affine Invariant Smooth Minimization Algorithm.
An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationconsistent learning by composite proximal thresholding
consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationLECTURE 15 Markov chain Monte Carlo
LECTURE 15 Markov chain Monte Carlo There are many settings when posterior computation is a challenge in that one does not have a closed form expression for the posterior distribution. Markov chain Monte
More informationConcentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand
Concentration inequalities for Feynman-Kac particle models P. Del Moral INRIA Bordeaux & IMB & CMAP X Journées MAS 2012, SMAI Clermond-Ferrand Some hyper-refs Feynman-Kac formulae, Genealogical & Interacting
More informationIncremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning
Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic
More informationMixed effect model for the spatiotemporal analysis of longitudinal manifold value data
Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data Stéphanie Allassonnière with J.B. Schiratti, O. Colliot and S. Durrleman Université Paris Descartes & Ecole Polytechnique
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationGeneralized greedy algorithms.
Generalized greedy algorithms. François-Xavier Dupé & Sandrine Anthoine LIF & I2M Aix-Marseille Université - CNRS - Ecole Centrale Marseille, Marseille ANR Greta Séminaire Parisien des Mathématiques Appliquées
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More informationMinicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics
Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics Eric Slud, Statistics Program Lecture 1: Metropolis-Hastings Algorithm, plus background in Simulation and Markov Chains. Lecture
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationVariational inference
Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationInference in state-space models with multiple paths from conditional SMC
Inference in state-space models with multiple paths from conditional SMC Sinan Yıldırım (Sabancı) joint work with Christophe Andrieu (Bristol), Arnaud Doucet (Oxford) and Nicolas Chopin (ENSAE) September
More informationComputer intensive statistical methods
Lecture 11 Markov Chain Monte Carlo cont. October 6, 2015 Jonas Wallin jonwal@chalmers.se Chalmers, Gothenburg university The two stage Gibbs sampler If the conditional distributions are easy to sample
More informationfor Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim
Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationGenerative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis
Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Stéphanie Allassonnière CIS, JHU July, 15th 28 Context : Computational Anatomy Context and motivations :
More informationMean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral
Mean field simulation for Monte Carlo integration Part II : Feynman-Kac models P. Del Moral INRIA Bordeaux & Inst. Maths. Bordeaux & CMAP Polytechnique Lectures, INLN CNRS & Nice Sophia Antipolis Univ.
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationA mathematical framework for Exact Milestoning
A mathematical framework for Exact Milestoning David Aristoff (joint work with Juan M. Bello-Rivas and Ron Elber) Colorado State University July 2015 D. Aristoff (Colorado State University) July 2015 1
More informationGradient Estimation for Attractor Networks
Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationSélection adaptative des paramètres pour le débruitage des images
Journées SIERRA 2014, Saint-Etienne, France, 25 mars, 2014 Sélection adaptative des paramètres pour le débruitage des images Adaptive selection of parameters for image denoising Charles Deledalle 1 Joint
More informationMonte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan
Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationarxiv: v1 [math.st] 4 Dec 2015
MCMC convergence diagnosis using geometry of Bayesian LASSO A. Dermoune, D.Ounaissi, N.Rahmania Abstract arxiv:151.01366v1 [math.st] 4 Dec 015 Using posterior distribution of Bayesian LASSO we construct
More informationSemi-Parametric Importance Sampling for Rare-event probability Estimation
Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability
More informationA Backward Particle Interpretation of Feynman-Kac Formulae
A Backward Particle Interpretation of Feynman-Kac Formulae P. Del Moral Centre INRIA de Bordeaux - Sud Ouest Workshop on Filtering, Cambridge Univ., June 14-15th 2010 Preprints (with hyperlinks), joint
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationIntroduction. log p θ (y k y 1:k 1 ), k=1
ESAIM: PROCEEDINGS, September 2007, Vol.19, 115-120 Christophe Andrieu & Dan Crisan, Editors DOI: 10.1051/proc:071915 PARTICLE FILTER-BASED APPROXIMATE MAXIMUM LIKELIHOOD INFERENCE ASYMPTOTICS IN STATE-SPACE
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationErgodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.
Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions
More informationKERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO. By Yves F. Atchadé University of Michigan
Submitted to the Annals of Statistics arxiv: math.pr/0911.1164 KERNEL ESTIMATORS OF ASYMPTOTIC VARIANCE FOR ADAPTIVE MARKOV CHAIN MONTE CARLO By Yves F. Atchadé University of Michigan We study the asymptotic
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationNumerical methods for a fractional diffusion/anti-diffusion equation
Numerical methods for a fractional diffusion/anti-diffusion equation Afaf Bouharguane Institut de Mathématiques de Bordeaux (IMB), Université Bordeaux 1, France Berlin, November 2012 Afaf Bouharguane Numerical
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More informationThe Theory behind PageRank
The Theory behind PageRank Mauro Sozio Telecom ParisTech May 21, 2014 Mauro Sozio (LTCI TPT) The Theory behind PageRank May 21, 2014 1 / 19 A Crash Course on Discrete Probability Events and Probability
More informationHYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING
COMMUNICATIONS IN INFORMATION AND SYSTEMS c 01 International Press Vol. 1, No. 3, pp. 1-3, 01 003 HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING QI HE AND JACK XIN Abstract.
More informationSequential convex programming,: value function and convergence
Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific
More informationSampling multimodal densities in high dimensional sampling space
Sampling multimodal densities in high dimensional sampling space Gersende FORT LTCI, CNRS & Telecom ParisTech Paris, France Journées MAS Toulouse, Août 4 Introduction Sample from a target distribution
More informationLecture 10. Theorem 1.1 [Ergodicity and extremality] A probability measure µ on (Ω, F) is ergodic for T if and only if it is an extremal point in M.
Lecture 10 1 Ergodic decomposition of invariant measures Let T : (Ω, F) (Ω, F) be measurable, and let M denote the space of T -invariant probability measures on (Ω, F). Then M is a convex set, although
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationA framework for adaptive Monte-Carlo procedures
A framework for adaptive Monte-Carlo procedures Jérôme Lelong (with B. Lapeyre) http://www-ljk.imag.fr/membres/jerome.lelong/ Journées MAS Bordeaux Friday 3 September 2010 J. Lelong (ENSIMAG LJK) Journées
More informationLarge-scale machine learning and convex optimization
Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationBandit Convex Optimization
March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict
More informationStochastic Dynamic Programming: The One Sector Growth Model
Stochastic Dynamic Programming: The One Sector Growth Model Esteban Rossi-Hansberg Princeton University March 26, 2012 Esteban Rossi-Hansberg () Stochastic Dynamic Programming March 26, 2012 1 / 31 References
More informationA regeneration proof of the central limit theorem for uniformly ergodic Markov chains
A regeneration proof of the central limit theorem for uniformly ergodic Markov chains By AJAY JASRA Department of Mathematics, Imperial College London, SW7 2AZ, London, UK and CHAO YANG Department of Mathematics,
More informationMCMC: Markov Chain Monte Carlo
I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov
More informationMarkov chain Monte Carlo
1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop
More informationNon-homogeneous random walks on a semi-infinite strip
Non-homogeneous random walks on a semi-infinite strip Chak Hei Lo Joint work with Andrew R. Wade World Congress in Probability and Statistics 11th July, 2016 Outline Motivation: Lamperti s problem Our
More informationStochastic Gradient Descent in Continuous Time
Stochastic Gradient Descent in Continuous Time Justin Sirignano University of Illinois at Urbana Champaign with Konstantinos Spiliopoulos (Boston University) 1 / 27 We consider a diffusion X t X = R m
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationRecent Advances in Regional Adaptation for MCMC
Recent Advances in Regional Adaptation for MCMC Radu Craiu Department of Statistics University of Toronto Collaborators: Yan Bai (Statistics, Toronto) Antonio Fabio di Narzo (Statistics, Bologna) Jeffrey
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationarxiv: v3 [stat.me] 12 Jul 2015
Derivative-Free Estimation of the Score Vector and Observed Information Matrix with Application to State-Space Models Arnaud Doucet 1, Pierre E. Jacob and Sylvain Rubenthaler 3 1 Department of Statistics,
More informationTitles and Abstracts
Titles and Abstracts Stability of the Nonlinear Filter for Random Expanding Maps Jochen Broecker A ubiquitous problem in science and engineering is to reconstruct the state of a hidden Markov process (the
More information