Stochastic Proximal Gradient Algorithm

Similar documents
Perturbed Proximal Gradient Algorithm

On Perturbed Proximal Gradient Algorithms

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

Graphical Models for Collaborative Filtering

6. Proximal gradient method

Probabilistic Graphical Models

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Optimization methods

Optimization methods

Gradient Sliding for Composite Optimization

Learning with stochastic proximal gradient

6. Proximal gradient method

Learning Gaussian Graphical Models with Unknown Group Sparsity

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Stochastic and online algorithms

Consistency of the maximum likelihood estimator for general hidden Markov models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Markov Networks.

Monte Carlo in Bayesian Statistics

Fast proximal gradient methods

ADMM and Fast Gradient Methods for Distributed Optimization

Pattern Recognition and Machine Learning

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Dual Proximal Gradient Method

Information theoretic perspectives on learning algorithms

Inderjit Dhillon The University of Texas at Austin

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Lecture 16 Deep Neural Generative Models

Algorithms for Nonsmooth Optimization

Stochastic Composition Optimization

Selected Topics in Optimization. Some slides borrowed from

STA 4273H: Statistical Machine Learning

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

3 : Representation of Undirected GM

Proximal and First-Order Methods for Convex Optimization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

On Markov chain Monte Carlo methods for tall data

Coordinate Descent and Ascent Methods

Convex Optimization Lecture 16

CPSC 540: Machine Learning

Scaling Neighbourhood Methods

Probabilistic Graphical Models

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

The deterministic Lasso

Lasso: Algorithms and Extensions

Undirected Graphical Models

STA414/2104 Statistical Methods for Machine Learning II

Big Data Analytics: Optimization and Randomization

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 1: Supervised Learning

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Basic math for biology

Bayesian model selection in graphs by using BDgraph package

Graphical Model Selection

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 8: February 9

Online Forest Density Estimation

Deep unsupervised learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

CPSC 540: Machine Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

14 : Theory of Variational Inference: Inner and Outer Approximation

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Probabilistic Graphical Networks: Definitions and Basic Results

Sparse Stochastic Inference for Latent Dirichlet Allocation

Learning Parameters of Undirected Models. Sargur Srihari

Greedy Layer-Wise Training of Deep Networks

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Coordinate descent methods

STA 4273H: Statistical Machine Learning

Riemann Manifold Methods in Bayesian Statistics

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Chapter 17: Undirected Graphical Models

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

The Frank-Wolfe Algorithm:

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Neural Networks and Deep Learning

Computer Intensive Methods in Mathematical Statistics

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Accelerated Proximal Gradient Methods for Convex Optimization

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Lecture 25: November 27

Introduction to Restricted Boltzmann Machines

Computer Intensive Methods in Mathematical Statistics

Stochastic optimization Markov Chain Monte Carlo

Sparse Gaussian conditional random fields

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Estimators based on non-convex programs: Statistical and computational guarantees

Transcription:

Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind help of A. Juditsky LJK, Grenoble

1 Motivation 2 3 4 5

Problem Statement (P) Argmin θ Θ { l(θ) + g(θ)}, where l is a smooth log-likelihood function or some other smooth statistical learning function, and g is a possibly non-smooth convex penalty term. This problem has attracted a lot of attention with the growing need to address high-dimensional statistical problems This work focuses on the case where the function l and its gradient l are both intractable, and where l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ.

Network structure Problem: Estimate sparse network structures from measurements on the nodes. For discrete measurement: amounts to estimate a Gibbs measure with pair-wise interactions f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + θ ij B(x i, x j ) Z θ, i=1 1 j<i p for a function B 0 : X R, and a symmetric function B : X X R, where X is a finite set. The absence of an edge encodes conditional independence.

Network structure Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Constraint-based approaches: test conditional independence from the data and then determine a graph that most closely represents those independencies. Score-based approaches combine a metric for the complexity of the graph with a measure of the goodness of fit of the graph to the data... but the number of graph structures grows super-exponentially, and the problem is in general NP-hard.

Network structure For x X p, define B(x) def = (B jk (x j, x k )) 1 j,k p R p p. The l 1 -penalized maximum likelihood estimate of θ is obtained by solving an optimization problem of the form (P) where l and g are given by l(θ) = 1 n n i=1 θ, B(x (i) ) log Z θ, g(θ) = λ θ jk. 1 k<j p

Fisher identity Fact 1: Z θ is the normalization constant is given by Z θ = x exp( θ, B(x) where the sum is over all the possible configurations. Fact 2: the gradient log Z θ is the expectation of the sufficient statistics: log Z θ = B(x)f θ (x) x Problem: None of these quantities can be computed explicitly... Nevertheless, they can be estimated using Monte Carlo integration.

General framework where re Argmin θ Θ { l(θ) + g(θ)}, l is a smooth log-likelihood function or some other smooth statistical learning function, g is a non-smooth convex sparsity-inducing penalty. the function l and its gradient l are intractable, The score function l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ on some measurable space (X, B), and some function H θ : X Θ.

1 Motivation 2 3 4 5

Definition Definition: Proximal mapping associated with closed convex function g and stepsize γ ( prox γ (θ) = Argmin ϑ Θ g(ϑ) + (2γ) 1 ϑ θ 2 ) 2 If g = I K, where K is a closed convex set (I K (x) = 0, x K, I K (x) = otherwise), then prox γ is the Euclidean projection on K prox γ (θ) = Argmin ϑ K ϑ θ 2 2 = P K (θ) if g(θ) = p i=1 λ i θ i then prox g is shrinkage (soft threshold) operation θ i γλ i θ i γλ i [S λ,γ (θ)] i = 0 θ i γλ i θ i + γλ i θ i γλ i

Proximal gradient method Unconstrained problem with cost function split in two components Minimize f(θ) = l(θ) + g(θ) l convex, differentiable with dom(g) = R n g closed, convex, possibly non differentiable... but prox g is inexpensive! Proximal gradient algorithm θ (k) = prox γk g(θ (k 1) + γ k l(θ (k 1) )) where {γ k, k N} is a sequence stepsizes, which either be constant, decreasing or determined by line search

Interpretation Denote θ + = prox γ (θ + γ l(θ)) from definition of proximal operator: θ + = Argmin ϑ (g(ϑ) + (2γ) 1 ϑ θ γ l(θ) 2 2) = Argmin ϑ (g(ϑ) l(θ) l(θ) T (ϑ θ) + (2γ) 1 ϑ θ 2 2). θ + minimizes g(ϑ) plus a simple quadratic local model of l(ϑ) around θ If γ 1/L, the surrogate function on the RHS majorizes the target function, and the algorithm might be seen as a specific instance of the Majorization-Minimization algorithm.

Some specific examples if g(θ) = 0 then proximal gradient = gradient method. θ (k) = θ (k 1) + γ k l(θ (k 1) ) if g(θ) = I K (θ), then proximal gradient = projected gradient θ (k) = P K (θ (k 1) + γ k l(θ (k 1) )). if g(θ) = i λ i θ i then proximal gradient = soft-thresholded gradient θ (k) = S λ,γk (θ (k 1) + γ k l(θ (k 1) ))

Gradient map The proximal gradient may be equivalently rewritten as θ (k) = θ (k 1) γ k G γk (θ (k 1) ) where the function G γ is given by G γ (θ) = 1 γ (θ prox γ(θ + γ l(θ))) The subgradient characterization of the proximal map implies G γ (θ) l(θ) + g(θ γg γ (θ)) Therefore, G γ (θ) = 0 if and only if θ minimizes f(θ) = l(θ) + g(θ)

Convergence of the proximal gradient Assumptions: f(θ) = l(θ) + g(θ) l is Lipschitz continuous with constant L > 0 l(θ) l(ϑ) 2 L θ ϑ 2 θ, ϑ Θ optimal value f is finite and attained at θ (not necessarily unique) Theorem f(θ (k) ) f decreases at least as fast as 1/k if fixed step size γ k 1/L is used if backtracking line search is used

Stochastic Approximation vs minibatch proximal Stochastic Approximation 1 Motivation 2 3 4 5

Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problem! The score function l is given by l(θ) = H θ (x)π θ (dx). Therefore, at each iteration, the score function should be approximated. The case where π θ = π and and random variables {X n, n N} each marginally distributed according to π = online learning (Juditsky, Nemirovski, 2010, Duchi et al, 2011).

Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problems! l(θ) = H θ (x)π θ (dx). π θ depends on the unknown parameter θ... Sampling directly from π θ is often not directly feasible. But one may construct a Markov chain, with Markov kernel P θ, such that π θ P θ = π θ The Metropolis-Hastings algorithm or Gibbs sampling provides a natural framework to handle such problems.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation / Mini-batches g 0 where H n+1 approximates l(θ n ). θ n+1 = θ n + γ n+1 H n+1 Stochastic Approximation: γ n 0 and H n+1 = H θn (X n+1 ) and X n+1 P θn (X n, ). Mini-batches setting: γ n γ and m n+1 1 H n+1 = m 1 n+1 H(θ n, X n+1,j ), j=0 where m n and {X n+1,j } mn+1 j=1 is a run of the length m n+1 of a Markov chain with transition kernel P θn. Beware! For SA, n iterations = n simulations. For minibatches, n iterations = n j=1 m j simulations.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Averaging θ n def = n k=1 a kθ k n k=1 a k = ( 1 ) a n n k=1 a θ n 1 + k a n n k=1 a k θ n. Stochastic approximation: take a n 1, γ n = Cn α with α (1/2, 1), then n ( θn θ ) D N(0, σ 2 ) Mini-batch SA: take a n m n, γ n γ 1/(2L) and m n sufficiently fast, then n ( θnn θ ) D N(0, σ 2 ) where N n is the number of iterations for n simulations: Nn k=1 m k n < N n+1 k=1 m k.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation θ n+1 = θ n + γ n+1 l(θ n ) + γ n+1 η n+1 η n+1 = H θn (X n+1 ) l(θ n ) = H θn (X n+1 ) π θn (H θn ). Idea Split the error into a martingale increment + remainder term Key tool Poisson equation Ĥ θ P θ Ĥ θ = H θ π θ (H θ ).

Stochastic Approximation vs minibatch proximal Stochastic Approximation Decomposition of the error η n+1 = H θn (X n+1 ) π θn (H θn ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n+1 ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n ) + P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) We further split the error P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) = P θn+1 Ĥ θn+1 (X n+1 ) P θn Ĥ θn (X n )+P θn Ĥ θn (X n+1 ) P θn+1 Ĥ θn+1 (X n+1 ). To prove that the remainder term goes to zero, it is required to prove the regularity of the Poisson solution with respect to θ, to prove that θ Ĥθ and θ P θ Ĥ θ is smooth in some sense... this is not always a trivial issue!

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch case Assume that the Markov kernel is nice... Bias E [η n+1 F n ] = m 1 n+1 Fluctuation m n+1 1 m 1 n+1 j=0 m n+1 1 j=0 H θn (X j ) π θn (H θn ) (ν θn P jθn H θn π θn H θn ) = O(m 1 n+1 ) m n+1 1 = m 1 n+1 Ĥ θn (X j ) P θn Ĥ θn (X j 1 ) + remainders j=1

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatches case Contrary to SA, the noise η n+1 in the recursion θ n+1 = θ n + γ l(θ n ) + γη n+1 converges to zero a.s. and the stepsize is kept constant γ n = γ. Idea: perturbation of a discrete time dynamic system θ k+1 = θ k + γ l( θ k ) having a unique fixed point and a Lyapunov function l: l(θ k+1 ) l(θ k ) in presence of vanishingly small perturbation η n+1. a.s convergence of perturbed dynamical system with a Lyapunov function can be established under very weak assumptions...

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic proximal gradient The stochastic proximal gradient sequence {θ n, n N} can be rewritten as θ n+1 = prox γn+1 (θ n + γ n+1 l(θ n ) + γ n+1 η n+1 ), where η n+1 def = H n+1 l(θ n ) is the approximation error. Questions: Convergence and rate of convergence in the SA and mini-batch settings? Stochastic Approximation / Minibatch: which one should I prefer? Tuning of the parameters (stepsize for SA, size of minibatches, averaging weights, etc...) Acceleration (à la Nesterov)?

Stochastic Approximation vs minibatch proximal Stochastic Approximation Main result Lemma Suppose that {γ n, n N} is decreasing and 0 < Lγ n 1 for all n 1. For any θ Θ, and any nonnegative sequence {a n, n N}, n {f( θ n ) f(θ )} j=1 a j 1 2 + n j=1 ( aj a ) j 1 θ j 1 θ 2 + a 0 θ 0 θ 2 γ j γ j 1 2γ 0 n n a j Tγj (θ j 1 ) θ, η j + a j γ j η j 2, j=1 where T γ (θ) = def = prox γ (θ + γ l(θ)) is the gradient proximal map, j=1

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation setting Take a j = γ j and decompose, using the Poisson equation, η j = ξ j + r j, where ξ j is a martingale term and r j is a remainder term. n γ j {f( θ n ) f(θ )} a 0 θ 0 θ 2 2γ 0 j=1 + n n γ j Tγj (θ j 1 ) θ, ξ j + γj 2 η j 2 + remainders, j=1 j=1 The red term is a martingale with a bracket bounded by n γj 2 θ j 1 θ 2 E [ ξ j 2 ] F j 1 j=1 If j=1 γ j = and j=1 γ2 j <, { θ n, n N} converges. rate of convergence ln(n)/ n by taking γ j = j 1/2.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch setting Theorem Let { θ n, n 0} be the average estimator. Then for all n 1, n E [ f( θ n ) f(θ ) ] j=1 a j 1 2 + j=1 n j=1 ( aj a ) j 1 E [ θ j 1 θ 2] + a 0 E[ θ 0 θ 2 ] γ j γ j 1 2γ 0 n [ ] a j E θ j 1 θ ɛ (1) j 1 + n j=1 [ ] a j γ j E ɛ (2) j 1. where ɛ (1) n def = E [η n+1 F n ], ɛ (2) def n = E [ η ] n+1 2 Fn.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Convergence analysis Corollary Suppose that γ n (0, 1/L], and there exist constants C 1, C 2, B < such that, for n 1, E[ɛ (1) n ] C 1 m 1 n+1, E[ɛ(2) n ] C 2 m 1 n+1 Then, setting γ n = γ, m n = n and a n 1,, and sup θ n θ B, P a.s. n N E [f(θ n ) f(θ )] C/n and E [f(θ Nn ) f(θ )] C/ n where N n is the number of iterations for n simulations. This is the same rate than for the SA (without the logarithmic term).

1 Motivation 2 3 4 5

Potts model We focus on the particular case where X = {1,..., M}, and B(x, y) = 1 {x=y}, which corresponds to the well known Potts model f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + Z θ i=1 1 j<i p θ ij 1 {xi=x j}. The term p i=1 θ iib 0 (x i ) is sometimes referred to as the external field and defines the distribution in the absence of interaction. We focus on the case where the interactions terms θ ij for i j are nonnegative. This corresponds to networks with there is either no interaction, or collaborative interactions between the nodes.

Algorithm At the k-th iteration, and given F k = σ(θ 1,..., θ k ): 1 generate the X p -valued Markov sequence {X k+1,j } m k+1 j=0 with transition P θk and initial distribution ν θk, and compute the approximate gradient H k+1 = 1 n n i=1 B(x (i) ) 1 m k+1 m k+1 j=1 B(X k+1,j ), 2 Compute θ k+1 = Π Ka ( sγk+1,λ (θ k + γ k+1 H k+1 ) ), the operation s γ,λ (M) soft-thresholds each entry of the matrix M, and the operation Π Ka (M) projects each entry of M on [0, a].

MCMC scheme For j i, we set b ij = e θij. Notice that b ij 1. For x = (x 1,..., x p ), ( p ) f θ (x) = 1 ( ) exp θ ii B 0 (x i ) bij 1 {xi=x Z j} + 1 {xi x j}. θ i=1 1 j<i p Augment the likelihood with auxiliary variables {δ ij, 1 j < i p}, δ ij {0, 1} such that the joint distribution of (x, δ) is given by ( p ) f θ (x, δ) exp θ ii B 0 (x i ) j<i ( i=1 1 {xi=x j}b ij ( 1 b 1 ij ) δij ) δ b ij 1 ij + 1 {xi x j}0 δij 1 1 δij. The marginal distribution of x in this joint distribution is the same f θ given above.

MCMC scheme The auxiliary variables {δ ij, 1 j < i p} are conditionally independent given x = (x 1,..., x p ); if x i x j, then δ ij = 0 with probability 1. If x i = x j, then δ ij = 1 with probability 1 b 1 ij, and δ ij = 0 with probability b 1 ij. The auxiliary variables {δ ij, 1 j < i p} defines an undirected graph with nodes {1,..., p} where there is an edge between i j if δ ij = 1, and there is no edge otherwise. This graph partitions the nodes {1,..., p} into maximal clusters C 1,..., C K (a set of nodes where there is a path joining any two of them). Notice that δ ij = 1 implies x i = x j. Hence all the nodes in a given cluster holds the same value of x. ( ) K f θ (x δ) exp θ ii B 0 (x i ). i C k k=1 j<i,(i,j) C k 1 {xi=x j}

Wolff algorithm Given X = (X 1,..., X p ) 1 Randomly select a node i {1,..., p}, and set C 0 = {i}. 2 Do until C 0 can no longer grow. For each new addition j to C 0, and for each j / C 0 such that θ jj > 0, starting with δ jj = 0, do the following. If X j = X j, set δ jj = 1 with probability 1 e θ jj. If δ jj = 1, add j to C 0. 3 If X i = v, randomly select v {1,..., M} \ {v}, and propose a new vector X X p, where X j = v for j C 0 and X j = X j for j / C 0. Accept X with probability 1 exp (B 0 (v ) B 0 (v)) θ jj. j C0

Relative error rate 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Stoch. Grad. Stoch. Grad. Averaged Acc. scheme True positive rate 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Stoch. Grad. Acc. scheme False discovery rate 0.0 0.1 0.2 0.3 0.4 Stoch. Grad. Acc. scheme 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 50, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = 100 + n

Relative error rate 0.80 0.85 0.90 0.95 1.00 1.05 1.10 Stoch. Grad. Stoch. Grad. Averaged True positive rate 0.76 0.78 0.80 0.82 0.84 False discovery rate 0.0 0.1 0.2 0.3 0.4 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 100, n = 500 observations, 1% of off-diagonal terms, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = 100 + n

Relative error rate 0.80 0.85 0.90 0.95 1.00 1.05 1.10 Stoch. Grad. Stoch. Grad. Averaged True positive rate 0.80 0.85 0.90 0.95 1.00 False discovery rate 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 200, n = 500 observations, 1% of off-diagonal terms, 1% of off-diagonal terms, minibatch, m n = 100 + n

1 Motivation 2 3 4 5

Take-home message Efficient and globally converging procedure for penalized likelihood inference in incomplete data models are available if the complete data likelihood is globally concave with convex sparsity-inducing penalty (provided that computing the proximal operator is easy) Stochastic Approximation and Minibatch algorithms achieve the same rate, which is 1/ n where n is the number of simulations. Minibatch algorithms are in general preferable if the computation of the proximal operator is complex. Thanks for your attention... and patience!