Stochastic Proximal Gradient Algorithm

Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind help of A. Juditsky LJK, Grenoble

1 Motivation 2 3 4 5

Problem Statement (P) Argmin θ Θ { l(θ) + g(θ)}, where l is a smooth log-likelihood function or some other smooth statistical learning function, and g is a possibly non-smooth convex penalty term. This problem has attracted a lot of attention with the growing need to address high-dimensional statistical problems This work focuses on the case where the function l and its gradient l are both intractable, and where l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ.

Network structure Problem: Estimate sparse network structures from measurements on the nodes. For discrete measurement: amounts to estimate a Gibbs measure with pair-wise interactions f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + θ ij B(x i, x j ) Z θ, i=1 1 j<i p for a function B 0 : X R, and a symmetric function B : X X R, where X is a finite set. The absence of an edge encodes conditional independence.

Network structure Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Constraint-based approaches: test conditional independence from the data and then determine a graph that most closely represents those independencies. Score-based approaches combine a metric for the complexity of the graph with a measure of the goodness of fit of the graph to the data... but the number of graph structures grows super-exponentially, and the problem is in general NP-hard.

Network structure For x X p, define B(x) def = (B jk (x j, x k )) 1 j,k p R p p. The l 1 -penalized maximum likelihood estimate of θ is obtained by solving an optimization problem of the form (P) where l and g are given by l(θ) = 1 n n i=1 θ, B(x (i) ) log Z θ, g(θ) = λ θ jk. 1 k<j p

Fisher identity Fact 1: Z θ is the normalization constant is given by Z θ = x exp( θ, B(x) where the sum is over all the possible configurations. Fact 2: the gradient log Z θ is the expectation of the sufficient statistics: log Z θ = B(x)f θ (x) x Problem: None of these quantities can be computed explicitly... Nevertheless, they can be estimated using Monte Carlo integration.

General framework where re Argmin θ Θ { l(θ) + g(θ)}, l is a smooth log-likelihood function or some other smooth statistical learning function, g is a non-smooth convex sparsity-inducing penalty. the function l and its gradient l are intractable, The score function l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ on some measurable space (X, B), and some function H θ : X Θ.

1 Motivation 2 3 4 5

Definition Definition: Proximal mapping associated with closed convex function g and stepsize γ ( prox γ (θ) = Argmin ϑ Θ g(ϑ) + (2γ) 1 ϑ θ 2 ) 2 If g = I K, where K is a closed convex set (I K (x) = 0, x K, I K (x) = otherwise), then prox γ is the Euclidean projection on K prox γ (θ) = Argmin ϑ K ϑ θ 2 2 = P K (θ) if g(θ) = p i=1 λ i θ i then prox g is shrinkage (soft threshold) operation θ i γλ i θ i γλ i [S λ,γ (θ)] i = 0 θ i γλ i θ i + γλ i θ i γλ i

Proximal gradient method Unconstrained problem with cost function split in two components Minimize f(θ) = l(θ) + g(θ) l convex, differentiable with dom(g) = R n g closed, convex, possibly non differentiable... but prox g is inexpensive! Proximal gradient algorithm θ (k) = prox γk g(θ (k 1) + γ k l(θ (k 1) )) where {γ k, k N} is a sequence stepsizes, which either be constant, decreasing or determined by line search

Interpretation Denote θ + = prox γ (θ + γ l(θ)) from definition of proximal operator: θ + = Argmin ϑ (g(ϑ) + (2γ) 1 ϑ θ γ l(θ) 2 2) = Argmin ϑ (g(ϑ) l(θ) l(θ) T (ϑ θ) + (2γ) 1 ϑ θ 2 2). θ + minimizes g(ϑ) plus a simple quadratic local model of l(ϑ) around θ If γ 1/L, the surrogate function on the RHS majorizes the target function, and the algorithm might be seen as a specific instance of the Majorization-Minimization algorithm.

Some specific examples if g(θ) = 0 then proximal gradient = gradient method. θ (k) = θ (k 1) + γ k l(θ (k 1) ) if g(θ) = I K (θ), then proximal gradient = projected gradient θ (k) = P K (θ (k 1) + γ k l(θ (k 1) )). if g(θ) = i λ i θ i then proximal gradient = soft-thresholded gradient θ (k) = S λ,γk (θ (k 1) + γ k l(θ (k 1) ))

Gradient map The proximal gradient may be equivalently rewritten as θ (k) = θ (k 1) γ k G γk (θ (k 1) ) where the function G γ is given by G γ (θ) = 1 γ (θ prox γ(θ + γ l(θ))) The subgradient characterization of the proximal map implies G γ (θ) l(θ) + g(θ γg γ (θ)) Therefore, G γ (θ) = 0 if and only if θ minimizes f(θ) = l(θ) + g(θ)

Convergence of the proximal gradient Assumptions: f(θ) = l(θ) + g(θ) l is Lipschitz continuous with constant L > 0 l(θ) l(ϑ) 2 L θ ϑ 2 θ, ϑ Θ optimal value f is finite and attained at θ (not necessarily unique) Theorem f(θ (k) ) f decreases at least as fast as 1/k if fixed step size γ k 1/L is used if backtracking line search is used

Stochastic Approximation vs minibatch proximal Stochastic Approximation 1 Motivation 2 3 4 5

Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problem! The score function l is given by l(θ) = H θ (x)π θ (dx). Therefore, at each iteration, the score function should be approximated. The case where π θ = π and and random variables {X n, n N} each marginally distributed according to π = online learning (Juditsky, Nemirovski, 2010, Duchi et al, 2011).

Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problems! l(θ) = H θ (x)π θ (dx). π θ depends on the unknown parameter θ... Sampling directly from π θ is often not directly feasible. But one may construct a Markov chain, with Markov kernel P θ, such that π θ P θ = π θ The Metropolis-Hastings algorithm or Gibbs sampling provides a natural framework to handle such problems.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation / Mini-batches g 0 where H n+1 approximates l(θ n ). θ n+1 = θ n + γ n+1 H n+1 Stochastic Approximation: γ n 0 and H n+1 = H θn (X n+1 ) and X n+1 P θn (X n, ). Mini-batches setting: γ n γ and m n+1 1 H n+1 = m 1 n+1 H(θ n, X n+1,j ), j=0 where m n and {X n+1,j } mn+1 j=1 is a run of the length m n+1 of a Markov chain with transition kernel P θn. Beware! For SA, n iterations = n simulations. For minibatches, n iterations = n j=1 m j simulations.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Averaging θ n def = n k=1 a kθ k n k=1 a k = ( 1 ) a n n k=1 a θ n 1 + k a n n k=1 a k θ n. Stochastic approximation: take a n 1, γ n = Cn α with α (1/2, 1), then n ( θn θ ) D N(0, σ 2 ) Mini-batch SA: take a n m n, γ n γ 1/(2L) and m n sufficiently fast, then n ( θnn θ ) D N(0, σ 2 ) where N n is the number of iterations for n simulations: Nn k=1 m k n < N n+1 k=1 m k.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation θ n+1 = θ n + γ n+1 l(θ n ) + γ n+1 η n+1 η n+1 = H θn (X n+1 ) l(θ n ) = H θn (X n+1 ) π θn (H θn ). Idea Split the error into a martingale increment + remainder term Key tool Poisson equation Ĥ θ P θ Ĥ θ = H θ π θ (H θ ).

Stochastic Approximation vs minibatch proximal Stochastic Approximation Decomposition of the error η n+1 = H θn (X n+1 ) π θn (H θn ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n+1 ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n ) + P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) We further split the error P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) = P θn+1 Ĥ θn+1 (X n+1 ) P θn Ĥ θn (X n )+P θn Ĥ θn (X n+1 ) P θn+1 Ĥ θn+1 (X n+1 ). To prove that the remainder term goes to zero, it is required to prove the regularity of the Poisson solution with respect to θ, to prove that θ Ĥθ and θ P θ Ĥ θ is smooth in some sense... this is not always a trivial issue!

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch case Assume that the Markov kernel is nice... Bias E [η n+1 F n ] = m 1 n+1 Fluctuation m n+1 1 m 1 n+1 j=0 m n+1 1 j=0 H θn (X j ) π θn (H θn ) (ν θn P jθn H θn π θn H θn ) = O(m 1 n+1 ) m n+1 1 = m 1 n+1 Ĥ θn (X j ) P θn Ĥ θn (X j 1 ) + remainders j=1

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatches case Contrary to SA, the noise η n+1 in the recursion θ n+1 = θ n + γ l(θ n ) + γη n+1 converges to zero a.s. and the stepsize is kept constant γ n = γ. Idea: perturbation of a discrete time dynamic system θ k+1 = θ k + γ l( θ k ) having a unique fixed point and a Lyapunov function l: l(θ k+1 ) l(θ k ) in presence of vanishingly small perturbation η n+1. a.s convergence of perturbed dynamical system with a Lyapunov function can be established under very weak assumptions...

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic proximal gradient The stochastic proximal gradient sequence {θ n, n N} can be rewritten as θ n+1 = prox γn+1 (θ n + γ n+1 l(θ n ) + γ n+1 η n+1 ), where η n+1 def = H n+1 l(θ n ) is the approximation error. Questions: Convergence and rate of convergence in the SA and mini-batch settings? Stochastic Approximation / Minibatch: which one should I prefer? Tuning of the parameters (stepsize for SA, size of minibatches, averaging weights, etc...) Acceleration (à la Nesterov)?

Stochastic Approximation vs minibatch proximal Stochastic Approximation Main result Lemma Suppose that {γ n, n N} is decreasing and 0 < Lγ n 1 for all n 1. For any θ Θ, and any nonnegative sequence {a n, n N}, n {f( θ n ) f(θ )} j=1 a j 1 2 + n j=1 ( aj a ) j 1 θ j 1 θ 2 + a 0 θ 0 θ 2 γ j γ j 1 2γ 0 n n a j Tγj (θ j 1 ) θ, η j + a j γ j η j 2, j=1 where T γ (θ) = def = prox γ (θ + γ l(θ)) is the gradient proximal map, j=1

Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation setting Take a j = γ j and decompose, using the Poisson equation, η j = ξ j + r j, where ξ j is a martingale term and r j is a remainder term. n γ j {f( θ n ) f(θ )} a 0 θ 0 θ 2 2γ 0 j=1 + n n γ j Tγj (θ j 1 ) θ, ξ j + γj 2 η j 2 + remainders, j=1 j=1 The red term is a martingale with a bracket bounded by n γj 2 θ j 1 θ 2 E [ ξ j 2 ] F j 1 j=1 If j=1 γ j = and j=1 γ2 j <, { θ n, n N} converges. rate of convergence ln(n)/ n by taking γ j = j 1/2.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch setting Theorem Let { θ n, n 0} be the average estimator. Then for all n 1, n E [ f( θ n ) f(θ ) ] j=1 a j 1 2 + j=1 n j=1 ( aj a ) j 1 E [ θ j 1 θ 2] + a 0 E[ θ 0 θ 2 ] γ j γ j 1 2γ 0 n [ ] a j E θ j 1 θ ɛ (1) j 1 + n j=1 [ ] a j γ j E ɛ (2) j 1. where ɛ (1) n def = E [η n+1 F n ], ɛ (2) def n = E [ η ] n+1 2 Fn.

Stochastic Approximation vs minibatch proximal Stochastic Approximation Convergence analysis Corollary Suppose that γ n (0, 1/L], and there exist constants C 1, C 2, B < such that, for n 1, E[ɛ (1) n ] C 1 m 1 n+1, E[ɛ(2) n ] C 2 m 1 n+1 Then, setting γ n = γ, m n = n and a n 1,, and sup θ n θ B, P a.s. n N E [f(θ n ) f(θ )] C/n and E [f(θ Nn ) f(θ )] C/ n where N n is the number of iterations for n simulations. This is the same rate than for the SA (without the logarithmic term).

1 Motivation 2 3 4 5

Potts model We focus on the particular case where X = {1,..., M}, and B(x, y) = 1 {x=y}, which corresponds to the well known Potts model f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + Z θ i=1 1 j<i p θ ij 1 {xi=x j}. The term p i=1 θ iib 0 (x i ) is sometimes referred to as the external field and defines the distribution in the absence of interaction. We focus on the case where the interactions terms θ ij for i j are nonnegative. This corresponds to networks with there is either no interaction, or collaborative interactions between the nodes.

Algorithm At the k-th iteration, and given F k = σ(θ 1,..., θ k ): 1 generate the X p -valued Markov sequence {X k+1,j } m k+1 j=0 with transition P θk and initial distribution ν θk, and compute the approximate gradient H k+1 = 1 n n i=1 B(x (i) ) 1 m k+1 m k+1 j=1 B(X k+1,j ), 2 Compute θ k+1 = Π Ka ( sγk+1,λ (θ k + γ k+1 H k+1 ) ), the operation s γ,λ (M) soft-thresholds each entry of the matrix M, and the operation Π Ka (M) projects each entry of M on [0, a].

MCMC scheme For j i, we set b ij = e θij. Notice that b ij 1. For x = (x 1,..., x p ), ( p ) f θ (x) = 1 ( ) exp θ ii B 0 (x i ) bij 1 {xi=x Z j} + 1 {xi x j}. θ i=1 1 j<i p Augment the likelihood with auxiliary variables {δ ij, 1 j < i p}, δ ij {0, 1} such that the joint distribution of (x, δ) is given by ( p ) f θ (x, δ) exp θ ii B 0 (x i ) j<i ( i=1 1 {xi=x j}b ij ( 1 b 1 ij ) δij ) δ b ij 1 ij + 1 {xi x j}0 δij 1 1 δij. The marginal distribution of x in this joint distribution is the same f θ given above.

MCMC scheme The auxiliary variables {δ ij, 1 j < i p} are conditionally independent given x = (x 1,..., x p ); if x i x j, then δ ij = 0 with probability 1. If x i = x j, then δ ij = 1 with probability 1 b 1 ij, and δ ij = 0 with probability b 1 ij. The auxiliary variables {δ ij, 1 j < i p} defines an undirected graph with nodes {1,..., p} where there is an edge between i j if δ ij = 1, and there is no edge otherwise. This graph partitions the nodes {1,..., p} into maximal clusters C 1,..., C K (a set of nodes where there is a path joining any two of them). Notice that δ ij = 1 implies x i = x j. Hence all the nodes in a given cluster holds the same value of x. ( ) K f θ (x δ) exp θ ii B 0 (x i ). i C k k=1 j<i,(i,j) C k 1 {xi=x j}

Wolff algorithm Given X = (X 1,..., X p ) 1 Randomly select a node i {1,..., p}, and set C 0 = {i}. 2 Do until C 0 can no longer grow. For each new addition j to C 0, and for each j / C 0 such that θ jj > 0, starting with δ jj = 0, do the following. If X j = X j, set δ jj = 1 with probability 1 e θ jj. If δ jj = 1, add j to C 0. 3 If X i = v, randomly select v {1,..., M} \ {v}, and propose a new vector X X p, where X j = v for j C 0 and X j = X j for j / C 0. Accept X with probability 1 exp (B 0 (v ) B 0 (v)) θ jj. j C0

Relative error rate 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Stoch. Grad. Stoch. Grad. Averaged Acc. scheme True positive rate 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Stoch. Grad. Acc. scheme False discovery rate 0.0 0.1 0.2 0.3 0.4 Stoch. Grad. Acc. scheme 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 50, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = 100 + n

Relative error rate 0.80 0.85 0.90 0.95 1.00 1.05 1.10 Stoch. Grad. Stoch. Grad. Averaged True positive rate 0.76 0.78 0.80 0.82 0.84 False discovery rate 0.0 0.1 0.2 0.3 0.4 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 100, n = 500 observations, 1% of off-diagonal terms, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = 100 + n

Relative error rate 0.80 0.85 0.90 0.95 1.00 1.05 1.10 Stoch. Grad. Stoch. Grad. Averaged True positive rate 0.80 0.85 0.90 0.95 1.00 False discovery rate 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations 0 200 400 600 800 1000 Iterations Figure : Simulation results for p = 200, n = 500 observations, 1% of off-diagonal terms, 1% of off-diagonal terms, minibatch, m n = 100 + n

1 Motivation 2 3 4 5

Take-home message Efficient and globally converging procedure for penalized likelihood inference in incomplete data models are available if the complete data likelihood is globally concave with convex sparsity-inducing penalty (provided that computing the proximal operator is easy) Stochastic Approximation and Minibatch algorithms achieve the same rate, which is 1/ n where n is the number of simulations. Minibatch algorithms are in general preferable if the computation of the proximal operator is complex. Thanks for your attention... and patience!