On Markov Chain Gradient Descent

Similar documents
WE consider an undirected, connected network of n

Ergodic Subgradient Descent

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Introduction to Machine Learning CMU-10701

6 Markov Chain Monte Carlo (MCMC)

A Distributed Newton Method for Network Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

arxiv: v3 [math.oc] 8 Jan 2019

Markov Chains, Random Walks on Graphs, and the Laplacian

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

ADMM and Fast Gradient Methods for Distributed Optimization

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Approximate Counting and Markov Chain Monte Carlo

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Markov Chains Handout for Stat 110

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME)

Convergence rates for distributed stochastic optimization over random networks

ECS171: Machine Learning

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Stochastic Subgradient Method

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

WE consider an undirected, connected network of n

Homework 10 Solution

Stochastic Proximal Gradient Algorithm

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Propp-Wilson Algorithm (and sampling the Ising model)

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

The Perron Frobenius theorem and the Hilbert metric

Convex Optimization CMU-10725

Introduction to Machine Learning CMU-10701

CS260: Machine Learning Algorithms

Convex Optimization Lecture 16

Distributed Optimization over Networks Gossip-Based Algorithms

SVRG++ with Non-uniform Sampling

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo Methods

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Neural Network Training

Nonlinear Optimization Methods for Machine Learning

Newton-like method with diagonal correction for distributed optimization

25.1 Markov Chain Monte Carlo (MCMC)

Lecture 16: Introduction to Neural Networks

Multi-Robotic Systems

Markov Chains and MCMC

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Asynchronous Non-Convex Optimization For Separable Problem

Sampling Contingency Tables

Random Walks on Graphs. One Concrete Example of a random walk Motivation applications

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Proximal and First-Order Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Subgradient Method. Ryan Tibshirani Convex Optimization

Reinforcement Learning

Consensus-Based Distributed Optimization with Malicious Nodes

Comparison of Modern Stochastic Optimization Algorithms

arxiv: v1 [stat.ml] 12 Nov 2015

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

STOCHASTIC PROCESSES Basic notions

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Computational statistics

Advanced computational methods X Selected Topics: SGD

Conductance and Rapidly Mixing Markov Chains

Stochastic Optimization Algorithms Beyond SG

Stochastic optimization Markov Chain Monte Carlo

University of Toronto Department of Statistics

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Stochastic and online algorithms

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

WE consider finite-state Markov decision processes

Convex Optimization of Graph Laplacian Eigenvalues

arxiv: v2 [math.oc] 5 May 2018

Newton-like method with diagonal correction for distributed optimization

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Convex Optimization of Graph Laplacian Eigenvalues

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Optimization methods

18.600: Lecture 32 Markov Chains

Distributed Optimization. Song Chong EE, KAIST

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Undirected Graphical Models

ECS289: Scalable Machine Learning

18.175: Lecture 30 Markov chains

Search Directions for Unconstrained Optimization

Linear Models in Machine Learning

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Asymptotic Filtering and Entropy Rate of a Hidden Markov Process in the Rare Transitions Regime

Random walks, Markov chains, and how to analyse them

ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS

Markov Chains and Stochastic Sampling

MDP Preliminaries. Nan Jiang. February 10, 2019

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

Distributed Control of the Laplacian Spectral Moments of a Network

Transcription:

On Marov Chain Gradient Descent Tao Sun College of Computer National University of Defense Technology Changsha, Hunan 40073, China nudtsuntao@63.com Yuejiao Sun Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA sunyj@math.ucla.edu Wotao Yin Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA wotaoyin@math.ucla.edu Abstract Stochastic gradient methods are the worhorse algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Marov chain gradient descent, a variant of stochastic gradient descent where the random samples are taen on the trajectory of a Marov chain. Existing results of this method assume convex objectives and a reversible Marov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Marov chains. Nonconvexity maes our method applicable to broader problem classes. Non-reversible finite-state Marov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Marov chains. The reported numerical results validate our contributions. Introduction In this paper, we consider a stochastic minimization problem. Let Ξ be a statistical sample space with probability distribution Π we omit the underlying σ-algebra). Let X R n be a closed convex set, which represents the parameter space. F ; ξ) : X R is a closed convex function associated with ξ Ξ. We aim to solve the following problem: ) minimize E ξ F x; ξ) = F x, ξ)dπξ). ) x X R n A common method to minimize ) is Stochastic Gradient Descent SGD) []: Π x + = Proj X x γ F x ; ξ ) ), samples ξ i.i.d Π. 2) However, for some problems and distributions, direct sampling from Π is expensive or impossible, and it is possible that the sample space Ξ is not explicitly nown. In these cases, it can be much cheaper to sample by following a Marov chain that has a desired equilibrium distribution Π. The wor is supported in part by the National Key R&D Program of China 207YFB0202902, China Scholarship Council. The wor of Y. Sun and W. Yin was supported in part by AFOSR MURI FA9550-8--0502, NSF DMS-720237, NSFC 72805, and ONR N00047262. 32nd Conference on Neural Information Processing Systems NeurIPS 208), Montréal, Canada.

To be concrete, imagine solving problem ) with a discrete space Ξ := {x {0, } n a, x b}, where a R n and b R, and the uniform distribution Π over Ξ. A straightforward way to obtain a uniform sample is iteratively randomly sampling x {0, } n until the constraint a, x b is satisfied. Even if the feasible set is small, it may tae up to O2 n ) iterations to get a feasible sample. Instead, one can sample a trajectory of a Marov chain described in [4]; to obtain a sample ε-close Ξ to the distribution Π, one only needs log ε ) expo nlog n) 5 2 )) samples [2], where Ξ is the cardinality of Ξ. This presents a signifant saving in sampling cost. Marov chains also naturally arise in some applications. Common examples are systems that evolve according to Marov chains, for example, linear dynamic systems with random transitions or errors. Another example is a distributed system in which every node locally stores a subset of training samples; to train a model using these samples, we can let a toen that holds all the model parameters traverse the nodes following a random wal, so the samples are accessed according to a Marov chain. Suppose that the Marov chain has a stationary distribution Π and a finite mixing time T, which is how long a random trajectory needs to be until its current state has a distribution that roughly matches Π. A larger T means a closer match. Then, in order to run one iteration of 2), we can generate a trajectory of samples ξ, ξ 2, ξ 3,..., ξ T and only tae the last sample ξ := ξ T. To run another iteration of 2), we repeat this process, i.e., sample a new trajectory ξ, ξ 2, ξ 3,..., ξ T and tae ξ := ξ T. Clearly, sampling a long trajectory just to use the last sample wastes a lot of samples, especially when T is large. But, this may seem necessary because ξ t, for all small t, have large biases. After all, it can tae a long time for a random trajectory to explore all of the space, and it will often double bac and visit states that it previously visited. Furthermore, it is also difficult to choose an appropriate T. A small T will cause large bias in ξ T, which slows the SGD convergence and reduces its final accuracy. A large T, on the other hand, is wasteful especially when x is still far from convergence and some bias does not prevent 2) to mae good progress. Therefore, T should increase adaptively as increases this maes the choice of T even more difficult. So, why waste samples, why worry about T, and why not just apply every sample immediately in stochastic gradient descent? This approach has appeared in [5, 6], which we call the Marov Chain Gradient Descent MCGD) algorithm for problem ): x + = Proj X x γ ˆ F x ; ξ ) ), 3) where ξ 0, ξ,... are samples on a Marov chain trajectory and ˆ F x ; ξ ) F x ; ξ ) is a subgradient. Let us examine some special cases. Suppose the distribution Π is supported on a set of M points, y,..., y M. Then, by letting f i x) := M Probξ = y i ) F x, y i ), problem ) reduces to the finite-sum problem: minimize fx) x X R d M M f i x). 4) By the definition of f i, each state i has the uniform probability /M. At each iteration of MCGD, we have x + = Proj X x γ ˆ fj x ) ), 5) where j ) 0 is a trajectory of a Marov chain on {, 2,..., M} that has a uniform stationary distribution. Here, ξ ) 0 Π and j ) 0 [M] are two different, but related Marov chains. Starting from a deterministic and arbitrary initialization x 0, the iteration is illustrated by the following diagram: j 0 j j 2... x 0 x x 2 x 3... In the diagram, given each j, the next state j + is statistically independent of j,..., j 0 ; given j and x, the next iterate x + is statistically independent of j,..., j 0 and x,..., x 0. i= 6) 2

Another application of MCGD involves a networ: consider a strongly connected graph G = V, E) with the set of vertices V = {, 2,..., M} and set of edges E V V. Each node j {, 2,..., M} possess some data and can compute f j ). To run MCGD, we employ a toen that carries the variable x, waling randomly over the networ. When it reaches a node j, node j reads x form the toen and computes f j ) to update x according to 5). Then, the toen wals away to a random neighbor of node j.. Numerical tests We present two inds of numerical results. The first one is to show that MCGD uses fewer samples to train both a convex model and a nonconvex model. The second one demonstrates the advantage of the faster mixing of a non-reversible Marov chain. Our results on nonconvex objective and non-reversible chains are new.. Comparision with SGD Let us compare:. MCGD 3), where j is taen from one trajectory of the Marov chain; 2. SGDT, for T =, 2, 4, 8, 6, 32, where each j is the T th sample of a fresh, independent trajectory. All trajectories are generated by starting from the same state 0. To compute T gradients, SGDT uses T times as many samples as MCGD. We did not try to adapt T as increases because there lacs a theoretical guidance. In the first test, we recover a vector u from an auto regressive process, which closely resembles the first i.i.d experiment in []. Set matrix A as a subdiagonal matrix with random entries A i,i U[0.8, 0.99]. Randomly sample a vector u R d, d = 50, with the unit 2-norm. Our data ξt, ξt 2 ) t= are generated according to the following auto regressive process: ξ t = Aξ t + e W t, W t i.i.d N0, ) { ξ t 2, if u, ξ = t > 0, 0, otherwise; { ξ2 ξt 2 = t, with probability 0.8, ξ t 2, with probability 0.2. Clearly, ξ t, ξ 2 t ) t= forms a Marov chain. Let Π denote the stationary distribution of this Marov chain. We recover u as the solution to the following problem: minimize x E ξ,ξ 2 ) Πlx; ξ, ξ 2 ). We consider both convex and nonconvex loss functions, which were not done before in the literature. The convex one is the logistic loss lx; ξ, ξ 2 ) = ξ 2 logσ x, ξ )) ξ 2 ) log σ x, ξ )), where σt) = +exp t). And the nonconvex one is taen as lx; ξ, ξ 2 ) = 2 σ x, ξ ) ξ 2 ) 2 from [7]. We choose γ = as our stepsize, where q = 0.50. This choice is consistently with our theory below. q Our results in Figure are surprisingly positive on MCGD, more so to our expectation. As we had expected, MCGD used significantly fewer total samples than SGD on every T. But, it is surprising that MCGD did not need even more gradient evaluations. Randomly generated data must have helped homogenize the samples over the different states, maing it less important for a trajectory to converge. It is important to note that SGD and SGD2, as well as SGD4, in the nonconvex case, stagnate at noticeably lower accuracies because their T values are too small for convergence. 3

0 0 0 0 2 0 3 0 4 0 0 Convex case 0 0 Convex case 0 - Nonconvex case 0 - Nonconvex case f x ) fx ) 0-0 -2 MCSGD SGD SGD2 SGD4 SGD8 SGD6 SGD32 f x ) fx ) 0-0 -2 f x ) fx ) 0-2 f x ) fx ) 0-2 0-3 Number of gradient evaluations 0-3 0 0 0 0 2 0 3 0 4 Number of samples 0-3 0 0 0 0 2 0 3 0 4 Number of gradient evaluations 0-3 0 0 0 0 2 0 3 0 4 Number of samples Figure : Comparisons of MCGD and SGDT for T =, 2, 4, 8, 6, 32. x is the average of x,..., x. 2. Comparison of reversible and non-reversible Marov chains We also compare the convergence of MCGD when woring with reversible and non-reversible Marov chains the definition of reversibility is given in next section). As mentioned in [4], transforming a reversible Marov chain into non-reversible Marov chain can significantly accelerate the mixing process. This technique also helps to accelerate the convergence of MCGD. In our experiment, we first construct an undirected connected graph with n = 20 nodes with edges randomly generated. Let G denote the adjacency matrix of the graph, that is, {, if i, j are connected; G i,j = 0, otherwise. Let d max be the maximum number of outgoing edges of a node. Select d = 0 and compute β N 0, I d ). The transition probability of the reversible Marov chain is then defined by, nown as Metropolis-Hastings marov chain, P i,j = d max, if j i, G i,j = ; j i Gi,j d max, if j = i; 0, otherwise. Obviously, P is symmetric and the stationary distribution is uniform. The non-reversible Marov chain is constructed by adding cycles. The edges of these cycles are directed and let V denote the adjacency matrix of these cycles. If V i,j =, then V j,i = 0. Let w 0 > 0 be the weight of flows along these cycles. Then we construct the transition probability of the non-reversible Marov chain as follows, f )-f * 0 2 0 0 0 0-0 -2 Reversible Non-reversible Q i,j = W i,j l W, i,l where W = d max P + w 0 V. See [4] for an explanation why this change maes the chain mix faster. In our experiment, we add 5 cycles of length 4, with edges existing in G. w 0 is set to be dmax 2. We test MCGD on a least square problem. First, we select β N 0, I d ); and then for each node i, we generate x i N 0, I d ), and y i = x T i β. The objective function is defined as, fβ) = n x T i β y i ) 2. 2 The convergence results are depicted in Figure 2. i= 0-3 0 0 0 0 2 0 3 0 4 Iteration Figure 2: Comparison of reversible and irreversible Marov chains. The second largest eigenvalues of reversible and nonreversible Marov chains are 0.75 and 0.66 respectively..2 Known approaches and results It is more difficult to analyze MCGD due to its biased samples. To see this, let p,j be the probability to select f j in the th iteration. SGD s uniform probability selection p,j M ) yields an unbiased 4

gradient estimate E j f j x )) = C fx ) 7) for some C > 0. However, in MCGD, it is possible to have p,j = 0 for some, j. Consider a random wal. The probability p j,j is determined by the current state j, and we have p j,i > 0 only for i N j ) and p j,i = 0 for i / N j ), where N j ) denotes the neighborhood of j. Therefore, we no longer have 7). All analyses of MCGD must deal with the biased expectation. Papers [6, 5] investigate the conditional expectation E j+τ j f j+τ x )). For a sufficiently large τ Z +, it is sufficiently close to M fx ) but still different). In [6, 5], the authors proved that, to achieve an ɛ error, MCGD with stepsize Oɛ) can return a solution in O ɛ ) iteration. Their error bound is given in the ergodic sense 2 and using liminf. The authors of [0] proved a lim inf fx ) and Edist 2 x, X ) have almost sure convergence under diminishing stepsizes γ = q, 2 3 < q. Although the authors did not compute ) iterations, any rates, we computed that their stepsizes will lead to a solution with ɛ error in O ɛ q for 2 3 < q <, and Oe ɛ ) for q =. In [], the authors improved the stepsizes to γ = and showed ergodic convergence; in other words, to achieve ɛ error, it is enough to run MCGD for O ɛ 2 ) iterations. There is no non-ergodic result regarding the convergence of fx ). It is worth mentioning that [0, ] use time non-homogeneous Marov chains, where the transition probability can change over the iterations as long as there is still a finite mixing time. In [], MCGD is generalized from gradient descent to mirror descent. In all these wors, the Marov chain is required to be reversible, and all functions f i, i [M], are assumed to be convex. However, non-reversible chains can have substantially faster convergence and thus more numerically efficient..3 Our approaches and results In this paper, we improve the analyses of MCGD to non-reversible finite-state Marov chains and to nonconvex functions. The former allows us to have faster mixing, and the latter frequently appears in applications. Our convergence result is given in the non-ergodic sense though the rate results are still given the ergodic sense. It is important to mention that, in our analysis, the mixing time of the underlying Marov chain is not tied to a fixed mixing level but can vary to different levels. This is essential because MCGD needs time to reduce its objective error from its current value to a lower one, and this time becomes longer when the current value is lower since a more accurate Marov chain convergence and thus a longer mixing time are required. When f, f 2,..., f M are all convex, we allow them to be non-differentiable and MCGD to use subgradients, provided that X is bounded. When any of them is nonconvex, we assume X is the full space and f, f 2,..., f M are differentiable with bounded gradients. The bounded-gradient assumption is due to a technical difficulty associated with nonconvexity. Specifically, in the convex setting, we prove lim Efx ) = f minimum of f over X) for both exact and inexact MCGD with stepsizes γ =, q 2 < q <. The convergence rates of MCGD with exact and inexact subgradient computations are presented. The first analysis of nonconvex MCGD is also presented with its convergence given in the expectation of fx ). These results hold for non-reversible finite-state Marov chains and can be extended to time non-homogeneous Marov chain under extra assumptions [0, Assumptions 4 and 5] and [, Assumption C], which essentially ensure finite mixing. Our results for finite-state Marov chains are first presented in Sections 3 and 4. They are extended to continuous-state reversible Marov chains in Section 5. Some novel results are are developed based on new techniques and approaches developed in this paper. To get the stronger results in general cases, we used the varying mixing time rather than fixed ones. We list the possible extensions of MCGD that are not discussed in this paper. The first one is the accelerated versions including the Nesterov s acceleration and variance reduction schemes. The second one is the design and optimization of Marov chains to improve the convergence of MCGD. 5

2 Preliminaries 2. Marov chain We recall some definitions, properties, and existing results about the Marov chain. Although we use the finite-state time-homogeneous Marov chain, results can be extended to more general chains under similar extra assumptions in [0, Assumptions 4, 5] and [, Assumption C]. Definition finite-state time-homogeneous Marov chain) Let P be an M M-matrix with real-valued elements. A stochastic process X, X 2,... in a finite state space [M] := {, 2,..., M} is called a time-homogeneous Marov chain with transition matrix P if, for N, i, j [M], and i 0, i,..., i [M], we have PX + = j X 0 = i 0, X = i,..., X = i) = PX + = j X = i) = P i,j. 8) Let the probability distribution of X be denoted as the non-negative row vector π = π, π2,..., πm ), that is, PX = j) = πj. π satisfies M i= π i =. When the Marov chain is time-homogeneous, we have π = π P and π = π P = = π 0 P, 9) for N, where P denotes the th power of P. A Marov chain is irreducible if, for any i, j [M], there exists such that P ) i,j > 0. State i [M] is said to have a period d if P i,i = 0 whenever is not a multiple of d and d is the greatest integer with this property. If d =, then we say state i is aperiodic. If every state is aperiodic, the Marov chain is said to be aperiodic. Any time-homogeneous, irreducible, and aperiodic Marov chain has a stationary distribution π = lim π = [π, π2,..., πm ] with M i= π i = and min i{πi } > 0, and π = π P. It also holds that lim P = [π ); π );... ; π )] =: Π R M M. 0) The largest eigenvalue of P is, and the corresponding left eigenvector is π. Assumption The Marov chain X ) 0 is time-homogeneous, irreducible, and aperiodic. It has a transition matrix P and has stationary distribution π. 2.2 Mixing time Mixing time is how long a Marov chain evolves until its current state has a distribution very close to its stationary distribution. The literature has a thorough investigation of various inds of mixing times, with the majority for reversible Marov chains that is, π i P i,j = π j P j,i ). Mixing times of non-reversible Marov chains are discussed in [3]. In this part, we consider a new type of mixing time of non-reversible Marov chain. The proofs are based on basic matrix analysis. Our mixing time gives us a direct relationship between and the deviation of the distribution of the current state from the stationary distribution. To start a lemma, we review some basic notions in linear algebra. Let C be the n-dimensional complex field. The modulus of a complex number a C is given as a. For a vector x C n, the l and l 2 norms are defined as x := max i x i, x 2 := n i= x i 2. For a matrix A = [a i,j ] C m n, its -induced and Frobenius norms are A := max i,j a i,j, A F := n i,j= a i,j 2, respectively. We now P Π, as. The following lemma presents a deviation bound for finite. Lemma Let Assumption hold and let λ i P ) C be the ith largest eigenvalue of P, and λp ) := max{ λ 2P ), λ M P ) } + [0, ). 2 Then, we can bound the largest entry-wise absolute value of the deviation matrix δ := Π P R M M as δ C P λ P ) ) 6

for K P, where C P is a constant that also depends on the Jordan canonical form of P and K P is a constant that depends on λp ) and λ 2 P ). Their formulas are given in 45) and 46) in the Supplementary Material. Remar If P is symmetric, then all λ i P ) s are all real and nonnegative, K P = 0, and C P M 3 2. Furthermore, 42) can be improved by directly using λ 2P ) for the right side as δ δ F M 3 2 λ 2 P ), 0. 3 Convergence analysis for convex minimization This part considers the convergence of MCGD in the convex cases, i.e., f, f 2,..., f M and X are all convex. We investigate the convergence of scheme 5). We prove non-ergodic convergence of the expected objective value sequence under diminishing non-summable stepsizes, where the stepsizes are required to be almost" square summable. Therefore, the convergence requirements are almost equal to SGD. This section uses the following assumption. Assumption 2 The set X is assumed to be convex and compact. Now, we present the convergence results for MCGD in the convex but not necessarily differentiable) case. Let f be the minimum value of f over X. Theorem Let Assumptions and 2 hold and x ) 0 be generated by scheme 5). Assume that f i, i [M], are convex functions, and the stepsizes satisfy γ = +, ln γ 2 < +. 2) Then, we have Define We have: lim Efx ) = f. 3) ψp ) := max{, ln/λp )) }. ψp ) ) Efx ) f ) = O i= γ, 4) i where x i= := γixi. Therefore, if we select the stepsize γ = O i= γi ) as q 2 < q <, we get the rate Efx ) f ) = O ψp ) ). q Furthermore, consider the inexact version of MCGD: x + = Proj X x γ ˆ f j x ) + e ) ), 5) where the noise sequence e ) 0 is arbitrary but obeys + =2 e 2 ln < +. 6) Then, for iteration 5), results 3) and 4) still hold; furthermore, if e = O ) with p > p 2 and γ = O ) as q 2 < q <, the rate Efx ) f ) = O ψp ) ) also holds. q The stepsizes requirement 2) is nearly identical to the one of SGD and subgradient algorithms. In the theorem above, we use the stepsize setting γ = O ) as q 2 < q <. This ind of stepsize requirements also wors for SGD and subgradient algorithms. The convergence rate of MCGD is O ) = O i= γi ), which is also as the same as SGD and subgradient algorithms for q γ = O ). q 7

4 Convergence analysis for nonconvex minimization This section considers the convergence of MCGD when one or more of f i is nonconvex. In this case, we assume f i, i =, 2,..., M, are differentiable and f i is Lipschitz with L 2. We also set X as the full space. We study the following scheme x + = x γ f j x ). 7) We prove non-ergodic convergence of the expected gradient norm of f under diminishing nonsummable stepsizes. The stepsize requirements in this section are slightly stronger than those in the convex case with an extra ln factor. In this part, we use the following assumption. Assumption 3 The gradients of f i are assumed to be bounded, i.e., there exists D > 0 such that f i x) D, i [M]. 8) We use this new assumption because X is now the full space, and we have to directly bound the size of f i x). In the nonconvex case, we cannot obtain objective value convergence, and we only bound the gradients. Now, we are prepared to present our convergence results of nonconvex MCGD. Theorem 2 Let Assumptions and 3 hold and x ) 0 be generated by scheme 7). Also, assume f i is differentiable and f i is L-Lipschitz, and the stepsizes satisfy γ = +, ln 2 γ 2 < +. 9) Then, we have lim E fx ) = 0. 20) and E min i { fxi ) 2 } ) ψp ) ) = O i= γ, 2) i where ψp ) is given in Lemma. If we select the stepsize as γ = O ), q 2 < q <, then we get the rate E min i { fx i ) 2 } ) = O ψp ) ). q Furthermore, let e ) 0 be a sequence of noise and consider the inexact nonconvex MCGD iteration: If the noise sequence obeys x + = x γ fj x ) + e ). 22) + = γ e < +, 23) then the convergence results 20) and 2) still hold for inexact nonconvex MCGD. In addition, if we set γ = O ) as q 2 < q < and the noise satisfy e = O ) for p + q >, then 20) still p holds and E min i { fx i ) 2 } ) = O ψp ) ). q This proof of Theorem 2 is different from previous one. In particular, we cannot expect some sort of convergence to fx ), where x argmin f due to nonconvexity. To this end, we use the Lipschitz continuity of f i i [M]) to derive the descent". Here, the O" contains a polynomial compisition of constants D and L. Compared with MCGD in the convex case, the stepsize requirements of nonconvex MCGD become a tad higher; in summable part, we need ln2 γ 2 < + rather than ln γ2 < +. Nevertheless, we can still use γ = O q ) for 2 < q <. 2 This is for the convenience of the presentation in the proofs. If each f i has a L i, it is possible to improve our results slights. But, we simply set L := max i{l i} 8

5 Convergence analysis for continuous state space When the state space Ξ is a continuum, there are infinitely many possible states. In this case, we consider an infinite-state Marov chain that is time-homogeneous and reversible. Using the results in [8, Theorem 4.9], the mixing time of this ind of Marov chain still has geometric decrease lie ). Since Lemma is based on a linear algebra analysis, it no longer applies to the continuous case. Nevertheless, previous results still hold with nearly unchanged proofs under the following assumption: Assumption 4 For any ξ Ξ, F x; ξ) F y; ξ) L x y, sup x X,ξ Ξ { ˆ F x; ξ) } D, E ξ ˆ F x; ξ) Eξ F x; ξ), and sup x,y X,ξ Ξ F x; ξ) F y; ξ) H. We consider the general scheme x + = Proj X x γ ˆ F x ; ξ ) + e ) ), 24) where ξ are samples on a Marov chain trajectory. If e 0, the scheme then reduces to 3). Corollary Assume F ; ξ) is convex for each ξ Ξ. Let the stepsizes satisfy 2) and x ) 0 be generated by Algorithm 24), and e ) 0 satisfy 6). Let F := min x X E ξ F x; ξ)). If Assumption 4 holds and the Marov chain is time-homogeneous, irreducible, aperiodic, and reversible, then we have lim E E ξ F x ; ξ)) F ) max{, = 0, EE ξ F x ; ξ)) F ln/λ) ) = O } ), i= γi where 0 < λ < is the geometric rate of the mixing time of the Marov chain which corresponds to λp ) in the finite-state case). Next, we present our result for a possibly nonconvex objective function F ; ξ) under the following assumption. Assumption 5 For any ξ Ξ, F x; ξ) is differentiable, and F x; ξ) F y; ξ) L x y. In addition, sup x X,ξ Ξ { F x; ξ) } < +, X is the full space, and E ξ F x; ξ) = E ξ F x; ξ). Since F x, ξ) is differentiable and X is the full space, the iteration reduces to x + = x γ F x ; ξ ) + e ). 25) Corollary 2 Let the stepsizes satisfy 9), x ) 0 be generated by Algorithm 25), the noises obey 23), and Assumption 5 hold. Assume the Marov chain is time-homogeneous, irreducible, and aperiodic and reversible. Then, we have max{, lim E E ξ F x ; ξ)) = 0, E min { E ξf x i ; ξ)) 2 }) = O i i= γ i where 0 < λ < is geometric rate for the mixing time of the Marov chain. 6 Conclusion ln/λ) } ), 26) In this paper, we have analyzed the stochastic gradient descent method where the samples are taen on a trajectory of Marov chain. One of our main contributions is non-ergodic convergence analysis for convex MCGD, which uses a novel line of analysis. The result is then extended to the inexact gradients. This analysis lets us establish convergence for non-reversible finite-state Marov chains and for nonconvex minimization problems. Our results are useful in the cases where it is impossible or expensive to directly tae samples from a distribution, or the distribution is not even nown, but sampling via a Marov chain is possible. Our results also apply to decentralized learning over a networ, where we can employ a random waler to traverse the networ and minimizer the objective that is defined over the samples that are held at the nodes in a distribute fashion. 9

References [] John C Duchi, Aleh Agarwal, Miael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 224):549 578, 202. [2] Martin Dyer, Alan Frieze, Ravi Kannan, Ajai Kapoor, Ljubomir Perovic, and Umesh Vazirani. A mildly exponential time algorithm for approximating the number of solutions to a multidimensional napsac problem. Combinatorics, Probability and Computing, 23):27 284, 993. [3] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible marov chains, with an application to the exclusion process. The annals of applied probability, ):62 87, 99. [4] Mar Jerrum and Alistair Sinclair. The Marov chain Monte Carlo method: an approach to approximate counting and integration. Citeseer, 996. [5] Bjorn Johansson, Maben Rabi, and Miael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networs. In Decision and Control, 2007 46th IEEE Conference on, pages 4705 470. IEEE, 2007. [6] Björn Johansson, Maben Rabi, and Miael Johansson. A randomized incremental subgradient method for distributed optimization in networed systems. SIAM Journal on Optimization, 203):57 70, 2009. [7] Song Mei, Yu Bai, Andrea Montanari, et al. The landscape of empirical ris for nonconvex losses. The Annals of Statistics, 466A):2747 2774, 208. [8] Ravi Montenegro, Prasad Tetali, et al. Mathematical aspects of mixing times in marov chains. Foundations and Trends R in Theoretical Computer Science, 3):237 354, 2006. [9] Rufus Oldenburger et al. Infinite powers of matrices and characteristic roots. Due Mathematical Journal, 62):357 36, 940. [0] S Sundhar Ram, A Nedić, and Venugopal V Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 202):69 77, 2009. [] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 223):400 407, 95. [2] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233 257. Elsevier, 97. [3] Ralph Tyrell Rocafellar. Convex Analysis. Princeton university press, 205. [4] Konstantin S Turitsyn, Michael Chertov, and Marija Vucelja. Irreversible monte carlo algorithms for efficient sampling. Physica D: Nonlinear Phenomena, 2404-5):40 44, 20. [5] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66):2834 2848, 208. 0