On Markov Chain Gradient Descent

Size: px
Start display at page:

Download "On Markov Chain Gradient Descent"

Transcription

1 On Marov Chain Gradient Descent Tao Sun College of Computer National University of Defense Technology Changsha, Hunan 40073, China Yuejiao Sun Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA Wotao Yin Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA Abstract Stochastic gradient methods are the worhorse algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Marov chain gradient descent, a variant of stochastic gradient descent where the random samples are taen on the trajectory of a Marov chain. Existing results of this method assume convex objectives and a reversible Marov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Marov chains. Nonconvexity maes our method applicable to broader problem classes. Non-reversible finite-state Marov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Marov chains. The reported numerical results validate our contributions. Introduction In this paper, we consider a stochastic minimization problem. Let Ξ be a statistical sample space with probability distribution Π we omit the underlying σ-algebra). Let X R n be a closed convex set, which represents the parameter space. F ; ξ) : X R is a closed convex function associated with ξ Ξ. We aim to solve the following problem: ) minimize E ξ F x; ξ) = F x, ξ)dπξ). ) x X R n A common method to minimize ) is Stochastic Gradient Descent SGD) []: Π x + = Proj X x γ F x ; ξ ) ), samples ξ i.i.d Π. 2) However, for some problems and distributions, direct sampling from Π is expensive or impossible, and it is possible that the sample space Ξ is not explicitly nown. In these cases, it can be much cheaper to sample by following a Marov chain that has a desired equilibrium distribution Π. The wor is supported in part by the National Key R&D Program of China 207YFB , China Scholarship Council. The wor of Y. Sun and W. Yin was supported in part by AFOSR MURI FA , NSF DMS , NSFC 72805, and ONR N nd Conference on Neural Information Processing Systems NeurIPS 208), Montréal, Canada.

2 To be concrete, imagine solving problem ) with a discrete space Ξ := {x {0, } n a, x b}, where a R n and b R, and the uniform distribution Π over Ξ. A straightforward way to obtain a uniform sample is iteratively randomly sampling x {0, } n until the constraint a, x b is satisfied. Even if the feasible set is small, it may tae up to O2 n ) iterations to get a feasible sample. Instead, one can sample a trajectory of a Marov chain described in [4]; to obtain a sample ε-close Ξ to the distribution Π, one only needs log ε ) expo nlog n) 5 2 )) samples [2], where Ξ is the cardinality of Ξ. This presents a signifant saving in sampling cost. Marov chains also naturally arise in some applications. Common examples are systems that evolve according to Marov chains, for example, linear dynamic systems with random transitions or errors. Another example is a distributed system in which every node locally stores a subset of training samples; to train a model using these samples, we can let a toen that holds all the model parameters traverse the nodes following a random wal, so the samples are accessed according to a Marov chain. Suppose that the Marov chain has a stationary distribution Π and a finite mixing time T, which is how long a random trajectory needs to be until its current state has a distribution that roughly matches Π. A larger T means a closer match. Then, in order to run one iteration of 2), we can generate a trajectory of samples ξ, ξ 2, ξ 3,..., ξ T and only tae the last sample ξ := ξ T. To run another iteration of 2), we repeat this process, i.e., sample a new trajectory ξ, ξ 2, ξ 3,..., ξ T and tae ξ := ξ T. Clearly, sampling a long trajectory just to use the last sample wastes a lot of samples, especially when T is large. But, this may seem necessary because ξ t, for all small t, have large biases. After all, it can tae a long time for a random trajectory to explore all of the space, and it will often double bac and visit states that it previously visited. Furthermore, it is also difficult to choose an appropriate T. A small T will cause large bias in ξ T, which slows the SGD convergence and reduces its final accuracy. A large T, on the other hand, is wasteful especially when x is still far from convergence and some bias does not prevent 2) to mae good progress. Therefore, T should increase adaptively as increases this maes the choice of T even more difficult. So, why waste samples, why worry about T, and why not just apply every sample immediately in stochastic gradient descent? This approach has appeared in [5, 6], which we call the Marov Chain Gradient Descent MCGD) algorithm for problem ): x + = Proj X x γ ˆ F x ; ξ ) ), 3) where ξ 0, ξ,... are samples on a Marov chain trajectory and ˆ F x ; ξ ) F x ; ξ ) is a subgradient. Let us examine some special cases. Suppose the distribution Π is supported on a set of M points, y,..., y M. Then, by letting f i x) := M Probξ = y i ) F x, y i ), problem ) reduces to the finite-sum problem: minimize fx) x X R d M M f i x). 4) By the definition of f i, each state i has the uniform probability /M. At each iteration of MCGD, we have x + = Proj X x γ ˆ fj x ) ), 5) where j ) 0 is a trajectory of a Marov chain on {, 2,..., M} that has a uniform stationary distribution. Here, ξ ) 0 Π and j ) 0 [M] are two different, but related Marov chains. Starting from a deterministic and arbitrary initialization x 0, the iteration is illustrated by the following diagram: j 0 j j 2... x 0 x x 2 x 3... In the diagram, given each j, the next state j + is statistically independent of j,..., j 0 ; given j and x, the next iterate x + is statistically independent of j,..., j 0 and x,..., x 0. i= 6) 2

3 Another application of MCGD involves a networ: consider a strongly connected graph G = V, E) with the set of vertices V = {, 2,..., M} and set of edges E V V. Each node j {, 2,..., M} possess some data and can compute f j ). To run MCGD, we employ a toen that carries the variable x, waling randomly over the networ. When it reaches a node j, node j reads x form the toen and computes f j ) to update x according to 5). Then, the toen wals away to a random neighbor of node j.. Numerical tests We present two inds of numerical results. The first one is to show that MCGD uses fewer samples to train both a convex model and a nonconvex model. The second one demonstrates the advantage of the faster mixing of a non-reversible Marov chain. Our results on nonconvex objective and non-reversible chains are new.. Comparision with SGD Let us compare:. MCGD 3), where j is taen from one trajectory of the Marov chain; 2. SGDT, for T =, 2, 4, 8, 6, 32, where each j is the T th sample of a fresh, independent trajectory. All trajectories are generated by starting from the same state 0. To compute T gradients, SGDT uses T times as many samples as MCGD. We did not try to adapt T as increases because there lacs a theoretical guidance. In the first test, we recover a vector u from an auto regressive process, which closely resembles the first i.i.d experiment in []. Set matrix A as a subdiagonal matrix with random entries A i,i U[0.8, 0.99]. Randomly sample a vector u R d, d = 50, with the unit 2-norm. Our data ξt, ξt 2 ) t= are generated according to the following auto regressive process: ξ t = Aξ t + e W t, W t i.i.d N0, ) { ξ t 2, if u, ξ = t > 0, 0, otherwise; { ξ2 ξt 2 = t, with probability 0.8, ξ t 2, with probability 0.2. Clearly, ξ t, ξ 2 t ) t= forms a Marov chain. Let Π denote the stationary distribution of this Marov chain. We recover u as the solution to the following problem: minimize x E ξ,ξ 2 ) Πlx; ξ, ξ 2 ). We consider both convex and nonconvex loss functions, which were not done before in the literature. The convex one is the logistic loss lx; ξ, ξ 2 ) = ξ 2 logσ x, ξ )) ξ 2 ) log σ x, ξ )), where σt) = +exp t). And the nonconvex one is taen as lx; ξ, ξ 2 ) = 2 σ x, ξ ) ξ 2 ) 2 from [7]. We choose γ = as our stepsize, where q = This choice is consistently with our theory below. q Our results in Figure are surprisingly positive on MCGD, more so to our expectation. As we had expected, MCGD used significantly fewer total samples than SGD on every T. But, it is surprising that MCGD did not need even more gradient evaluations. Randomly generated data must have helped homogenize the samples over the different states, maing it less important for a trajectory to converge. It is important to note that SGD and SGD2, as well as SGD4, in the nonconvex case, stagnate at noticeably lower accuracies because their T values are too small for convergence. 3

4 Convex case 0 0 Convex case 0 - Nonconvex case 0 - Nonconvex case f x ) fx ) MCSGD SGD SGD2 SGD4 SGD8 SGD6 SGD32 f x ) fx ) f x ) fx ) 0-2 f x ) fx ) Number of gradient evaluations Number of samples Number of gradient evaluations Number of samples Figure : Comparisons of MCGD and SGDT for T =, 2, 4, 8, 6, 32. x is the average of x,..., x. 2. Comparison of reversible and non-reversible Marov chains We also compare the convergence of MCGD when woring with reversible and non-reversible Marov chains the definition of reversibility is given in next section). As mentioned in [4], transforming a reversible Marov chain into non-reversible Marov chain can significantly accelerate the mixing process. This technique also helps to accelerate the convergence of MCGD. In our experiment, we first construct an undirected connected graph with n = 20 nodes with edges randomly generated. Let G denote the adjacency matrix of the graph, that is, {, if i, j are connected; G i,j = 0, otherwise. Let d max be the maximum number of outgoing edges of a node. Select d = 0 and compute β N 0, I d ). The transition probability of the reversible Marov chain is then defined by, nown as Metropolis-Hastings marov chain, P i,j = d max, if j i, G i,j = ; j i Gi,j d max, if j = i; 0, otherwise. Obviously, P is symmetric and the stationary distribution is uniform. The non-reversible Marov chain is constructed by adding cycles. The edges of these cycles are directed and let V denote the adjacency matrix of these cycles. If V i,j =, then V j,i = 0. Let w 0 > 0 be the weight of flows along these cycles. Then we construct the transition probability of the non-reversible Marov chain as follows, f )-f * Reversible Non-reversible Q i,j = W i,j l W, i,l where W = d max P + w 0 V. See [4] for an explanation why this change maes the chain mix faster. In our experiment, we add 5 cycles of length 4, with edges existing in G. w 0 is set to be dmax 2. We test MCGD on a least square problem. First, we select β N 0, I d ); and then for each node i, we generate x i N 0, I d ), and y i = x T i β. The objective function is defined as, fβ) = n x T i β y i ) 2. 2 The convergence results are depicted in Figure 2. i= Iteration Figure 2: Comparison of reversible and irreversible Marov chains. The second largest eigenvalues of reversible and nonreversible Marov chains are 0.75 and 0.66 respectively..2 Known approaches and results It is more difficult to analyze MCGD due to its biased samples. To see this, let p,j be the probability to select f j in the th iteration. SGD s uniform probability selection p,j M ) yields an unbiased 4

5 gradient estimate E j f j x )) = C fx ) 7) for some C > 0. However, in MCGD, it is possible to have p,j = 0 for some, j. Consider a random wal. The probability p j,j is determined by the current state j, and we have p j,i > 0 only for i N j ) and p j,i = 0 for i / N j ), where N j ) denotes the neighborhood of j. Therefore, we no longer have 7). All analyses of MCGD must deal with the biased expectation. Papers [6, 5] investigate the conditional expectation E j+τ j f j+τ x )). For a sufficiently large τ Z +, it is sufficiently close to M fx ) but still different). In [6, 5], the authors proved that, to achieve an ɛ error, MCGD with stepsize Oɛ) can return a solution in O ɛ ) iteration. Their error bound is given in the ergodic sense 2 and using liminf. The authors of [0] proved a lim inf fx ) and Edist 2 x, X ) have almost sure convergence under diminishing stepsizes γ = q, 2 3 < q. Although the authors did not compute ) iterations, any rates, we computed that their stepsizes will lead to a solution with ɛ error in O ɛ q for 2 3 < q <, and Oe ɛ ) for q =. In [], the authors improved the stepsizes to γ = and showed ergodic convergence; in other words, to achieve ɛ error, it is enough to run MCGD for O ɛ 2 ) iterations. There is no non-ergodic result regarding the convergence of fx ). It is worth mentioning that [0, ] use time non-homogeneous Marov chains, where the transition probability can change over the iterations as long as there is still a finite mixing time. In [], MCGD is generalized from gradient descent to mirror descent. In all these wors, the Marov chain is required to be reversible, and all functions f i, i [M], are assumed to be convex. However, non-reversible chains can have substantially faster convergence and thus more numerically efficient..3 Our approaches and results In this paper, we improve the analyses of MCGD to non-reversible finite-state Marov chains and to nonconvex functions. The former allows us to have faster mixing, and the latter frequently appears in applications. Our convergence result is given in the non-ergodic sense though the rate results are still given the ergodic sense. It is important to mention that, in our analysis, the mixing time of the underlying Marov chain is not tied to a fixed mixing level but can vary to different levels. This is essential because MCGD needs time to reduce its objective error from its current value to a lower one, and this time becomes longer when the current value is lower since a more accurate Marov chain convergence and thus a longer mixing time are required. When f, f 2,..., f M are all convex, we allow them to be non-differentiable and MCGD to use subgradients, provided that X is bounded. When any of them is nonconvex, we assume X is the full space and f, f 2,..., f M are differentiable with bounded gradients. The bounded-gradient assumption is due to a technical difficulty associated with nonconvexity. Specifically, in the convex setting, we prove lim Efx ) = f minimum of f over X) for both exact and inexact MCGD with stepsizes γ =, q 2 < q <. The convergence rates of MCGD with exact and inexact subgradient computations are presented. The first analysis of nonconvex MCGD is also presented with its convergence given in the expectation of fx ). These results hold for non-reversible finite-state Marov chains and can be extended to time non-homogeneous Marov chain under extra assumptions [0, Assumptions 4 and 5] and [, Assumption C], which essentially ensure finite mixing. Our results for finite-state Marov chains are first presented in Sections 3 and 4. They are extended to continuous-state reversible Marov chains in Section 5. Some novel results are are developed based on new techniques and approaches developed in this paper. To get the stronger results in general cases, we used the varying mixing time rather than fixed ones. We list the possible extensions of MCGD that are not discussed in this paper. The first one is the accelerated versions including the Nesterov s acceleration and variance reduction schemes. The second one is the design and optimization of Marov chains to improve the convergence of MCGD. 5

6 2 Preliminaries 2. Marov chain We recall some definitions, properties, and existing results about the Marov chain. Although we use the finite-state time-homogeneous Marov chain, results can be extended to more general chains under similar extra assumptions in [0, Assumptions 4, 5] and [, Assumption C]. Definition finite-state time-homogeneous Marov chain) Let P be an M M-matrix with real-valued elements. A stochastic process X, X 2,... in a finite state space [M] := {, 2,..., M} is called a time-homogeneous Marov chain with transition matrix P if, for N, i, j [M], and i 0, i,..., i [M], we have PX + = j X 0 = i 0, X = i,..., X = i) = PX + = j X = i) = P i,j. 8) Let the probability distribution of X be denoted as the non-negative row vector π = π, π2,..., πm ), that is, PX = j) = πj. π satisfies M i= π i =. When the Marov chain is time-homogeneous, we have π = π P and π = π P = = π 0 P, 9) for N, where P denotes the th power of P. A Marov chain is irreducible if, for any i, j [M], there exists such that P ) i,j > 0. State i [M] is said to have a period d if P i,i = 0 whenever is not a multiple of d and d is the greatest integer with this property. If d =, then we say state i is aperiodic. If every state is aperiodic, the Marov chain is said to be aperiodic. Any time-homogeneous, irreducible, and aperiodic Marov chain has a stationary distribution π = lim π = [π, π2,..., πm ] with M i= π i = and min i{πi } > 0, and π = π P. It also holds that lim P = [π ); π );... ; π )] =: Π R M M. 0) The largest eigenvalue of P is, and the corresponding left eigenvector is π. Assumption The Marov chain X ) 0 is time-homogeneous, irreducible, and aperiodic. It has a transition matrix P and has stationary distribution π. 2.2 Mixing time Mixing time is how long a Marov chain evolves until its current state has a distribution very close to its stationary distribution. The literature has a thorough investigation of various inds of mixing times, with the majority for reversible Marov chains that is, π i P i,j = π j P j,i ). Mixing times of non-reversible Marov chains are discussed in [3]. In this part, we consider a new type of mixing time of non-reversible Marov chain. The proofs are based on basic matrix analysis. Our mixing time gives us a direct relationship between and the deviation of the distribution of the current state from the stationary distribution. To start a lemma, we review some basic notions in linear algebra. Let C be the n-dimensional complex field. The modulus of a complex number a C is given as a. For a vector x C n, the l and l 2 norms are defined as x := max i x i, x 2 := n i= x i 2. For a matrix A = [a i,j ] C m n, its -induced and Frobenius norms are A := max i,j a i,j, A F := n i,j= a i,j 2, respectively. We now P Π, as. The following lemma presents a deviation bound for finite. Lemma Let Assumption hold and let λ i P ) C be the ith largest eigenvalue of P, and λp ) := max{ λ 2P ), λ M P ) } + [0, ). 2 Then, we can bound the largest entry-wise absolute value of the deviation matrix δ := Π P R M M as δ C P λ P ) ) 6

7 for K P, where C P is a constant that also depends on the Jordan canonical form of P and K P is a constant that depends on λp ) and λ 2 P ). Their formulas are given in 45) and 46) in the Supplementary Material. Remar If P is symmetric, then all λ i P ) s are all real and nonnegative, K P = 0, and C P M 3 2. Furthermore, 42) can be improved by directly using λ 2P ) for the right side as δ δ F M 3 2 λ 2 P ), 0. 3 Convergence analysis for convex minimization This part considers the convergence of MCGD in the convex cases, i.e., f, f 2,..., f M and X are all convex. We investigate the convergence of scheme 5). We prove non-ergodic convergence of the expected objective value sequence under diminishing non-summable stepsizes, where the stepsizes are required to be almost" square summable. Therefore, the convergence requirements are almost equal to SGD. This section uses the following assumption. Assumption 2 The set X is assumed to be convex and compact. Now, we present the convergence results for MCGD in the convex but not necessarily differentiable) case. Let f be the minimum value of f over X. Theorem Let Assumptions and 2 hold and x ) 0 be generated by scheme 5). Assume that f i, i [M], are convex functions, and the stepsizes satisfy γ = +, ln γ 2 < +. 2) Then, we have Define We have: lim Efx ) = f. 3) ψp ) := max{, ln/λp )) }. ψp ) ) Efx ) f ) = O i= γ, 4) i where x i= := γixi. Therefore, if we select the stepsize γ = O i= γi ) as q 2 < q <, we get the rate Efx ) f ) = O ψp ) ). q Furthermore, consider the inexact version of MCGD: x + = Proj X x γ ˆ f j x ) + e ) ), 5) where the noise sequence e ) 0 is arbitrary but obeys + =2 e 2 ln < +. 6) Then, for iteration 5), results 3) and 4) still hold; furthermore, if e = O ) with p > p 2 and γ = O ) as q 2 < q <, the rate Efx ) f ) = O ψp ) ) also holds. q The stepsizes requirement 2) is nearly identical to the one of SGD and subgradient algorithms. In the theorem above, we use the stepsize setting γ = O ) as q 2 < q <. This ind of stepsize requirements also wors for SGD and subgradient algorithms. The convergence rate of MCGD is O ) = O i= γi ), which is also as the same as SGD and subgradient algorithms for q γ = O ). q 7

8 4 Convergence analysis for nonconvex minimization This section considers the convergence of MCGD when one or more of f i is nonconvex. In this case, we assume f i, i =, 2,..., M, are differentiable and f i is Lipschitz with L 2. We also set X as the full space. We study the following scheme x + = x γ f j x ). 7) We prove non-ergodic convergence of the expected gradient norm of f under diminishing nonsummable stepsizes. The stepsize requirements in this section are slightly stronger than those in the convex case with an extra ln factor. In this part, we use the following assumption. Assumption 3 The gradients of f i are assumed to be bounded, i.e., there exists D > 0 such that f i x) D, i [M]. 8) We use this new assumption because X is now the full space, and we have to directly bound the size of f i x). In the nonconvex case, we cannot obtain objective value convergence, and we only bound the gradients. Now, we are prepared to present our convergence results of nonconvex MCGD. Theorem 2 Let Assumptions and 3 hold and x ) 0 be generated by scheme 7). Also, assume f i is differentiable and f i is L-Lipschitz, and the stepsizes satisfy γ = +, ln 2 γ 2 < +. 9) Then, we have lim E fx ) = 0. 20) and E min i { fxi ) 2 } ) ψp ) ) = O i= γ, 2) i where ψp ) is given in Lemma. If we select the stepsize as γ = O ), q 2 < q <, then we get the rate E min i { fx i ) 2 } ) = O ψp ) ). q Furthermore, let e ) 0 be a sequence of noise and consider the inexact nonconvex MCGD iteration: If the noise sequence obeys x + = x γ fj x ) + e ). 22) + = γ e < +, 23) then the convergence results 20) and 2) still hold for inexact nonconvex MCGD. In addition, if we set γ = O ) as q 2 < q < and the noise satisfy e = O ) for p + q >, then 20) still p holds and E min i { fx i ) 2 } ) = O ψp ) ). q This proof of Theorem 2 is different from previous one. In particular, we cannot expect some sort of convergence to fx ), where x argmin f due to nonconvexity. To this end, we use the Lipschitz continuity of f i i [M]) to derive the descent". Here, the O" contains a polynomial compisition of constants D and L. Compared with MCGD in the convex case, the stepsize requirements of nonconvex MCGD become a tad higher; in summable part, we need ln2 γ 2 < + rather than ln γ2 < +. Nevertheless, we can still use γ = O q ) for 2 < q <. 2 This is for the convenience of the presentation in the proofs. If each f i has a L i, it is possible to improve our results slights. But, we simply set L := max i{l i} 8

9 5 Convergence analysis for continuous state space When the state space Ξ is a continuum, there are infinitely many possible states. In this case, we consider an infinite-state Marov chain that is time-homogeneous and reversible. Using the results in [8, Theorem 4.9], the mixing time of this ind of Marov chain still has geometric decrease lie ). Since Lemma is based on a linear algebra analysis, it no longer applies to the continuous case. Nevertheless, previous results still hold with nearly unchanged proofs under the following assumption: Assumption 4 For any ξ Ξ, F x; ξ) F y; ξ) L x y, sup x X,ξ Ξ { ˆ F x; ξ) } D, E ξ ˆ F x; ξ) Eξ F x; ξ), and sup x,y X,ξ Ξ F x; ξ) F y; ξ) H. We consider the general scheme x + = Proj X x γ ˆ F x ; ξ ) + e ) ), 24) where ξ are samples on a Marov chain trajectory. If e 0, the scheme then reduces to 3). Corollary Assume F ; ξ) is convex for each ξ Ξ. Let the stepsizes satisfy 2) and x ) 0 be generated by Algorithm 24), and e ) 0 satisfy 6). Let F := min x X E ξ F x; ξ)). If Assumption 4 holds and the Marov chain is time-homogeneous, irreducible, aperiodic, and reversible, then we have lim E E ξ F x ; ξ)) F ) max{, = 0, EE ξ F x ; ξ)) F ln/λ) ) = O } ), i= γi where 0 < λ < is the geometric rate of the mixing time of the Marov chain which corresponds to λp ) in the finite-state case). Next, we present our result for a possibly nonconvex objective function F ; ξ) under the following assumption. Assumption 5 For any ξ Ξ, F x; ξ) is differentiable, and F x; ξ) F y; ξ) L x y. In addition, sup x X,ξ Ξ { F x; ξ) } < +, X is the full space, and E ξ F x; ξ) = E ξ F x; ξ). Since F x, ξ) is differentiable and X is the full space, the iteration reduces to x + = x γ F x ; ξ ) + e ). 25) Corollary 2 Let the stepsizes satisfy 9), x ) 0 be generated by Algorithm 25), the noises obey 23), and Assumption 5 hold. Assume the Marov chain is time-homogeneous, irreducible, and aperiodic and reversible. Then, we have max{, lim E E ξ F x ; ξ)) = 0, E min { E ξf x i ; ξ)) 2 }) = O i i= γ i where 0 < λ < is geometric rate for the mixing time of the Marov chain. 6 Conclusion ln/λ) } ), 26) In this paper, we have analyzed the stochastic gradient descent method where the samples are taen on a trajectory of Marov chain. One of our main contributions is non-ergodic convergence analysis for convex MCGD, which uses a novel line of analysis. The result is then extended to the inexact gradients. This analysis lets us establish convergence for non-reversible finite-state Marov chains and for nonconvex minimization problems. Our results are useful in the cases where it is impossible or expensive to directly tae samples from a distribution, or the distribution is not even nown, but sampling via a Marov chain is possible. Our results also apply to decentralized learning over a networ, where we can employ a random waler to traverse the networ and minimizer the objective that is defined over the samples that are held at the nodes in a distribute fashion. 9

10 References [] John C Duchi, Aleh Agarwal, Miael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 224): , 202. [2] Martin Dyer, Alan Frieze, Ravi Kannan, Ajai Kapoor, Ljubomir Perovic, and Umesh Vazirani. A mildly exponential time algorithm for approximating the number of solutions to a multidimensional napsac problem. Combinatorics, Probability and Computing, 23):27 284, 993. [3] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible marov chains, with an application to the exclusion process. The annals of applied probability, ):62 87, 99. [4] Mar Jerrum and Alistair Sinclair. The Marov chain Monte Carlo method: an approach to approximate counting and integration. Citeseer, 996. [5] Bjorn Johansson, Maben Rabi, and Miael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networs. In Decision and Control, th IEEE Conference on, pages IEEE, [6] Björn Johansson, Maben Rabi, and Miael Johansson. A randomized incremental subgradient method for distributed optimization in networed systems. SIAM Journal on Optimization, 203):57 70, [7] Song Mei, Yu Bai, Andrea Montanari, et al. The landscape of empirical ris for nonconvex losses. The Annals of Statistics, 466A): , 208. [8] Ravi Montenegro, Prasad Tetali, et al. Mathematical aspects of mixing times in marov chains. Foundations and Trends R in Theoretical Computer Science, 3): , [9] Rufus Oldenburger et al. Infinite powers of matrices and characteristic roots. Due Mathematical Journal, 62):357 36, 940. [0] S Sundhar Ram, A Nedić, and Venugopal V Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 202):69 77, [] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 223): , 95. [2] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages Elsevier, 97. [3] Ralph Tyrell Rocafellar. Convex Analysis. Princeton university press, 205. [4] Konstantin S Turitsyn, Michael Chertov, and Marija Vucelja. Irreversible monte carlo algorithms for efficient sampling. Physica D: Nonlinear Phenomena, ):40 44, 20. [5] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66): ,

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Ergodic Subgradient Descent

Ergodic Subgradient Descent Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

A Distributed Newton Method for Network Optimization

A Distributed Newton Method for Network Optimization A Distributed Newton Method for Networ Optimization Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham Abstract Most existing wor uses dual decomposition and subgradient methods to solve networ optimization

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Markov Chains, Random Walks on Graphs, and the Laplacian

Markov Chains, Random Walks on Graphs, and the Laplacian Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

Approximate Counting and Markov Chain Monte Carlo

Approximate Counting and Markov Chain Monte Carlo Approximate Counting and Markov Chain Monte Carlo A Randomized Approach Arindam Pal Department of Computer Science and Engineering Indian Institute of Technology Delhi March 18, 2011 April 8, 2011 Arindam

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Markov Chains Handout for Stat 110

Markov Chains Handout for Stat 110 Markov Chains Handout for Stat 0 Prof. Joe Blitzstein (Harvard Statistics Department) Introduction Markov chains were first introduced in 906 by Andrey Markov, with the goal of showing that the Law of

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME)

WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME) WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME ITAI BENJAMINI, GADY KOZMA, LÁSZLÓ LOVÁSZ, DAN ROMIK, AND GÁBOR TARDOS Abstract. We observe returns of a simple random wal on a finite graph to a fixed node,

More information

Convergence rates for distributed stochastic optimization over random networks

Convergence rates for distributed stochastic optimization over random networks Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

INTRODUCTION TO MARKOV CHAIN MONTE CARLO INTRODUCTION TO MARKOV CHAIN MONTE CARLO 1. Introduction: MCMC In its simplest incarnation, the Monte Carlo method is nothing more than a computerbased exploitation of the Law of Large Numbers to estimate

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Homework 10 Solution

Homework 10 Solution CS 174: Combinatorics and Discrete Probability Fall 2012 Homewor 10 Solution Problem 1. (Exercise 10.6 from MU 8 points) The problem of counting the number of solutions to a napsac instance can be defined

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Propp-Wilson Algorithm (and sampling the Ising model)

Propp-Wilson Algorithm (and sampling the Ising model) Propp-Wilson Algorithm (and sampling the Ising model) Danny Leshem, Nov 2009 References: Haggstrom, O. (2002) Finite Markov Chains and Algorithmic Applications, ch. 10-11 Propp, J. & Wilson, D. (1996)

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

The Perron Frobenius theorem and the Hilbert metric

The Perron Frobenius theorem and the Hilbert metric The Perron Frobenius theorem and the Hilbert metric Vaughn Climenhaga April 7, 03 In the last post, we introduced basic properties of convex cones and the Hilbert metric. In this post, we loo at how these

More information

Convex Optimization CMU-10725

Convex Optimization CMU-10725 Convex Optimization CMU-10725 Simulated Annealing Barnabás Póczos & Ryan Tibshirani Andrey Markov Markov Chains 2 Markov Chains Markov chain: Homogen Markov chain: 3 Markov Chains Assume that the state

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Markov Chain Monte Carlo Methods p. 2/36 Markov Chains

More information

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states. Chapter 8 Finite Markov Chains A discrete system is characterized by a set V of states and transitions between the states. V is referred to as the state space. We think of the transitions as occurring

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

25.1 Markov Chain Monte Carlo (MCMC)

25.1 Markov Chain Monte Carlo (MCMC) CS880: Approximations Algorithms Scribe: Dave Andrzejewski Lecturer: Shuchi Chawla Topic: Approx counting/sampling, MCMC methods Date: 4/4/07 The previous lecture showed that, for self-reducible problems,

More information

Lecture 16: Introduction to Neural Networks

Lecture 16: Introduction to Neural Networks Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Sampling Contingency Tables

Sampling Contingency Tables Sampling Contingency Tables Martin Dyer Ravi Kannan John Mount February 3, 995 Introduction Given positive integers and, let be the set of arrays with nonnegative integer entries and row sums respectively

More information

Random Walks on Graphs. One Concrete Example of a random walk Motivation applications

Random Walks on Graphs. One Concrete Example of a random walk Motivation applications Random Walks on Graphs Outline One Concrete Example of a random walk Motivation applications shuffling cards universal traverse sequence self stabilizing token management scheme random sampling enumeration

More information

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Consensus-Based Distributed Optimization with Malicious Nodes

Consensus-Based Distributed Optimization with Malicious Nodes Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

STOCHASTIC PROCESSES Basic notions

STOCHASTIC PROCESSES Basic notions J. Virtamo 38.3143 Queueing Theory / Stochastic processes 1 STOCHASTIC PROCESSES Basic notions Often the systems we consider evolve in time and we are interested in their dynamic behaviour, usually involving

More information

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Conductance and Rapidly Mixing Markov Chains

Conductance and Rapidly Mixing Markov Chains Conductance and Rapidly Mixing Markov Chains Jamie King james.king@uwaterloo.ca March 26, 2003 Abstract Conductance is a measure of a Markov chain that quantifies its tendency to circulate around its states.

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

University of Toronto Department of Statistics

University of Toronto Department of Statistics Norm Comparisons for Data Augmentation by James P. Hobert Department of Statistics University of Florida and Jeffrey S. Rosenthal Department of Statistics University of Toronto Technical Report No. 0704

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

Convex Optimization of Graph Laplacian Eigenvalues

Convex Optimization of Graph Laplacian Eigenvalues Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Abstract. We consider the problem of choosing the edge weights of an undirected graph so as to maximize or minimize some function of the

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

Convex Optimization of Graph Laplacian Eigenvalues

Convex Optimization of Graph Laplacian Eigenvalues Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Stanford University (Joint work with Persi Diaconis, Arpita Ghosh, Seung-Jean Kim, Sanjay Lall, Pablo Parrilo, Amin Saberi, Jun Sun, Lin

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

18.600: Lecture 32 Markov Chains

18.600: Lecture 32 Markov Chains 18.600: Lecture 32 Markov Chains Scott Sheffield MIT Outline Markov chains Examples Ergodicity and stationarity Outline Markov chains Examples Ergodicity and stationarity Markov chains Consider a sequence

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

18.175: Lecture 30 Markov chains

18.175: Lecture 30 Markov chains 18.175: Lecture 30 Markov chains Scott Sheffield MIT Outline Review what you know about finite state Markov chains Finite state ergodicity and stationarity More general setup Outline Review what you know

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness

More information

Asymptotic Filtering and Entropy Rate of a Hidden Markov Process in the Rare Transitions Regime

Asymptotic Filtering and Entropy Rate of a Hidden Markov Process in the Rare Transitions Regime Asymptotic Filtering and Entropy Rate of a Hidden Marov Process in the Rare Transitions Regime Chandra Nair Dept. of Elect. Engg. Stanford University Stanford CA 94305, USA mchandra@stanford.edu Eri Ordentlich

More information

Random walks, Markov chains, and how to analyse them

Random walks, Markov chains, and how to analyse them Chapter 12 Random walks, Markov chains, and how to analyse them Today we study random walks on graphs. When the graph is allowed to be directed and weighted, such a walk is also called a Markov Chain.

More information

ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS

ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS Giorgio Picci with T. Taylor, ASU Tempe AZ. Wofgang Runggaldier s Birthday, Brixen July 2007 1 CONSENSUS FOR RANDOM GOSSIP ALGORITHMS Consider a finite

More information

Markov Chains and Stochastic Sampling

Markov Chains and Stochastic Sampling Part I Markov Chains and Stochastic Sampling 1 Markov Chains and Random Walks on Graphs 1.1 Structure of Finite Markov Chains We shall only consider Markov chains with a finite, but usually very large,

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa Introduction to Search Engine Technology Introduction to Link Structure Analysis Ronny Lempel Yahoo Labs, Haifa Outline Anchor-text indexing Mathematical Background Motivation for link structure analysis

More information

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING ERIC SHANG Abstract. This paper provides an introduction to Markov chains and their basic classifications and interesting properties. After establishing

More information

Distributed Control of the Laplacian Spectral Moments of a Network

Distributed Control of the Laplacian Spectral Moments of a Network Distributed Control of the Laplacian Spectral Moments of a Networ Victor M. Preciado, Michael M. Zavlanos, Ali Jadbabaie, and George J. Pappas Abstract It is well-nown that the eigenvalue spectrum of the

More information