On Markov Chain Gradient Descent
|
|
- Adelia Anderson
- 5 years ago
- Views:
Transcription
1 On Marov Chain Gradient Descent Tao Sun College of Computer National University of Defense Technology Changsha, Hunan 40073, China Yuejiao Sun Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA Wotao Yin Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA Abstract Stochastic gradient methods are the worhorse algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Marov chain gradient descent, a variant of stochastic gradient descent where the random samples are taen on the trajectory of a Marov chain. Existing results of this method assume convex objectives and a reversible Marov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Marov chains. Nonconvexity maes our method applicable to broader problem classes. Non-reversible finite-state Marov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Marov chains. The reported numerical results validate our contributions. Introduction In this paper, we consider a stochastic minimization problem. Let Ξ be a statistical sample space with probability distribution Π we omit the underlying σ-algebra). Let X R n be a closed convex set, which represents the parameter space. F ; ξ) : X R is a closed convex function associated with ξ Ξ. We aim to solve the following problem: ) minimize E ξ F x; ξ) = F x, ξ)dπξ). ) x X R n A common method to minimize ) is Stochastic Gradient Descent SGD) []: Π x + = Proj X x γ F x ; ξ ) ), samples ξ i.i.d Π. 2) However, for some problems and distributions, direct sampling from Π is expensive or impossible, and it is possible that the sample space Ξ is not explicitly nown. In these cases, it can be much cheaper to sample by following a Marov chain that has a desired equilibrium distribution Π. The wor is supported in part by the National Key R&D Program of China 207YFB , China Scholarship Council. The wor of Y. Sun and W. Yin was supported in part by AFOSR MURI FA , NSF DMS , NSFC 72805, and ONR N nd Conference on Neural Information Processing Systems NeurIPS 208), Montréal, Canada.
2 To be concrete, imagine solving problem ) with a discrete space Ξ := {x {0, } n a, x b}, where a R n and b R, and the uniform distribution Π over Ξ. A straightforward way to obtain a uniform sample is iteratively randomly sampling x {0, } n until the constraint a, x b is satisfied. Even if the feasible set is small, it may tae up to O2 n ) iterations to get a feasible sample. Instead, one can sample a trajectory of a Marov chain described in [4]; to obtain a sample ε-close Ξ to the distribution Π, one only needs log ε ) expo nlog n) 5 2 )) samples [2], where Ξ is the cardinality of Ξ. This presents a signifant saving in sampling cost. Marov chains also naturally arise in some applications. Common examples are systems that evolve according to Marov chains, for example, linear dynamic systems with random transitions or errors. Another example is a distributed system in which every node locally stores a subset of training samples; to train a model using these samples, we can let a toen that holds all the model parameters traverse the nodes following a random wal, so the samples are accessed according to a Marov chain. Suppose that the Marov chain has a stationary distribution Π and a finite mixing time T, which is how long a random trajectory needs to be until its current state has a distribution that roughly matches Π. A larger T means a closer match. Then, in order to run one iteration of 2), we can generate a trajectory of samples ξ, ξ 2, ξ 3,..., ξ T and only tae the last sample ξ := ξ T. To run another iteration of 2), we repeat this process, i.e., sample a new trajectory ξ, ξ 2, ξ 3,..., ξ T and tae ξ := ξ T. Clearly, sampling a long trajectory just to use the last sample wastes a lot of samples, especially when T is large. But, this may seem necessary because ξ t, for all small t, have large biases. After all, it can tae a long time for a random trajectory to explore all of the space, and it will often double bac and visit states that it previously visited. Furthermore, it is also difficult to choose an appropriate T. A small T will cause large bias in ξ T, which slows the SGD convergence and reduces its final accuracy. A large T, on the other hand, is wasteful especially when x is still far from convergence and some bias does not prevent 2) to mae good progress. Therefore, T should increase adaptively as increases this maes the choice of T even more difficult. So, why waste samples, why worry about T, and why not just apply every sample immediately in stochastic gradient descent? This approach has appeared in [5, 6], which we call the Marov Chain Gradient Descent MCGD) algorithm for problem ): x + = Proj X x γ ˆ F x ; ξ ) ), 3) where ξ 0, ξ,... are samples on a Marov chain trajectory and ˆ F x ; ξ ) F x ; ξ ) is a subgradient. Let us examine some special cases. Suppose the distribution Π is supported on a set of M points, y,..., y M. Then, by letting f i x) := M Probξ = y i ) F x, y i ), problem ) reduces to the finite-sum problem: minimize fx) x X R d M M f i x). 4) By the definition of f i, each state i has the uniform probability /M. At each iteration of MCGD, we have x + = Proj X x γ ˆ fj x ) ), 5) where j ) 0 is a trajectory of a Marov chain on {, 2,..., M} that has a uniform stationary distribution. Here, ξ ) 0 Π and j ) 0 [M] are two different, but related Marov chains. Starting from a deterministic and arbitrary initialization x 0, the iteration is illustrated by the following diagram: j 0 j j 2... x 0 x x 2 x 3... In the diagram, given each j, the next state j + is statistically independent of j,..., j 0 ; given j and x, the next iterate x + is statistically independent of j,..., j 0 and x,..., x 0. i= 6) 2
3 Another application of MCGD involves a networ: consider a strongly connected graph G = V, E) with the set of vertices V = {, 2,..., M} and set of edges E V V. Each node j {, 2,..., M} possess some data and can compute f j ). To run MCGD, we employ a toen that carries the variable x, waling randomly over the networ. When it reaches a node j, node j reads x form the toen and computes f j ) to update x according to 5). Then, the toen wals away to a random neighbor of node j.. Numerical tests We present two inds of numerical results. The first one is to show that MCGD uses fewer samples to train both a convex model and a nonconvex model. The second one demonstrates the advantage of the faster mixing of a non-reversible Marov chain. Our results on nonconvex objective and non-reversible chains are new.. Comparision with SGD Let us compare:. MCGD 3), where j is taen from one trajectory of the Marov chain; 2. SGDT, for T =, 2, 4, 8, 6, 32, where each j is the T th sample of a fresh, independent trajectory. All trajectories are generated by starting from the same state 0. To compute T gradients, SGDT uses T times as many samples as MCGD. We did not try to adapt T as increases because there lacs a theoretical guidance. In the first test, we recover a vector u from an auto regressive process, which closely resembles the first i.i.d experiment in []. Set matrix A as a subdiagonal matrix with random entries A i,i U[0.8, 0.99]. Randomly sample a vector u R d, d = 50, with the unit 2-norm. Our data ξt, ξt 2 ) t= are generated according to the following auto regressive process: ξ t = Aξ t + e W t, W t i.i.d N0, ) { ξ t 2, if u, ξ = t > 0, 0, otherwise; { ξ2 ξt 2 = t, with probability 0.8, ξ t 2, with probability 0.2. Clearly, ξ t, ξ 2 t ) t= forms a Marov chain. Let Π denote the stationary distribution of this Marov chain. We recover u as the solution to the following problem: minimize x E ξ,ξ 2 ) Πlx; ξ, ξ 2 ). We consider both convex and nonconvex loss functions, which were not done before in the literature. The convex one is the logistic loss lx; ξ, ξ 2 ) = ξ 2 logσ x, ξ )) ξ 2 ) log σ x, ξ )), where σt) = +exp t). And the nonconvex one is taen as lx; ξ, ξ 2 ) = 2 σ x, ξ ) ξ 2 ) 2 from [7]. We choose γ = as our stepsize, where q = This choice is consistently with our theory below. q Our results in Figure are surprisingly positive on MCGD, more so to our expectation. As we had expected, MCGD used significantly fewer total samples than SGD on every T. But, it is surprising that MCGD did not need even more gradient evaluations. Randomly generated data must have helped homogenize the samples over the different states, maing it less important for a trajectory to converge. It is important to note that SGD and SGD2, as well as SGD4, in the nonconvex case, stagnate at noticeably lower accuracies because their T values are too small for convergence. 3
4 Convex case 0 0 Convex case 0 - Nonconvex case 0 - Nonconvex case f x ) fx ) MCSGD SGD SGD2 SGD4 SGD8 SGD6 SGD32 f x ) fx ) f x ) fx ) 0-2 f x ) fx ) Number of gradient evaluations Number of samples Number of gradient evaluations Number of samples Figure : Comparisons of MCGD and SGDT for T =, 2, 4, 8, 6, 32. x is the average of x,..., x. 2. Comparison of reversible and non-reversible Marov chains We also compare the convergence of MCGD when woring with reversible and non-reversible Marov chains the definition of reversibility is given in next section). As mentioned in [4], transforming a reversible Marov chain into non-reversible Marov chain can significantly accelerate the mixing process. This technique also helps to accelerate the convergence of MCGD. In our experiment, we first construct an undirected connected graph with n = 20 nodes with edges randomly generated. Let G denote the adjacency matrix of the graph, that is, {, if i, j are connected; G i,j = 0, otherwise. Let d max be the maximum number of outgoing edges of a node. Select d = 0 and compute β N 0, I d ). The transition probability of the reversible Marov chain is then defined by, nown as Metropolis-Hastings marov chain, P i,j = d max, if j i, G i,j = ; j i Gi,j d max, if j = i; 0, otherwise. Obviously, P is symmetric and the stationary distribution is uniform. The non-reversible Marov chain is constructed by adding cycles. The edges of these cycles are directed and let V denote the adjacency matrix of these cycles. If V i,j =, then V j,i = 0. Let w 0 > 0 be the weight of flows along these cycles. Then we construct the transition probability of the non-reversible Marov chain as follows, f )-f * Reversible Non-reversible Q i,j = W i,j l W, i,l where W = d max P + w 0 V. See [4] for an explanation why this change maes the chain mix faster. In our experiment, we add 5 cycles of length 4, with edges existing in G. w 0 is set to be dmax 2. We test MCGD on a least square problem. First, we select β N 0, I d ); and then for each node i, we generate x i N 0, I d ), and y i = x T i β. The objective function is defined as, fβ) = n x T i β y i ) 2. 2 The convergence results are depicted in Figure 2. i= Iteration Figure 2: Comparison of reversible and irreversible Marov chains. The second largest eigenvalues of reversible and nonreversible Marov chains are 0.75 and 0.66 respectively..2 Known approaches and results It is more difficult to analyze MCGD due to its biased samples. To see this, let p,j be the probability to select f j in the th iteration. SGD s uniform probability selection p,j M ) yields an unbiased 4
5 gradient estimate E j f j x )) = C fx ) 7) for some C > 0. However, in MCGD, it is possible to have p,j = 0 for some, j. Consider a random wal. The probability p j,j is determined by the current state j, and we have p j,i > 0 only for i N j ) and p j,i = 0 for i / N j ), where N j ) denotes the neighborhood of j. Therefore, we no longer have 7). All analyses of MCGD must deal with the biased expectation. Papers [6, 5] investigate the conditional expectation E j+τ j f j+τ x )). For a sufficiently large τ Z +, it is sufficiently close to M fx ) but still different). In [6, 5], the authors proved that, to achieve an ɛ error, MCGD with stepsize Oɛ) can return a solution in O ɛ ) iteration. Their error bound is given in the ergodic sense 2 and using liminf. The authors of [0] proved a lim inf fx ) and Edist 2 x, X ) have almost sure convergence under diminishing stepsizes γ = q, 2 3 < q. Although the authors did not compute ) iterations, any rates, we computed that their stepsizes will lead to a solution with ɛ error in O ɛ q for 2 3 < q <, and Oe ɛ ) for q =. In [], the authors improved the stepsizes to γ = and showed ergodic convergence; in other words, to achieve ɛ error, it is enough to run MCGD for O ɛ 2 ) iterations. There is no non-ergodic result regarding the convergence of fx ). It is worth mentioning that [0, ] use time non-homogeneous Marov chains, where the transition probability can change over the iterations as long as there is still a finite mixing time. In [], MCGD is generalized from gradient descent to mirror descent. In all these wors, the Marov chain is required to be reversible, and all functions f i, i [M], are assumed to be convex. However, non-reversible chains can have substantially faster convergence and thus more numerically efficient..3 Our approaches and results In this paper, we improve the analyses of MCGD to non-reversible finite-state Marov chains and to nonconvex functions. The former allows us to have faster mixing, and the latter frequently appears in applications. Our convergence result is given in the non-ergodic sense though the rate results are still given the ergodic sense. It is important to mention that, in our analysis, the mixing time of the underlying Marov chain is not tied to a fixed mixing level but can vary to different levels. This is essential because MCGD needs time to reduce its objective error from its current value to a lower one, and this time becomes longer when the current value is lower since a more accurate Marov chain convergence and thus a longer mixing time are required. When f, f 2,..., f M are all convex, we allow them to be non-differentiable and MCGD to use subgradients, provided that X is bounded. When any of them is nonconvex, we assume X is the full space and f, f 2,..., f M are differentiable with bounded gradients. The bounded-gradient assumption is due to a technical difficulty associated with nonconvexity. Specifically, in the convex setting, we prove lim Efx ) = f minimum of f over X) for both exact and inexact MCGD with stepsizes γ =, q 2 < q <. The convergence rates of MCGD with exact and inexact subgradient computations are presented. The first analysis of nonconvex MCGD is also presented with its convergence given in the expectation of fx ). These results hold for non-reversible finite-state Marov chains and can be extended to time non-homogeneous Marov chain under extra assumptions [0, Assumptions 4 and 5] and [, Assumption C], which essentially ensure finite mixing. Our results for finite-state Marov chains are first presented in Sections 3 and 4. They are extended to continuous-state reversible Marov chains in Section 5. Some novel results are are developed based on new techniques and approaches developed in this paper. To get the stronger results in general cases, we used the varying mixing time rather than fixed ones. We list the possible extensions of MCGD that are not discussed in this paper. The first one is the accelerated versions including the Nesterov s acceleration and variance reduction schemes. The second one is the design and optimization of Marov chains to improve the convergence of MCGD. 5
6 2 Preliminaries 2. Marov chain We recall some definitions, properties, and existing results about the Marov chain. Although we use the finite-state time-homogeneous Marov chain, results can be extended to more general chains under similar extra assumptions in [0, Assumptions 4, 5] and [, Assumption C]. Definition finite-state time-homogeneous Marov chain) Let P be an M M-matrix with real-valued elements. A stochastic process X, X 2,... in a finite state space [M] := {, 2,..., M} is called a time-homogeneous Marov chain with transition matrix P if, for N, i, j [M], and i 0, i,..., i [M], we have PX + = j X 0 = i 0, X = i,..., X = i) = PX + = j X = i) = P i,j. 8) Let the probability distribution of X be denoted as the non-negative row vector π = π, π2,..., πm ), that is, PX = j) = πj. π satisfies M i= π i =. When the Marov chain is time-homogeneous, we have π = π P and π = π P = = π 0 P, 9) for N, where P denotes the th power of P. A Marov chain is irreducible if, for any i, j [M], there exists such that P ) i,j > 0. State i [M] is said to have a period d if P i,i = 0 whenever is not a multiple of d and d is the greatest integer with this property. If d =, then we say state i is aperiodic. If every state is aperiodic, the Marov chain is said to be aperiodic. Any time-homogeneous, irreducible, and aperiodic Marov chain has a stationary distribution π = lim π = [π, π2,..., πm ] with M i= π i = and min i{πi } > 0, and π = π P. It also holds that lim P = [π ); π );... ; π )] =: Π R M M. 0) The largest eigenvalue of P is, and the corresponding left eigenvector is π. Assumption The Marov chain X ) 0 is time-homogeneous, irreducible, and aperiodic. It has a transition matrix P and has stationary distribution π. 2.2 Mixing time Mixing time is how long a Marov chain evolves until its current state has a distribution very close to its stationary distribution. The literature has a thorough investigation of various inds of mixing times, with the majority for reversible Marov chains that is, π i P i,j = π j P j,i ). Mixing times of non-reversible Marov chains are discussed in [3]. In this part, we consider a new type of mixing time of non-reversible Marov chain. The proofs are based on basic matrix analysis. Our mixing time gives us a direct relationship between and the deviation of the distribution of the current state from the stationary distribution. To start a lemma, we review some basic notions in linear algebra. Let C be the n-dimensional complex field. The modulus of a complex number a C is given as a. For a vector x C n, the l and l 2 norms are defined as x := max i x i, x 2 := n i= x i 2. For a matrix A = [a i,j ] C m n, its -induced and Frobenius norms are A := max i,j a i,j, A F := n i,j= a i,j 2, respectively. We now P Π, as. The following lemma presents a deviation bound for finite. Lemma Let Assumption hold and let λ i P ) C be the ith largest eigenvalue of P, and λp ) := max{ λ 2P ), λ M P ) } + [0, ). 2 Then, we can bound the largest entry-wise absolute value of the deviation matrix δ := Π P R M M as δ C P λ P ) ) 6
7 for K P, where C P is a constant that also depends on the Jordan canonical form of P and K P is a constant that depends on λp ) and λ 2 P ). Their formulas are given in 45) and 46) in the Supplementary Material. Remar If P is symmetric, then all λ i P ) s are all real and nonnegative, K P = 0, and C P M 3 2. Furthermore, 42) can be improved by directly using λ 2P ) for the right side as δ δ F M 3 2 λ 2 P ), 0. 3 Convergence analysis for convex minimization This part considers the convergence of MCGD in the convex cases, i.e., f, f 2,..., f M and X are all convex. We investigate the convergence of scheme 5). We prove non-ergodic convergence of the expected objective value sequence under diminishing non-summable stepsizes, where the stepsizes are required to be almost" square summable. Therefore, the convergence requirements are almost equal to SGD. This section uses the following assumption. Assumption 2 The set X is assumed to be convex and compact. Now, we present the convergence results for MCGD in the convex but not necessarily differentiable) case. Let f be the minimum value of f over X. Theorem Let Assumptions and 2 hold and x ) 0 be generated by scheme 5). Assume that f i, i [M], are convex functions, and the stepsizes satisfy γ = +, ln γ 2 < +. 2) Then, we have Define We have: lim Efx ) = f. 3) ψp ) := max{, ln/λp )) }. ψp ) ) Efx ) f ) = O i= γ, 4) i where x i= := γixi. Therefore, if we select the stepsize γ = O i= γi ) as q 2 < q <, we get the rate Efx ) f ) = O ψp ) ). q Furthermore, consider the inexact version of MCGD: x + = Proj X x γ ˆ f j x ) + e ) ), 5) where the noise sequence e ) 0 is arbitrary but obeys + =2 e 2 ln < +. 6) Then, for iteration 5), results 3) and 4) still hold; furthermore, if e = O ) with p > p 2 and γ = O ) as q 2 < q <, the rate Efx ) f ) = O ψp ) ) also holds. q The stepsizes requirement 2) is nearly identical to the one of SGD and subgradient algorithms. In the theorem above, we use the stepsize setting γ = O ) as q 2 < q <. This ind of stepsize requirements also wors for SGD and subgradient algorithms. The convergence rate of MCGD is O ) = O i= γi ), which is also as the same as SGD and subgradient algorithms for q γ = O ). q 7
8 4 Convergence analysis for nonconvex minimization This section considers the convergence of MCGD when one or more of f i is nonconvex. In this case, we assume f i, i =, 2,..., M, are differentiable and f i is Lipschitz with L 2. We also set X as the full space. We study the following scheme x + = x γ f j x ). 7) We prove non-ergodic convergence of the expected gradient norm of f under diminishing nonsummable stepsizes. The stepsize requirements in this section are slightly stronger than those in the convex case with an extra ln factor. In this part, we use the following assumption. Assumption 3 The gradients of f i are assumed to be bounded, i.e., there exists D > 0 such that f i x) D, i [M]. 8) We use this new assumption because X is now the full space, and we have to directly bound the size of f i x). In the nonconvex case, we cannot obtain objective value convergence, and we only bound the gradients. Now, we are prepared to present our convergence results of nonconvex MCGD. Theorem 2 Let Assumptions and 3 hold and x ) 0 be generated by scheme 7). Also, assume f i is differentiable and f i is L-Lipschitz, and the stepsizes satisfy γ = +, ln 2 γ 2 < +. 9) Then, we have lim E fx ) = 0. 20) and E min i { fxi ) 2 } ) ψp ) ) = O i= γ, 2) i where ψp ) is given in Lemma. If we select the stepsize as γ = O ), q 2 < q <, then we get the rate E min i { fx i ) 2 } ) = O ψp ) ). q Furthermore, let e ) 0 be a sequence of noise and consider the inexact nonconvex MCGD iteration: If the noise sequence obeys x + = x γ fj x ) + e ). 22) + = γ e < +, 23) then the convergence results 20) and 2) still hold for inexact nonconvex MCGD. In addition, if we set γ = O ) as q 2 < q < and the noise satisfy e = O ) for p + q >, then 20) still p holds and E min i { fx i ) 2 } ) = O ψp ) ). q This proof of Theorem 2 is different from previous one. In particular, we cannot expect some sort of convergence to fx ), where x argmin f due to nonconvexity. To this end, we use the Lipschitz continuity of f i i [M]) to derive the descent". Here, the O" contains a polynomial compisition of constants D and L. Compared with MCGD in the convex case, the stepsize requirements of nonconvex MCGD become a tad higher; in summable part, we need ln2 γ 2 < + rather than ln γ2 < +. Nevertheless, we can still use γ = O q ) for 2 < q <. 2 This is for the convenience of the presentation in the proofs. If each f i has a L i, it is possible to improve our results slights. But, we simply set L := max i{l i} 8
9 5 Convergence analysis for continuous state space When the state space Ξ is a continuum, there are infinitely many possible states. In this case, we consider an infinite-state Marov chain that is time-homogeneous and reversible. Using the results in [8, Theorem 4.9], the mixing time of this ind of Marov chain still has geometric decrease lie ). Since Lemma is based on a linear algebra analysis, it no longer applies to the continuous case. Nevertheless, previous results still hold with nearly unchanged proofs under the following assumption: Assumption 4 For any ξ Ξ, F x; ξ) F y; ξ) L x y, sup x X,ξ Ξ { ˆ F x; ξ) } D, E ξ ˆ F x; ξ) Eξ F x; ξ), and sup x,y X,ξ Ξ F x; ξ) F y; ξ) H. We consider the general scheme x + = Proj X x γ ˆ F x ; ξ ) + e ) ), 24) where ξ are samples on a Marov chain trajectory. If e 0, the scheme then reduces to 3). Corollary Assume F ; ξ) is convex for each ξ Ξ. Let the stepsizes satisfy 2) and x ) 0 be generated by Algorithm 24), and e ) 0 satisfy 6). Let F := min x X E ξ F x; ξ)). If Assumption 4 holds and the Marov chain is time-homogeneous, irreducible, aperiodic, and reversible, then we have lim E E ξ F x ; ξ)) F ) max{, = 0, EE ξ F x ; ξ)) F ln/λ) ) = O } ), i= γi where 0 < λ < is the geometric rate of the mixing time of the Marov chain which corresponds to λp ) in the finite-state case). Next, we present our result for a possibly nonconvex objective function F ; ξ) under the following assumption. Assumption 5 For any ξ Ξ, F x; ξ) is differentiable, and F x; ξ) F y; ξ) L x y. In addition, sup x X,ξ Ξ { F x; ξ) } < +, X is the full space, and E ξ F x; ξ) = E ξ F x; ξ). Since F x, ξ) is differentiable and X is the full space, the iteration reduces to x + = x γ F x ; ξ ) + e ). 25) Corollary 2 Let the stepsizes satisfy 9), x ) 0 be generated by Algorithm 25), the noises obey 23), and Assumption 5 hold. Assume the Marov chain is time-homogeneous, irreducible, and aperiodic and reversible. Then, we have max{, lim E E ξ F x ; ξ)) = 0, E min { E ξf x i ; ξ)) 2 }) = O i i= γ i where 0 < λ < is geometric rate for the mixing time of the Marov chain. 6 Conclusion ln/λ) } ), 26) In this paper, we have analyzed the stochastic gradient descent method where the samples are taen on a trajectory of Marov chain. One of our main contributions is non-ergodic convergence analysis for convex MCGD, which uses a novel line of analysis. The result is then extended to the inexact gradients. This analysis lets us establish convergence for non-reversible finite-state Marov chains and for nonconvex minimization problems. Our results are useful in the cases where it is impossible or expensive to directly tae samples from a distribution, or the distribution is not even nown, but sampling via a Marov chain is possible. Our results also apply to decentralized learning over a networ, where we can employ a random waler to traverse the networ and minimizer the objective that is defined over the samples that are held at the nodes in a distribute fashion. 9
10 References [] John C Duchi, Aleh Agarwal, Miael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 224): , 202. [2] Martin Dyer, Alan Frieze, Ravi Kannan, Ajai Kapoor, Ljubomir Perovic, and Umesh Vazirani. A mildly exponential time algorithm for approximating the number of solutions to a multidimensional napsac problem. Combinatorics, Probability and Computing, 23):27 284, 993. [3] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible marov chains, with an application to the exclusion process. The annals of applied probability, ):62 87, 99. [4] Mar Jerrum and Alistair Sinclair. The Marov chain Monte Carlo method: an approach to approximate counting and integration. Citeseer, 996. [5] Bjorn Johansson, Maben Rabi, and Miael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networs. In Decision and Control, th IEEE Conference on, pages IEEE, [6] Björn Johansson, Maben Rabi, and Miael Johansson. A randomized incremental subgradient method for distributed optimization in networed systems. SIAM Journal on Optimization, 203):57 70, [7] Song Mei, Yu Bai, Andrea Montanari, et al. The landscape of empirical ris for nonconvex losses. The Annals of Statistics, 466A): , 208. [8] Ravi Montenegro, Prasad Tetali, et al. Mathematical aspects of mixing times in marov chains. Foundations and Trends R in Theoretical Computer Science, 3): , [9] Rufus Oldenburger et al. Infinite powers of matrices and characteristic roots. Due Mathematical Journal, 62):357 36, 940. [0] S Sundhar Ram, A Nedić, and Venugopal V Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 202):69 77, [] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 223): , 95. [2] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages Elsevier, 97. [3] Ralph Tyrell Rocafellar. Convex Analysis. Princeton university press, 205. [4] Konstantin S Turitsyn, Michael Chertov, and Marija Vucelja. Irreversible monte carlo algorithms for efficient sampling. Physica D: Nonlinear Phenomena, ):40 44, 20. [5] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66): ,
WE consider an undirected, connected network of n
On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More information6 Markov Chain Monte Carlo (MCMC)
6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution
More informationA Distributed Newton Method for Network Optimization
A Distributed Newton Method for Networ Optimization Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham Abstract Most existing wor uses dual decomposition and subgradient methods to solve networ optimization
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationMarkov Chains, Random Walks on Graphs, and the Laplacian
Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationAlternative Characterization of Ergodicity for Doubly Stochastic Chains
Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationQ-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced
More informationApproximate Counting and Markov Chain Monte Carlo
Approximate Counting and Markov Chain Monte Carlo A Randomized Approach Arindam Pal Department of Computer Science and Engineering Indian Institute of Technology Delhi March 18, 2011 April 8, 2011 Arindam
More informationMAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing
MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very
More informationMarkov Chains Handout for Stat 110
Markov Chains Handout for Stat 0 Prof. Joe Blitzstein (Harvard Statistics Department) Introduction Markov chains were first introduced in 906 by Andrey Markov, with the goal of showing that the Law of
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationWAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME)
WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME ITAI BENJAMINI, GADY KOZMA, LÁSZLÓ LOVÁSZ, DAN ROMIK, AND GÁBOR TARDOS Abstract. We observe returns of a simple random wal on a finite graph to a fixed node,
More informationConvergence rates for distributed stochastic optimization over random networks
Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationINTRODUCTION TO MARKOV CHAIN MONTE CARLO
INTRODUCTION TO MARKOV CHAIN MONTE CARLO 1. Introduction: MCMC In its simplest incarnation, the Monte Carlo method is nothing more than a computerbased exploitation of the Law of Large Numbers to estimate
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationWE consider an undirected, connected network of n
On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been
More informationHomework 10 Solution
CS 174: Combinatorics and Discrete Probability Fall 2012 Homewor 10 Solution Problem 1. (Exercise 10.6 from MU 8 points) The problem of counting the number of solutions to a napsac instance can be defined
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo
Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative
More informationPropp-Wilson Algorithm (and sampling the Ising model)
Propp-Wilson Algorithm (and sampling the Ising model) Danny Leshem, Nov 2009 References: Haggstrom, O. (2002) Finite Markov Chains and Algorithmic Applications, ch. 10-11 Propp, J. & Wilson, D. (1996)
More informationA Distributed Newton Method for Network Utility Maximization, I: Algorithm
A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order
More informationThe Perron Frobenius theorem and the Hilbert metric
The Perron Frobenius theorem and the Hilbert metric Vaughn Climenhaga April 7, 03 In the last post, we introduced basic properties of convex cones and the Hilbert metric. In this post, we loo at how these
More informationConvex Optimization CMU-10725
Convex Optimization CMU-10725 Simulated Annealing Barnabás Póczos & Ryan Tibshirani Andrey Markov Markov Chains 2 Markov Chains Markov chain: Homogen Markov chain: 3 Markov Chains Assume that the state
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationDistributed Optimization over Networks Gossip-Based Algorithms
Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationMarkov Chain Monte Carlo The Metropolis-Hastings Algorithm
Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability
More informationMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Markov Chain Monte Carlo Methods p. 2/36 Markov Chains
More informationDefinition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.
Chapter 8 Finite Markov Chains A discrete system is characterized by a set V of states and transitions between the states. V is referred to as the state space. We think of the transitions as occurring
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationNewton-like method with diagonal correction for distributed optimization
Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems
More information25.1 Markov Chain Monte Carlo (MCMC)
CS880: Approximations Algorithms Scribe: Dave Andrzejewski Lecturer: Shuchi Chawla Topic: Approx counting/sampling, MCMC methods Date: 4/4/07 The previous lecture showed that, for self-reducible problems,
More informationLecture 16: Introduction to Neural Networks
Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,
More informationMulti-Robotic Systems
CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed
More informationMarkov Chains and MCMC
Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationSampling Contingency Tables
Sampling Contingency Tables Martin Dyer Ravi Kannan John Mount February 3, 995 Introduction Given positive integers and, let be the set of arrays with nonnegative integer entries and row sums respectively
More informationRandom Walks on Graphs. One Concrete Example of a random walk Motivation applications
Random Walks on Graphs Outline One Concrete Example of a random walk Motivation applications shuffling cards universal traverse sequence self stabilizing token management scheme random sampling enumeration
More informationMarkov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains
Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014
More informationMixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines
Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationConsensus-Based Distributed Optimization with Malicious Nodes
Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationarxiv: v1 [stat.ml] 12 Nov 2015
Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationSTOCHASTIC PROCESSES Basic notions
J. Virtamo 38.3143 Queueing Theory / Stochastic processes 1 STOCHASTIC PROCESSES Basic notions Often the systems we consider evolve in time and we are interested in their dynamic behaviour, usually involving
More informationLecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016
Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationConductance and Rapidly Mixing Markov Chains
Conductance and Rapidly Mixing Markov Chains Jamie King james.king@uwaterloo.ca March 26, 2003 Abstract Conductance is a measure of a Markov chain that quantifies its tendency to circulate around its states.
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic optimization Markov Chain Monte Carlo
Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing
More informationUniversity of Toronto Department of Statistics
Norm Comparisons for Data Augmentation by James P. Hobert Department of Statistics University of Florida and Jeffrey S. Rosenthal Department of Statistics University of Toronto Technical Report No. 0704
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationDistributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract
More informationWE consider finite-state Markov decision processes
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider
More informationConvex Optimization of Graph Laplacian Eigenvalues
Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Abstract. We consider the problem of choosing the edge weights of an undirected graph so as to maximize or minimize some function of the
More informationarxiv: v2 [math.oc] 5 May 2018
The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]
More informationNewton-like method with diagonal correction for distributed optimization
Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationFinite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More informationConvex Optimization of Graph Laplacian Eigenvalues
Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Stanford University (Joint work with Persi Diaconis, Arpita Ghosh, Seung-Jean Kim, Sanjay Lall, Pablo Parrilo, Amin Saberi, Jun Sun, Lin
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More information18.600: Lecture 32 Markov Chains
18.600: Lecture 32 Markov Chains Scott Sheffield MIT Outline Markov chains Examples Ergodicity and stationarity Outline Markov chains Examples Ergodicity and stationarity Markov chains Consider a sequence
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationfor Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim
Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More information18.175: Lecture 30 Markov chains
18.175: Lecture 30 Markov chains Scott Sheffield MIT Outline Review what you know about finite state Markov chains Finite state ergodicity and stationarity More general setup Outline Review what you know
More informationSearch Directions for Unconstrained Optimization
8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash
CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash Equilibrium Price of Stability Coping With NP-Hardness
More informationAsymptotic Filtering and Entropy Rate of a Hidden Markov Process in the Rare Transitions Regime
Asymptotic Filtering and Entropy Rate of a Hidden Marov Process in the Rare Transitions Regime Chandra Nair Dept. of Elect. Engg. Stanford University Stanford CA 94305, USA mchandra@stanford.edu Eri Ordentlich
More informationRandom walks, Markov chains, and how to analyse them
Chapter 12 Random walks, Markov chains, and how to analyse them Today we study random walks on graphs. When the graph is allowed to be directed and weighted, such a walk is also called a Markov Chain.
More informationALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS
ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS Giorgio Picci with T. Taylor, ASU Tempe AZ. Wofgang Runggaldier s Birthday, Brixen July 2007 1 CONSENSUS FOR RANDOM GOSSIP ALGORITHMS Consider a finite
More informationMarkov Chains and Stochastic Sampling
Part I Markov Chains and Stochastic Sampling 1 Markov Chains and Random Walks on Graphs 1.1 Structure of Finite Markov Chains We shall only consider Markov chains with a finite, but usually very large,
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationIntroduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa
Introduction to Search Engine Technology Introduction to Link Structure Analysis Ronny Lempel Yahoo Labs, Haifa Outline Anchor-text indexing Mathematical Background Motivation for link structure analysis
More informationINTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING
INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING ERIC SHANG Abstract. This paper provides an introduction to Markov chains and their basic classifications and interesting properties. After establishing
More informationDistributed Control of the Laplacian Spectral Moments of a Network
Distributed Control of the Laplacian Spectral Moments of a Networ Victor M. Preciado, Michael M. Zavlanos, Ali Jadbabaie, and George J. Pappas Abstract It is well-nown that the eigenvalue spectrum of the
More information