Convergence Analysis of Gradient Descent for Eigenvector Computation

Size: px
Start display at page:

Download "Convergence Analysis of Gradient Descent for Eigenvector Computation"

Transcription

1 Convergence Analysis of Gradient Descent for Eigenvector Computation Zhiqiang Xu, Xin Cao 2 and Xin Gao KAUST, Saudi Arabia 2 UNSW, Australia zhiqiang.u@kaust.edu.sa, in.cao@unsw.edu.au, in.gao@kaust.edu.sa Abstract We present a novel, simple and systematic convergence analysis of gradient descent for eigenvector computation. As a popular, practical, and provable approach to numerous machine learning problems, gradient descent has found successful applications to eigenvector computation as well. However, surprisingly, it lacks a thorough theoretical analysis for the underlying geodesically non-conve problem. In this work, the convergence of the gradient descent solver for the leading eigenvector computation is shown to be at a global rate O(min{( λ 2 log ɛ, ɛ }, where λ p λ p+ > 0 represents the generalized positive eigengap and always eists without loss of generality with λ i being the i-th largest eigenvalue of the given real symmetric matri and p being the multiplicity of λ. The rate is linear at O(( λ 2 log ɛ if ( λ 2 O(, otherwise sub-linear at O( ɛ. We also show that the convergence only logarithmically instead of quadratically depends on the initial iterate. Particularly, this is the first time the linear convergence for the case that the conventionally considered eigengap λ λ 2 0 but the generalized eigengap satisfies ( λ 2 O(, as well as the logarithmic dependence on the initial iterate are established for the gradient descent solver. We are also the first to leverage for analysis the log principal angle between the iterate and the space of globally optimal solutions. Theoretical properties are verified in eperiments. Introduction Eigenvector computation is a ubiquitous problem in data processing nowadays, such as spectral clustering [Ng et al., 2002; Xu and Ke, 206a], pagerank computation, dimensionality reduction [Fan et al., 208], and so on. Classic solvers from numerical algebra are power methods and Lanczos algorithms [Golub and Van Loan, 996], based on which there has been a recent surge of interest in developing varieties Corresponding author <zhiqiangu200@gmail.com> of solvers [Hardt and Price, 204; Musco and Musco, 205; Garber et al., 206; Balcan et al., 206]. However, most of them are purely theoretic. In this work, we focus on a general and practical solver [Wen and Yin, 203], namely gradient descent from the optimization perspective, for the leading eigenvector computation: min f( R n : A, ( 2 where A R n n and A A. Gradient descent is the most widely used method in optimization for machine learning [Sra et al., 20], due to its effectiveness, simplicity, and provable theoretical guarantees. It has also found successful applications to eigenvector computation [Absil et al., 2008; Wen and Yin, 203; Pitaval et al., 205; Liu et al., 206; Zhang et al., 206; Xu and Ke, 206b]. In particular, it was demonstrated to have comparable performance and better robustness in comparison to Lanczos algorithms with etensive eperimental studies [Wen and Yin, 203]. Despite being practical, however, its convergence analysis remains unsatisfied thus far. Pitaval et al. proved its global convergence but the rate was unknown [Pitaval et al., 205]. Under a large positive eigengap assumption a linear but local rate was achieved in [Xu and Ke, 206b; Liu et al., 206; Xu et al., 207], while a gap-free rate O( ɛ was given by [Arora 2 et al., 203; Shamir, 206a]. Very recently, the rate for the case of an arbitrary eigengap has been improved to O( ɛ [Xu and Gao, 208]. As shown in our eperimental study, however, certain cases of zero eigengap are significantly underestimated and the quadratic dependence on the initial iterate is not true. In this paper, we try to present a systematic convergence analysis of the gradient descent solver for Problem ( to address all the issues mentioned above, and make it from a view of Riemannian optimization. The sphere constraint in ( actually represents a Riemannian manifold, called the sphere manifold, which is a special case of the Stiefel manifold defined by the orthogonality constraint [Absil et al., 2008]. In this sense, Problem ( is geodesically non-conve. A gradient descent step in Riemannian optimization takes the following form: t+ R( t, α t+ f(t, (2 where f( t denotes the Riemannian gradient representing the steepest ascent direction in the equidimensional Euclidean 2933

2 space tangent to the manifold at t, α t+ > 0 represents the step-size, and R( t, represents the retraction map from the tangent space at t to the manifold. In particular, in order to measure the progress of the iterate t to one of globally optimal solutions, we use for analysis a novel potential function defined by the log principal angle between t and, the space of such solutions rather than only one of such solutions as in previous work [Shamir, 205]. It turns out that this potential function boasts the advantage of establishing the global convergence and improving the dependence on the initial iterate. We present a novel general analysis that depends on the generalized eigengap, λ p λ p+, where λ i represents the i-th largest eigenvalue of the given matri A and p is the multiplicity of λ. always remains positive without loss of generality. This unifies all the cases of the conventionally considered eigengap λ λ 2, i.e., > 0 and 0. When is as small as the target precision parameter ɛ, the resulting theoretic compleity is high. However, another unified analysis regardless of the value of we present subsequently shows that this can be significantly reduced. Specifically, we make the following contributions: A global convergence rate O(min{( λ 2 log ɛ, ɛ } of the Riemannian gradient descent solver is achieved for Problem (. The rate is linear at O(( λ 2 log ɛ if ( λ 2 O(, otherwise sub-linear at O( ɛ. This shows that even if 0, it is possible as well to converge linearly. The quadratic dependence of the convergence on the initial iterate is improved to the logarithmic one. Theoretical properties, especially those related to 0, are empirically verified on synthetic or real data. 2 Related Work Due to the space limit, we only discuss those gradient based solvers and refer readers to the cited references for more literature work. There are two categories of such solvers: projected gradient descent and Riemannian gradient descent. Arora et al. proposed the stochastic power method without theoretical guarantees [Arora et al., 202], which actually is equivalent to the projected stochastic gradient descent for the principal component analysis (PCA problem. It was named after the power method because the projected (deterministic gradient descent for PCA will degenerate to the power method when the step-size goes to infinity. This method was subsequently etended via conve relaation and proved to have a global, gap-free, and sub-linear convergence rate O( ɛ [Arora et al., 203]. Balsubramani et al. demonstrated that 2 Oja s algorithm converges at a global, gap-dependent, and sub-linear rate O( ɛ via the martingale analysis [Balsubramani et al., 203]. Note that the gap dependence refers to the dependence on in most of eisting analyses. More recently, stochastic gradient descent (SGD for PCA was shown to converge either at a global, gap-dependent, and sub-linear rate O( ɛ or at a global, gap-free, and sub-linear rate O( ɛ [Shamir, 2 206a]. Shamir proposed the projected SGD with variance reduction for PCA, called VR-PCA, and proved its global or local, gap-dependent, and linear rate O( log 2 ɛ [Shamir, 205; 206b]. More relevant to our work is an increasing body of Riemannian gradient descent methods. It is worth noting that general Riemannian optimization methods are applicable to Problem (, including Riemannian gradient descent [Edelman et al., 999; Absil et al., 2008; Wen and Yin, 203], Riemannian SGD [Bonnabel, 203] and Riemannian SVRG [Zhang et al., 206]. Notably, Wen et al. demonstrated that curvilinear search, which actually performs Riemannian gradient descent with Cayley transformation based retraction, achieves a comparable performance and is more robust compared to Lanczos algorithms [Wen and Yin, 203]. However, the analysis only, at most, achieves either a global, gap-free, and sub-linear convergence to critical points [Absil et al., 2008; Wen and Yin, 203; Bonnabel, 203; Zhang et al., 206] or a local, gap-dependent, and linear convergence to globally optimal solutions [Absil et al., 2008], due to the geodesic non-conveity. As we know, each eigenvector corresponds to a critical point of the objective function in Problem ( and thus they fail to meet the global optimality in theory. On the other hand, specifically, gradient descent for low-rank approimation was proven to converge globally but without any rate provided in [Pitaval et al., 205]. Xu et al. established the local, gap-dependent, and linear rate O( log 2 ɛ of the Riemannian SVRG solver [Xu and Ke, 206b; Xu et al., 207]. By proving an eplicit Łojasiewicz eponent at 2 in [Liu et al., 206], Liu et al. demonstrated a local, gapdependent, and linear convergence of the Riemannian linesearch methods for the quadratic problem under the orthogonality constraint, which includes Problem ( as a special case. In addition, despite the motivation from optimization on Riemannian quotient manifolds, alecton as a SGD solver for streaming low-rank matri approimation ends up with an update like the stochastic power method, and achieves a global, gap-dependent, and sub-linear rate O( 2 ɛ via the martingale analysis [Sa et al., 205]. Although gradient based methods for Problem ( work well in practice [Arora et al., 202; Wen and Yin, 203], related theory is far behind. Despite great interest in developing stochastic solvers, surprisingly, there even does not eist a thorough convergence analysis for the deterministic gradient descent solver as is achieved in this work, which we believe will be an important basis for developing stochastic versions with better theoretical guarantees than the state-of-the-art. 3 Analysis of Gradient Descent We now conduct the convergence analysis of the Riemannian gradient descent solver for Problem (, starting from introducing necessary notions and notations. Main results are then stated in theorems and followed by a few important supporting lemmas. The section ends with the proofs of the theorem and lemmas. 3. Notions and Notations Recall that given a real symmetric matri A R n n, its i-th largest eigenvalue is denoted as λ i, i.e., λ λ n, and 2934

3 the corresponding eigenvector denoted as v i. Define eigengap i λ i λ i+ and let p be the multiplicity of λ with the associated eigenspace denoted as V p [v,, v p ]. One then should have j 0 for j,, p and > 0 without loss of generality, i.e., p < n, because f( will be a constant function if p n. Instead of focusing on a specific case of the eigengap like > 0, we target a general framework admitting all the cases of. To this end, we have to rely on and V p, where we follow [Xu and Gao, 208] to term as the generalized eigengap. That is, it suffices for us to show the convergence to one of globally optimal solutions, i.e., a unit vector v span(v p, instead of a specific solution, e.g., v. We thus define the following potential function: ψ( t 2 log V p t 2, (3 where actually Vp t 2 cos θ( t, V p representing the cosine of the principal angle between the current iterate t and the space of globally optimal solutions. As Vp t 2, it is easy to see that ψ( t 0 and t converges to a unit vector in span(v p when ψ( t goes to 0. In addition, we take the normalization retraction, i.e., R(, ξ +ξ +ξ 2. Thus, the update in (2 can be eplicitly written as t+ t α t+ f(t t α t+ f(t 2, (4 where the Riemannian gradient for Problem ( is f( (I A. 3.2 Main Results Theorem. If the initial 0 y y 2 where entries of y are i.i.d. standard normal samples, i.e., y i N (0,, then the Rimannian gradient descent solver for Problem ( will converge to certain unit vector v span(v p at a global rate O(min{( λ 2 log ɛ, ɛ } with high probability. Specifically, for any ɛ (0,, 2λ 2 (+α p I the solver with constant step-sizes α < will converge, i.e., ψ( T < ɛ, after T O(( λ 2 log ψ(0 ɛ iterations, with probability at least ν p, where ν 0 if p otherwise ν (0, ; II the solver with diminishing step-sizes α t τ+t for sufficiently large constants c, τ > 0 will converge, i.e., ψ( T < ɛ, after T O( ɛ iterations, with probability at least ν p, where ν 0 if p otherwise ν (0,. As the convergence holds with high probability for any random initial iterate, it is global by convention. Contrastingly, the success probability given in [Shamir, 206a] is Ω( n. To prove the theorem, we need a few important lemmas whose proofs are deferred to after that of Theorem. Lemma 2. λ A sin 2 θ(, V p. Lemma 3. t α t+ f(t α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p. c log( log( Lemma 4. for any (0,, and log( + for any >. + Lemma 5. V p 0 2 > 0 with probability if p, otherwise with probability at least ν p, where 0 < ν < is a constant about V p y. Proof of Theorem. Proof. I We start from epanding ψ( t+ using (3-(4: ψ( t+ 2 log V p t+ 2 2 log V p ( t α t+ f(t 2 +2 log t α t+ f(t 2. To proceed, we need the full eigen-decomposition of A, i.e., A λ V p V p + V p diag(λ p+,, λ n (V p, (5 where V p represents the orthogonal complement of V p. Plugging in this decomposition to V p A, one then gets V p f( t V p (I t t A t We now can write V p A t ( t A t V p t (λ t A t V p t. ψ( t+ ψ( t 2 log( + α t+ (λ t A t where we have and +2 log t α t+ f(t 2, 2 log( + α t+ (λ t A t 2 log( + α t+ sin 2 θ( t, V p (by Lemma 2 2α t+ sin 2 θ( t, V p + α t+ sin 2 θ( t, V p 2α t+ + α t+ sin 2 θ( t, V p, (by Lemma 4 2 log t α t+ f(t 2 log( + α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p (by Lemma 3 α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p (by Lemma 4 2α 2 t+λ 2 sin 2 θ( t, V p. One thus can arrive at ψ( t+ ψ( t 2α t+ + α t+ sin 2 θ( t, V p +2αt+λ 2 2 sin 2 θ( t, V p (6 2 ψ( t ( 2α t+ λ 2 + α t+ p α t+ sin 2 θ( t, V p ψ( t ψ( t. 2935

4 Note that by Lemma 4, sin 2 θ( t, V p ψ( t sin 2 θ( t, V p log( sin 2 θ( t, V p log( sin 2 θ( t, V p + ψ( t. (7 If +α t+ 2α t+ λ 2 > 0, i.e., α t+ < then ψ( t+ < ψ( t and 2λ 2 (+αt+ p, α t+ ψ( t+ ψ( t + α t+ + ψ( t ψ( t ( if α t+ α. We thus have + α t+ α t+ + ψ( t ψ( t (8 ( + α α + ψ( 0 ψ( t, (9 ψ( t ( + α α + ψ( 0 t ψ( 0 If Ξ < ɛ, i.e., and α ep{ t + α + ψ( 0 }ψ( 0 Ξ. ψ( 0 < + (0 T > ( + α( + ψ( 0 log ψ( 0, α ɛ we must have ψ( T < ɛ. Plugging in α < we can write 2λ 2 (+α p to T, T O( log ψ( 0 ( α ɛ O(( λ 2 log ψ( 0. (2 ɛ Finally note that (0 is equivalent to V p 0 2 > 0, which occurs with probability if p, otherwise at least ν p for certain constant ν (0,, by Lemma 5. II We now restart from (6. Plugging in (7 to (6, we can write ψ( t+ ψ( t 2α t+ + α t+ sin 2 θ( t, V p +2α 2 t+λ 2 sin 2 θ( t, V p ψ( t 2α t+ + α t+ ψ( t + ψ( t + 2λ2 α 2 t+. c Let α t+ τ+t where c, τ > 0 are constants. It is easy to see that for sufficiently small α t+, equivalently, sufficiently large τ, we have ψ( t+ < ψ( t. Thus, one gets ψ( t+ ( 2α t+ + c τ + ψ( 0 ψ( t + 2λ 2 αt+. 2 Denoting a 2c p + c τ p +ψ( and b 0 2c2 λ 2, one can write ψ( t+ ( a τ + t ψ( b t + (τ + t 2. As long as ψ( 0 < + and a >, by Lemma D. in [Balsubramani et al., 203], the recursion of the above inequality will yield τ + ψ( t ( t + τ + a ψ( 0 + 2a+ b a t + τ +. We thus have ψ( T O( T, i.e., T O( ɛ. The proof completes by noting that ψ( 0 < + occurs with high probability similarly and a > can be guaranteed by choosing sufficiently large constants c, τ. Remark We make a few remarks on our proof. Equation (8 shows that the convergence is at least linear because ψ( t+ < ψ( t and thus the contraction factor keep decreasing for constant step-sizes. In Equation (9, α t+ is not necessarily constant. In fact, the results will hold for any stepsize scheme as long as α t+ < 2λ 2 (+αt+ p. This provides theoretical support for us to fleibly choose step-size schemes. Equation ( shows that larger step-sizes in a safe range will lead to faster convergence. Moreover, we can see from Equation (2 that the convergence actually depends on the relative generalized eigengap p λ. This seems to be eplicitly shown for the first time for the considered solver. As mentioned in Section, the proof provides two general analysis frameworks. The generality lies in that both rates hold for any value of the positive. The combination of the results from two kinds of analysis comprehensively portrays the convergence behaviors of the solver. For both kinds of analysis, the convergence has only logarithmic dependence on the initial iterate via ψ( 0 2 log Vp 0 2. Proof of Lemma 2 Proof. Plugging in (5 to A, we get A λ V p V p diag(λ p+,, λ n (V p λ V p λ p+ V p (V p λ V p λ p+ (I V p V p λ V p λ p+ ( V p 2 2 λ cos 2 θ(, V p + λ p+ sin 2 θ(, V p. Thus, one arrives at λ A λ λ cos 2 θ(, V p λ p+ sin 2 θ(, V p (λ λ p+ sin 2 θ(, V p. 2936

5 Proof of Lemma 3 Proof. t α t+ f(t 2 2 ( t α t+ f(t ( t α t+ f(t + α 2 t+ f( t 2 2, where, plugging in the following form of A s full eigendecomposition, A λ vv + v diag(λ 2,, λ n v, for any unit vector v span(v p, we get f( t 2 2 A 2 2 λ vv + v diag(λ 2,, λ n v λ vv v diag(λ 2,, λ n v 2 2 2λ 2 v 2 2 v λ 2 2 v 2 2 v 2 2 2λ 2 v λ 2 2 v 2 2 2λ 2 v v + 2λ 2 2 v v 2λ 2 v (I v + 2λ 2 2 (I vv 2(λ 2 + λ 2 2( (v 2. Since the above inequality holds for any unit vector v span(v p, we have f( t 2 2 2(λ 2 + λ 2 2 min v 2 min ( v span(v (v 2. p By the definition of principal angles [Golub and Van Loan, 996], Vp 2 min min ( v 2 v span(v (v 2. p One thus has f( t 2 2 2(λ 2 + λ 2 2 sin 2 θ(, V p. Proof of Lemma 4 Proof. For any, it holds that + e. Then for any >, log( +. If one lets y + in the above inequality, then log y y. Further letting y z yields log z z + z Last, setting z + gives us z. log( Note that log( + i+ i0 ( i i+ for <. One then can write for (0, that log( ( i+ i0 ( i i+ i0 i i+ + i i i /( log(. + i i i+ ( i i ( i i Proof of Lemma 5 Proof. Let σ min ( and σ ma ( be the smallest and largest singular value of a matri, respectively. We then have Vp 0 2 V p y 2 σ min(vp y, y 2 σ ma (y where σ ma (y > 0 almost surely. Since entries of y are i.i.d. samples from N (0, and V p is orthonormal, entries of V p y are i.i.d. standard normal samples as well. By Equation (3.2 in [Rudelson and Vershynin, 200], σ min (V y > 0 almost surely (i.e., with probability. By Theorem 3.3 in [Rudelson and Vershynin, 200], we have σ min (V p y > 0 with probability at least ν p, where 0 < ν < is a constant about V p y. 4 Eperiments Theoretical properties of the considered solver were already observed and verified empirically in, e.g., [Wen and Yin, 203; Liu et al., 206], albeit without thorough theoretical support. For the sake of completeness, we reproduce a few eperiments on both synthetic and real data here, with new testing on matrices of zero eigengap. All the ground truth information, e.g., v or V p, is obtained by matlab s eigs function for the purpose of benchmarking. The advantage of using synthetic data mainly lies in the control of the eigengap. We set n 000 and follow [Shamir, 205] to generate data using the full eigenvalue decomposition A UΣU. U is an orthogonal matri and is set the same way as 0. Σ diag(σ, Σ 2, where Σ 2 diag( g n,, gn r n with g i N (0, and r being the order of Σ. In addition, the following two settings are considered: p : Σ diag(, η,.η,.2η,.3η,.4η, and then p λ η > 0, where η {0.2, 0.5, 0.8}. p 3: Σ diag(,,, and then p λ g n > 0. For the conventionally considered case that > 0, two types of convergence curves, in terms of the relative function and the commonly used potential function error f(t f(v f(v sin 2 θ( t, v, respectively, are shown in Figure 2 for different eigengap values. Global linear convergence of the solver is observed, and a larger eigengap yields faster convergence. It agrees with the theory. The convergence trends remain greatly consistent for the two types, which holds in the rest of eperiments as well. For the case that 0 and α t+ c τ+t (see Remark, different pairs of step-size parameters are tested in Figure. As described by Theorem, the convergence reported in Figures (a-(b is global linear, rather than sub-linear by [Xu and Gao, 208], and large step-sizes result in faster convergence as well. As it is unknown which leading eigenvector t will converge to finally, it is necessary for us to change the potential from the commonly used sin 2 θ( t, v to sin 2 θ( t, V p in this case. As demonstrated in Figure (c, 2937

6 ( f( t - f(v / f(v c2.87, 0 c4.48, 0 c6.09, 0 sin 2 ( t, V p c2.87, 0 c4.48, 0 c6.09, 0 sin 2 ( t, v c2.87, 0 c4.48, 0 c6.09, (a Relative function error (b Potential sin 2 θ( t, V p Figure : Synthetic data: (c Improper potential sin 2 θ( t, v > real: Schenk Cayley + Barzilai-Borwein normalization + constant ( f( t - f(v / f(v , 0.50, 0.80, t.96 t.4 t.5 ( f( t - f(v / f(v (a Relative function error > (a Relative function error real: Schenk Cayley + Barzilai-Borwein normalization + constant sin 2 ( t, v , 0.50, 0.80, t.96 t.4 t.5 sin 2 ( t, v (b Potential sin 2 θ( t, v Figure 2: Synthetic data: > 0 sin 2 θ( t, v is unable to converge, which implies that t has converged to another leading eigenvector in span(v p. We run two implementations of the solver on a real symmetric matri A, named Schenk, which is of size 0, 728 0, 728 with 85, 000 nonzero entries. The implementation is characterized by the combination of the retraction and stepsize scheme used. Curvilinear search was proposed in [Wen and Yin, 203] which uses Cayley transform + Barzilai- Borwein step-size. The other implementation uses normalization + constant step-size. They were fed with the same random initial iterate. The results are reported in Figure 3, which reflect the global linear convergence as well and meanwhile shows that different step-size schemes matter in practice Conclusion (b Potential sin 2 θ( t, v Figure 3: Real data. We presented a simple yet comprehensive convergence analysis of the Riemannian gradient descent solver for the leading eigenvector computation. Two kinds of general analysis jointly established the true global rate of convergence to one of the leading eigenvectors. The generalized eigengap eliminates the limitation of the commonly considered eigengap. When is large, the convergence is linear, which is described by the first general analysis. If it is as small as ɛ, the second general analysis shows that it is sub-linear O( ɛ rather than O( ɛ 2. It was also shown that the convergence only logarithmically depends on the initial iterate. The key to these breakthroughs made for the considered solver is to use for analysis the log principal angle between iterates and the space of the leading eigenvectors. Along this work, there are a few interesting future research directions. For eample, it is unknown if the results can be etended to k > for the top- 2938

7 k eigenspace computation without deflation. In particular, it would be more attractive to practitioners if the analysis can be translated to the case of the stochastic gradient descent solver, which is well worth further investigation. Also, although the rate is optimal for the considered solver, it would be of great interest to develop faster solvers of this type to match the optimal rate for the problem, e.g., those of Lanczos algorithms. Acknowledgements This research is supported in part by the funding from King Abdullah University of Science and Technology (KAUST. References [Absil et al., 2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matri manifolds. Princeton University Press, [Arora et al., 202] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for PCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton, pages , 202. [Arora et al., 203] Raman Arora, Andrew Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG. In NIPS, pages , 203. [Balcan et al., 206] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gapdependency analysis of the noisy power method. In COLT, pages , 206. [Balsubramani et al., 203] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In NIPS, pages Curran Associates, Inc., 203. [Bonnabel, 203] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Contr., 58(9: , 203. [Edelman et al., 999] Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints. SIAM J. Matri Anal. Appl., 20(2: , April 999. [Fan et al., 208] Jianqing Fan, Qiang Sun, Wen-Xin Zhou, and Ziwei Zhu. Principal component analysis for big data. arxiv preprint arxiv: , 208. [Garber et al., 206] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In ICML, pages , 206. [Golub and Van Loan, 996] Gene H. Golub and Charles F. Van Loan. Matri Computations (3rd Ed.. Johns Hopkins University Press, Baltimore, MD, USA, 996. [Hardt and Price, 204] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In NIPS, pages , 204. [Liu et al., 206] Huikang Liu, Weijie Wu, and Anthony Man-Cho So. Quadratic optimization with orthogonality constraints: Eplicit lojasiewicz eponent and linear convergence of line-search methods. In ICML, pages 58 67, 206. [Musco and Musco, 205] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approimate singular value decomposition. In NIPS, pages , 205. [Ng et al., 2002] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages MIT Press, [Pitaval et al., 205] Renaud-Aleandre Pitaval, Wei Dai, and Olav Tirkkonen. Convergence of gradient descent for low-rank matri approimation. IEEE Trans. Information Theory, 6(8: , 205. [Rudelson and Vershynin, 200] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: etreme singular values. arxiv preprint arxiv: , 200. [Sa et al., 205] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-conve matri problems. In ICML, pages , 205. [Shamir, 205] Ohad Shamir. A stochastic PCA and SVD algorithm with an eponential convergence rate. In ICML, pages 44 52, 205. [Shamir, 206a] Ohad Shamir. Convergence of stochastic gradient descent for PCA. In ICML, pages , 206. [Shamir, 206b] Ohad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and conveity. In ICML, pages , 206. [Sra et al., 20] Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for Machine Learning. The MIT Press, 20. [Wen and Yin, 203] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Math. Program., 42(-2: , 203. [Xu and Gao, 208] Zhiqiang Xu and Xin Gao. On truly block eigensolvers via riemannian optimization. In AIS- TATS, pages 68 77, 208. [Xu and Ke, 206a] Zhiqiang Xu and Yiping Ke. Effective and efficient spectral clustering on tet and link data. In CIKM, pages , 206. [Xu and Ke, 206b] Zhiqiang Xu and Yiping Ke. Stochastic variance reduced riemannian eigensolver. CoRR, abs/ , 206. [Xu et al., 207] Zhiqiang Xu, Yiping Ke, and Xin Gao. A fast algorithm for matri eigen-decompositionn. In UAI 207, Sydney, Australia, August -5, 207, 207. [Zhang et al., 206] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: fast stochastic optimization on riemannian manifolds. In NIPS, pages ,

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation Zhiqiang Xu Cognitive Computing Lab (CCL, Baidu Research National Engineering Laboratory of Deep Learning Technology

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

On Truly Block Eigensolvers via Riemannian Optimization

On Truly Block Eigensolvers via Riemannian Optimization Zhiqiang Xu Xin Gao zhiqiang.xu@kaust.edu.sa xin.gao@kaust.edu.sa Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST) Abstract

More information

Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization

Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization Xingguo Li Zhehui Chen Lin Yang Jarvis Haupt Tuo Zhao University of Minnesota Princeton University

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Diffusion Approximations for Online Principal Component Estimation and Global Convergence

Diffusion Approximations for Online Principal Component Estimation and Global Convergence Diffusion Approximations for Online Principal Component Estimation and Global Convergence Chris Junchi Li Mengdi Wang Princeton University Department of Operations Research and Financial Engineering, Princeton,

More information

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis Dan Garber Ohad Shamir Nathan Srebro 3 Abstract We study the fundamental problem of Principal Component Analysis

More information

A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate

A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science, Rehovot, Israel OHAD.SHAMIR@WEIZMANN.AC.IL Abstract We describe and analyze a simple algorithm for principal component analysis and singular value decomposition,

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Gen-Oja: A Simple and Efficient Algorithm for Streaming Generalized Eigenvector Computation

Gen-Oja: A Simple and Efficient Algorithm for Streaming Generalized Eigenvector Computation Gen-Oja: A Simple and Efficient Algorithm for Streaming Generalized Eigenvector Computation Kush Bhatia University of California, Berkeley kushbhatia@berkeley.edu Nicolas Flammarion University of California,

More information

Statistical Geometry Processing Winter Semester 2011/2012

Statistical Geometry Processing Winter Semester 2011/2012 Statistical Geometry Processing Winter Semester 2011/2012 Linear Algebra, Function Spaces & Inverse Problems Vector and Function Spaces 3 Vectors vectors are arrows in space classically: 2 or 3 dim. Euclidian

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds

Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds Robust Principal Component Pursuit via Alternating Minimization Scheme on Matrix Manifolds Tao Wu Institute for Mathematics and Scientific Computing Karl-Franzens-University of Graz joint work with Prof.

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug

More information

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Geometric Modeling Summer Semester 2010 Mathematical Tools (1) Geometric Modeling Summer Semester 2010 Mathematical Tools (1) Recap: Linear Algebra Today... Topics: Mathematical Background Linear algebra Analysis & differential geometry Numerical techniques Geometric

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yingbin Liang Department of EECS Syracuse

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

Streaming Principal Component Analysis in Noisy Settings

Streaming Principal Component Analysis in Noisy Settings Streaming Principal Component Analysis in Noisy Settings Teodor V. Marinov * 1 Poorya Mianjy * 1 Raman Arora 1 Abstract We study streaming algorithms for principal component analysis (PCA) in noisy settings.

More information

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Key words. conjugate gradients, normwise backward error, incremental norm estimation. Proceedings of ALGORITMY 2016 pp. 323 332 ON ERROR ESTIMATION IN THE CONJUGATE GRADIENT METHOD: NORMWISE BACKWARD ERROR PETR TICHÝ Abstract. Using an idea of Duff and Vömel [BIT, 42 (2002), pp. 300 322

More information

Exponentially convergent stochastic k-pca without variance reduction

Exponentially convergent stochastic k-pca without variance reduction arxiv:1904.01750v1 [cs.lg 3 Apr 2019 Exponentially convergent stochastic k-pca without variance reduction Cheng Tang Amazon AI Abstract We present Matrix Krasulina, an algorithm for online k-pca, by generalizing

More information

Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization

Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization Sparse and low-rank decomposition for big data systems via smoothed Riemannian optimization Yuanming Shi ShanghaiTech University, Shanghai, China shiym@shanghaitech.edu.cn Bamdev Mishra Amazon Development

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades Qiaochu Yuan Department of Mathematics UC Berkeley Joint work with Prof. Ming Gu, Bo Li April 17, 2018 Qiaochu Yuan Tour

More information

Self-Tuning Spectral Clustering

Self-Tuning Spectral Clustering Self-Tuning Spectral Clustering Lihi Zelnik-Manor Pietro Perona Department of Electrical Engineering Department of Electrical Engineering California Institute of Technology California Institute of Technology

More information

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery Mingkui Tan, School of omputer Science, The University of Adelaide, Australia Ivor W. Tsang, IS, University of Technology Sydney,

More information

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering Filippo Pompili 1, Nicolas Gillis 2, P.-A. Absil 2,andFrançois Glineur 2,3 1- University of Perugia, Department

More information

arxiv: v1 [stat.ml] 15 Feb 2018

arxiv: v1 [stat.ml] 15 Feb 2018 1 : A New Algorithm for Streaming PCA arxi:1802.054471 [stat.ml] 15 Feb 2018 Puyudi Yang, Cho-Jui Hsieh, Jane-Ling Wang Uniersity of California, Dais pydyang, chohsieh, janelwang@ucdais.edu Abstract In

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Weiran Wang Toyota Technological Institute at Chicago * Joint work with Raman Arora (JHU), Karen Livescu and Nati Srebro (TTIC)

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Accelerated Stochastic Power Iteration

Accelerated Stochastic Power Iteration Peng Xu 1 Bryan He 1 Christopher De Sa Ioannis Mitliagkas 3 Christopher Ré 1 1 Stanford University Cornell University 3 University of Montréal Abstract Principal component analysis (PCA) is one of the

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

arxiv: v1 [math.oc] 22 May 2018

arxiv: v1 [math.oc] 22 May 2018 On the Connection Between Sequential Quadratic Programming and Riemannian Gradient Methods Yu Bai Song Mei arxiv:1805.08756v1 [math.oc] 22 May 2018 May 23, 2018 Abstract We prove that a simple Sequential

More information

Majorization for Changes in Ritz Values and Canonical Angles Between Subspaces (Part I and Part II)

Majorization for Changes in Ritz Values and Canonical Angles Between Subspaces (Part I and Part II) 1 Majorization for Changes in Ritz Values and Canonical Angles Between Subspaces (Part I and Part II) Merico Argentati (speaker), Andrew Knyazev, Ilya Lashuk and Abram Jujunashvili Department of Mathematics

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Unconstrained Optimization Models for Computing Several Extreme Eigenpairs of Real Symmetric Matrices

Unconstrained Optimization Models for Computing Several Extreme Eigenpairs of Real Symmetric Matrices Unconstrained Optimization Models for Computing Several Extreme Eigenpairs of Real Symmetric Matrices Yu-Hong Dai LSEC, ICMSEC Academy of Mathematics and Systems Science Chinese Academy of Sciences Joint

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Efficient coordinate-wise leading eigenvector computation

Efficient coordinate-wise leading eigenvector computation Journal of Machine Learning Research (07-5 Submitted ; Published Efficient coordinate-wise leading eigenvector computation Jialei Wang University of Chicago Weiran Wang Amazon Alexa Dan Garber Technion

More information

Given the vectors u, v, w and real numbers α, β, γ. Calculate vector a, which is equal to the linear combination α u + β v + γ w.

Given the vectors u, v, w and real numbers α, β, γ. Calculate vector a, which is equal to the linear combination α u + β v + γ w. Selected problems from the tetbook J. Neustupa, S. Kračmar: Sbírka příkladů z Matematiky I Problems in Mathematics I I. LINEAR ALGEBRA I.. Vectors, vector spaces Given the vectors u, v, w and real numbers

More information

Sparse Optimization Lecture: Basic Sparse Optimization Models

Sparse Optimization Lecture: Basic Sparse Optimization Models Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm

More information

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) Chapter 5 The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) 5.1 Basics of SVD 5.1.1 Review of Key Concepts We review some key definitions and results about matrices that will

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

Block Bidiagonal Decomposition and Least Squares Problems

Block Bidiagonal Decomposition and Least Squares Problems Block Bidiagonal Decomposition and Least Squares Problems Åke Björck Department of Mathematics Linköping University Perspectives in Numerical Analysis, Helsinki, May 27 29, 2008 Outline Bidiagonal Decomposition

More information

distances between objects of different dimensions

distances between objects of different dimensions distances between objects of different dimensions Lek-Heng Lim University of Chicago joint work with: Ke Ye (CAS) and Rodolphe Sepulchre (Cambridge) thanks: DARPA D15AP00109, NSF DMS-1209136, NSF IIS-1546413,

More information

Conceptual Questions for Review

Conceptual Questions for Review Conceptual Questions for Review Chapter 1 1.1 Which vectors are linear combinations of v = (3, 1) and w = (4, 3)? 1.2 Compare the dot product of v = (3, 1) and w = (4, 3) to the product of their lengths.

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time Zhaoran Wang Huanran Lu Han Liu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08540 {zhaoran,huanranl,hanliu}@princeton.edu

More information

On the Computations of Eigenvalues of the Fourth-order Sturm Liouville Problems

On the Computations of Eigenvalues of the Fourth-order Sturm Liouville Problems Int. J. Open Problems Compt. Math., Vol. 4, No. 3, September 2011 ISSN 1998-6262; Copyright c ICSRS Publication, 2011 www.i-csrs.org On the Computations of Eigenvalues of the Fourth-order Sturm Liouville

More information

Solving Large-scale Systems of Random Quadratic Equations via Stochastic Truncated Amplitude Flow

Solving Large-scale Systems of Random Quadratic Equations via Stochastic Truncated Amplitude Flow Solving Large-scale Systems of Random Quadratic Equations via Stochastic Truncated Amplitude Flow Gang Wang,, Georgios B. Giannakis, and Jie Chen Dept. of ECE and Digital Tech. Center, Univ. of Minnesota,

More information

Stein s Method for Matrix Concentration

Stein s Method for Matrix Concentration Stein s Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp University of California, Berkeley California Institute of Technology

More information

Review of similarity transformation and Singular Value Decomposition

Review of similarity transformation and Singular Value Decomposition Review of similarity transformation and Singular Value Decomposition Nasser M Abbasi Applied Mathematics Department, California State University, Fullerton July 8 7 page compiled on June 9, 5 at 9:5pm

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems JAMES H. MONEY and QIANG YE UNIVERSITY OF KENTUCKY eigifp is a MATLAB program for computing a few extreme eigenvalues

More information

R-Linear Convergence of Limited Memory Steepest Descent

R-Linear Convergence of Limited Memory Steepest Descent R-Linear Convergence of Limited Memory Steepest Descent Frank E. Curtis, Lehigh University joint work with Wei Guo, Lehigh University OP17 Vancouver, British Columbia, Canada 24 May 2017 R-Linear Convergence

More information

A projection algorithm for strictly monotone linear complementarity problems.

A projection algorithm for strictly monotone linear complementarity problems. A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu

More information

The nonsmooth Newton method on Riemannian manifolds

The nonsmooth Newton method on Riemannian manifolds The nonsmooth Newton method on Riemannian manifolds C. Lageman, U. Helmke, J.H. Manton 1 Introduction Solving nonlinear equations in Euclidean space is a frequently occurring problem in optimization and

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate

SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS. Hafiz Imtiaz and Anand D. Sarwate SYMMETRIC MATRIX PERTURBATION FOR DIFFERENTIALLY-PRIVATE PRINCIPAL COMPONENT ANALYSIS Hafiz Imtiaz and Anand D. Sarwate Rutgers, The State University of New Jersey ABSTRACT Differential privacy is a strong,

More information

On the exponential convergence of. the Kaczmarz algorithm

On the exponential convergence of. the Kaczmarz algorithm On the exponential convergence of the Kaczmarz algorithm Liang Dai and Thomas B. Schön Department of Information Technology, Uppsala University, arxiv:4.407v [cs.sy] 0 Mar 05 75 05 Uppsala, Sweden. E-mail:

More information

Small sample size in high dimensional space - minimum distance based classification.

Small sample size in high dimensional space - minimum distance based classification. Small sample size in high dimensional space - minimum distance based classification. Ewa Skubalska-Rafaj lowicz Institute of Computer Engineering, Automatics and Robotics, Department of Electronics, Wroc

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

4.2. ORTHOGONALITY 161

4.2. ORTHOGONALITY 161 4.2. ORTHOGONALITY 161 Definition 4.2.9 An affine space (E, E ) is a Euclidean affine space iff its underlying vector space E is a Euclidean vector space. Given any two points a, b E, we define the distance

More information

Invariant Subspace Computation - A Geometric Approach

Invariant Subspace Computation - A Geometric Approach Invariant Subspace Computation - A Geometric Approach Pierre-Antoine Absil PhD Defense Université de Liège Thursday, 13 February 2003 1 Acknowledgements Advisor: Prof. R. Sepulchre (Université de Liège)

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

arxiv: v1 [math.na] 1 Sep 2018

arxiv: v1 [math.na] 1 Sep 2018 On the perturbation of an L -orthogonal projection Xuefeng Xu arxiv:18090000v1 [mathna] 1 Sep 018 September 5 018 Abstract The L -orthogonal projection is an important mathematical tool in scientific computing

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011 .S. Project Report Efficient Failure Rate Prediction for SRA Cells via Gibbs Sampling Yamei Feng /5/ Committee embers: Prof. Xin Li Prof. Ken ai Table of Contents CHAPTER INTRODUCTION...3 CHAPTER BACKGROUND...5

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

COR-OPT Seminar Reading List Sp 18

COR-OPT Seminar Reading List Sp 18 COR-OPT Seminar Reading List Sp 18 Damek Davis January 28, 2018 References [1] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank Solutions of Linear Matrix Equations via Procrustes

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Normalized power iterations for the computation of SVD

Normalized power iterations for the computation of SVD Normalized power iterations for the computation of SVD Per-Gunnar Martinsson Department of Applied Mathematics University of Colorado Boulder, Co. Per-gunnar.Martinsson@Colorado.edu Arthur Szlam Courant

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR WEN LI AND MICHAEL K. NG Abstract. In this paper, we study the perturbation bound for the spectral radius of an m th - order n-dimensional

More information

Lecture 7: Weak Duality

Lecture 7: Weak Duality EE 227A: Conve Optimization and Applications February 7, 2012 Lecture 7: Weak Duality Lecturer: Laurent El Ghaoui 7.1 Lagrange Dual problem 7.1.1 Primal problem In this section, we consider a possibly

More information

Sparsification by Effective Resistance Sampling

Sparsification by Effective Resistance Sampling Spectral raph Theory Lecture 17 Sparsification by Effective Resistance Sampling Daniel A. Spielman November 2, 2015 Disclaimer These notes are not necessarily an accurate representation of what happened

More information

Recursive Generalized Eigendecomposition for Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis Recursive Generalized Eigendecomposition for Independent Component Analysis Umut Ozertem 1, Deniz Erdogmus 1,, ian Lan 1 CSEE Department, OGI, Oregon Health & Science University, Portland, OR, USA. {ozertemu,deniz}@csee.ogi.edu

More information

Computation of the Joint Spectral Radius with Optimization Techniques

Computation of the Joint Spectral Radius with Optimization Techniques Computation of the Joint Spectral Radius with Optimization Techniques Amir Ali Ahmadi Goldstine Fellow, IBM Watson Research Center Joint work with: Raphaël Jungers (UC Louvain) Pablo A. Parrilo (MIT) R.

More information