Convergence Analysis of Gradient Descent for Eigenvector Computation

Size: px

Start display at page:

Download "Convergence Analysis of Gradient Descent for Eigenvector Computation"

Dwight Snow
5 years ago
Views:

1 Convergence Analysis of Gradient Descent for Eigenvector Computation Zhiqiang Xu, Xin Cao 2 and Xin Gao KAUST, Saudi Arabia 2 UNSW, Australia zhiqiang.u@kaust.edu.sa, in.cao@unsw.edu.au, in.gao@kaust.edu.sa Abstract We present a novel, simple and systematic convergence analysis of gradient descent for eigenvector computation. As a popular, practical, and provable approach to numerous machine learning problems, gradient descent has found successful applications to eigenvector computation as well. However, surprisingly, it lacks a thorough theoretical analysis for the underlying geodesically non-conve problem. In this work, the convergence of the gradient descent solver for the leading eigenvector computation is shown to be at a global rate O(min{( λ 2 log ɛ, ɛ }, where λ p λ p+ > 0 represents the generalized positive eigengap and always eists without loss of generality with λ i being the i-th largest eigenvalue of the given real symmetric matri and p being the multiplicity of λ. The rate is linear at O(( λ 2 log ɛ if ( λ 2 O(, otherwise sub-linear at O( ɛ. We also show that the convergence only logarithmically instead of quadratically depends on the initial iterate. Particularly, this is the first time the linear convergence for the case that the conventionally considered eigengap λ λ 2 0 but the generalized eigengap satisfies ( λ 2 O(, as well as the logarithmic dependence on the initial iterate are established for the gradient descent solver. We are also the first to leverage for analysis the log principal angle between the iterate and the space of globally optimal solutions. Theoretical properties are verified in eperiments. Introduction Eigenvector computation is a ubiquitous problem in data processing nowadays, such as spectral clustering [Ng et al., 2002; Xu and Ke, 206a], pagerank computation, dimensionality reduction [Fan et al., 208], and so on. Classic solvers from numerical algebra are power methods and Lanczos algorithms [Golub and Van Loan, 996], based on which there has been a recent surge of interest in developing varieties Corresponding author <zhiqiangu200@gmail.com> of solvers [Hardt and Price, 204; Musco and Musco, 205; Garber et al., 206; Balcan et al., 206]. However, most of them are purely theoretic. In this work, we focus on a general and practical solver [Wen and Yin, 203], namely gradient descent from the optimization perspective, for the leading eigenvector computation: min f( R n : A, ( 2 where A R n n and A A. Gradient descent is the most widely used method in optimization for machine learning [Sra et al., 20], due to its effectiveness, simplicity, and provable theoretical guarantees. It has also found successful applications to eigenvector computation [Absil et al., 2008; Wen and Yin, 203; Pitaval et al., 205; Liu et al., 206; Zhang et al., 206; Xu and Ke, 206b]. In particular, it was demonstrated to have comparable performance and better robustness in comparison to Lanczos algorithms with etensive eperimental studies [Wen and Yin, 203]. Despite being practical, however, its convergence analysis remains unsatisfied thus far. Pitaval et al. proved its global convergence but the rate was unknown [Pitaval et al., 205]. Under a large positive eigengap assumption a linear but local rate was achieved in [Xu and Ke, 206b; Liu et al., 206; Xu et al., 207], while a gap-free rate O( ɛ was given by [Arora 2 et al., 203; Shamir, 206a]. Very recently, the rate for the case of an arbitrary eigengap has been improved to O( ɛ [Xu and Gao, 208]. As shown in our eperimental study, however, certain cases of zero eigengap are significantly underestimated and the quadratic dependence on the initial iterate is not true. In this paper, we try to present a systematic convergence analysis of the gradient descent solver for Problem ( to address all the issues mentioned above, and make it from a view of Riemannian optimization. The sphere constraint in ( actually represents a Riemannian manifold, called the sphere manifold, which is a special case of the Stiefel manifold defined by the orthogonality constraint [Absil et al., 2008]. In this sense, Problem ( is geodesically non-conve. A gradient descent step in Riemannian optimization takes the following form: t+ R( t, α t+ f(t, (2 where f( t denotes the Riemannian gradient representing the steepest ascent direction in the equidimensional Euclidean 2933

2 space tangent to the manifold at t, α t+ > 0 represents the step-size, and R( t, represents the retraction map from the tangent space at t to the manifold. In particular, in order to measure the progress of the iterate t to one of globally optimal solutions, we use for analysis a novel potential function defined by the log principal angle between t and, the space of such solutions rather than only one of such solutions as in previous work [Shamir, 205]. It turns out that this potential function boasts the advantage of establishing the global convergence and improving the dependence on the initial iterate. We present a novel general analysis that depends on the generalized eigengap, λ p λ p+, where λ i represents the i-th largest eigenvalue of the given matri A and p is the multiplicity of λ. always remains positive without loss of generality. This unifies all the cases of the conventionally considered eigengap λ λ 2, i.e., > 0 and 0. When is as small as the target precision parameter ɛ, the resulting theoretic compleity is high. However, another unified analysis regardless of the value of we present subsequently shows that this can be significantly reduced. Specifically, we make the following contributions: A global convergence rate O(min{( λ 2 log ɛ, ɛ } of the Riemannian gradient descent solver is achieved for Problem (. The rate is linear at O(( λ 2 log ɛ if ( λ 2 O(, otherwise sub-linear at O( ɛ. This shows that even if 0, it is possible as well to converge linearly. The quadratic dependence of the convergence on the initial iterate is improved to the logarithmic one. Theoretical properties, especially those related to 0, are empirically verified on synthetic or real data. 2 Related Work Due to the space limit, we only discuss those gradient based solvers and refer readers to the cited references for more literature work. There are two categories of such solvers: projected gradient descent and Riemannian gradient descent. Arora et al. proposed the stochastic power method without theoretical guarantees [Arora et al., 202], which actually is equivalent to the projected stochastic gradient descent for the principal component analysis (PCA problem. It was named after the power method because the projected (deterministic gradient descent for PCA will degenerate to the power method when the step-size goes to infinity. This method was subsequently etended via conve relaation and proved to have a global, gap-free, and sub-linear convergence rate O( ɛ [Arora et al., 203]. Balsubramani et al. demonstrated that 2 Oja s algorithm converges at a global, gap-dependent, and sub-linear rate O( ɛ via the martingale analysis [Balsubramani et al., 203]. Note that the gap dependence refers to the dependence on in most of eisting analyses. More recently, stochastic gradient descent (SGD for PCA was shown to converge either at a global, gap-dependent, and sub-linear rate O( ɛ or at a global, gap-free, and sub-linear rate O( ɛ [Shamir, 2 206a]. Shamir proposed the projected SGD with variance reduction for PCA, called VR-PCA, and proved its global or local, gap-dependent, and linear rate O( log 2 ɛ [Shamir, 205; 206b]. More relevant to our work is an increasing body of Riemannian gradient descent methods. It is worth noting that general Riemannian optimization methods are applicable to Problem (, including Riemannian gradient descent [Edelman et al., 999; Absil et al., 2008; Wen and Yin, 203], Riemannian SGD [Bonnabel, 203] and Riemannian SVRG [Zhang et al., 206]. Notably, Wen et al. demonstrated that curvilinear search, which actually performs Riemannian gradient descent with Cayley transformation based retraction, achieves a comparable performance and is more robust compared to Lanczos algorithms [Wen and Yin, 203]. However, the analysis only, at most, achieves either a global, gap-free, and sub-linear convergence to critical points [Absil et al., 2008; Wen and Yin, 203; Bonnabel, 203; Zhang et al., 206] or a local, gap-dependent, and linear convergence to globally optimal solutions [Absil et al., 2008], due to the geodesic non-conveity. As we know, each eigenvector corresponds to a critical point of the objective function in Problem ( and thus they fail to meet the global optimality in theory. On the other hand, specifically, gradient descent for low-rank approimation was proven to converge globally but without any rate provided in [Pitaval et al., 205]. Xu et al. established the local, gap-dependent, and linear rate O( log 2 ɛ of the Riemannian SVRG solver [Xu and Ke, 206b; Xu et al., 207]. By proving an eplicit Łojasiewicz eponent at 2 in [Liu et al., 206], Liu et al. demonstrated a local, gapdependent, and linear convergence of the Riemannian linesearch methods for the quadratic problem under the orthogonality constraint, which includes Problem ( as a special case. In addition, despite the motivation from optimization on Riemannian quotient manifolds, alecton as a SGD solver for streaming low-rank matri approimation ends up with an update like the stochastic power method, and achieves a global, gap-dependent, and sub-linear rate O( 2 ɛ via the martingale analysis [Sa et al., 205]. Although gradient based methods for Problem ( work well in practice [Arora et al., 202; Wen and Yin, 203], related theory is far behind. Despite great interest in developing stochastic solvers, surprisingly, there even does not eist a thorough convergence analysis for the deterministic gradient descent solver as is achieved in this work, which we believe will be an important basis for developing stochastic versions with better theoretical guarantees than the state-of-the-art. 3 Analysis of Gradient Descent We now conduct the convergence analysis of the Riemannian gradient descent solver for Problem (, starting from introducing necessary notions and notations. Main results are then stated in theorems and followed by a few important supporting lemmas. The section ends with the proofs of the theorem and lemmas. 3. Notions and Notations Recall that given a real symmetric matri A R n n, its i-th largest eigenvalue is denoted as λ i, i.e., λ λ n, and 2934

3 the corresponding eigenvector denoted as v i. Define eigengap i λ i λ i+ and let p be the multiplicity of λ with the associated eigenspace denoted as V p [v,, v p ]. One then should have j 0 for j,, p and > 0 without loss of generality, i.e., p < n, because f( will be a constant function if p n. Instead of focusing on a specific case of the eigengap like > 0, we target a general framework admitting all the cases of. To this end, we have to rely on and V p, where we follow [Xu and Gao, 208] to term as the generalized eigengap. That is, it suffices for us to show the convergence to one of globally optimal solutions, i.e., a unit vector v span(v p, instead of a specific solution, e.g., v. We thus define the following potential function: ψ( t 2 log V p t 2, (3 where actually Vp t 2 cos θ( t, V p representing the cosine of the principal angle between the current iterate t and the space of globally optimal solutions. As Vp t 2, it is easy to see that ψ( t 0 and t converges to a unit vector in span(v p when ψ( t goes to 0. In addition, we take the normalization retraction, i.e., R(, ξ +ξ +ξ 2. Thus, the update in (2 can be eplicitly written as t+ t α t+ f(t t α t+ f(t 2, (4 where the Riemannian gradient for Problem ( is f( (I A. 3.2 Main Results Theorem. If the initial 0 y y 2 where entries of y are i.i.d. standard normal samples, i.e., y i N (0,, then the Rimannian gradient descent solver for Problem ( will converge to certain unit vector v span(v p at a global rate O(min{( λ 2 log ɛ, ɛ } with high probability. Specifically, for any ɛ (0,, 2λ 2 (+α p I the solver with constant step-sizes α < will converge, i.e., ψ( T < ɛ, after T O(( λ 2 log ψ(0 ɛ iterations, with probability at least ν p, where ν 0 if p otherwise ν (0, ; II the solver with diminishing step-sizes α t τ+t for sufficiently large constants c, τ > 0 will converge, i.e., ψ( T < ɛ, after T O( ɛ iterations, with probability at least ν p, where ν 0 if p otherwise ν (0,. As the convergence holds with high probability for any random initial iterate, it is global by convention. Contrastingly, the success probability given in [Shamir, 206a] is Ω( n. To prove the theorem, we need a few important lemmas whose proofs are deferred to after that of Theorem. Lemma 2. λ A sin 2 θ(, V p. Lemma 3. t α t+ f(t α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p. c log( log( Lemma 4. for any (0,, and log( + for any >. + Lemma 5. V p 0 2 > 0 with probability if p, otherwise with probability at least ν p, where 0 < ν < is a constant about V p y. Proof of Theorem. Proof. I We start from epanding ψ( t+ using (3-(4: ψ( t+ 2 log V p t+ 2 2 log V p ( t α t+ f(t 2 +2 log t α t+ f(t 2. To proceed, we need the full eigen-decomposition of A, i.e., A λ V p V p + V p diag(λ p+,, λ n (V p, (5 where V p represents the orthogonal complement of V p. Plugging in this decomposition to V p A, one then gets V p f( t V p (I t t A t We now can write V p A t ( t A t V p t (λ t A t V p t. ψ( t+ ψ( t 2 log( + α t+ (λ t A t where we have and +2 log t α t+ f(t 2, 2 log( + α t+ (λ t A t 2 log( + α t+ sin 2 θ( t, V p (by Lemma 2 2α t+ sin 2 θ( t, V p + α t+ sin 2 θ( t, V p 2α t+ + α t+ sin 2 θ( t, V p, (by Lemma 4 2 log t α t+ f(t 2 log( + α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p (by Lemma 3 α 2 t+(λ 2 + λ 2 2 sin 2 θ( t, V p (by Lemma 4 2α 2 t+λ 2 sin 2 θ( t, V p. One thus can arrive at ψ( t+ ψ( t 2α t+ + α t+ sin 2 θ( t, V p +2αt+λ 2 2 sin 2 θ( t, V p (6 2 ψ( t ( 2α t+ λ 2 + α t+ p α t+ sin 2 θ( t, V p ψ( t ψ( t. 2935

4 Note that by Lemma 4, sin 2 θ( t, V p ψ( t sin 2 θ( t, V p log( sin 2 θ( t, V p log( sin 2 θ( t, V p + ψ( t. (7 If +α t+ 2α t+ λ 2 > 0, i.e., α t+ < then ψ( t+ < ψ( t and 2λ 2 (+αt+ p, α t+ ψ( t+ ψ( t + α t+ + ψ( t ψ( t ( if α t+ α. We thus have + α t+ α t+ + ψ( t ψ( t (8 ( + α α + ψ( 0 ψ( t, (9 ψ( t ( + α α + ψ( 0 t ψ( 0 If Ξ < ɛ, i.e., and α ep{ t + α + ψ( 0 }ψ( 0 Ξ. ψ( 0 < + (0 T > ( + α( + ψ( 0 log ψ( 0, α ɛ we must have ψ( T < ɛ. Plugging in α < we can write 2λ 2 (+α p to T, T O( log ψ( 0 ( α ɛ O(( λ 2 log ψ( 0. (2 ɛ Finally note that (0 is equivalent to V p 0 2 > 0, which occurs with probability if p, otherwise at least ν p for certain constant ν (0,, by Lemma 5. II We now restart from (6. Plugging in (7 to (6, we can write ψ( t+ ψ( t 2α t+ + α t+ sin 2 θ( t, V p +2α 2 t+λ 2 sin 2 θ( t, V p ψ( t 2α t+ + α t+ ψ( t + ψ( t + 2λ2 α 2 t+. c Let α t+ τ+t where c, τ > 0 are constants. It is easy to see that for sufficiently small α t+, equivalently, sufficiently large τ, we have ψ( t+ < ψ( t. Thus, one gets ψ( t+ ( 2α t+ + c τ + ψ( 0 ψ( t + 2λ 2 αt+. 2 Denoting a 2c p + c τ p +ψ( and b 0 2c2 λ 2, one can write ψ( t+ ( a τ + t ψ( b t + (τ + t 2. As long as ψ( 0 < + and a >, by Lemma D. in [Balsubramani et al., 203], the recursion of the above inequality will yield τ + ψ( t ( t + τ + a ψ( 0 + 2a+ b a t + τ +. We thus have ψ( T O( T, i.e., T O( ɛ. The proof completes by noting that ψ( 0 < + occurs with high probability similarly and a > can be guaranteed by choosing sufficiently large constants c, τ. Remark We make a few remarks on our proof. Equation (8 shows that the convergence is at least linear because ψ( t+ < ψ( t and thus the contraction factor keep decreasing for constant step-sizes. In Equation (9, α t+ is not necessarily constant. In fact, the results will hold for any stepsize scheme as long as α t+ < 2λ 2 (+αt+ p. This provides theoretical support for us to fleibly choose step-size schemes. Equation ( shows that larger step-sizes in a safe range will lead to faster convergence. Moreover, we can see from Equation (2 that the convergence actually depends on the relative generalized eigengap p λ. This seems to be eplicitly shown for the first time for the considered solver. As mentioned in Section, the proof provides two general analysis frameworks. The generality lies in that both rates hold for any value of the positive. The combination of the results from two kinds of analysis comprehensively portrays the convergence behaviors of the solver. For both kinds of analysis, the convergence has only logarithmic dependence on the initial iterate via ψ( 0 2 log Vp 0 2. Proof of Lemma 2 Proof. Plugging in (5 to A, we get A λ V p V p diag(λ p+,, λ n (V p λ V p λ p+ V p (V p λ V p λ p+ (I V p V p λ V p λ p+ ( V p 2 2 λ cos 2 θ(, V p + λ p+ sin 2 θ(, V p. Thus, one arrives at λ A λ λ cos 2 θ(, V p λ p+ sin 2 θ(, V p (λ λ p+ sin 2 θ(, V p. 2936

5 Proof of Lemma 3 Proof. t α t+ f(t 2 2 ( t α t+ f(t ( t α t+ f(t + α 2 t+ f( t 2 2, where, plugging in the following form of A s full eigendecomposition, A λ vv + v diag(λ 2,, λ n v, for any unit vector v span(v p, we get f( t 2 2 A 2 2 λ vv + v diag(λ 2,, λ n v λ vv v diag(λ 2,, λ n v 2 2 2λ 2 v 2 2 v λ 2 2 v 2 2 v 2 2 2λ 2 v λ 2 2 v 2 2 2λ 2 v v + 2λ 2 2 v v 2λ 2 v (I v + 2λ 2 2 (I vv 2(λ 2 + λ 2 2( (v 2. Since the above inequality holds for any unit vector v span(v p, we have f( t 2 2 2(λ 2 + λ 2 2 min v 2 min ( v span(v (v 2. p By the definition of principal angles [Golub and Van Loan, 996], Vp 2 min min ( v 2 v span(v (v 2. p One thus has f( t 2 2 2(λ 2 + λ 2 2 sin 2 θ(, V p. Proof of Lemma 4 Proof. For any, it holds that + e. Then for any >, log( +. If one lets y + in the above inequality, then log y y. Further letting y z yields log z z + z Last, setting z + gives us z. log( Note that log( + i+ i0 ( i i+ for <. One then can write for (0, that log( ( i+ i0 ( i i+ i0 i i+ + i i i /( log(. + i i i+ ( i i ( i i Proof of Lemma 5 Proof. Let σ min ( and σ ma ( be the smallest and largest singular value of a matri, respectively. We then have Vp 0 2 V p y 2 σ min(vp y, y 2 σ ma (y where σ ma (y > 0 almost surely. Since entries of y are i.i.d. samples from N (0, and V p is orthonormal, entries of V p y are i.i.d. standard normal samples as well. By Equation (3.2 in [Rudelson and Vershynin, 200], σ min (V y > 0 almost surely (i.e., with probability. By Theorem 3.3 in [Rudelson and Vershynin, 200], we have σ min (V p y > 0 with probability at least ν p, where 0 < ν < is a constant about V p y. 4 Eperiments Theoretical properties of the considered solver were already observed and verified empirically in, e.g., [Wen and Yin, 203; Liu et al., 206], albeit without thorough theoretical support. For the sake of completeness, we reproduce a few eperiments on both synthetic and real data here, with new testing on matrices of zero eigengap. All the ground truth information, e.g., v or V p, is obtained by matlab s eigs function for the purpose of benchmarking. The advantage of using synthetic data mainly lies in the control of the eigengap. We set n 000 and follow [Shamir, 205] to generate data using the full eigenvalue decomposition A UΣU. U is an orthogonal matri and is set the same way as 0. Σ diag(σ, Σ 2, where Σ 2 diag( g n,, gn r n with g i N (0, and r being the order of Σ. In addition, the following two settings are considered: p : Σ diag(, η,.η,.2η,.3η,.4η, and then p λ η > 0, where η {0.2, 0.5, 0.8}. p 3: Σ diag(,,, and then p λ g n > 0. For the conventionally considered case that > 0, two types of convergence curves, in terms of the relative function and the commonly used potential function error f(t f(v f(v sin 2 θ( t, v, respectively, are shown in Figure 2 for different eigengap values. Global linear convergence of the solver is observed, and a larger eigengap yields faster convergence. It agrees with the theory. The convergence trends remain greatly consistent for the two types, which holds in the rest of eperiments as well. For the case that 0 and α t+ c τ+t (see Remark, different pairs of step-size parameters are tested in Figure. As described by Theorem, the convergence reported in Figures (a-(b is global linear, rather than sub-linear by [Xu and Gao, 208], and large step-sizes result in faster convergence as well. As it is unknown which leading eigenvector t will converge to finally, it is necessary for us to change the potential from the commonly used sin 2 θ( t, v to sin 2 θ( t, V p in this case. As demonstrated in Figure (c, 2937

6 ( f( t - f(v / f(v c2.87, 0 c4.48, 0 c6.09, 0 sin 2 ( t, V p c2.87, 0 c4.48, 0 c6.09, 0 sin 2 ( t, v c2.87, 0 c4.48, 0 c6.09, (a Relative function error (b Potential sin 2 θ( t, V p Figure : Synthetic data: (c Improper potential sin 2 θ( t, v > real: Schenk Cayley + Barzilai-Borwein normalization + constant ( f( t - f(v / f(v , 0.50, 0.80, t.96 t.4 t.5 ( f( t - f(v / f(v (a Relative function error > (a Relative function error real: Schenk Cayley + Barzilai-Borwein normalization + constant sin 2 ( t, v , 0.50, 0.80, t.96 t.4 t.5 sin 2 ( t, v (b Potential sin 2 θ( t, v Figure 2: Synthetic data: > 0 sin 2 θ( t, v is unable to converge, which implies that t has converged to another leading eigenvector in span(v p. We run two implementations of the solver on a real symmetric matri A, named Schenk, which is of size 0, 728 0, 728 with 85, 000 nonzero entries. The implementation is characterized by the combination of the retraction and stepsize scheme used. Curvilinear search was proposed in [Wen and Yin, 203] which uses Cayley transform + Barzilai- Borwein step-size. The other implementation uses normalization + constant step-size. They were fed with the same random initial iterate. The results are reported in Figure 3, which reflect the global linear convergence as well and meanwhile shows that different step-size schemes matter in practice Conclusion (b Potential sin 2 θ( t, v Figure 3: Real data. We presented a simple yet comprehensive convergence analysis of the Riemannian gradient descent solver for the leading eigenvector computation. Two kinds of general analysis jointly established the true global rate of convergence to one of the leading eigenvectors. The generalized eigengap eliminates the limitation of the commonly considered eigengap. When is large, the convergence is linear, which is described by the first general analysis. If it is as small as ɛ, the second general analysis shows that it is sub-linear O( ɛ rather than O( ɛ 2. It was also shown that the convergence only logarithmically depends on the initial iterate. The key to these breakthroughs made for the considered solver is to use for analysis the log principal angle between iterates and the space of the leading eigenvectors. Along this work, there are a few interesting future research directions. For eample, it is unknown if the results can be etended to k > for the top- 2938

7 k eigenspace computation without deflation. In particular, it would be more attractive to practitioners if the analysis can be translated to the case of the stochastic gradient descent solver, which is well worth further investigation. Also, although the rate is optimal for the considered solver, it would be of great interest to develop faster solvers of this type to match the optimal rate for the problem, e.g., those of Lanczos algorithms. Acknowledgements This research is supported in part by the funding from King Abdullah University of Science and Technology (KAUST. References [Absil et al., 2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matri manifolds. Princeton University Press, [Arora et al., 202] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for PCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton, pages , 202. [Arora et al., 203] Raman Arora, Andrew Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG. In NIPS, pages , 203. [Balcan et al., 206] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gapdependency analysis of the noisy power method. In COLT, pages , 206. [Balsubramani et al., 203] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In NIPS, pages Curran Associates, Inc., 203. [Bonnabel, 203] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Contr., 58(9: , 203. [Edelman et al., 999] Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints. SIAM J. Matri Anal. Appl., 20(2: , April 999. [Fan et al., 208] Jianqing Fan, Qiang Sun, Wen-Xin Zhou, and Ziwei Zhu. Principal component analysis for big data. arxiv preprint arxiv: , 208. [Garber et al., 206] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In ICML, pages , 206. [Golub and Van Loan, 996] Gene H. Golub and Charles F. Van Loan. Matri Computations (3rd Ed.. Johns Hopkins University Press, Baltimore, MD, USA, 996. [Hardt and Price, 204] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In NIPS, pages , 204. [Liu et al., 206] Huikang Liu, Weijie Wu, and Anthony Man-Cho So. Quadratic optimization with orthogonality constraints: Eplicit lojasiewicz eponent and linear convergence of line-search methods. In ICML, pages 58 67, 206. [Musco and Musco, 205] Cameron Musco and Christopher Musco. Randomized block krylov methods for stronger and faster approimate singular value decomposition. In NIPS, pages , 205. [Ng et al., 2002] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages MIT Press, [Pitaval et al., 205] Renaud-Aleandre Pitaval, Wei Dai, and Olav Tirkkonen. Convergence of gradient descent for low-rank matri approimation. IEEE Trans. Information Theory, 6(8: , 205. [Rudelson and Vershynin, 200] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: etreme singular values. arxiv preprint arxiv: , 200. [Sa et al., 205] Christopher De Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some non-conve matri problems. In ICML, pages , 205. [Shamir, 205] Ohad Shamir. A stochastic PCA and SVD algorithm with an eponential convergence rate. In ICML, pages 44 52, 205. [Shamir, 206a] Ohad Shamir. Convergence of stochastic gradient descent for PCA. In ICML, pages , 206. [Shamir, 206b] Ohad Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and conveity. In ICML, pages , 206. [Sra et al., 20] Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for Machine Learning. The MIT Press, 20. [Wen and Yin, 203] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Math. Program., 42(-2: , 203. [Xu and Gao, 208] Zhiqiang Xu and Xin Gao. On truly block eigensolvers via riemannian optimization. In AIS- TATS, pages 68 77, 208. [Xu and Ke, 206a] Zhiqiang Xu and Yiping Ke. Effective and efficient spectral clustering on tet and link data. In CIKM, pages , 206. [Xu and Ke, 206b] Zhiqiang Xu and Yiping Ke. Stochastic variance reduced riemannian eigensolver. CoRR, abs/ , 206. [Xu et al., 207] Zhiqiang Xu, Yiping Ke, and Xin Gao. A fast algorithm for matri eigen-decompositionn. In UAI 207, Sydney, Australia, August -5, 207, 207. [Zhang et al., 206] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: fast stochastic optimization on riemannian manifolds. In NIPS, pages ,

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation Zhiqiang Xu Cognitive Computing Lab (CCL, Baidu Research National Engineering Laboratory of Deep Learning Technology