A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence 1/19

Principal Component Analysis PCA Input: x 1,..., x n R d Goal: Find k directions with most variance max U R d k :U U=I max w R d : w =1 1 n n U x i=1 2 For k = 1: Find leading eigenvector of covariance matrix ( ) w 1 n x i x i w n i=1 Ohad Shamir Stochastic PCA with Exponential Convergence 2/19

Existing Approaches max w R d : w =1 w ( 1 n ) n x i x i w i=1 Regime: n, d large, non-sparse matrix Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

Existing Approaches max w R d : w =1 w ( 1 n ) n x i x i w i=1 Regime: n, d large, non-sparse matrix Approach 1: Eigendecomposition Compute leading eigenvector of 1 n n i=1 x ix i exactly (e.g. via QR decomposition) Runtime: O(d 3 ) Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

Existing Approaches Approach 2: Power Iterations Initialize w 1 randomly on unit sphere For t = 1, 2,... w t+1 := ( 1 n n i=1 x ) ix i wt = 1 n n i=1 w t, x i x i := w t+1 / w t+1 w t+1 O ( 1 λ log ( )) 1 ɛ iterations for ɛ-optimality λ: Eigengap O(nd) runtime per iteration Overall runtime O ( nd λ log ( )) d ɛ Approach 2.5: Lanczos Iterations More complex ( algorithm, but roughly similar iteration runtime and only O λ 1 log ( ) ) 1 ɛ iterations [Kuczyński and Wozńiakowski 1989] ( Overall runtime O nd λ log ( ) ) d ɛ Ohad Shamir Stochastic PCA with Exponential Convergence 4/19

Existing Approaches Approach 3: Stochastic/Incremental Algorithms Example (Oja s algorithm) Initialize w 1 randomly on unit sphere For t = 1, 2,... Pick i t {1,..., n} (randomly or otherwise) w t+1 := w t + η t x it x i t w t w t+1 := w t+1 / w t+1 Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012; Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, Ré 2014... O(d) runtime per iteration Iteration bounds: Balsubramani, Dasgupta, Freund 2013: Õ ( ( d 1 λ 2 ɛ + d)) De Sa, Olukotun, Ré 2014: For a different SGD method, Õ ( ) d λ 2 ɛ ( ) Runtime: Õ d 2 λ 2 ɛ Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

Existing Approaches Up to constants/log-factors: Algorithm Time per iter. # iter. Runtime Exact d 3 Power/Lanczos nd 1 λ p Incremental d d λ 2 ɛ nd λ p d 2 λ 2 ɛ Main Question Can we get the best of both worlds? O(d) time per iteration and fast convergence (logarithmic dependence on ɛ?) Ohad Shamir Stochastic PCA with Exponential Convergence 6/19

Convex Optimization to the Rescue? Our problem is equivalent to: min w: w =1 1 n n ( w, x i 2) i=1 Much recent progress in solving strongly convex + smooth problems with finite-sum structure min w W 1 n n f i (w) i=1 Stochastic algorithms with O(d) runtime per iteration and exponential convergence [Le Roux, Schmidt, Bach 2012; Shalev-Shwartz and Zhang 2012; Johnson and Zhang 2013; Zhang, Mahdavi, Jin 2013; Konečný and Richtárik 2013; Xiao and Zhang 2014; Zhang and Xiao, 2014...] Ohad Shamir Stochastic PCA with Exponential Convergence 7/19

Convex Optimization to the Rescue? Unfortunately: 1 min w: w =1 n n ( w, x i 2) i=1 Function not strongly convex, or even convex (in fact, concave everywhere) Has > 1 global optima, plateaus... Existing results don t work as-is But: Maybe we can borrow some ideas... Ohad Shamir Stochastic PCA with Exponential Convergence 8/19

Algorithm Oja Iteration 1 min w: w =1 n n ( w, x i 2) i=1 Choose i t {1,..., n} at random w t+1 = w t + η t w t, x it x it w t+1 := w t+1 / w t+1 Essentially projected stochastic gradient descent Ohad Shamir Stochastic PCA with Exponential Convergence 9/19

Algorithm Letting A = 1 n n i=1 x ix i, update step is w t+1 = w t + η t x it x i t w t ) = w t + η t Aw }{{} t + η t (x it x i t A w t }{{} power/gradient step zero-mean noise Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

Algorithm Letting A = 1 n n i=1 x ix i, update step is w t+1 = w t + η t x it x i t w t ) = w t + η t Aw }{{} t + η t (x it x i t A w t }{{} power/gradient step zero-mean noise Main idea: Replace by w t+1 = w t + ηaw }{{} t power/gradient step ( + η where ũ close to w t (similar to SVRG of Johnson and Zhang (2013)) ) A (w t ũ) x it x i t }{{} zero-mean noise Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

Algorithm VR-PCA Parameters: Step size η, epoch length m Input: Data set {x i } n i=1, Initial unit vector w 0 For s = 1, 2,... ũ = 1 n n i=1 x ix i w s 1 w 0 = w s 1 For t = 1, 2,..., m Pick i t {1,..., n} uniformly at random w t = w t 1 + η ( x it x i t (w t 1 w s 1) + ũ ) w t = 1 w t w t w s = w m To get k > 1 directions: Either repeat, or perform orthogonal-like iterations: Replace all vectors by k d matrices Replace normalization step by orthogonalization step Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

Analysis Theorem Suppose max i x i 2 r, and A has leading eigenvector v 1. Assuming w 0, v 1 1 2, then for any δ, ɛ (0, 1), if η c 1δ 2 λ, m c 2 log(2/δ) r 2 (where c 1, c 2, c 3 are constants) and we run T = ( ) then Pr w T, v 1 2 1 ɛ 1 2 log(1/ɛ)δ Corollary ηλ, mη 2 r 2 + r mη 2 log(2/δ) c 3, epochs, log(1/ɛ) log(2/δ) Picking η, m appropriately, ɛ-convergence w.h.p. in O ( d ( n + 1 λ 2 ) log ( 1 ɛ )) runtime Exponential convergence with O(d)-time iterations Proportional to # examples plus eigengap Proportional to single data pass if λ 1/ n Ohad Shamir Stochastic PCA with Exponential Convergence 12/19

Proof Idea Track decay of F (w t ) = 1 w t, v 1 2 Key Lemma Assuming η = αλ and F (w t ) 3/4, E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

Proof Idea Assume η = αλ (α 1) Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) If m Ω ( 1 αλ 2 ), F (wm ) smaller than F ( w s 1 ) by some constant factor Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) If m Ω ( 1 αλ 2 ), F (wm ) smaller than F ( w s 1 ) by some constant factor 1 Overall, if m 1, every epoch shrinks F by αλ 2 α 2 λ 2 constant factor Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Experiments Ran VR-PCA with Random initialization m = n (epoch one pass over data) η = 0.05/ n (based on theory) Compared to Oja s algorithm with hand-tuned step-size Ohad Shamir Stochastic PCA with Exponential Convergence 15/19

Experiments 0 1 2 3 4 λ=0.258 5 0 20 40 60 0 1 2 3 4 λ=0.016 5 0 20 40 60 0 1 2 3 4 λ=0.163 5 0 20 40 60 0 1 2 3 4 λ=0.004 5 0 20 40 60 0 1 2 3 4 λ=0.078 5 0 20 40 60 VR PCA η t = 1/t η t = 3/t η t = 9/t η t = 27/t Ohad Shamir Stochastic PCA with Exponential Convergence 16/19

Preliminary Experiments 0 1 2 3 4 VR PCA η t =3 η t =9 η t =27 η t =81 η t =243 Hybrid 5 6 7 8 9 0 10 20 30 40 50 60 70 80 90 100 Ohad Shamir Stochastic PCA with Exponential Convergence 17/19

Work-in-progress and Open Questions Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

Work-in-progress and Open Questions Runtime is O(d(n + 1 ) log ( ) 1 λ 2 ɛ ) can we get O(d(n + 1 λ ) log ( ) 1 ɛ or better, analogous to convex case? Experimentally, required parameters do seem somewhat different Generalizing analysis to PCA with several directions Theoretically optimal parameters without knowing λ Other fast-stochastic approaches Other non-convex problems Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

More details: arxiv 1409:2848 THANKS! Ohad Shamir Stochastic PCA with Exponential Convergence 19/19