A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Similar documents
A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate

Trade-Offs in Distributed Learning and Optimization

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Gradient Descent with Variance Reduction

Diffusion Approximations for Online Principal Component Estimation and Global Convergence

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Stochastic Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization

Stochastic and online algorithms

Fast Stochastic Optimization Algorithms for ML

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Streaming Principal Component Analysis in Noisy Settings

Big Data Analytics: Optimization and Randomization

Accelerating Stochastic Optimization

ECS289: Scalable Machine Learning

Fast and Simple PCA via Convex Optimization

Comparison of Modern Stochastic Optimization Algorithms

SVRG++ with Non-uniform Sampling

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Full-information Online Learning

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Optimization for Machine Learning

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Convergence Analysis of Gradient Descent for Eigenvector Computation

Exponentially convergent stochastic k-pca without variance reduction

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Mini-Course 1: SGD Escapes Saddle Points

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Nesterov s Acceleration

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

STA141C: Big Data & High Performance Statistical Computing

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Lecture 3: Minimizing Large Sums. Peter Richtárik

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Failures of Gradient-Based Deep Learning

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Accelerating SVRG via second-order information

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

Gen-Oja: A Simple and Efficient Algorithm for Streaming Generalized Eigenvector Computation

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Solving Large-scale Systems of Random Quadratic Equations via Stochastic Truncated Amplitude Flow

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Stochastic optimization in Hilbert spaces

Large-scale Stochastic Optimization

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

Day 3 Lecture 3. Optimizing deep networks

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Coordinate Descent Faceoff: Primal or Dual?

A tutorial on sparse modeling. Outline:

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations

Gradient Descent Meets Shift-and-Invert Preconditioning for Eigenvector Computation

Accelerated Stochastic Power Iteration

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

arxiv: v2 [stat.ml] 16 Jun 2015

Correlated-PCA: Principal Components Analysis when Data and Noise are Correlated

Convex Optimization Lecture 16

arxiv: v1 [math.oc] 18 Mar 2016

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Optimistic Rates Nati Srebro

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

CSC321 Lecture 8: Optimization

Efficient Stochastic Optimization for Low-Rank Distance Metric Learning

arxiv: v2 [cs.lg] 4 Oct 2016

Subsampling for Ridge Regression via Regularized Volume Sampling

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

arxiv: v2 [cs.na] 3 Apr 2018

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

SGD and Randomized projection algorithms for overdetermined linear systems

Mini-Batch Primal and Dual Methods for SVMs

Selected Topics in Optimization. Some slides borrowed from

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

arxiv: v1 [stat.ml] 15 Feb 2018

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

A Parallel SGD method with Strong Convergence

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Proximal Minimization by Incremental Surrogate Optimization (MISO)

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v2 [stat.ml] 3 Jun 2017

High Order Methods for Empirical Risk Minimization

Lecture 1: Supervised Learning

Incremental Quasi-Newton methods with local superlinear convergence rate

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

A Two-Stage Approach for Learning a Sparse Model with Sharp

Machine Learning in the Data Revolution Era

Transcription:

A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence 1/19

Principal Component Analysis PCA Input: x 1,..., x n R d Goal: Find k directions with most variance max U R d k :U U=I max w R d : w =1 1 n n U x i=1 2 For k = 1: Find leading eigenvector of covariance matrix ( ) w 1 n x i x i w n i=1 Ohad Shamir Stochastic PCA with Exponential Convergence 2/19

Existing Approaches max w R d : w =1 w ( 1 n ) n x i x i w i=1 Regime: n, d large, non-sparse matrix Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

Existing Approaches max w R d : w =1 w ( 1 n ) n x i x i w i=1 Regime: n, d large, non-sparse matrix Approach 1: Eigendecomposition Compute leading eigenvector of 1 n n i=1 x ix i exactly (e.g. via QR decomposition) Runtime: O(d 3 ) Ohad Shamir Stochastic PCA with Exponential Convergence 3/19

Existing Approaches Approach 2: Power Iterations Initialize w 1 randomly on unit sphere For t = 1, 2,... w t+1 := ( 1 n n i=1 x ) ix i wt = 1 n n i=1 w t, x i x i := w t+1 / w t+1 w t+1 O ( 1 λ log ( )) 1 ɛ iterations for ɛ-optimality λ: Eigengap O(nd) runtime per iteration Overall runtime O ( nd λ log ( )) d ɛ Ohad Shamir Stochastic PCA with Exponential Convergence 4/19

Existing Approaches Approach 2: Power Iterations Initialize w 1 randomly on unit sphere For t = 1, 2,... w t+1 := ( 1 n n i=1 x ) ix i wt = 1 n n i=1 w t, x i x i := w t+1 / w t+1 w t+1 O ( 1 λ log ( )) 1 ɛ iterations for ɛ-optimality λ: Eigengap O(nd) runtime per iteration Overall runtime O ( nd λ log ( )) d ɛ Approach 2.5: Lanczos Iterations More complex ( algorithm, but roughly similar iteration runtime and only O λ 1 log ( ) ) 1 ɛ iterations [Kuczyński and Wozńiakowski 1989] ( Overall runtime O nd λ log ( ) ) d ɛ Ohad Shamir Stochastic PCA with Exponential Convergence 4/19

Existing Approaches Approach 3: Stochastic/Incremental Algorithms Example (Oja s algorithm) Initialize w 1 randomly on unit sphere For t = 1, 2,... Pick i t {1,..., n} (randomly or otherwise) w t+1 := w t + η t x it x i t w t w t+1 := w t+1 / w t+1 Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012; Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, Ré 2014... Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

Existing Approaches Approach 3: Stochastic/Incremental Algorithms Example (Oja s algorithm) Initialize w 1 randomly on unit sphere For t = 1, 2,... Pick i t {1,..., n} (randomly or otherwise) w t+1 := w t + η t x it x i t w t w t+1 := w t+1 / w t+1 Also Krasulina 1969; Arora, Cotter, Livescu, Srebro 2012; Mitliagkas, Caramanis, Jain 2013; De Sa, Olukotun, Ré 2014... O(d) runtime per iteration Iteration bounds: Balsubramani, Dasgupta, Freund 2013: Õ ( ( d 1 λ 2 ɛ + d)) De Sa, Olukotun, Ré 2014: For a different SGD method, Õ ( ) d λ 2 ɛ ( ) Runtime: Õ d 2 λ 2 ɛ Ohad Shamir Stochastic PCA with Exponential Convergence 5/19

Existing Approaches Up to constants/log-factors: Algorithm Time per iter. # iter. Runtime Exact d 3 Power/Lanczos nd 1 λ p Incremental d d λ 2 ɛ nd λ p d 2 λ 2 ɛ Main Question Can we get the best of both worlds? O(d) time per iteration and fast convergence (logarithmic dependence on ɛ?) Ohad Shamir Stochastic PCA with Exponential Convergence 6/19

Convex Optimization to the Rescue? Our problem is equivalent to: min w: w =1 1 n n ( w, x i 2) i=1 Much recent progress in solving strongly convex + smooth problems with finite-sum structure min w W 1 n n f i (w) i=1 Stochastic algorithms with O(d) runtime per iteration and exponential convergence [Le Roux, Schmidt, Bach 2012; Shalev-Shwartz and Zhang 2012; Johnson and Zhang 2013; Zhang, Mahdavi, Jin 2013; Konečný and Richtárik 2013; Xiao and Zhang 2014; Zhang and Xiao, 2014...] Ohad Shamir Stochastic PCA with Exponential Convergence 7/19

Convex Optimization to the Rescue? Unfortunately: 1 min w: w =1 n n ( w, x i 2) i=1 Function not strongly convex, or even convex (in fact, concave everywhere) Has > 1 global optima, plateaus... Existing results don t work as-is But: Maybe we can borrow some ideas... Ohad Shamir Stochastic PCA with Exponential Convergence 8/19

Algorithm Oja Iteration 1 min w: w =1 n n ( w, x i 2) i=1 Choose i t {1,..., n} at random w t+1 = w t + η t w t, x it x it w t+1 := w t+1 / w t+1 Essentially projected stochastic gradient descent Ohad Shamir Stochastic PCA with Exponential Convergence 9/19

Algorithm Letting A = 1 n n i=1 x ix i, update step is w t+1 = w t + η t x it x i t w t ) = w t + η t Aw }{{} t + η t (x it x i t A w t }{{} power/gradient step zero-mean noise Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

Algorithm Letting A = 1 n n i=1 x ix i, update step is w t+1 = w t + η t x it x i t w t ) = w t + η t Aw }{{} t + η t (x it x i t A w t }{{} power/gradient step zero-mean noise Main idea: Replace by w t+1 = w t + ηaw }{{} t power/gradient step ( + η where ũ close to w t (similar to SVRG of Johnson and Zhang (2013)) ) A (w t ũ) x it x i t }{{} zero-mean noise Ohad Shamir Stochastic PCA with Exponential Convergence 10/19

Algorithm VR-PCA Parameters: Step size η, epoch length m Input: Data set {x i } n i=1, Initial unit vector w 0 For s = 1, 2,... ũ = 1 n n i=1 x ix i w s 1 w 0 = w s 1 For t = 1, 2,..., m Pick i t {1,..., n} uniformly at random w t = w t 1 + η ( x it x i t (w t 1 w s 1) + ũ ) w t = 1 w t w t w s = w m Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

Algorithm VR-PCA Parameters: Step size η, epoch length m Input: Data set {x i } n i=1, Initial unit vector w 0 For s = 1, 2,... ũ = 1 n n i=1 x ix i w s 1 w 0 = w s 1 For t = 1, 2,..., m Pick i t {1,..., n} uniformly at random w t = w t 1 + η ( x it x i t (w t 1 w s 1) + ũ ) w t = 1 w t w t w s = w m To get k > 1 directions: Either repeat, or perform orthogonal-like iterations: Replace all vectors by k d matrices Replace normalization step by orthogonalization step Ohad Shamir Stochastic PCA with Exponential Convergence 11/19

Analysis Theorem Suppose max i x i 2 r, and A has leading eigenvector v 1. Assuming w 0, v 1 1 2, then for any δ, ɛ (0, 1), if η c 1δ 2 λ, m c 2 log(2/δ) r 2 (where c 1, c 2, c 3 are constants) and we run T = ( ) then Pr w T, v 1 2 1 ɛ 1 2 log(1/ɛ)δ Corollary ηλ, mη 2 r 2 + r mη 2 log(2/δ) c 3, epochs, log(1/ɛ) log(2/δ) Picking η, m appropriately, ɛ-convergence w.h.p. in O ( d ( n + 1 λ 2 ) log ( 1 ɛ )) runtime Exponential convergence with O(d)-time iterations Proportional to # examples plus eigengap Proportional to single data pass if λ 1/ n Ohad Shamir Stochastic PCA with Exponential Convergence 12/19

Proof Idea Track decay of F (w t ) = 1 w t, v 1 2 Key Lemma Assuming η = αλ and F (w t ) 3/4, E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

Proof Idea Track decay of F (w t ) = 1 w t, v 1 2 Key Lemma Assuming η = αλ and F (w t ) 3/4, E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Ohad Shamir Stochastic PCA with Exponential Convergence 13/19

Proof Idea Assume η = αλ (α 1) Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) If m Ω ( 1 αλ 2 ), F (wm ) smaller than F ( w s 1 ) by some constant factor Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Proof Idea Assume η = αλ (α 1) Using martingale arguments: W.h.p., never reach flat region in m O ( 1 α 2 λ 2 ) iterations For all t m E [F (w t+1 ) w t ] ( 1 Θ(αλ 2 ) ) F (w t ) + O ( α 2 λ 2 F ( w s 1 ) ). Carefully unwinding recursion, and using w 0 = w s 1, E [F (w m ) w 0 ] (( 1 Θ(αλ 2 ) ) m + O(α) ) F ( ws 1 ) If m Ω ( 1 αλ 2 ), F (wm ) smaller than F ( w s 1 ) by some constant factor 1 Overall, if m 1, every epoch shrinks F by αλ 2 α 2 λ 2 constant factor Ohad Shamir Stochastic PCA with Exponential Convergence 14/19

Experiments Ran VR-PCA with Random initialization m = n (epoch one pass over data) η = 0.05/ n (based on theory) Compared to Oja s algorithm with hand-tuned step-size Ohad Shamir Stochastic PCA with Exponential Convergence 15/19

Experiments 0 1 2 3 4 λ=0.258 5 0 20 40 60 0 1 2 3 4 λ=0.016 5 0 20 40 60 0 1 2 3 4 λ=0.163 5 0 20 40 60 0 1 2 3 4 λ=0.004 5 0 20 40 60 0 1 2 3 4 λ=0.078 5 0 20 40 60 VR PCA η t = 1/t η t = 3/t η t = 9/t η t = 27/t Ohad Shamir Stochastic PCA with Exponential Convergence 16/19

Preliminary Experiments 0 1 2 3 4 VR PCA η t =3 η t =9 η t =27 η t =81 η t =243 Hybrid 5 6 7 8 9 0 10 20 30 40 50 60 70 80 90 100 Ohad Shamir Stochastic PCA with Exponential Convergence 17/19

Work-in-progress and Open Questions Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

Work-in-progress and Open Questions Runtime is O(d(n + 1 ) log ( ) 1 λ 2 ɛ ) can we get O(d(n + 1 λ ) log ( ) 1 ɛ or better, analogous to convex case? Experimentally, required parameters do seem somewhat different Generalizing analysis to PCA with several directions Theoretically optimal parameters without knowing λ Other fast-stochastic approaches Other non-convex problems Ohad Shamir Stochastic PCA with Exponential Convergence 18/19

More details: arxiv 1409:2848 THANKS! Ohad Shamir Stochastic PCA with Exponential Convergence 19/19