arxiv: v2 [stat.ml] 24 Apr 2017

Similar documents
Principal Component Analysis

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

arxiv: v4 [math.oc] 24 Apr 2017

randomized block krylov methods for stronger and faster approximate svd

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Linear Algebra Massoud Malek

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Lecture 8: Linear Algebra Background

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

1. General Vector Spaces

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

14 Singular Value Decomposition

arxiv: v1 [cs.lg] 17 Nov 2017

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Maths for Signals and Systems Linear Algebra in Engineering

Linear Algebra: Matrix Eigenvalue Problems

Iterative Methods for Solving A x = b

NORMS ON SPACE OF MATRICES

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Lecture 10: October 27, 2016

7 Principal Component Analysis

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

Linear Regression and Its Applications

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

Lecture 2: Linear Algebra Review

Linear Algebra. Session 12

8 Numerical methods for unconstrained problems

Linear Algebra for Machine Learning. Sargur N. Srihari

Repeated Eigenvalues and Symmetric Matrices

FIXED POINT ITERATIONS

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

COMP 558 lecture 18 Nov. 15, 2010

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

UNIT 6: The singular value decomposition.

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A =

arxiv: v1 [math.na] 5 May 2011

Big Data Analytics: Optimization and Randomization

Mathematical foundations - linear algebra

The Steepest Descent Algorithm for Unconstrained Optimization

Linear Algebra and Eigenproblems

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

LECTURE NOTES ELEMENTARY NUMERICAL METHODS. Eusebius Doedel

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

DELFT UNIVERSITY OF TECHNOLOGY

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Least Sparsity of p-norm based Optimization Problems with p > 1

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Linear Algebra March 16, 2019

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Lecture 5 : Projections

MTH 2032 SemesterII

Stat 159/259: Linear Algebra Notes

Stochastic and online algorithms

Convex Functions and Optimization

Chapter 7 Iterative Techniques in Matrix Algebra

Trust Regions. Charles J. Geyer. March 27, 2013

Optimization Theory. A Concise Introduction. Jiongmin Yong

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Introduction to Matrix Algebra

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Linear Algebra Primer

Optimization methods

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Numerical Methods - Numerical Linear Algebra

arxiv: v4 [math.oc] 11 Jun 2018

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Eigenvalues and Eigenvectors

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

MATH 205 HOMEWORK #3 OFFICIAL SOLUTION. Problem 1: Find all eigenvalues and eigenvectors of the following linear transformations. (a) F = R, V = R 3,

MIT Final Exam Solutions, Spring 2017

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

A Randomized Algorithm for the Approximation of Matrices

Introduction to Iterative Solvers of Linear Systems

Line Search Methods for Unconstrained Optimisation

Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems

Gradient Descent Methods

Iterative solvers for linear equations

Computational Methods. Eigenvalues and Singular Values

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

7. Symmetric Matrices and Quadratic Forms

The Hilbert Space of Random Variables

CHAPTER 3 Further properties of splines and B-splines

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Iterative solvers for linear equations

10-725/36-725: Convex Optimization Prerequisite Topics

Notes on Eigenvalues, Singular Values and QR

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

8. Diagonalization.

arxiv: v4 [math.oc] 5 Jan 2016

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Transcription:

Faster Principal Component Regression and Stable Matrix Chebyshev Approximation arxiv:68.4773v [stat.ml] 4 Apr 7 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / IAS August 6, 6 Abstract Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University We solve principal component regression (PCR), up to a multiplicative accuracy + γ, by reducing the problem to Õ(γ ) black-box calls of ridge regression. Therefore, our algorithm does not require any explicit construction of the top principal components, and is suitable for large-scale PCR instances. In contrast, previous result requires Õ(γ ) such black-box calls. We obtain this result by developing a general stable recurrence formula for matrix Chebyshev polynomials, and a degree-optimal polynomial approximation to the matrix sign function. Our techniques may be of independent interests, especially when designing iterative methods. Introduction In machine learning and statistics, it is often desirable to represent a large-scale dataset in a more tractable, lower-dimensional form, without losing too much information. One of the most robust ways to achieve this goal is through principal component projection (PCP): PCP: project vectors onto the span of the top principal components of the a matrix. It is well-known that PCP decreases noise and increases efficiency in downstream tasks. One of the main applications is principal component regression (PCR): PCR: linear regression but restricted to the subspace of top principal components. Classical algorithms for PCP or PCR rely on a principal component analysis (PCA) solver to recover the top principal components first; with these components available, the tasks of PCP and PCR become trivial because the projection matrix can be constructed explicitly. Unfortunately, PCA solvers demand a running time that at least linearly scales with the number of top principal components chosen for the projection. For instance, to project a vector onto the top principal components of a high-dimensional dataset, even the most efficient -based [8] or Lanczos-based [4] methods require a running time that is proportional to 4 = 4 4 times the input matrix sparsity, if the or Lanczos method is executed for 4 iterations. This is usually computationally intractable.

. Approximating PCP Without PCA In, we propose the following notion of PCP approximation. Given a data matrix A R d d (with singular values no greater than ) and a threshold λ >, we say that an algorithm solves (γ, ε)-approximate PCP if informally speaking and up to a multiplicative ± ε error it projects (see Def. 3. for a formal inition). any eigenvector ν of A A with value in [ λ( + γ), ] to ν,. any eigenvector ν of A A with value in [, λ( γ) ] to, 3. any eigenvector ν of A A with value in [ λ( γ), λ( + γ) ] to anywhere between and ν. Such a inition also extends to (γ, ε)-approximate PCR (see Def. 3.). It was first noticed by Frostig et al. [3] that approximate PCP and PCR be solved with a running time independent of the number of principal components above threshold λ. More specifically, they reduced (γ, ε)-approximate PCP and PCR to O ( γ log(/ε) ) black-box calls of any ridge regression subroutine where each call computes (A A + λi) u for some vector u. Our main focus of is to quadratically improve this performance and reduce PCP and PCR to O ( γ log(/γε) ) black-box calls of any ridge regression subroutine where each call again computes (A A + λi) u. Remark.. Frostig et al. only showed their algorithm satisfies the properties and of (γ, ε)- approximation (but not the property 3), and thus their proof was only for matrix A with no singular value in the range [ λ( γ), λ( + γ)]. This is known as the eigengap assumption, which is rarely satisfied in practice [8]. In, we prove our result both with and without such eigengap assumption. Since our techniques also imply the algorithm of Frostig et al. satisfies property 3, throughout the paper, we say Frostig et al. solve (γ, ε)-approximate PCP and PCR.. From PCP to Polynomial Approximation The main technique of Frostig et al. is to construct a polynomial to approximate the sign function sgn(x): [, ] {±}: { +, x ; sgn(x) =, x <. In particular, given any polynomial g(x) satisfying g(x) sgn(x) ε x [, γ] [γ, ], and (.) g(x) x [ γ, γ], (.) the problem of (γ, ε)-approximate PCP can be reduced to computing the matrix polynomial g(s) for S = (A A + λi) (A A λi) (cf. Fact 7.). In other words, to project any vector χ R d to top principal components, we can compute g(s)χ instead; and Ridge regression is often considered as an easy-to-solve machine learning problem: using for instance SVRG [7], one can usually solve ridge regression to an 8 accuracy with at most 4 passes of the data.

to compute g(s)χ, we can reduce it to ridge regression for each evaluation of Su for some vector u. Remark.. Since the transformation from A A to S is not linear, the final approximation to the PCP is a rational function (as opposed to a polynomial) over A A. We restrict to polynomial choices of g( ) because in this way, the final rational function has all the denominators being A A + λi, thus reduces to ridge regressions. Remark.3. The transformation from A A to S ensures that all the eigenvalues of A A in the range ( ± γ)λ roughly map to the eigenvalues of S in the range [ γ, γ]. Main Challenges. There are two main challenges regarding the design of polynomial g(x). Efficiency. We wish to minimize the degree n = deg(g(x)) because the computation of g(s)χ usually requires n calls of ridge regression. Stability. We wish g(x) to be stable; that is, g(s)χ must be given by a recursive formula where if we make ε error in each recursion (due to error incurred from ridge regression), the final error of g(s)χ must be at most ε poly(d). Remark.4. Efficient routines such as SVRG [7] solve ridge regression and thus compute Su for any u R d, with running times only logarithmically in /ε. Therefore, by setting ε = ε/poly(d), one can blow up the running time by a small factor O(log(d)) in order to obtain an ε-accurate solution for g(s)χ. The polynomial g(x) constructed by Frostig et al. comes from truncated Taylor expansion. It has degree O ( γ log(/ε) ) and is stable. This γ dependency limits the practical performance of their proposed PCP and PCR algorithms, especially in a high accuracy regime. At the same time, the optimal degree for a polynomial to satisfy even only (.) is Θ ( γ log(/ε) ) [9, ]. Frostig et al. were unable to find a stable polynomial matching this optimal degree and left it as open question..3 Our Results and Main Ideas We provide an efficient and stable polynomial approximation to the matrix sign function that has a near-optimal degree O(γ log(/γε)). At a high level, we construct a polynomial q(x) that approximately equals ( ) +κ x / for some κ = Θ(γ ); then we set g(x) = x q( + κ x ) which approximates sgn(x). To construct q(x), we first note that ( ) +κ x / has no singular point on [, ] so we can apply Chebyshev approximation theory to obtain some q(x) of degree O(γ log(/γε)) satisfying ( + κ x ) / q(x) ε for every x [, ]. This can be shown to imply g(x) sgn(x) ε for every x [, γ] [γ, ], so (.) is satisfied. In order to prove (.) (i.e., g(x) for every x [ γ, γ]), we prove a separate lemma: 3 ( + κ x ) / q(x) for every x [, + κ]. Using degree reduction, Frostig et al. found an explicit polynomial g(x) of degree O ( γ log(/γε) ) satisfying (.). However, that polynomial is unstable because it is constructed monomial by monomial and has exponentially large coefficients in front of each monomial. Furthermore, it is not clear if their polynomial satisfies the (.). 3 We proved a general lemma which holds for any function whose all orders of derivatives are non-negative at x =. 3

Note that this does not follow from standard Chebyshev theory because Chebyshev approximation guarantees are only with respect to x [, ] and do not extend to singular point x = + κ. This proves the efficiency part of the main challenges discussed earlier. As for the stability part, we prove a general theorem regarding any weighted sum of Chebyshev polynomials applied to matrices. We provide a backward recurrence algorithm and show that it is stable under noisy computations. This may be of independent interest. For interested readers, we compare our polynomial q(x) with that of Frostig et al. in Figure.....5.5.5 -. -.5.5. -. -.5.5. -. -.5.5. -.5 -.5 -.5 -. (a) degree -. (b) degree 4 -. (c) degree Figure : Comparing our polynomial g(x) (orange solid curve) with that of Frostig et al. (blue dashed curve)..4 Related Work There are a few attempts to reduce the cost of PCA when solving PCR, by for instance approximating the matrix AP λ where P λ is the PCP projection matrix [6, 7]. However, they cost a running time that linearly scales with the number of principal components above λ. A significant number of papers have focused on the low-rank case of PCA [, 4, 8] and its online variant [3]. Unfortunately, all of these methods require a running time that scales at least linearly with respect to the number of top principal components. More related to is work on matrix sign function, which plays an important role in control theory and quantum chromodynamics. Several results have addressed methods for applying the sign function in the so-called subspace, without explicitly constructing any approximate polynomial [, 4]. However, methods are not (γ, ε)-approximate PCP solvers, and there is no supporting stability theory behind them. 4 Other iterative methods have also been proposed, see Section 5 of textbook [6]. For instance, Schur s method is a slow one and also requires the matrix to be explicitly given. The Newton s iteration and its numerous variants (e.g. [9]) provide rational approximations to the matrix sign function as opposed to polynomial approximations. Our result and Frostig et al. [3] differ from these cited works, because we have only accessed an approximate ridge regression oracle, so ensuring a polynomial approximation to the sign function and ensuring its stability are crucial. Using matrix Chebyshev polynomials to approximate matrix functions is not new. Perhaps the most celebrated example is to approximate S using polynomials on S, used in the analysis of conjugate gradient []. Independent from, 5 Han et al. [5] used Chebyshev polynomials to approximate the trace of the matrix sign function, i.e., Tr(sgn(S)), which is similar but a different problem. 6 Also, they did not study the case when the matrix-vector multiplication oracle is only approximate (like we do in ), or the case when S has eigenvalues in the range [ γ, γ]. 4 We anyways have included method in our empirical evaluation section and shall discuss its performance there, see for instance Remark 8.. 5 Their paper appeared online two months before us, and we became aware of their work in March 7. 6 In particular, their degree of the Chebyshev polynomial is O ( γ (log (/γ) + log(/γ) log(/ε)) ) in the language of ; in contrast, we have degree O ( γ log(/γε) ). 4

Roadmap. In Section, we provide notions for and basics for Chebyshev polynomials In Section 3, we put forward our formal initions for approximate PCP and PCR, and show a reduction from approximate PCR to approximate PCP. In Section 4, we prove a general lemma regarding Chebyshev approximations outside [, ]. In Section 5, we design our polynomial approximation to sgn(x). In Section 6, we show how to stably compute any weighted sum of Chebyshev polynomials. In Section 7, we provide pseudocode and prove our main theorems regarding PCP and PCR. In Section 8, we provide empirical evaluations of our theory. Preliminaries We denote by [e] {, } the indicator function for event e, by v or v the Euclidean norm of a vector v, by M the Moore-Penrose pseudo-inverse of a symmetric matrix M, and by M its spectral norm. We sometimes use v to emphasize that v is a vector. Given a symmetric d d matrix M and any f : R R, f(m) is the matrix function applied to M, which is equal to Udiag{f(D ),..., f(d d )}U if M = Udiag{D,..., D d }U is its eigendecomposition. Throughout the paper, matrix A is of dimension d d. We denote by σ max (A) the largest singular value of A. Following the tradition of [3] and keeping the notations light, we assume without loss of generality that σ max (A). We are interested in PCP and PCR problems with an eigenvalue threshold λ (, ). Throughout the paper, we denote by λ λ d the eigenvalues of A A, and by ν,..., ν d R d the eigenvectors of A A corresponding to λ,..., λ d. We denote by P λ the projection matrix P λ = (ν,..., ν j )(ν,..., ν j ) where j is the largest index satisfying λ j λ. In other words, P λ is a projection matrix to the eigenvectors of A A with eigenvalues λ. Definition.. The principal component projection (PCP) of χ R d at threshold λ is ξ = P λ χ. Definition.. The principal component regression (PCR) of regressand b R d. Ridge Regression x = arg min AP λ y b or equivalently x = (A A) P λ (A b). y R d at threshold λ is Definition.3. A black-box algorithm ApxRidge(A, λ, u) is an ε-approximate ridge regression solver, if for every u R d, it satisfies ApxRidge(A, λ, u) (A A + λi) u ε u. 7 Ridge regression is equivalent to solving well-conditioned linear systems, or minimizing strongly convex and smooth objectives f(y) = y (A A + λi)y u y. Remark.4. There is huge literature on efficient algorithms solving ridge regression. Most notably, () Conjugate gradient [] or accelerated gradient descent [] gives fastest full-gradient methods; () SVRG [7] and its acceleration Katyusha [] give the fastest stochastic-gradient method; and (3) NUACDM [5] gives the fastest coordinate-descent method. 7 In fact, throughout the paper, we only need ApxRidge to satisfy this property with high probability for each u. 5

The running time of () is O(nnz(A)λ / log(/ε)) where nnz(a) is time to multiply A to any vector. The running times of () and (3) depend on structural properties of A and are always faster than (). Because the best complexity of ridge regression depends on the structural properties of A, following Frostig et al., we only compute our running time in terms of the number of black-box calls to a ridge regression solver.. Chebyshev Polynomials Definition.5. Chebyshev polynomials of st and nd kind are {T n (x)} n and {U n (x)} n where T (x) =, T (x) = x, T n+ (x) = x T n (x) T n (x) U (x) =, U (x) = x, U n+ (x) = x U n (x) U n (x) Fact.6 ([3]). It satisfies d dx T n(x) = nu n (x) for n and cos(n arccos(x)), if x ; n : T n (x) = cosh(n arccosh(x)), if x ; ( ) n cosh(n arccosh( x)), if x. In particular, when x, T n (x) = [( x x ) n ( + x + x ) n] and Un (x) = x ) n+ ( x x ) n+]. T n (x) = [( x x ) n ( + x + x ) n] U n (x) = [( x + x ) n+ ( x x ) n+] x x [( x + Definition.7. For function f(x) whose domain contains [, ], its degree-n Chebyshev truncated series and degree-n Chebyshev interpolation are respectively n n p n (x) = a k T k (x) and q n (x) = c k T k (x), where a k = [k = ] π k= f(x)t k (x) x dx and c k = k= [k = ] n + Above, x j = cos ( (j+.5)π ) [, ] is the j-th Chebyshev point of order n. n+ n f ( ) ( ) x j Tk xj. The following lemma is known as the aliasing formula for Chebyshev coefficients: Lemma.8 (cf. Theorem 4. of [3]). Let f be Lipschitz continuous on [, ] and {a k }, {c k } be ined in Def..7, then c = a + a n + a 4n +..., c n = a n + a 3n + a 5n +..., and k {,,..., n }: c k = a k + (a k+n + a k+4n +...) + (a k+n + a k+4n +...) Definition.9. For every ρ >, let E ρ be the ellipse E of foci ± with major radius + ρ. (This is also known as Bernstein ellipse with parameter + ρ + ρ + ρ.) The following lemma is the main theory regarding Chebyshev approximation: j= 6

Lemma. (cf. Theorem 8. and 8. of [3]). Suppose f(z) is analytic on E ρ and f(z) M on E ρ. Let p n (x) and q n (x) be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x) on [, ]. Then, max x [,] f(x) p n (x) max x [,] f(x) q n (x) M ρ+ ρ+ρ ( + ρ + ρ + ρ ) n ; 4M ρ+ ρ+ρ ( + ρ + ρ + ρ ) n. a M and a k M ( + ρ + ρ + ρ ) k for k. 3 Approximate PCP and PCR We formalize our notions of approximation for PCP and PCR, and provide a reduction from PCR to PCP. 3. Our Notions of Approximation Recall that Frostig et al. [3] work only with matrices A that satisfy the eigengap assumption, that is, A has no singular value in the range [ λ( γ), λ( + γ)]. Their approximation guarantees are very straightforward: an output ξ is ε-approximate for PCP on vector χ if ξ ξ ε χ ; an output x is ε-approximate for PCR with regressand b if x x ε b. Unfortunately, these notions are too strong and impossible to satisfy for matrices that do not have a large eigengap around the projection threshold λ. In we propose the following more general (but yet very meaningful) approximation notions. Definition 3.. An algorithm B(χ) is (γ, ε)-approximate PCP for threshold λ, if for every χ R d. P (+γ)λ ( B(χ) χ ) ε χ.. (I P ( γ)λ )B(χ) ε χ. 3. i such that λ i [ ( γ)λ, ( + γ)λ ], it satisfies ν i, B(χ) χ ν i, χ + ε χ. Intuitively, the first property above states that, if projected to the eigenspace with eigenvalues above ( + γ)λ, then B(χ) and χ are almost identical; the second property states that, if projected to the eigenspace with eigenvalues below ( γ)λ, then B(χ) is almost zero; and the third property states that, for each eigenvector ν i with eigenvalue in the range [( γ)λ, ( + γ)λ], the projection ν i, B(χ) must be between and ν i, χ (but up to an error ε χ ). Naturally, P λ (χ) itself is a (, )-approximate PCP. We propose the following notion for approximate PCR: Definition 3.. An algorithm C(b) is (γ, ε)-approximate PCR for threshold λ, if for every b R d. (I P ( γ)λ )C(b) ε b.. AC(b) b Ax b + ε b. where x = (A A) P (+γ)λ A b is the exact PCR solution for threshold ( + γ)λ. 7

The first notion states that the output x = C(b) has nearly no correlation with eigenvectors below threshold ( γ)λ; and the second states that the regression error should be nearly optimal with respect to the exact PCR solution but at a different threshold ( + γ)λ. Relationship to Frostig et al. Under eigengap assumption, our notions are equivalent to Frostig et al.: Fact 3.3. If A has no singular value in [ λ( γ), λ( + γ)], then Def. 3. is equivalent to B(χ) P λ (χ) O(ε) χ. Def. 3. implies C(χ) x O(ε/λ) b and C(χ) x O(ε) b implies Def. 3.. Above, x = (A A) P λ A b is the exact PCR solution. 3. Reductions from PCR to PCP If the PCP solution ξ = P λ (A b) is computed exactly, then by inition one can compute (A A) ξ which gives a solution to PCR by solving a linear system. However, as pointed by Frostig et al. [3], this computation is problematic if ξ is only approximate. The following approach has been proposed to improve its accuracy by Frostig et al. compute p((a A + λi) )ξ where p(x) is a polynomial that approximates function x λx. x λx and This is a good approximation to (A A) ξ because the composition of functions +λx is exactly x. Frostig et al. picked p(x) = p m (x) = m t= λt x t which is a truncated Taylor series, and used the following procedure to compute s m p m ((A A + λi) )ξ: s = B(A b), s = ApxRidge(A, λ, s ), k : s k+ = s + λ ApxRidge(A, λ, s k ). (3.) Above, B is an approximate PCP solver and ApxRidge is an approximate ridge regression solver. Under the eigengap assumption, Frostig et al. [3] showed that Lemma 3.4 (PCR-to-PCP). For fixed λ, γ, ε (, ), let A be a matrix whose singular values lie in [, ( γ)λ ] [ ( γ)λ, ]. Let ApxRidge be any O( ε )-approximate ridge regression m solver, and let B be any (γ, O( ελ ))-approximate PCP solver 8. Then, procedure (3.) satisfies m s m (A A) P λ A b ε b if m = Θ(log(/εγ)). Unfortunately, the above lemma does not hold without eigengap assumption. In, we fix this issue by proving the following analogous lemma: Lemma 3.5 (gap free PCR-to-PCP). For fixed λ, ε (, ) and γ (, /3], let A be a matrix whose singular values are no more than. Let ApxRidge be any O( ε )-approximate ridge regression solver, and B be any (γ, O( ελ ))-approximate PCP solver. Then, procedure (3.) satisfies, m m { } (I P ( γ)λ )s m ε b, and if m = Θ(log(/εγ)) As m b A(A A) P (+γ)λ A b b + ε b Note that the conclusion of this lemma exactly corresponds to the two properties in our Def. 3.. The proof of Lemma 3.5 is not hard, but requires a very careful case analysis by decomposing vectors b and each s k into three components, each corresponding to eigenvalues of A A in the range [, ( γ)λ], [( γ)λ, ( + γ)λ] and [( + γ)λ, ]. We er the details to Appendix A. 8 Recall from Fact 3.3 that this requirement is equivalent to saying that B(χ) P λ χ O( ε λ m ) χ. 8

4 Property of Chebyshev Approximation Outside [, ] Classical Chebyshev approximation theory (such as Lemma.) only talks about the behaviors of p n (x) or g n (x) on interval [, ]. However, for the purpose of, we must also bound its value for x >. We prove the following general lemma in Appendix B, and believe it could be of independent interest: (we denote by f (k) (x) the k-th derivative of f at x) Lemma 4.. Suppose f(z) is analytic on E ρ and for every k, f (k) (). Then, for every n N, letting p n (x) and q n (x) be be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x), we have y [, ρ]: p n ( + y), q n ( + y) f( + y). 5 Our Polynomial Approximation of sgn(x) For fixed κ (, ], we consider the degree-n Chebyshev interpolation q n (x) = n k= c kt k (x) of the function f(x) = ( ) +κ x / on [, ]. Def..7 tells us that c k = [k = ] n + n j= ( ( k(j +.5)π ))( cos + κ cos n + Our final polynomial to approximate sgn(x) is therefore g n (x) = x q n ( + κ x ) and deg(g n (x)) = n +. We prove the following theorem in this section: ( (j +.5)π )) /. n + Theorem 5.. For every α (, ], ε (, /), choosing κ = α, our function g n (x) = x q n ( + κ x ) satisfies that as long as n α log 3, then (see also Figure ) εα g n (x) sgn(x) ε for every x [, α] [α, ]. g n (x) [, ] for every x [, α] and g n (x) [, ] for every x [ α, ]. Note that our degree n = O ( α log(/αε) ) is near-optimal, because the minimum degree for a polynomial to satisfy even only the first item is Θ ( α log(/ε) ) [9, ]. However, the results of [9, ] are not constructive, and thus may not lead to stable matrix polynomials. We prove Theorem 5. by first establishing two simple lemmas. The following lemma is a consequence of Lemma.: Lemma 5.. For every ε (, /) and κ (, ], if n κ ( log κ + log 4 ε) then x [, ], f(x) q n (x) ε. Proof of Lemma 5.. Denoting by f(z) = ( ) +κ z.5, we know that f(z) is analytic on ellipse Eρ with ρ = κ/, and it satisfies f(z) /κ in E ρ. Applying Lemma., we know that when n ( κ log κ + log 4 ε) it satisfies f(x) qn (x) ε. The next lemma an immediate consequence of our Lemma 4. with f(z) = ( ) +κ z.5: Lemma 5.3. For every ε (, /), κ (, ], n N, and x [, κ], we have ( κ x ) / q n ( + x). 9

Proof of Theorem 5.. We are now ready to prove Theorem 5.. When x [, α] [α, ], it satisfies + κ x [, ]. Therefore, applying Lemma 5. we have whenever n κ log 6 εκ = α log 3 it satisfies f( + κ x ) q εα n ( + κ x ) ε. This further implies g n (x) sgn(x) = xq n (+κ x ) xf(+κ x ) x f(+κ x ) q n (+κ x ) ε. When x α, it satisfies + κ x [, + κ]. Applying Lemma 5.3 we have x [, α]: g n (x) = x q n ( + κ x ) x (x ) / = and similarly for x [ α, ] it satisfies g n (x). A Bound on Chebyshev Coefficients. We also give an upper bound to the coefficients of polynomial q n (x). Its proof can be found in Appendix C, and this upper bound shall be used in our final stability analysis. Lemma 5.4 (coefficients of q n ). Let q n (x) = n k= c kt k (x) be the degree-n Chebyshev interpolation of f(x) = ( ) +κ x / on [, ]. Then, i {,,..., n}: c i e 3(i + ) ( + κ + ) i κ + κ κ 6 Stable Computation of Matrix Chebyshev Polynomials In this section we show that any polynomial that is a weighted summation of Chebyshev polynomials with bounded coefficients, can be stably computed when applied to matrices with approximate computations. We achieve so by first generalizing Clenshaw s backward method to matrix case in Section 6. in order to compute a matrix variant of Chebyshev sum, and then analyze its stability in Section 6. with the help from Elloit s forward-backward transformation [8]. Remark 6.. We wish to point out that although Chebyshev polynomials are known to be stable under error when computed on scalars [4], it is not immediately clear why it holds also for matrices. Recall that Chebyshev polynomials satisfy T n+ (x) = xt n (x) T n (x). In the matrix case, we have T n+ (M)χ = MT n (M)χ T n (M)χ where χ R d is a vector. If we analyzed this formula coordinate by coordinate, error could blow up by a factor d per iteration. In addition, we need to ensure that the stability theorem holds for matrices M with eigenvalues that can exceed. This is not standard because Chebyshev polynomials are typically analyzed only on domain [, ]. 6. Clenshaw s Method in Matrix Form In the scalar case, Clenshaw s method (sometimes referred to as backward recurrence) is one of the most widely used implementations for Chebyshev polynomials. We now generalize it to matrices. Consider any computation of the form s N = N T k (M) c k R d where M R d d is symmetric and each c k is in R d. (6.) k= (Note that for PCP and PCR purposes, we it suffices to consider c k = c k χ where c k R is a scalar and χ R d is a fixed vector for all k. However, we need to work on this more general form for our stability analysis.)

Vector s N can be computed using the following procedure: Lemma 6. (backward recurrence). s N = b M b where bn+ =, bn = c N, and r {N,..., }: b r = M b r+ b r+ + c r R d. 6. Inexact Clenshaw s Method in Matrix Form We show that, if implemented using the backward recurrence formula, the Chebyshev sum of (6.) can be stably computed. We ine the following model to capture the error with respect to matrix-vector multiplications. Definition 6.3 (inexact backward recurrence). Let M be an approximate algorithm that satisfies M(u) Mu ε u for every u R d. Then, ine inexact backward recurrence to be bn+ =, bn = c N, and r {N,..., }: b r = M ( br+ ) br+ + c r R d, and ine the output as ŝ N = b M( b ). The following theorem gives an error analysis to our inexact backward recurrence. We prove it in Appendix D., and the main idea of our proof is to convert each error vector of a recursion of the backward procedure into an error vector corresponding to some original c k. Theorem 6.4 (stable Chebyshev sum). For every N N, suppose the eigenvalues of M are in [a, b] and suppose there are parameters C U, C T, ρ, C c satisfying { k {,,..., N}: ρ k c k C c x [a, b]: Tk (x) C T ρ k and U k (x) C U ρ k}. Then, if the inexact backward recurrence in Def. 6.3 is applied with ε 4NC U, we have ŝ N s N ε ( + NC T )NC U C c. 7 Algorithms and Main Theorems for PCP and PCR We are now ready to state our main theorems for PCP and PCR. We first note a simple fact: Fact 7.. (P λ )χ = I+sgn(S) where S = (A A + λi) A A I = (A A + λi) (A A λi). In other words, for every vector χ R d, the exact PCP solution P λ (χ) is the same as computing (P λ )χ = I+sgn(S) χ. Thus, we can use our polynomial g n (x) introduced in Section 5 and compute g n (S)χ sgn(s)χ. Finally, in order to compute g n (S), we need to multiply S to deg(g n ) vectors; whenever we do so, we call perform ridge regression once. 7. Our Pseudo Codes First of all, we can approximately compute Sχ for an arbitrary χ R d. This simply uses one oracle call to ridge regression, see Algorithm. Next, since we are interested in (γ, ε)-approximate PCP, we want g n (x) to be close to sgn(x) on all eigenvalues of A A that are outside [( γ)λ, ( + γ)λ], or equivalently all eigenvalues of S outside the range [ ( + γ) + ( + γ) ( γ) ], + ( γ).

Algorithm MultS(A, λ, χ) Input: A R d d ; λ > ; χ R d. Output: a vector that approximately equals Sχ = (A A + λi) (A A λi)χ : return ApxRidge(A, λ, A Aχ λχ). Since this new interval contains [ α, α] for α = γ/(+γ) = γ/ O(γ ), we can apply Theorem 5., which gives us a polynomial g n (x) = x q n ( + κ x ) where κ = α = (γ/( + γ)). We use (inexact) backward ( recurrence see Lemma 6. to compute the Chebyshev interpolation polynomial u q n (+κ)i S ) χ. Our final output for approximate PCP is simply Su+χ because P λ Sgn((+κ) S )+I. We summarize this algorithm as QuickPCP(A, χ, λ, γ, n) in Algorithm. Algorithm QuickPCP(A, χ, λ, γ, n) Input: A R d d data matrix satisfying σ max (A) ; χ R d, vector to project; λ >, eigenvalue threshold for PCP; γ (, /3], PCP approximation ratio. n, number of iterations one can also ignore γ and set γ =, see Remark 7.5 Output: a vector ξ R d satisfying ξ P λ (χ). : γ max{γ, log(n) n } if γ to small, work in a γ-free regime, see Remark 7.5 : κ ( γ/( + γ) ) recall κ = α = (γ/( + γ)) in our analysis 3: Define c k = [k=] ( n ( n+ j= cos k(j+.5)π ) )( n+ + κ cos ( (j+.5)π ) ) / n+ 4: b n+, b n c n χ 5: for r n to do coefficients for q n(x) 6: w ( + κ)b r+ MultS(A, λ, MultS(A, λ, b r+ )); w (( + κ)i S )b r+ 7: b r w b r+ + c r χ 8: end for 9: u MultS(A, λ, b w); u S(g n(( + κ)i S ))χ sgn(s)χ : return u + sgn(s)+i χ output χ Finally, we apply the PCR-to-PCP reduction (see Section 3) to derive a solution for PCR from an approximate solution for PCP. See QuickPCR(A, b, λ, γ, n, m) in Algorithm 3. Algorithm 3 QuickPCR(A, b, λ, γ, n, m) Input: A, λ, γ, n the same as QuickPCP; b R d is the regressand vector; m is the number of iterations for PCR. choosing m = it sufficient for practical purposes Output: a vector x R d that solves approximate PCR. : v QuickPCP(A, A b, λ, γ, n), s v, s ApxRidge(A, λ, v); : for r to m do 3: s λ ApxRidge(A, λ, s) + s ; 4: return s Fact 7.. QuickPCP calls ridge regression n + times and QuickPCR calls it n + m + times.

7. Our Main Theorems We first state our main theorem under the eigengap assumption, in order to provide a direct comparison to that of Frostig et al. [3]. d Theorem 7.3 (eigengap assumption). Given A R d and λ, γ (, ), assume that the singular values of A are in the range [, ( γ)λ] [ ( + γ)λ, ]. Given χ R d and b R d, denote by ξ = P λ χ and x = (A A) P λ A b the exact PCP and PCR solutions. If ApxRidge is an ε -approximate ridge regression solver, then the output ξ QuickPCP(A, χ, λ, γ, n) satisfies ξ ξ ε χ if n = Θ ( γ log ) γε and log(/ε ) = Θ ( log γε) ; the output x QuickPCR(A, b, λ, γ, n, m) satisfies x x ε b if n = Θ ( γ log ) ( ) γλε, m = Θ log γε and log(/ε ) = Θ ( log γλε). In contrast, the number of ridge-regression oracle calls was Θ(γ log γε ) for PCP and Θ(γ log for PCR in [3]. We include the proof of Theorem 7.3 in Appendix E.. Next we state our stronger theorem without the eigengap assumption. Theorem 7.4 (gap-free). Given A R d d, λ (, ), and γ (, /3], assume that A. Given χ R d and b R d, and suppose ApxRidge is an ε -approximate ridge regression solver, then QuickPCP outputs ξ that is (γ, ε)-approximate PCP with O ( γ log γε) oracle calls to ApxRidge as long as log(/ε ) = Θ ( log γε). QuickPCR outputs x that is (γ, ε)-approximate PCR with O ( γ log γλε) oracle calls to ApxRidge as long as elog(/ε ) = Θ ( log γλε). We make a final remark here regarding the practical usage of QuickPCP and QuickPCR. Remark 7.5. Since our theory is for (γ, ε)-approximations that have two parameters, the user in principle has to feed in both γ and n (in addition to other ault inputs such as A, b and λ). In practice, however, it is usually sufficient to obtain (ε, ε)-approximate PCP and PCR. Therefore, our pseudocodes allow users to set γ = and thus ignore this parameter γ; in such a case, we shall use γ = log(n)/n which is equivalent to setting γ = Θ(ε) because n = Θ(γ log(/γε)). 8 Experiments In the same way as [3], we conclude with an empirical evaluation to demonstrate our theorems. Datasets. We consider synthetic and real-life datasets. We generate the synthetic dataset in the same way as [3]. That is, we form a 3 dimensional matrix A via the SVD A = UΣV where U and V are random orthonormal matrices and Σ contains random singular values. Among the singular values, we let half of them be randomly chosen from [,.( a)] and the other half randomly chosen from [.(+a), ]. We generate vector b by adding noise to the response Ax of a random true x that correlates with A s top principal components. We consider eigenvalue threshold λ =., and use a =,.,.,. in our experiments. We call these datasets random-a. γλε ) 3

As for the real-life dataset, we use mnist []. After scaling its largest singular value to one, 9 we choose the eigenvalue threshold λ =.5 (or equivalently singular value threshold λ =.5). The closest singular values to this threshold are respectively.57 and.4958. Algorithms. We implemented our algorithm and Frostig et al. [3] (which we call for short) and minimized the number of calls to ridge regression in our implementations. For instance, if using our pseudocode QuickPCP, the number of ridge regression calls is n + ; if using our pseudocode QuickPCR, the number of extra ridge regression calls is m +. We choose m = in all of our experiments because the theoretical prediction of m is only a small logarithmic quantity (see Lemma 3.4 and Lemma 3.5). We also implemented a practical heuristic using subspace that were found on the website []. We call this algorithm method for short. method transforms the covariance matrix AA into a lower-dimensional subspace and performs exact PCP and PCR there. Similar to, method also reduces PCP and PCR to multiple calls of ridge regressions. We emphasize that method has no supporting theory behind it. Since we find it performs much faster than in practice, we include it in our experiments for a stronger comparison. Remark 8.. There are two main issues behind the missing theory of method. Stability. If matrix-vector multiplications are only approximate, -based methods are usually unstable so one needs to replace it with other stable variants. Our polynomial approximation g n (x) can be viewed as one such stable variant. Accuracy. To the best of our knowledge, even with exact computations, if there is no eigengap around threshold λ which is usually the case in real life it is unlikely that method can achieve a log(/ε) convergence with respect to the ε-parameter in (γ, ε)-approximate PCP or PCR. Our experiments later (namely Figure 3(c) and 3(f)) shall also confirm on this. 8. Evaluation : With Eigengap Assumption In the first evaluation we consider matrices that satisfy the eigengap assumption. To simulate an eigengap, we use random datasets random-a with a =.,.,. and present our findings in Figure in terms of the following three performance measures: Regression Error: x x / x ; where x is the output of a PCR algorithm and x = (A A) P λ A b is the exact PCR solution. Projection Error: ξ ξ / ξ ; where ξ is the output of a PCP algorithm and ξ = P λ A b is the exact PCP solution. Denoising Error: (I P λ )ξ / ξ ; where ξ is the output of a PCP algorithm. The x-axis of these plots represent the number of calls to ridge regression, and in Figure we use exact implementations of ridge regression similar to the experiments in [3]. Note that the horizontal axis starts with for projection performances (second and third column) and with 9 This is a cheap procedure and for instance can be done by power method [3]. The original code [], when working with subspace of dimension k, requires k calls of ridge regression. In our experiments, we improved this implementation and reduced it from k calls to k calls for a stronger comparison. This is so because method works in a smaller dimension whose so-called Ritz values approximate the original eigenvalues of A A. However, this approximation cannot be exponentially close because there are only very few Ritz values as compared to the original eigenvalues of A A. 4

E+ E+ E- E- E- E-6 E-8 6 6 E- E-6 E-8 5 (a) random-., regression error 5 (b) random-., projection error E+ 5 (c) random-., denoising error E- E+ E- E- 5 E- E- E- 6 6 (d) random-., regression error E+ 5 5 (e) random-., projection error E+ 5 8 5 (f) random-., denoising error 9 E- E- E- E- E- E- 6 6 (g) random-., regression error 5 5 (h) random-., projection error 5 5 (i) random-., denoising error Figure : Performance comparison on random-a datasets with eigengap a >. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents performance. Denoting by x and ξ respectively the PCR and PCP outputs, then regression error is kx x k /kx k, projection error is kξ ξ k /kξ k, and denoising error is k(i Pλ )ξk /kξk. for regression performance (first column). This is so because in order to reduce PCR to PCP one needs m + calls to ridge regression in QuickPCR and in our experiments we simply choose m =. We make some important observations from these results We significantly outperform for our choices of a. Our performance degrades as a (and thus γ) decreases; this is consistent to our theory. The performance of method fluctuates partly due to the missing theory behind it. This limits the practicality of method, because it is hardly possible for the algorithm to determine when is the best time to stop the algorithm. If the fluctuation of method is ignored, it matches the performance of QuickPCP and QuickPCR. This is an interesting phenomenon and might even be a first evidence towards a theoretical proof for method. 8. Evaluation : Without Eigengap Assumption In our second evaluation we consider scenarios when there is no significant eigengap around the projection threshold λ. We consider dataset random-a for a = as well as dataset mnist. This Of course, if the true projection matrix Pλ is given explicitly, we can determine a good iteration to stop. However, the entire PCP problem is regarding how to compute Pλ without explicitly constructing it. 5

E+ E- E- 6 6 E+ (a) random-, regression error E+ E- E- 5 5 E- (b) random-, projection error E- E- E-5 5 5 (c) random-, denoising error (small) E- E-5 E- 3 5 7 9 4 6 8 E-6 4 6 8 (d) mnist, regression error (e) mnist, projection error (f) mnist, denoising error (small) Figure 3: Gap-free performance comparison on random- and mnist. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents performance. Denoting by x and ξ respectively the PCR and PCP outputs, then regression error is x x / x, projection error is ξ ξ / ξ, and denoising error (small) is (I P.8λ )ξ / ξ. time, we also consider three performance measures. The first two are the same as the previous subsection, as for the third measure, we replace it with denoising error (small): (I P.8λ )ξ / ξ. We emphasize here that in gap-free scenarios, regression error, projection error, or even the quantity (I P λ )ξ can all be very large in the extreme case if there is an eigenvector that has exactly eigenvalue λ, then these quantities do not converge to zero. This is why our gap-free approximation initions do not account for such quantities (see Def. 3. and Def. 3.). In contrast, by focusing only on eigenvectors that are less than threshold ( γ)λ for some γ >, and looking at (I P ( γ)λ )ξ, this quantity can indeed converge to ε > with a speed that is O(γ log(/ε)) if our algorithm is used (see Theorem 7.4). Note that this speed was only O(γ log(/ε)) for. We present our findings in Figure 3 and make some important conclusions here: Our method still significantly outperforms. In terms of denoising error, our method significantly outperforms method. This is so because, according to Remark 8., method cannot achieve a log(/ε) convergence rate with respect to the ε-parameter in (γ, ε)-approximate PCP or PCR. Threfore, our method is clearly the best for denoising purposes. 8.3 Evaluation 3: Stability Test In our third evaluation, we verify that our method continues to work well even if ridge regressions are computed with moderate error. We consider two types of errors in our experiments: 6

E- E- E- E-6 E-6 E-6 E-8 E-8 5 5 (a) random γ =., ridge-exact E-8 5 5 (b) random γ =., ridge-svrg E- E- E- E- E- E-5 E-5 5 5 5 5 (e) random γ =, ridge-svrg E- E- E-5 E-5 E-5 E-6 E-6 4 6 8 (g) mnist, ridge-exact 5 6 (f) random γ =, ridge- 5 E- 5 6 E-5 (d) random γ =, ridge-exact (c) random γ =., ridge- E- 5 E-6 4 6 8 (h) mnist, ridge-svrg 4 6 8 krylov (i) mnist, ridge- 7 Figure 4: Stability test exact vs. approximate ridge regression subroutines. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents the denoising error. We compare exact implementation of ridge regression with ridge-svrg and ridge- k. Remark. Although it seems our method is more affected by error than, we emphasize that this is because is too slow and still works in a very low-accuracy regime in the plots. (For instance, as a stable algorithm, should not be affected by error of magnitude around 6 when the desired accuracy is above 4.) ridge-svrg: we run the SVRG [7] method for 5 passes to solve each ridge regression.3 ridge- k : we run exact ridge regression but randomly add noise [ k, k ] per coordinate. We present our findings in Figure 4. For cleanness, we compare only the denoising error and only on datasets mnist, random- and random-..4 We make the following conclusions and remarks: Even with inexact ridge regression, our method still works very well. We continue to outperform significantly. Compared with method, we continue to outperform it significantly in gap-free scenarios. Although it seems our method is more affected by error than, we emphasize that this is because is too slow and still works in a very low-accuracy regime in the plots. (For 3 We choose the epoch length of SVRG to be n, and therefore full gradients are computed every n stochastic iterations. Each n stochastic iterations is counted as one pass of the data, and each full gradient computation is counted as one pass of the data. 4 Since mnist and random- are datasets without significant eigengap, we present denoising error (small) as ined in Section 8.. 7

instance, as a stable algorithm, should not be affected by error of magnitude around 6 when the desired accuracy is above 4.) 9 Conclusion We summarize our contributions. We put forward approximate notions for PCP and PCR that do not rely on any eigengap assumption. Our notions reduce to standard ones under the eigengap assumption. We design near-optimal polynomial approximation g(x) to sgn(x) satisfying (.) and (.). We develop general stable recurrence formula for matrix Chebyshev polynomials; as a corollary, our g(x) can be applied to matrices in a stable manner. We obtain faster, provable PCA-free algorithms for PCP and PCR than known results. Acknowledgements We thank Yin Tat Lee for suggesting us the new title, and anonymous referees for useful suggestions. Z. Allen-Zhu is partially supported an NSF Grant, no. CCF-4958, and a Microsoft Research Grant, no. 58584. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. A Proof of Lemma 3.5 Appendix Lemma 3.5. For fixed λ, ε (, ) and γ (, /3], let A be a matrix whose singular values are no more than. Let ApxRidge be any O( ε )-approximate ridge regression solver, and B be any m (γ, O( ελ ))-approximate PCP solver. Then, procedure (3.) satisfies, m { } (I P ( γ)λ )s m ε b, and if m = Θ(log(/εγ)) As m b A(A A) P (+γ)λ A b b + ε b Proof of Lemma 3.5. We first notice that the approximation guarantee of B implies s = B(A b) A b + O ( ελ/m ) b b. Let us consider a new exact sequence {s k } k where s = P ( γ)λ s, s = (A A + λi) s, k : s k+ = s + λ (A A + λi) s k. Step I. We first bound the error between s k and s k. We have s k+ s +λ A A+λI s k s + s k which implies s k k s k λ s k λ b. Therefore, s k+ s k+ s s + λ (A A + λi) s k ApxRidge(A, λ, s k) s s + λ (A A + λi) (s k s k) + λ (A A + λi) s k ApxRidge(A, λ, s k ) s s + s k s k + O ( λε/m ) s k. (A.) Since s k k λ b m λ b and since s s O( ελ m ) b, we can conclude from (A.) (by telescoping sum over k =,..., k ) that k {,,..., m}: s k s k ε b. 8

Step II. We next focus on s k and decompose s k into three parts: for every k, ine v,k = P (+γ)λ s k =: P s k, v,k = (I P ( γ)λ )s k =: P s k, v 3,k = (P ( γ)λ P (+γ)λ )s k =: P 3s k. The update rule of s k tells us that i [3], k : v i,k = λ k ( λ(a A + λi) ) ts P i. t= In particular, since v, = P s = we always have v,m =. As for v,m and v 3,m, we first notice that if we denote by p k (x) = k t= λt x t, then v i,k = p k ((A A + λi) )P i s. But since lim k p k (x) = =: p(x), we have x λx lim v i,k = p((a A + λi) )P i s = (A A) P i s = (A A) v i,. k At the same time, note that the spectral norms λ (A A + λi) P and λ (A A + λi) P 3 are both no more than 3 4. (This is so because for every eigenvalue λ j of A A that is below λ( γ) λ λ we have λ+λ j λ+(/3)λ = 3 4.) Therefore, for both i = and i = 3, we have vi,m lim v i,k λ (A A + λi) P i t s k λ (3/4) m O ( b /λ ). t=m+ In other words, choosing m = Θ(log(/ελ)), we have Step III. v,m (A A) v, ε b and v 3,m (A A) v 3, ε b. (A.) We now take into account the error of the PCP solver B. For v,m, we have: v,m (A A) P A b v,m (A A) v, + (A A) P (B(A b) A b) ε b + λ P ( B(A b) A b ) ε b, (A.3) where the first inequality uses triangle inequality, the second uses (A.), and the third uses Def. 3. and A b b. As for v 3,m, we let A = UΣV be the SVD of A and let Σ be the same matrix Σ except all non-zero elements get inverted. We have A(v 3,m (A A) P 3 A b) = 3 = 4 A ( (A A) v 3, (A A) P 3 A b ) + Av3,m A(A A) v 3, A ( (A A) v 3, (A A) P 3 A b ) + ε b UΣ V ( P 3 B(A b) A b ) + ε b Σ V P 3 (B(A b) A b) + ε b i:λ i [( γ)λ,(+γ)λ] i:λ i [( γ)λ,(+γ)λ] λi v i, B(A b) A b + ε b λi v i, A b + ε b = (A A) P 3 A b + ε b. (A.4) Above, uses triangle inequality, uses (A.) and the fact A, 3 uses U, 4 uses Def. 3. and A b b. 9

Step IV. Finally we put everything together and bound the regression error. Denote by opt = A(A A) P (+γ)λ A b b. If we decompose b as ( 3 ) b = A(A A) P i A b + (b A(A A) A b), (A.5) i= then the four vectors in (A.5) are orthogonal to each other, which gives us opt = A(A A) P A b b = A(A A) P A b + A(A A) P 3 A b + A(A A) A b b. (A.6) Now we compute the regression error with respect to s m: As m b = A(v,m + v 3,m ) b 3 = A(v,m + v 3,m ) A(A A) P i A b + (b A(A A) A b) i= 3 A(v,m (A A) P A b) + A(v 3,m (A A) P 3 A b) 4 + A(A A) P A b + A(A A) A b b 4ε b + A(A A) P A b + A(A A) P 3 A b + A(A A) A b b 5 = opt + 4ε b. Above, is because v,m = ; uses (A.5); 3 uses triangle inequality; 4 uses (A.3) and (A.4); 5 uses (A.6). Finally, using s m s m ε b we complete the proof that As m b opt + 5ε b. We also have P s m ε b + P s m = ε b because P s m = v,m =. B Appendix for Section 4 Lemma 4.. Suppose f(z) is analytic on E ρ and for every k, f (k) (). Then, for every n N, letting p n (x) and q n (x) be be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x), we have y [, ρ]: p n ( + y), q n ( + y) f( + y). To show Lemma 4. we first need an auxiliary lemma, which can be proved by some careful case analysis (see Appendix B.). Lemma B.. Let m, n N be two integers, then a m,n = x m T x n(x)dx. Lemma B. essentially says that the Chebyshev coefficients of any function x m must be all non-negative. We also recall the following lemma regarding high-order derivatives of Chebyshev truncated series: Lemma B. (cf. Theorem. of [3]). Suppose f(z) is analytic on E ρ with ρ >, and let p n (x) be the degree-n Chebyshev truncated series of f(x). Then, for every k, { } lim max f (k) (x) p n (k) (x) =. n + x [,]

We are now ready to prove Lemma 4.. The main idea is to expand f into its Taylor series, and then deal with monomials x m one by one: Proof of Lemma 4.. Since f (k) () for all k, and since f(z) is analytic, we can write f as f(z) = k= r kz k where each r k is a nonnegative real. Consider the i-th coefficient of Chebyshev series: [i = ] f(x) a i = π T [i = ] x k i(x)dx = r x k π T i(x) x where the last inequality is due to Lemma B., and the integral and infinite Taylor sum are interchangeable. 5 This implies we can write p n (x) = n i= a it i (x) where each a i. Since each T i (+y) is a polynomial of degree i, it exactly equals to its degree-i Taylor expansion i y k k= k! T (k) i (). Thus, we have (recall y [, ρ]) ( n n i a i n n ) p n ( + y) = a i T i ( + y) = k! T (k) i ()y k = a i T (k) i () y k. k! i= k= i=k i= k= Denote by b k,n = ( n i=k a it (k) i () ). Since for every i, k it satisfies T (k) i () (which is a factual property of Chebyshev polynomial) and a i, we know b k,n and moreover b k,n is monotonically non-decreasing in n for each k. On the other hand, Lemma B. implies lim () f (k) () = lim b k,n f (k) () =, n p (k) n k= n so we must have b k,n f (k) () for every n N (because b k,n is non-decreasing in n). Therefore, for every y [, ρ]: n p n ( + y) = k! b k,ny k k! b k,ny k k! f (k) ()y k = f( + y). (B.) k= Finally, since q n (x) = n k= c kt k (x) is a degree-n Chebyshev interpolation polynomial, the aliasing Lemma.8 tells us c i for every i =,,..., n. Furthermore, applying the aliasing Lemma.8 again we have c i a i for i =,,..., n but n i= c i = i= a i. Therefore, using the fact that T (k) i () is a monotone increasing function in i (for every fixed k), we have n c i T (k) i () a i T (k) i () = lim b k,n = f (k) (). n i= i= Finally, an analogous proof as (B.) also shows q n ( + y) f( + y) for every y [, ρ]. B. Proof of Lemma B. Lemma B.. Let m, n N be two integers, then a m,n = x m T x n(x)dx. 5 The interchangeability and be verified as follows. Denoting by f m(x) = m k= rmxm, we have f m(x) uniformly converges to f(x) on x [, ] because the Taylor expansion of any analytical function has local uniform convergence, but [, ] is a compact, closed interval so local uniform convergence becomes global uniform convergence. For every ε >, let M be the integer so that for every m M it satisfies max x [,] f m(x) f(x) ε. We compute that f(x) Ti(x)dx m x k= r k xk Ti(x)dx = f(x) f m(x) T x i(x)dx x ε dx = επ. x Therefore, the left hand side converges to zero so the integral and the infinite Taylor sum are interchangeable. k= k=