arxiv: v2 [stat.ml] 24 Apr 2017

Size: px
Start display at page:

Download "arxiv: v2 [stat.ml] 24 Apr 2017"

Transcription

1 Faster Principal Component Regression and Stable Matrix Chebyshev Approximation arxiv: v [stat.ml] 4 Apr 7 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / IAS August 6, 6 Abstract Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University We solve principal component regression (PCR), up to a multiplicative accuracy + γ, by reducing the problem to Õ(γ ) black-box calls of ridge regression. Therefore, our algorithm does not require any explicit construction of the top principal components, and is suitable for large-scale PCR instances. In contrast, previous result requires Õ(γ ) such black-box calls. We obtain this result by developing a general stable recurrence formula for matrix Chebyshev polynomials, and a degree-optimal polynomial approximation to the matrix sign function. Our techniques may be of independent interests, especially when designing iterative methods. Introduction In machine learning and statistics, it is often desirable to represent a large-scale dataset in a more tractable, lower-dimensional form, without losing too much information. One of the most robust ways to achieve this goal is through principal component projection (PCP): PCP: project vectors onto the span of the top principal components of the a matrix. It is well-known that PCP decreases noise and increases efficiency in downstream tasks. One of the main applications is principal component regression (PCR): PCR: linear regression but restricted to the subspace of top principal components. Classical algorithms for PCP or PCR rely on a principal component analysis (PCA) solver to recover the top principal components first; with these components available, the tasks of PCP and PCR become trivial because the projection matrix can be constructed explicitly. Unfortunately, PCA solvers demand a running time that at least linearly scales with the number of top principal components chosen for the projection. For instance, to project a vector onto the top principal components of a high-dimensional dataset, even the most efficient -based [8] or Lanczos-based [4] methods require a running time that is proportional to 4 = 4 4 times the input matrix sparsity, if the or Lanczos method is executed for 4 iterations. This is usually computationally intractable.

2 . Approximating PCP Without PCA In, we propose the following notion of PCP approximation. Given a data matrix A R d d (with singular values no greater than ) and a threshold λ >, we say that an algorithm solves (γ, ε)-approximate PCP if informally speaking and up to a multiplicative ± ε error it projects (see Def. 3. for a formal inition). any eigenvector ν of A A with value in [ λ( + γ), ] to ν,. any eigenvector ν of A A with value in [, λ( γ) ] to, 3. any eigenvector ν of A A with value in [ λ( γ), λ( + γ) ] to anywhere between and ν. Such a inition also extends to (γ, ε)-approximate PCR (see Def. 3.). It was first noticed by Frostig et al. [3] that approximate PCP and PCR be solved with a running time independent of the number of principal components above threshold λ. More specifically, they reduced (γ, ε)-approximate PCP and PCR to O ( γ log(/ε) ) black-box calls of any ridge regression subroutine where each call computes (A A + λi) u for some vector u. Our main focus of is to quadratically improve this performance and reduce PCP and PCR to O ( γ log(/γε) ) black-box calls of any ridge regression subroutine where each call again computes (A A + λi) u. Remark.. Frostig et al. only showed their algorithm satisfies the properties and of (γ, ε)- approximation (but not the property 3), and thus their proof was only for matrix A with no singular value in the range [ λ( γ), λ( + γ)]. This is known as the eigengap assumption, which is rarely satisfied in practice [8]. In, we prove our result both with and without such eigengap assumption. Since our techniques also imply the algorithm of Frostig et al. satisfies property 3, throughout the paper, we say Frostig et al. solve (γ, ε)-approximate PCP and PCR.. From PCP to Polynomial Approximation The main technique of Frostig et al. is to construct a polynomial to approximate the sign function sgn(x): [, ] {±}: { +, x ; sgn(x) =, x <. In particular, given any polynomial g(x) satisfying g(x) sgn(x) ε x [, γ] [γ, ], and (.) g(x) x [ γ, γ], (.) the problem of (γ, ε)-approximate PCP can be reduced to computing the matrix polynomial g(s) for S = (A A + λi) (A A λi) (cf. Fact 7.). In other words, to project any vector χ R d to top principal components, we can compute g(s)χ instead; and Ridge regression is often considered as an easy-to-solve machine learning problem: using for instance SVRG [7], one can usually solve ridge regression to an 8 accuracy with at most 4 passes of the data.

3 to compute g(s)χ, we can reduce it to ridge regression for each evaluation of Su for some vector u. Remark.. Since the transformation from A A to S is not linear, the final approximation to the PCP is a rational function (as opposed to a polynomial) over A A. We restrict to polynomial choices of g( ) because in this way, the final rational function has all the denominators being A A + λi, thus reduces to ridge regressions. Remark.3. The transformation from A A to S ensures that all the eigenvalues of A A in the range ( ± γ)λ roughly map to the eigenvalues of S in the range [ γ, γ]. Main Challenges. There are two main challenges regarding the design of polynomial g(x). Efficiency. We wish to minimize the degree n = deg(g(x)) because the computation of g(s)χ usually requires n calls of ridge regression. Stability. We wish g(x) to be stable; that is, g(s)χ must be given by a recursive formula where if we make ε error in each recursion (due to error incurred from ridge regression), the final error of g(s)χ must be at most ε poly(d). Remark.4. Efficient routines such as SVRG [7] solve ridge regression and thus compute Su for any u R d, with running times only logarithmically in /ε. Therefore, by setting ε = ε/poly(d), one can blow up the running time by a small factor O(log(d)) in order to obtain an ε-accurate solution for g(s)χ. The polynomial g(x) constructed by Frostig et al. comes from truncated Taylor expansion. It has degree O ( γ log(/ε) ) and is stable. This γ dependency limits the practical performance of their proposed PCP and PCR algorithms, especially in a high accuracy regime. At the same time, the optimal degree for a polynomial to satisfy even only (.) is Θ ( γ log(/ε) ) [9, ]. Frostig et al. were unable to find a stable polynomial matching this optimal degree and left it as open question..3 Our Results and Main Ideas We provide an efficient and stable polynomial approximation to the matrix sign function that has a near-optimal degree O(γ log(/γε)). At a high level, we construct a polynomial q(x) that approximately equals ( ) +κ x / for some κ = Θ(γ ); then we set g(x) = x q( + κ x ) which approximates sgn(x). To construct q(x), we first note that ( ) +κ x / has no singular point on [, ] so we can apply Chebyshev approximation theory to obtain some q(x) of degree O(γ log(/γε)) satisfying ( + κ x ) / q(x) ε for every x [, ]. This can be shown to imply g(x) sgn(x) ε for every x [, γ] [γ, ], so (.) is satisfied. In order to prove (.) (i.e., g(x) for every x [ γ, γ]), we prove a separate lemma: 3 ( + κ x ) / q(x) for every x [, + κ]. Using degree reduction, Frostig et al. found an explicit polynomial g(x) of degree O ( γ log(/γε) ) satisfying (.). However, that polynomial is unstable because it is constructed monomial by monomial and has exponentially large coefficients in front of each monomial. Furthermore, it is not clear if their polynomial satisfies the (.). 3 We proved a general lemma which holds for any function whose all orders of derivatives are non-negative at x =. 3

4 Note that this does not follow from standard Chebyshev theory because Chebyshev approximation guarantees are only with respect to x [, ] and do not extend to singular point x = + κ. This proves the efficiency part of the main challenges discussed earlier. As for the stability part, we prove a general theorem regarding any weighted sum of Chebyshev polynomials applied to matrices. We provide a backward recurrence algorithm and show that it is stable under noisy computations. This may be of independent interest. For interested readers, we compare our polynomial q(x) with that of Frostig et al. in Figure (a) degree -. (b) degree 4 -. (c) degree Figure : Comparing our polynomial g(x) (orange solid curve) with that of Frostig et al. (blue dashed curve)..4 Related Work There are a few attempts to reduce the cost of PCA when solving PCR, by for instance approximating the matrix AP λ where P λ is the PCP projection matrix [6, 7]. However, they cost a running time that linearly scales with the number of principal components above λ. A significant number of papers have focused on the low-rank case of PCA [, 4, 8] and its online variant [3]. Unfortunately, all of these methods require a running time that scales at least linearly with respect to the number of top principal components. More related to is work on matrix sign function, which plays an important role in control theory and quantum chromodynamics. Several results have addressed methods for applying the sign function in the so-called subspace, without explicitly constructing any approximate polynomial [, 4]. However, methods are not (γ, ε)-approximate PCP solvers, and there is no supporting stability theory behind them. 4 Other iterative methods have also been proposed, see Section 5 of textbook [6]. For instance, Schur s method is a slow one and also requires the matrix to be explicitly given. The Newton s iteration and its numerous variants (e.g. [9]) provide rational approximations to the matrix sign function as opposed to polynomial approximations. Our result and Frostig et al. [3] differ from these cited works, because we have only accessed an approximate ridge regression oracle, so ensuring a polynomial approximation to the sign function and ensuring its stability are crucial. Using matrix Chebyshev polynomials to approximate matrix functions is not new. Perhaps the most celebrated example is to approximate S using polynomials on S, used in the analysis of conjugate gradient []. Independent from, 5 Han et al. [5] used Chebyshev polynomials to approximate the trace of the matrix sign function, i.e., Tr(sgn(S)), which is similar but a different problem. 6 Also, they did not study the case when the matrix-vector multiplication oracle is only approximate (like we do in ), or the case when S has eigenvalues in the range [ γ, γ]. 4 We anyways have included method in our empirical evaluation section and shall discuss its performance there, see for instance Remark Their paper appeared online two months before us, and we became aware of their work in March 7. 6 In particular, their degree of the Chebyshev polynomial is O ( γ (log (/γ) + log(/γ) log(/ε)) ) in the language of ; in contrast, we have degree O ( γ log(/γε) ). 4

5 Roadmap. In Section, we provide notions for and basics for Chebyshev polynomials In Section 3, we put forward our formal initions for approximate PCP and PCR, and show a reduction from approximate PCR to approximate PCP. In Section 4, we prove a general lemma regarding Chebyshev approximations outside [, ]. In Section 5, we design our polynomial approximation to sgn(x). In Section 6, we show how to stably compute any weighted sum of Chebyshev polynomials. In Section 7, we provide pseudocode and prove our main theorems regarding PCP and PCR. In Section 8, we provide empirical evaluations of our theory. Preliminaries We denote by [e] {, } the indicator function for event e, by v or v the Euclidean norm of a vector v, by M the Moore-Penrose pseudo-inverse of a symmetric matrix M, and by M its spectral norm. We sometimes use v to emphasize that v is a vector. Given a symmetric d d matrix M and any f : R R, f(m) is the matrix function applied to M, which is equal to Udiag{f(D ),..., f(d d )}U if M = Udiag{D,..., D d }U is its eigendecomposition. Throughout the paper, matrix A is of dimension d d. We denote by σ max (A) the largest singular value of A. Following the tradition of [3] and keeping the notations light, we assume without loss of generality that σ max (A). We are interested in PCP and PCR problems with an eigenvalue threshold λ (, ). Throughout the paper, we denote by λ λ d the eigenvalues of A A, and by ν,..., ν d R d the eigenvectors of A A corresponding to λ,..., λ d. We denote by P λ the projection matrix P λ = (ν,..., ν j )(ν,..., ν j ) where j is the largest index satisfying λ j λ. In other words, P λ is a projection matrix to the eigenvectors of A A with eigenvalues λ. Definition.. The principal component projection (PCP) of χ R d at threshold λ is ξ = P λ χ. Definition.. The principal component regression (PCR) of regressand b R d. Ridge Regression x = arg min AP λ y b or equivalently x = (A A) P λ (A b). y R d at threshold λ is Definition.3. A black-box algorithm ApxRidge(A, λ, u) is an ε-approximate ridge regression solver, if for every u R d, it satisfies ApxRidge(A, λ, u) (A A + λi) u ε u. 7 Ridge regression is equivalent to solving well-conditioned linear systems, or minimizing strongly convex and smooth objectives f(y) = y (A A + λi)y u y. Remark.4. There is huge literature on efficient algorithms solving ridge regression. Most notably, () Conjugate gradient [] or accelerated gradient descent [] gives fastest full-gradient methods; () SVRG [7] and its acceleration Katyusha [] give the fastest stochastic-gradient method; and (3) NUACDM [5] gives the fastest coordinate-descent method. 7 In fact, throughout the paper, we only need ApxRidge to satisfy this property with high probability for each u. 5

6 The running time of () is O(nnz(A)λ / log(/ε)) where nnz(a) is time to multiply A to any vector. The running times of () and (3) depend on structural properties of A and are always faster than (). Because the best complexity of ridge regression depends on the structural properties of A, following Frostig et al., we only compute our running time in terms of the number of black-box calls to a ridge regression solver.. Chebyshev Polynomials Definition.5. Chebyshev polynomials of st and nd kind are {T n (x)} n and {U n (x)} n where T (x) =, T (x) = x, T n+ (x) = x T n (x) T n (x) U (x) =, U (x) = x, U n+ (x) = x U n (x) U n (x) Fact.6 ([3]). It satisfies d dx T n(x) = nu n (x) for n and cos(n arccos(x)), if x ; n : T n (x) = cosh(n arccosh(x)), if x ; ( ) n cosh(n arccosh( x)), if x. In particular, when x, T n (x) = [( x x ) n ( + x + x ) n] and Un (x) = x ) n+ ( x x ) n+]. T n (x) = [( x x ) n ( + x + x ) n] U n (x) = [( x + x ) n+ ( x x ) n+] x x [( x + Definition.7. For function f(x) whose domain contains [, ], its degree-n Chebyshev truncated series and degree-n Chebyshev interpolation are respectively n n p n (x) = a k T k (x) and q n (x) = c k T k (x), where a k = [k = ] π k= f(x)t k (x) x dx and c k = k= [k = ] n + Above, x j = cos ( (j+.5)π ) [, ] is the j-th Chebyshev point of order n. n+ n f ( ) ( ) x j Tk xj. The following lemma is known as the aliasing formula for Chebyshev coefficients: Lemma.8 (cf. Theorem 4. of [3]). Let f be Lipschitz continuous on [, ] and {a k }, {c k } be ined in Def..7, then c = a + a n + a 4n +..., c n = a n + a 3n + a 5n +..., and k {,,..., n }: c k = a k + (a k+n + a k+4n +...) + (a k+n + a k+4n +...) Definition.9. For every ρ >, let E ρ be the ellipse E of foci ± with major radius + ρ. (This is also known as Bernstein ellipse with parameter + ρ + ρ + ρ.) The following lemma is the main theory regarding Chebyshev approximation: j= 6

7 Lemma. (cf. Theorem 8. and 8. of [3]). Suppose f(z) is analytic on E ρ and f(z) M on E ρ. Let p n (x) and q n (x) be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x) on [, ]. Then, max x [,] f(x) p n (x) max x [,] f(x) q n (x) M ρ+ ρ+ρ ( + ρ + ρ + ρ ) n ; 4M ρ+ ρ+ρ ( + ρ + ρ + ρ ) n. a M and a k M ( + ρ + ρ + ρ ) k for k. 3 Approximate PCP and PCR We formalize our notions of approximation for PCP and PCR, and provide a reduction from PCR to PCP. 3. Our Notions of Approximation Recall that Frostig et al. [3] work only with matrices A that satisfy the eigengap assumption, that is, A has no singular value in the range [ λ( γ), λ( + γ)]. Their approximation guarantees are very straightforward: an output ξ is ε-approximate for PCP on vector χ if ξ ξ ε χ ; an output x is ε-approximate for PCR with regressand b if x x ε b. Unfortunately, these notions are too strong and impossible to satisfy for matrices that do not have a large eigengap around the projection threshold λ. In we propose the following more general (but yet very meaningful) approximation notions. Definition 3.. An algorithm B(χ) is (γ, ε)-approximate PCP for threshold λ, if for every χ R d. P (+γ)λ ( B(χ) χ ) ε χ.. (I P ( γ)λ )B(χ) ε χ. 3. i such that λ i [ ( γ)λ, ( + γ)λ ], it satisfies ν i, B(χ) χ ν i, χ + ε χ. Intuitively, the first property above states that, if projected to the eigenspace with eigenvalues above ( + γ)λ, then B(χ) and χ are almost identical; the second property states that, if projected to the eigenspace with eigenvalues below ( γ)λ, then B(χ) is almost zero; and the third property states that, for each eigenvector ν i with eigenvalue in the range [( γ)λ, ( + γ)λ], the projection ν i, B(χ) must be between and ν i, χ (but up to an error ε χ ). Naturally, P λ (χ) itself is a (, )-approximate PCP. We propose the following notion for approximate PCR: Definition 3.. An algorithm C(b) is (γ, ε)-approximate PCR for threshold λ, if for every b R d. (I P ( γ)λ )C(b) ε b.. AC(b) b Ax b + ε b. where x = (A A) P (+γ)λ A b is the exact PCR solution for threshold ( + γ)λ. 7

8 The first notion states that the output x = C(b) has nearly no correlation with eigenvectors below threshold ( γ)λ; and the second states that the regression error should be nearly optimal with respect to the exact PCR solution but at a different threshold ( + γ)λ. Relationship to Frostig et al. Under eigengap assumption, our notions are equivalent to Frostig et al.: Fact 3.3. If A has no singular value in [ λ( γ), λ( + γ)], then Def. 3. is equivalent to B(χ) P λ (χ) O(ε) χ. Def. 3. implies C(χ) x O(ε/λ) b and C(χ) x O(ε) b implies Def. 3.. Above, x = (A A) P λ A b is the exact PCR solution. 3. Reductions from PCR to PCP If the PCP solution ξ = P λ (A b) is computed exactly, then by inition one can compute (A A) ξ which gives a solution to PCR by solving a linear system. However, as pointed by Frostig et al. [3], this computation is problematic if ξ is only approximate. The following approach has been proposed to improve its accuracy by Frostig et al. compute p((a A + λi) )ξ where p(x) is a polynomial that approximates function x λx. x λx and This is a good approximation to (A A) ξ because the composition of functions +λx is exactly x. Frostig et al. picked p(x) = p m (x) = m t= λt x t which is a truncated Taylor series, and used the following procedure to compute s m p m ((A A + λi) )ξ: s = B(A b), s = ApxRidge(A, λ, s ), k : s k+ = s + λ ApxRidge(A, λ, s k ). (3.) Above, B is an approximate PCP solver and ApxRidge is an approximate ridge regression solver. Under the eigengap assumption, Frostig et al. [3] showed that Lemma 3.4 (PCR-to-PCP). For fixed λ, γ, ε (, ), let A be a matrix whose singular values lie in [, ( γ)λ ] [ ( γ)λ, ]. Let ApxRidge be any O( ε )-approximate ridge regression m solver, and let B be any (γ, O( ελ ))-approximate PCP solver 8. Then, procedure (3.) satisfies m s m (A A) P λ A b ε b if m = Θ(log(/εγ)). Unfortunately, the above lemma does not hold without eigengap assumption. In, we fix this issue by proving the following analogous lemma: Lemma 3.5 (gap free PCR-to-PCP). For fixed λ, ε (, ) and γ (, /3], let A be a matrix whose singular values are no more than. Let ApxRidge be any O( ε )-approximate ridge regression solver, and B be any (γ, O( ελ ))-approximate PCP solver. Then, procedure (3.) satisfies, m m { } (I P ( γ)λ )s m ε b, and if m = Θ(log(/εγ)) As m b A(A A) P (+γ)λ A b b + ε b Note that the conclusion of this lemma exactly corresponds to the two properties in our Def. 3.. The proof of Lemma 3.5 is not hard, but requires a very careful case analysis by decomposing vectors b and each s k into three components, each corresponding to eigenvalues of A A in the range [, ( γ)λ], [( γ)λ, ( + γ)λ] and [( + γ)λ, ]. We er the details to Appendix A. 8 Recall from Fact 3.3 that this requirement is equivalent to saying that B(χ) P λ χ O( ε λ m ) χ. 8

9 4 Property of Chebyshev Approximation Outside [, ] Classical Chebyshev approximation theory (such as Lemma.) only talks about the behaviors of p n (x) or g n (x) on interval [, ]. However, for the purpose of, we must also bound its value for x >. We prove the following general lemma in Appendix B, and believe it could be of independent interest: (we denote by f (k) (x) the k-th derivative of f at x) Lemma 4.. Suppose f(z) is analytic on E ρ and for every k, f (k) (). Then, for every n N, letting p n (x) and q n (x) be be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x), we have y [, ρ]: p n ( + y), q n ( + y) f( + y). 5 Our Polynomial Approximation of sgn(x) For fixed κ (, ], we consider the degree-n Chebyshev interpolation q n (x) = n k= c kt k (x) of the function f(x) = ( ) +κ x / on [, ]. Def..7 tells us that c k = [k = ] n + n j= ( ( k(j +.5)π ))( cos + κ cos n + Our final polynomial to approximate sgn(x) is therefore g n (x) = x q n ( + κ x ) and deg(g n (x)) = n +. We prove the following theorem in this section: ( (j +.5)π )) /. n + Theorem 5.. For every α (, ], ε (, /), choosing κ = α, our function g n (x) = x q n ( + κ x ) satisfies that as long as n α log 3, then (see also Figure ) εα g n (x) sgn(x) ε for every x [, α] [α, ]. g n (x) [, ] for every x [, α] and g n (x) [, ] for every x [ α, ]. Note that our degree n = O ( α log(/αε) ) is near-optimal, because the minimum degree for a polynomial to satisfy even only the first item is Θ ( α log(/ε) ) [9, ]. However, the results of [9, ] are not constructive, and thus may not lead to stable matrix polynomials. We prove Theorem 5. by first establishing two simple lemmas. The following lemma is a consequence of Lemma.: Lemma 5.. For every ε (, /) and κ (, ], if n κ ( log κ + log 4 ε) then x [, ], f(x) q n (x) ε. Proof of Lemma 5.. Denoting by f(z) = ( ) +κ z.5, we know that f(z) is analytic on ellipse Eρ with ρ = κ/, and it satisfies f(z) /κ in E ρ. Applying Lemma., we know that when n ( κ log κ + log 4 ε) it satisfies f(x) qn (x) ε. The next lemma an immediate consequence of our Lemma 4. with f(z) = ( ) +κ z.5: Lemma 5.3. For every ε (, /), κ (, ], n N, and x [, κ], we have ( κ x ) / q n ( + x). 9

10 Proof of Theorem 5.. We are now ready to prove Theorem 5.. When x [, α] [α, ], it satisfies + κ x [, ]. Therefore, applying Lemma 5. we have whenever n κ log 6 εκ = α log 3 it satisfies f( + κ x ) q εα n ( + κ x ) ε. This further implies g n (x) sgn(x) = xq n (+κ x ) xf(+κ x ) x f(+κ x ) q n (+κ x ) ε. When x α, it satisfies + κ x [, + κ]. Applying Lemma 5.3 we have x [, α]: g n (x) = x q n ( + κ x ) x (x ) / = and similarly for x [ α, ] it satisfies g n (x). A Bound on Chebyshev Coefficients. We also give an upper bound to the coefficients of polynomial q n (x). Its proof can be found in Appendix C, and this upper bound shall be used in our final stability analysis. Lemma 5.4 (coefficients of q n ). Let q n (x) = n k= c kt k (x) be the degree-n Chebyshev interpolation of f(x) = ( ) +κ x / on [, ]. Then, i {,,..., n}: c i e 3(i + ) ( + κ + ) i κ + κ κ 6 Stable Computation of Matrix Chebyshev Polynomials In this section we show that any polynomial that is a weighted summation of Chebyshev polynomials with bounded coefficients, can be stably computed when applied to matrices with approximate computations. We achieve so by first generalizing Clenshaw s backward method to matrix case in Section 6. in order to compute a matrix variant of Chebyshev sum, and then analyze its stability in Section 6. with the help from Elloit s forward-backward transformation [8]. Remark 6.. We wish to point out that although Chebyshev polynomials are known to be stable under error when computed on scalars [4], it is not immediately clear why it holds also for matrices. Recall that Chebyshev polynomials satisfy T n+ (x) = xt n (x) T n (x). In the matrix case, we have T n+ (M)χ = MT n (M)χ T n (M)χ where χ R d is a vector. If we analyzed this formula coordinate by coordinate, error could blow up by a factor d per iteration. In addition, we need to ensure that the stability theorem holds for matrices M with eigenvalues that can exceed. This is not standard because Chebyshev polynomials are typically analyzed only on domain [, ]. 6. Clenshaw s Method in Matrix Form In the scalar case, Clenshaw s method (sometimes referred to as backward recurrence) is one of the most widely used implementations for Chebyshev polynomials. We now generalize it to matrices. Consider any computation of the form s N = N T k (M) c k R d where M R d d is symmetric and each c k is in R d. (6.) k= (Note that for PCP and PCR purposes, we it suffices to consider c k = c k χ where c k R is a scalar and χ R d is a fixed vector for all k. However, we need to work on this more general form for our stability analysis.)

11 Vector s N can be computed using the following procedure: Lemma 6. (backward recurrence). s N = b M b where bn+ =, bn = c N, and r {N,..., }: b r = M b r+ b r+ + c r R d. 6. Inexact Clenshaw s Method in Matrix Form We show that, if implemented using the backward recurrence formula, the Chebyshev sum of (6.) can be stably computed. We ine the following model to capture the error with respect to matrix-vector multiplications. Definition 6.3 (inexact backward recurrence). Let M be an approximate algorithm that satisfies M(u) Mu ε u for every u R d. Then, ine inexact backward recurrence to be bn+ =, bn = c N, and r {N,..., }: b r = M ( br+ ) br+ + c r R d, and ine the output as ŝ N = b M( b ). The following theorem gives an error analysis to our inexact backward recurrence. We prove it in Appendix D., and the main idea of our proof is to convert each error vector of a recursion of the backward procedure into an error vector corresponding to some original c k. Theorem 6.4 (stable Chebyshev sum). For every N N, suppose the eigenvalues of M are in [a, b] and suppose there are parameters C U, C T, ρ, C c satisfying { k {,,..., N}: ρ k c k C c x [a, b]: Tk (x) C T ρ k and U k (x) C U ρ k}. Then, if the inexact backward recurrence in Def. 6.3 is applied with ε 4NC U, we have ŝ N s N ε ( + NC T )NC U C c. 7 Algorithms and Main Theorems for PCP and PCR We are now ready to state our main theorems for PCP and PCR. We first note a simple fact: Fact 7.. (P λ )χ = I+sgn(S) where S = (A A + λi) A A I = (A A + λi) (A A λi). In other words, for every vector χ R d, the exact PCP solution P λ (χ) is the same as computing (P λ )χ = I+sgn(S) χ. Thus, we can use our polynomial g n (x) introduced in Section 5 and compute g n (S)χ sgn(s)χ. Finally, in order to compute g n (S), we need to multiply S to deg(g n ) vectors; whenever we do so, we call perform ridge regression once. 7. Our Pseudo Codes First of all, we can approximately compute Sχ for an arbitrary χ R d. This simply uses one oracle call to ridge regression, see Algorithm. Next, since we are interested in (γ, ε)-approximate PCP, we want g n (x) to be close to sgn(x) on all eigenvalues of A A that are outside [( γ)λ, ( + γ)λ], or equivalently all eigenvalues of S outside the range [ ( + γ) + ( + γ) ( γ) ], + ( γ).

12 Algorithm MultS(A, λ, χ) Input: A R d d ; λ > ; χ R d. Output: a vector that approximately equals Sχ = (A A + λi) (A A λi)χ : return ApxRidge(A, λ, A Aχ λχ). Since this new interval contains [ α, α] for α = γ/(+γ) = γ/ O(γ ), we can apply Theorem 5., which gives us a polynomial g n (x) = x q n ( + κ x ) where κ = α = (γ/( + γ)). We use (inexact) backward ( recurrence see Lemma 6. to compute the Chebyshev interpolation polynomial u q n (+κ)i S ) χ. Our final output for approximate PCP is simply Su+χ because P λ Sgn((+κ) S )+I. We summarize this algorithm as QuickPCP(A, χ, λ, γ, n) in Algorithm. Algorithm QuickPCP(A, χ, λ, γ, n) Input: A R d d data matrix satisfying σ max (A) ; χ R d, vector to project; λ >, eigenvalue threshold for PCP; γ (, /3], PCP approximation ratio. n, number of iterations one can also ignore γ and set γ =, see Remark 7.5 Output: a vector ξ R d satisfying ξ P λ (χ). : γ max{γ, log(n) n } if γ to small, work in a γ-free regime, see Remark 7.5 : κ ( γ/( + γ) ) recall κ = α = (γ/( + γ)) in our analysis 3: Define c k = [k=] ( n ( n+ j= cos k(j+.5)π ) )( n+ + κ cos ( (j+.5)π ) ) / n+ 4: b n+, b n c n χ 5: for r n to do coefficients for q n(x) 6: w ( + κ)b r+ MultS(A, λ, MultS(A, λ, b r+ )); w (( + κ)i S )b r+ 7: b r w b r+ + c r χ 8: end for 9: u MultS(A, λ, b w); u S(g n(( + κ)i S ))χ sgn(s)χ : return u + sgn(s)+i χ output χ Finally, we apply the PCR-to-PCP reduction (see Section 3) to derive a solution for PCR from an approximate solution for PCP. See QuickPCR(A, b, λ, γ, n, m) in Algorithm 3. Algorithm 3 QuickPCR(A, b, λ, γ, n, m) Input: A, λ, γ, n the same as QuickPCP; b R d is the regressand vector; m is the number of iterations for PCR. choosing m = it sufficient for practical purposes Output: a vector x R d that solves approximate PCR. : v QuickPCP(A, A b, λ, γ, n), s v, s ApxRidge(A, λ, v); : for r to m do 3: s λ ApxRidge(A, λ, s) + s ; 4: return s Fact 7.. QuickPCP calls ridge regression n + times and QuickPCR calls it n + m + times.

13 7. Our Main Theorems We first state our main theorem under the eigengap assumption, in order to provide a direct comparison to that of Frostig et al. [3]. d Theorem 7.3 (eigengap assumption). Given A R d and λ, γ (, ), assume that the singular values of A are in the range [, ( γ)λ] [ ( + γ)λ, ]. Given χ R d and b R d, denote by ξ = P λ χ and x = (A A) P λ A b the exact PCP and PCR solutions. If ApxRidge is an ε -approximate ridge regression solver, then the output ξ QuickPCP(A, χ, λ, γ, n) satisfies ξ ξ ε χ if n = Θ ( γ log ) γε and log(/ε ) = Θ ( log γε) ; the output x QuickPCR(A, b, λ, γ, n, m) satisfies x x ε b if n = Θ ( γ log ) ( ) γλε, m = Θ log γε and log(/ε ) = Θ ( log γλε). In contrast, the number of ridge-regression oracle calls was Θ(γ log γε ) for PCP and Θ(γ log for PCR in [3]. We include the proof of Theorem 7.3 in Appendix E.. Next we state our stronger theorem without the eigengap assumption. Theorem 7.4 (gap-free). Given A R d d, λ (, ), and γ (, /3], assume that A. Given χ R d and b R d, and suppose ApxRidge is an ε -approximate ridge regression solver, then QuickPCP outputs ξ that is (γ, ε)-approximate PCP with O ( γ log γε) oracle calls to ApxRidge as long as log(/ε ) = Θ ( log γε). QuickPCR outputs x that is (γ, ε)-approximate PCR with O ( γ log γλε) oracle calls to ApxRidge as long as elog(/ε ) = Θ ( log γλε). We make a final remark here regarding the practical usage of QuickPCP and QuickPCR. Remark 7.5. Since our theory is for (γ, ε)-approximations that have two parameters, the user in principle has to feed in both γ and n (in addition to other ault inputs such as A, b and λ). In practice, however, it is usually sufficient to obtain (ε, ε)-approximate PCP and PCR. Therefore, our pseudocodes allow users to set γ = and thus ignore this parameter γ; in such a case, we shall use γ = log(n)/n which is equivalent to setting γ = Θ(ε) because n = Θ(γ log(/γε)). 8 Experiments In the same way as [3], we conclude with an empirical evaluation to demonstrate our theorems. Datasets. We consider synthetic and real-life datasets. We generate the synthetic dataset in the same way as [3]. That is, we form a 3 dimensional matrix A via the SVD A = UΣV where U and V are random orthonormal matrices and Σ contains random singular values. Among the singular values, we let half of them be randomly chosen from [,.( a)] and the other half randomly chosen from [.(+a), ]. We generate vector b by adding noise to the response Ax of a random true x that correlates with A s top principal components. We consider eigenvalue threshold λ =., and use a =,.,.,. in our experiments. We call these datasets random-a. γλε ) 3

14 As for the real-life dataset, we use mnist []. After scaling its largest singular value to one, 9 we choose the eigenvalue threshold λ =.5 (or equivalently singular value threshold λ =.5). The closest singular values to this threshold are respectively.57 and Algorithms. We implemented our algorithm and Frostig et al. [3] (which we call for short) and minimized the number of calls to ridge regression in our implementations. For instance, if using our pseudocode QuickPCP, the number of ridge regression calls is n + ; if using our pseudocode QuickPCR, the number of extra ridge regression calls is m +. We choose m = in all of our experiments because the theoretical prediction of m is only a small logarithmic quantity (see Lemma 3.4 and Lemma 3.5). We also implemented a practical heuristic using subspace that were found on the website []. We call this algorithm method for short. method transforms the covariance matrix AA into a lower-dimensional subspace and performs exact PCP and PCR there. Similar to, method also reduces PCP and PCR to multiple calls of ridge regressions. We emphasize that method has no supporting theory behind it. Since we find it performs much faster than in practice, we include it in our experiments for a stronger comparison. Remark 8.. There are two main issues behind the missing theory of method. Stability. If matrix-vector multiplications are only approximate, -based methods are usually unstable so one needs to replace it with other stable variants. Our polynomial approximation g n (x) can be viewed as one such stable variant. Accuracy. To the best of our knowledge, even with exact computations, if there is no eigengap around threshold λ which is usually the case in real life it is unlikely that method can achieve a log(/ε) convergence with respect to the ε-parameter in (γ, ε)-approximate PCP or PCR. Our experiments later (namely Figure 3(c) and 3(f)) shall also confirm on this. 8. Evaluation : With Eigengap Assumption In the first evaluation we consider matrices that satisfy the eigengap assumption. To simulate an eigengap, we use random datasets random-a with a =.,.,. and present our findings in Figure in terms of the following three performance measures: Regression Error: x x / x ; where x is the output of a PCR algorithm and x = (A A) P λ A b is the exact PCR solution. Projection Error: ξ ξ / ξ ; where ξ is the output of a PCP algorithm and ξ = P λ A b is the exact PCP solution. Denoising Error: (I P λ )ξ / ξ ; where ξ is the output of a PCP algorithm. The x-axis of these plots represent the number of calls to ridge regression, and in Figure we use exact implementations of ridge regression similar to the experiments in [3]. Note that the horizontal axis starts with for projection performances (second and third column) and with 9 This is a cheap procedure and for instance can be done by power method [3]. The original code [], when working with subspace of dimension k, requires k calls of ridge regression. In our experiments, we improved this implementation and reduced it from k calls to k calls for a stronger comparison. This is so because method works in a smaller dimension whose so-called Ritz values approximate the original eigenvalues of A A. However, this approximation cannot be exponentially close because there are only very few Ritz values as compared to the original eigenvalues of A A. 4

15 E+ E+ E- E- E- E-6 E E- E-6 E-8 5 (a) random-., regression error 5 (b) random-., projection error E+ 5 (c) random-., denoising error E- E+ E- E- 5 E- E- E- 6 6 (d) random-., regression error E+ 5 5 (e) random-., projection error E (f) random-., denoising error 9 E- E- E- E- E- E- 6 6 (g) random-., regression error 5 5 (h) random-., projection error 5 5 (i) random-., denoising error Figure : Performance comparison on random-a datasets with eigengap a >. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents performance. Denoting by x and ξ respectively the PCR and PCP outputs, then regression error is kx x k /kx k, projection error is kξ ξ k /kξ k, and denoising error is k(i Pλ )ξk /kξk. for regression performance (first column). This is so because in order to reduce PCR to PCP one needs m + calls to ridge regression in QuickPCR and in our experiments we simply choose m =. We make some important observations from these results We significantly outperform for our choices of a. Our performance degrades as a (and thus γ) decreases; this is consistent to our theory. The performance of method fluctuates partly due to the missing theory behind it. This limits the practicality of method, because it is hardly possible for the algorithm to determine when is the best time to stop the algorithm. If the fluctuation of method is ignored, it matches the performance of QuickPCP and QuickPCR. This is an interesting phenomenon and might even be a first evidence towards a theoretical proof for method. 8. Evaluation : Without Eigengap Assumption In our second evaluation we consider scenarios when there is no significant eigengap around the projection threshold λ. We consider dataset random-a for a = as well as dataset mnist. This Of course, if the true projection matrix Pλ is given explicitly, we can determine a good iteration to stop. However, the entire PCP problem is regarding how to compute Pλ without explicitly constructing it. 5

16 E+ E- E- 6 6 E+ (a) random-, regression error E+ E- E- 5 5 E- (b) random-, projection error E- E- E (c) random-, denoising error (small) E- E-5 E E (d) mnist, regression error (e) mnist, projection error (f) mnist, denoising error (small) Figure 3: Gap-free performance comparison on random- and mnist. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents performance. Denoting by x and ξ respectively the PCR and PCP outputs, then regression error is x x / x, projection error is ξ ξ / ξ, and denoising error (small) is (I P.8λ )ξ / ξ. time, we also consider three performance measures. The first two are the same as the previous subsection, as for the third measure, we replace it with denoising error (small): (I P.8λ )ξ / ξ. We emphasize here that in gap-free scenarios, regression error, projection error, or even the quantity (I P λ )ξ can all be very large in the extreme case if there is an eigenvector that has exactly eigenvalue λ, then these quantities do not converge to zero. This is why our gap-free approximation initions do not account for such quantities (see Def. 3. and Def. 3.). In contrast, by focusing only on eigenvectors that are less than threshold ( γ)λ for some γ >, and looking at (I P ( γ)λ )ξ, this quantity can indeed converge to ε > with a speed that is O(γ log(/ε)) if our algorithm is used (see Theorem 7.4). Note that this speed was only O(γ log(/ε)) for. We present our findings in Figure 3 and make some important conclusions here: Our method still significantly outperforms. In terms of denoising error, our method significantly outperforms method. This is so because, according to Remark 8., method cannot achieve a log(/ε) convergence rate with respect to the ε-parameter in (γ, ε)-approximate PCP or PCR. Threfore, our method is clearly the best for denoising purposes. 8.3 Evaluation 3: Stability Test In our third evaluation, we verify that our method continues to work well even if ridge regressions are computed with moderate error. We consider two types of errors in our experiments: 6

17 E- E- E- E-6 E-6 E-6 E-8 E (a) random γ =., ridge-exact E (b) random γ =., ridge-svrg E- E- E- E- E- E-5 E (e) random γ =, ridge-svrg E- E- E-5 E-5 E-5 E-6 E (g) mnist, ridge-exact 5 6 (f) random γ =, ridge- 5 E- 5 6 E-5 (d) random γ =, ridge-exact (c) random γ =., ridge- E- 5 E (h) mnist, ridge-svrg krylov (i) mnist, ridge- 7 Figure 4: Stability test exact vs. approximate ridge regression subroutines. In the plots, the x-axis represents the number of oracle calls to ridge regression and the y-axis represents the denoising error. We compare exact implementation of ridge regression with ridge-svrg and ridge- k. Remark. Although it seems our method is more affected by error than, we emphasize that this is because is too slow and still works in a very low-accuracy regime in the plots. (For instance, as a stable algorithm, should not be affected by error of magnitude around 6 when the desired accuracy is above 4.) ridge-svrg: we run the SVRG [7] method for 5 passes to solve each ridge regression.3 ridge- k : we run exact ridge regression but randomly add noise [ k, k ] per coordinate. We present our findings in Figure 4. For cleanness, we compare only the denoising error and only on datasets mnist, random- and random-..4 We make the following conclusions and remarks: Even with inexact ridge regression, our method still works very well. We continue to outperform significantly. Compared with method, we continue to outperform it significantly in gap-free scenarios. Although it seems our method is more affected by error than, we emphasize that this is because is too slow and still works in a very low-accuracy regime in the plots. (For 3 We choose the epoch length of SVRG to be n, and therefore full gradients are computed every n stochastic iterations. Each n stochastic iterations is counted as one pass of the data, and each full gradient computation is counted as one pass of the data. 4 Since mnist and random- are datasets without significant eigengap, we present denoising error (small) as ined in Section 8.. 7

18 instance, as a stable algorithm, should not be affected by error of magnitude around 6 when the desired accuracy is above 4.) 9 Conclusion We summarize our contributions. We put forward approximate notions for PCP and PCR that do not rely on any eigengap assumption. Our notions reduce to standard ones under the eigengap assumption. We design near-optimal polynomial approximation g(x) to sgn(x) satisfying (.) and (.). We develop general stable recurrence formula for matrix Chebyshev polynomials; as a corollary, our g(x) can be applied to matrices in a stable manner. We obtain faster, provable PCA-free algorithms for PCP and PCR than known results. Acknowledgements We thank Yin Tat Lee for suggesting us the new title, and anonymous referees for useful suggestions. Z. Allen-Zhu is partially supported an NSF Grant, no. CCF-4958, and a Microsoft Research Grant, no Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. A Proof of Lemma 3.5 Appendix Lemma 3.5. For fixed λ, ε (, ) and γ (, /3], let A be a matrix whose singular values are no more than. Let ApxRidge be any O( ε )-approximate ridge regression solver, and B be any m (γ, O( ελ ))-approximate PCP solver. Then, procedure (3.) satisfies, m { } (I P ( γ)λ )s m ε b, and if m = Θ(log(/εγ)) As m b A(A A) P (+γ)λ A b b + ε b Proof of Lemma 3.5. We first notice that the approximation guarantee of B implies s = B(A b) A b + O ( ελ/m ) b b. Let us consider a new exact sequence {s k } k where s = P ( γ)λ s, s = (A A + λi) s, k : s k+ = s + λ (A A + λi) s k. Step I. We first bound the error between s k and s k. We have s k+ s +λ A A+λI s k s + s k which implies s k k s k λ s k λ b. Therefore, s k+ s k+ s s + λ (A A + λi) s k ApxRidge(A, λ, s k) s s + λ (A A + λi) (s k s k) + λ (A A + λi) s k ApxRidge(A, λ, s k ) s s + s k s k + O ( λε/m ) s k. (A.) Since s k k λ b m λ b and since s s O( ελ m ) b, we can conclude from (A.) (by telescoping sum over k =,..., k ) that k {,,..., m}: s k s k ε b. 8

19 Step II. We next focus on s k and decompose s k into three parts: for every k, ine v,k = P (+γ)λ s k =: P s k, v,k = (I P ( γ)λ )s k =: P s k, v 3,k = (P ( γ)λ P (+γ)λ )s k =: P 3s k. The update rule of s k tells us that i [3], k : v i,k = λ k ( λ(a A + λi) ) ts P i. t= In particular, since v, = P s = we always have v,m =. As for v,m and v 3,m, we first notice that if we denote by p k (x) = k t= λt x t, then v i,k = p k ((A A + λi) )P i s. But since lim k p k (x) = =: p(x), we have x λx lim v i,k = p((a A + λi) )P i s = (A A) P i s = (A A) v i,. k At the same time, note that the spectral norms λ (A A + λi) P and λ (A A + λi) P 3 are both no more than 3 4. (This is so because for every eigenvalue λ j of A A that is below λ( γ) λ λ we have λ+λ j λ+(/3)λ = 3 4.) Therefore, for both i = and i = 3, we have vi,m lim v i,k λ (A A + λi) P i t s k λ (3/4) m O ( b /λ ). t=m+ In other words, choosing m = Θ(log(/ελ)), we have Step III. v,m (A A) v, ε b and v 3,m (A A) v 3, ε b. (A.) We now take into account the error of the PCP solver B. For v,m, we have: v,m (A A) P A b v,m (A A) v, + (A A) P (B(A b) A b) ε b + λ P ( B(A b) A b ) ε b, (A.3) where the first inequality uses triangle inequality, the second uses (A.), and the third uses Def. 3. and A b b. As for v 3,m, we let A = UΣV be the SVD of A and let Σ be the same matrix Σ except all non-zero elements get inverted. We have A(v 3,m (A A) P 3 A b) = 3 = 4 A ( (A A) v 3, (A A) P 3 A b ) + Av3,m A(A A) v 3, A ( (A A) v 3, (A A) P 3 A b ) + ε b UΣ V ( P 3 B(A b) A b ) + ε b Σ V P 3 (B(A b) A b) + ε b i:λ i [( γ)λ,(+γ)λ] i:λ i [( γ)λ,(+γ)λ] λi v i, B(A b) A b + ε b λi v i, A b + ε b = (A A) P 3 A b + ε b. (A.4) Above, uses triangle inequality, uses (A.) and the fact A, 3 uses U, 4 uses Def. 3. and A b b. 9

20 Step IV. Finally we put everything together and bound the regression error. Denote by opt = A(A A) P (+γ)λ A b b. If we decompose b as ( 3 ) b = A(A A) P i A b + (b A(A A) A b), (A.5) i= then the four vectors in (A.5) are orthogonal to each other, which gives us opt = A(A A) P A b b = A(A A) P A b + A(A A) P 3 A b + A(A A) A b b. (A.6) Now we compute the regression error with respect to s m: As m b = A(v,m + v 3,m ) b 3 = A(v,m + v 3,m ) A(A A) P i A b + (b A(A A) A b) i= 3 A(v,m (A A) P A b) + A(v 3,m (A A) P 3 A b) 4 + A(A A) P A b + A(A A) A b b 4ε b + A(A A) P A b + A(A A) P 3 A b + A(A A) A b b 5 = opt + 4ε b. Above, is because v,m = ; uses (A.5); 3 uses triangle inequality; 4 uses (A.3) and (A.4); 5 uses (A.6). Finally, using s m s m ε b we complete the proof that As m b opt + 5ε b. We also have P s m ε b + P s m = ε b because P s m = v,m =. B Appendix for Section 4 Lemma 4.. Suppose f(z) is analytic on E ρ and for every k, f (k) (). Then, for every n N, letting p n (x) and q n (x) be be the degree-n Chebyshev truncated series and Chebyshev interpolation of f(x), we have y [, ρ]: p n ( + y), q n ( + y) f( + y). To show Lemma 4. we first need an auxiliary lemma, which can be proved by some careful case analysis (see Appendix B.). Lemma B.. Let m, n N be two integers, then a m,n = x m T x n(x)dx. Lemma B. essentially says that the Chebyshev coefficients of any function x m must be all non-negative. We also recall the following lemma regarding high-order derivatives of Chebyshev truncated series: Lemma B. (cf. Theorem. of [3]). Suppose f(z) is analytic on E ρ with ρ >, and let p n (x) be the degree-n Chebyshev truncated series of f(x). Then, for every k, { } lim max f (k) (x) p n (k) (x) =. n + x [,]

21 We are now ready to prove Lemma 4.. The main idea is to expand f into its Taylor series, and then deal with monomials x m one by one: Proof of Lemma 4.. Since f (k) () for all k, and since f(z) is analytic, we can write f as f(z) = k= r kz k where each r k is a nonnegative real. Consider the i-th coefficient of Chebyshev series: [i = ] f(x) a i = π T [i = ] x k i(x)dx = r x k π T i(x) x where the last inequality is due to Lemma B., and the integral and infinite Taylor sum are interchangeable. 5 This implies we can write p n (x) = n i= a it i (x) where each a i. Since each T i (+y) is a polynomial of degree i, it exactly equals to its degree-i Taylor expansion i y k k= k! T (k) i (). Thus, we have (recall y [, ρ]) ( n n i a i n n ) p n ( + y) = a i T i ( + y) = k! T (k) i ()y k = a i T (k) i () y k. k! i= k= i=k i= k= Denote by b k,n = ( n i=k a it (k) i () ). Since for every i, k it satisfies T (k) i () (which is a factual property of Chebyshev polynomial) and a i, we know b k,n and moreover b k,n is monotonically non-decreasing in n for each k. On the other hand, Lemma B. implies lim () f (k) () = lim b k,n f (k) () =, n p (k) n k= n so we must have b k,n f (k) () for every n N (because b k,n is non-decreasing in n). Therefore, for every y [, ρ]: n p n ( + y) = k! b k,ny k k! b k,ny k k! f (k) ()y k = f( + y). (B.) k= Finally, since q n (x) = n k= c kt k (x) is a degree-n Chebyshev interpolation polynomial, the aliasing Lemma.8 tells us c i for every i =,,..., n. Furthermore, applying the aliasing Lemma.8 again we have c i a i for i =,,..., n but n i= c i = i= a i. Therefore, using the fact that T (k) i () is a monotone increasing function in i (for every fixed k), we have n c i T (k) i () a i T (k) i () = lim b k,n = f (k) (). n i= i= Finally, an analogous proof as (B.) also shows q n ( + y) f( + y) for every y [, ρ]. B. Proof of Lemma B. Lemma B.. Let m, n N be two integers, then a m,n = x m T x n(x)dx. 5 The interchangeability and be verified as follows. Denoting by f m(x) = m k= rmxm, we have f m(x) uniformly converges to f(x) on x [, ] because the Taylor expansion of any analytical function has local uniform convergence, but [, ] is a compact, closed interval so local uniform convergence becomes global uniform convergence. For every ε >, let M be the integer so that for every m M it satisfies max x [,] f m(x) f(x) ε. We compute that f(x) Ti(x)dx m x k= r k xk Ti(x)dx = f(x) f m(x) T x i(x)dx x ε dx = επ. x Therefore, the left hand side converges to zero so the integral and the infinite Taylor sum are interchangeable. k= k=

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

randomized block krylov methods for stronger and faster approximate svd

randomized block krylov methods for stronger and faster approximate svd randomized block krylov methods for stronger and faster approximate svd Cameron Musco and Christopher Musco December 2, 25 Massachusetts Institute of Technology, EECS singular value decomposition n d left

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Jim Lambers MAT 610 Summer Session Lecture 2 Notes Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities Mathematical Optimisation, Chpt 2: Linear Equations and inequalities Peter J.C. Dickinson p.j.c.dickinson@utwente.nl http://dickinson.website version: 12/02/18 Monday 5th February 2018 Peter J.C. Dickinson

More information

1. General Vector Spaces

1. General Vector Spaces 1.1. Vector space axioms. 1. General Vector Spaces Definition 1.1. Let V be a nonempty set of objects on which the operations of addition and scalar multiplication are defined. By addition we mean a rule

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

arxiv: v1 [cs.lg] 17 Nov 2017

arxiv: v1 [cs.lg] 17 Nov 2017 Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012.

Math Introduction to Numerical Analysis - Class Notes. Fernando Guevara Vasquez. Version Date: January 17, 2012. Math 5620 - Introduction to Numerical Analysis - Class Notes Fernando Guevara Vasquez Version 1990. Date: January 17, 2012. 3 Contents 1. Disclaimer 4 Chapter 1. Iterative methods for solving linear systems

More information

Maths for Signals and Systems Linear Algebra in Engineering

Maths for Signals and Systems Linear Algebra in Engineering Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 15, Tuesday 8 th and Friday 11 th November 016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL PROCESSING IMPERIAL COLLEGE

More information

Linear Algebra: Matrix Eigenvalue Problems

Linear Algebra: Matrix Eigenvalue Problems CHAPTER8 Linear Algebra: Matrix Eigenvalue Problems Chapter 8 p1 A matrix eigenvalue problem considers the vector equation (1) Ax = λx. 8.0 Linear Algebra: Matrix Eigenvalue Problems Here A is a given

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Lecture 10: October 27, 2016

Lecture 10: October 27, 2016 Mathematical Toolkit Autumn 206 Lecturer: Madhur Tulsiani Lecture 0: October 27, 206 The conjugate gradient method In the last lecture we saw the steepest descent or gradient descent method for finding

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES JOEL A. TROPP Abstract. We present an elementary proof that the spectral radius of a matrix A may be obtained using the formula ρ(a) lim

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Linear Algebra. Session 12

Linear Algebra. Session 12 Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Repeated Eigenvalues and Symmetric Matrices

Repeated Eigenvalues and Symmetric Matrices Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

FIXED POINT ITERATIONS

FIXED POINT ITERATIONS FIXED POINT ITERATIONS MARKUS GRASMAIR 1. Fixed Point Iteration for Non-linear Equations Our goal is the solution of an equation (1) F (x) = 0, where F : R n R n is a continuous vector valued mapping in

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

COMP 558 lecture 18 Nov. 15, 2010

COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

UNIT 6: The singular value decomposition.

UNIT 6: The singular value decomposition. UNIT 6: The singular value decomposition. María Barbero Liñán Universidad Carlos III de Madrid Bachelor in Statistics and Business Mathematical methods II 2011-2012 A square matrix is symmetric if A T

More information

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction

More information

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A =

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A = STUDENT S COMPANIONS IN BASIC MATH: THE ELEVENTH Matrix Reloaded by Block Buster Presumably you know the first part of matrix story, including its basic operations (addition and multiplication) and row

More information

arxiv: v1 [math.na] 5 May 2011

arxiv: v1 [math.na] 5 May 2011 ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Sub-Sampled Newton Methods I: Globally Convergent Algorithms Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems

More information

LECTURE NOTES ELEMENTARY NUMERICAL METHODS. Eusebius Doedel

LECTURE NOTES ELEMENTARY NUMERICAL METHODS. Eusebius Doedel LECTURE NOTES on ELEMENTARY NUMERICAL METHODS Eusebius Doedel TABLE OF CONTENTS Vector and Matrix Norms 1 Banach Lemma 20 The Numerical Solution of Linear Systems 25 Gauss Elimination 25 Operation Count

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88 Math Camp 2010 Lecture 4: Linear Algebra Xiao Yu Wang MIT Aug 2010 Xiao Yu Wang (MIT) Math Camp 2010 08/10 1 / 88 Linear Algebra Game Plan Vector Spaces Linear Transformations and Matrices Determinant

More information

Linear Algebra March 16, 2019

Linear Algebra March 16, 2019 Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

MTH 2032 SemesterII

MTH 2032 SemesterII MTH 202 SemesterII 2010-11 Linear Algebra Worked Examples Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education December 28, 2011 ii Contents Table of Contents

More information

Stat 159/259: Linear Algebra Notes

Stat 159/259: Linear Algebra Notes Stat 159/259: Linear Algebra Notes Jarrod Millman November 16, 2015 Abstract These notes assume you ve taken a semester of undergraduate linear algebra. In particular, I assume you are familiar with the

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Trust Regions. Charles J. Geyer. March 27, 2013

Trust Regions. Charles J. Geyer. March 27, 2013 Trust Regions Charles J. Geyer March 27, 2013 1 Trust Region Theory We follow Nocedal and Wright (1999, Chapter 4), using their notation. Fletcher (1987, Section 5.1) discusses the same algorithm, but

More information

Optimization Theory. A Concise Introduction. Jiongmin Yong

Optimization Theory. A Concise Introduction. Jiongmin Yong October 11, 017 16:5 ws-book9x6 Book Title Optimization Theory 017-08-Lecture Notes page 1 1 Optimization Theory A Concise Introduction Jiongmin Yong Optimization Theory 017-08-Lecture Notes page Optimization

More information

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts Some definitions Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization M. M. Sussman sussmanm@math.pitt.edu Office Hours: MW 1:45PM-2:45PM, Thack 622 A matrix A is SPD (Symmetric

More information

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Introduction to Matrix Algebra August 18, 2010 1 Vectors 1.1 Notations A p-dimensional vector is p numbers put together. Written as x 1 x =. x p. When p = 1, this represents a point in the line. When p

More information

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?) Math 35 Exam Review SOLUTIONS Overview In this third of the course we focused on linear learning algorithms to model data. summarize: To. Background: The SVD and the best basis (questions selected from

More information

Linear Algebra Primer

Linear Algebra Primer Linear Algebra Primer David Doria daviddoria@gmail.com Wednesday 3 rd December, 2008 Contents Why is it called Linear Algebra? 4 2 What is a Matrix? 4 2. Input and Output.....................................

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Numerical Methods - Numerical Linear Algebra

Numerical Methods - Numerical Linear Algebra Numerical Methods - Numerical Linear Algebra Y. K. Goh Universiti Tunku Abdul Rahman 2013 Y. K. Goh (UTAR) Numerical Methods - Numerical Linear Algebra I 2013 1 / 62 Outline 1 Motivation 2 Solving Linear

More information

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Eigenvalues and Eigenvectors

Eigenvalues and Eigenvectors Contents Eigenvalues and Eigenvectors. Basic Concepts. Applications of Eigenvalues and Eigenvectors 8.3 Repeated Eigenvalues and Symmetric Matrices 3.4 Numerical Determination of Eigenvalues and Eigenvectors

More information

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v ) Section 3.2 Theorem 3.6. Let A be an m n matrix of rank r. Then r m, r n, and, by means of a finite number of elementary row and column operations, A can be transformed into the matrix ( ) Ir O D = 1 O

More information

MATH 205 HOMEWORK #3 OFFICIAL SOLUTION. Problem 1: Find all eigenvalues and eigenvectors of the following linear transformations. (a) F = R, V = R 3,

MATH 205 HOMEWORK #3 OFFICIAL SOLUTION. Problem 1: Find all eigenvalues and eigenvectors of the following linear transformations. (a) F = R, V = R 3, MATH 205 HOMEWORK #3 OFFICIAL SOLUTION Problem 1: Find all eigenvalues and eigenvectors of the following linear transformations. a F = R, V = R 3, b F = R or C, V = F 2, T = T = 9 4 4 8 3 4 16 8 7 0 1

More information

MIT Final Exam Solutions, Spring 2017

MIT Final Exam Solutions, Spring 2017 MIT 8.6 Final Exam Solutions, Spring 7 Problem : For some real matrix A, the following vectors form a basis for its column space and null space: C(A) = span,, N(A) = span,,. (a) What is the size m n of

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

Introduction to Iterative Solvers of Linear Systems

Introduction to Iterative Solvers of Linear Systems Introduction to Iterative Solvers of Linear Systems SFB Training Event January 2012 Prof. Dr. Andreas Frommer Typeset by Lukas Krämer, Simon-Wolfgang Mages and Rudolf Rödl 1 Classes of Matrices and their

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems

Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems University of Warwick, EC9A0 Maths for Economists Peter J. Hammond 1 of 45 Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems Peter J. Hammond latest revision 2017 September

More information

Gradient Descent Methods

Gradient Descent Methods Lab 18 Gradient Descent Methods Lab Objective: Many optimization methods fall under the umbrella of descent algorithms. The idea is to choose an initial guess, identify a direction from this point along

More information

Iterative solvers for linear equations

Iterative solvers for linear equations Spectral Graph Theory Lecture 23 Iterative solvers for linear equations Daniel A. Spielman November 26, 2018 23.1 Overview In this and the next lecture, I will discuss iterative algorithms for solving

More information

Computational Methods. Eigenvalues and Singular Values

Computational Methods. Eigenvalues and Singular Values Computational Methods Eigenvalues and Singular Values Manfred Huber 2010 1 Eigenvalues and Singular Values Eigenvalues and singular values describe important aspects of transformations and of data relations

More information

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors Chapter 7 Canonical Forms 7.1 Eigenvalues and Eigenvectors Definition 7.1.1. Let V be a vector space over the field F and let T be a linear operator on V. An eigenvalue of T is a scalar λ F such that there

More information

7. Symmetric Matrices and Quadratic Forms

7. Symmetric Matrices and Quadratic Forms Linear Algebra 7. Symmetric Matrices and Quadratic Forms CSIE NCU 1 7. Symmetric Matrices and Quadratic Forms 7.1 Diagonalization of symmetric matrices 2 7.2 Quadratic forms.. 9 7.4 The singular value

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

CHAPTER 3 Further properties of splines and B-splines

CHAPTER 3 Further properties of splines and B-splines CHAPTER 3 Further properties of splines and B-splines In Chapter 2 we established some of the most elementary properties of B-splines. In this chapter our focus is on the question What kind of functions

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Iterative solvers for linear equations

Iterative solvers for linear equations Spectral Graph Theory Lecture 17 Iterative solvers for linear equations Daniel A. Spielman October 31, 2012 17.1 About these notes These notes are not necessarily an accurate representation of what happened

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

Notes on Eigenvalues, Singular Values and QR

Notes on Eigenvalues, Singular Values and QR Notes on Eigenvalues, Singular Values and QR Michael Overton, Numerical Computing, Spring 2017 March 30, 2017 1 Eigenvalues Everyone who has studied linear algebra knows the definition: given a square

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

8. Diagonalization.

8. Diagonalization. 8. Diagonalization 8.1. Matrix Representations of Linear Transformations Matrix of A Linear Operator with Respect to A Basis We know that every linear transformation T: R n R m has an associated standard

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES MATRICES ARE SIMILAR TO TRIANGULAR MATRICES 1 Complex matrices Recall that the complex numbers are given by a + ib where a and b are real and i is the imaginary unity, ie, i 2 = 1 In what we describe below,

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information