arxiv: v1 [stat.ml] 5 Nov 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 5 Nov 2018"

Transcription

1 Kernel Conjugate Gradient Methods with Random Projections Junhong Lin and Volkan Cevher {junhong.lin, Laboratory for Information and Inference Systems École Polytechnique Fédérale de Lausanne CH05-Lausanne, Switzerland arxiv:8.0760v [stat.ml] 5 Nov 08 November 6, 08 Abstract We propose and study kernel conjugate gradient methods KCGM with random projections for least-squares regression over a separable Hilbert space. Considering two types of random projections generated by randomized sketches and Nyström subsampling, we prove optimal statistical results with respect to variants of norms for the algorithms under a suitable stopping rule. Particularly, our results show that if the projection dimension is proportional to the effective dimension of the problem, KCGM with randomized sketches can generalize optimally, while achieving a computational advantage. As a corollary, we derive optimal rates for classic KCGM in the case that the target function may not be in the hypothesis space, filling a theoretical gap. Keywords: Learning theory, Conjugate gradient methods, Randomized sketches, Integral operator Mathematics Subject Classification: 68T05, 94A0, 4A35 Introduction Let the input space be a separable Hilbert space H with inner product, H, and the output space R. Let ρ be an unknown probability measure on H R. We study the following expected risk minimization, inf Ẽω, Ẽω = ω, x H y dρx, y, ω H H R where the measure ρ is known only through a sample z = {z i = x i, y i } n i= of size n N, independently and identically distributed i.i.d. according to ρ. As noted in [0, ], this setting covers nonparametric regression with kernel methods [8, 33], and it is close to functional linear regression [7] with the intercept to be zero and linear inverse problems []. In the large-scale learning scenarios, the search of an approximated estimator for the above problem via some specific algorithms could be limited to a smaller subspace S, in order to achieve some computational advantages [36, 3, 0]. Typically, with a subsample/sketch dimension m < n, S = span{ x j : j m} where x j is chosen randomly from the input set x = {x,, x n }, or S = span{ m j= G ijx j : i m} where G = [G ij ] i m, j n is a general random matrix whose rows are drawn according to a distribution. The former is called Nyström subsampling while the latter is called randomized sketches. Limiting the solution within the subspace S, replacing expected risk by empirical risk over z, and combining with a linear-fashion and

2 explicit regularized technique based on spectral-filtering of the empirical covariance operator, this leads to the projected-regularized algorithms. Refer to the previous papers [, 37, 9] and references therein for the statistical results and computational advantages of this kind of algorithms. In this paper, we take a different step and apply the random-projection techniques to another efficient powerful iterative algorithms: kernel conjugate gradient type algorithms. As noted in [9], a solution of the empirical risk minimization over the subspace S can be given by solving a projected normalized linear equation. We apply the kernel conjugate gradient methods KCGM [5, 5] for solving this normalized linear equation without any explicit regularization term, and at tth-iteration, we get an estimator that fits the linear equation best over the tth-order Krylov subspace. The regularization to ensure its best performance is realized by early-stopping the iterative procedure. Using the early-stopping iterative regularization [40, 38, 8] has its own benefit compared with spectral-filtering algorithms, as it can tune the regularization parameter in an adaptive way if a suitable stopping rule is used. Thus, for some easy learning problems, an iterative algorithm can stop earlier while generalizing optimally, leading to some computational advantages. Considering either randomized sketches or Nyström subsampling, we provide statistical results in terms of different norms with optimal rates. Particularly, our results indicate that for KCGM with randomized sketches, the algorithm can generalize optimally after some number of iterations, provided that the sketch dimension is proportional to the effective dimension [39] of the problem. Furthermore, we point out that the computational complexities for the algorithm are Om 3 in time and Om in space, which are lower than On t in time and On in space of classic KCGM. Thus, our results suggest that KCGM with randomized sketches can generalize optimally with less computational complexities, e.g., On 3/ in time and On in space without considering the begin assumptions of the problem in the attainable case i.e. the expected risk minimization has at least one solution in H. Finally, as a corollary, we derive the first result with optimal capacity-dependent rates for classical KCGM in the non-attainable case, filling a theoretical gap since [4]. The structure of this paper is organized as follows. We first introduce some preliminary notations and the studied algorithms in Section. We then introduce some basic assumptions and state our main results in Section 3, following with some simple discussions and numerical illustrations. All the proofs are given in Section 4 and Appendix. Learning with Kernel Conjugate Gradient Methods and Random Projection In this section, we first introduce some necessary notations. We then present KCGM with projection abbreviated as projected-kcgm, and discuss their numerical realizations considering two types of projection generated by randomized sketches and Nyström sketches/subsampling.

3 . Notations and Auxiliary Operators Let Z = H R, ρ X the induced marginal measure on H of ρ, and ρ x the conditional probability measure on R with respect to x H and ρ. Define the hypothesis space H ρ = {f : H R ω H with fx = ω, x H, ρ X -almost surely}. Denote L ρ X the Hilbert space of square integral functions from H to R with respect to ρ X, with its norm given by f ρ = H fx dρ X. Throughout this paper, we assume that the support of ρ X is compact and there exists a constant κ [, [, such that x, x H κ, x, x H, ρ X -almost every. For a given bounded operator L mapping from a separable Hilbert space H to another separable Hilbert space H, L denotes the operator norm of L, i.e., L = sup f H, f H = Lf H. Let r N +, the set {,, r} is denoted by [r]. For any real number a, a + = maxa, 0, a = min0, a. Let S ρ : H L ρ X be the linear map ω ω, H, which is bounded by κ under Assumption. Furthermore, we consider the adjoint operator Sρ : L ρ X H, the covariance operator T : H H given by T = S ρs ρ, and the integral operator L : L ρ X L ρ X given by S ρ S ρ. It can be easily proved that Sρg = Lf = S ρ Sρf = H H T = SρS ρ = xgxdρ X x, fx x, H dρ X x, H, x H xdρ X x. Under Assumption, the operators T and L can be proved to be positive trace class operators and hence compact: L = T trt = H trx xdρ X x = For any ω H, it is easy to prove the following isometry property, H and x Hdρ X x κ. 3 S ρ ω ρ = T ω H, 4 Moreover, according to the singular value decomposition of a compact operator, one can prove Similarly, for all f L ρ X, there holds, L Sρ ω ρ ω H. 5 S ρf H = L f ρ, and 6 T S ρ f H f ρ. 7 We define the normalized sampling operator S x : H R n by S x ω i = n ω, x i H, i [n], 3

4 where the norm in R n is the usual Euclidean norm. Its adjoint operator S x : R n H, defined by Sxy, ω H = y, S x ω for y R n is thus given by Sxy = n y i x i. n For notational simplicity, we also denote ȳ = n y. Moreover, we can define the empirical covariance operator T x : H H such that T x = SxS x. Obviously, T x = SxS x = n, x i H x i. n By Assumption, similar to 3, we have i= i= T x trt x κ. 8 Denote K x x the x x matrix with its i, j-th entry given by x i, x j H for any two x x input sets x and x. Obviously, Problem is equivalent to inf Ef, Ef = f H ρ K x x = S x S x. H R fx y dρx, y, 9 The function that minimizes the expected risk over all measurable functions is the regression function [8, 33], defined as, f ρ x = R ydρy x, x H, ρ X -almost every. 0 A simple calculation shows that the following well-known fact holds [8, 33], for all f L ρ X, Ef Ef ρ = f f ρ ρ. Under Assumption, H ρ is a subspace of L ρ X. Thus a solution f H for the problem 9 is the projection of the regression function f ρ onto the closure of H ρ in L ρ X, and for all f H ρ [0], S ρf ρ = S ρf H, and Ef Ef H = f f H ρ.. Kernel Conjugate Gradient Methods with Projection In this subsection, we introduce KCGM with solutions restricted to the subspace S, a closed subspace of H. Let P be the projection operator with its range S. As noted in [9], a solution for the empirical risk minimization over S is given by ˆω = P ˆω with ˆω such that Note that as T x = S xs x, P T x P = P S xs x P = S x P S x P. P T x P ˆω = P Sxȳ, 3 Thus, 3 could be viewed as a normalized equation of S x P ω = ȳ. Motivated by [5, 4], we study the following conjugate gradient type algorithms applied to this normalized equation. For notational simplicity, we let and write U to mean U + I. U = P T x P, 4 4

5 Algorithm Projected-KCGM. For any t =,, T, ω t = Here, K t U, P S xȳ is the so-called Krylov subspace, defined as arg min Uω P S xȳ H. 5 ω K tu,p S xȳ K t U, P S xȳ = span{p S xȳ, UP S xȳ,, U t P S xȳ} = {pup S xȳ : p P t }, where P t denotes the set of real polynomials of degree at most t. Different choices on the subspace S correspond to different algorithms. Particularly, when P = I, the algorithm is the classical KCGM. In this paper, we will set m S = span{ G ij x j : i m} j= where G = [G ij ] i m, j n is a random matrix, or S = span{ x j : j m} with x j chosen randomly from x. The following examples provide numerical realizations of Algorithm, considering randomized sketches, Nyström-subsampling sketches and non-sketching regimes. Example. Randomized sketches. Let S = span{ m j= G ijx j : i m}, and G = [G ij ] be a matrix in R m n. Let R R m r be the matrix such that RR = GK xx G with r = rankr. Denote K = R GK xxg R and b = R GK xx ȳ. In this case, Algorithm is equivalent to ω t = n n i= G Ra t i x i with a t given by We call this type of algorithm sketched-kcgm. a t = arg min Ka b. 6 a K t K,b Example. Subsampling sketches. In Nyström-subsampling sketches, x = { x,, x m } with each x j drawn randomly following a distribution from x. Let R R m r be the matrix such that RR = K x x with r = rankr. Denote K = R K xx K x x R and b = R K xx ȳ. In this case, Algorithm is equivalent to ω t = m m i= Ra t i x i with a t given by We call this algorithm Nyström-KCGM. a t = arg min Ka b. a K t K,b Example.3. Non-sketches [4] For the ordinary non-sketching regimes, S = H. Let K = S x S x. Then Algorithm is equivalent to ω t = n n i= a t i x i, with a t given by a t = arg min Ka ȳ K. a K tk,ȳ 5

6 In all the above examples, in order to execute the algorithms, one only needs to know how to compute x, x H for any two points x, x H, which is met by many cases such as learning with kernel methods. In general, as that the computation of the matrix GK xx = n [GK xx, GK xx,, K xxn ] or K x x R can be parallelized, the computational costs are Om 3 + m T in time and Om in space for sketched/nyström KCGM after T -iterations, while they are On T in time and On in space for non-sketched KCGM. As shown both in theory and our numerical results, the total number of iterations T for the algorithms to achieve best performance is typically less than m for sketched/nyström KCGM. A classical [9] or sketched [] kernel conjugate gradient type algorithm was proposed for solving the penalized empirical risk minimization. In contrast, Algorithm is for solving the unpenalized empirical risk minimization and it does not involve any explicit penalty. In this case, we do not need to tune the penalty parameter. The best generalization ability of Algorithm is ensured by early-stopping the procedure, considering a suitable stopping rule. The proofs for the three examples will be given in Subsection Main Results In this section, we first introduce some common assumptions from statistical learning theory, and then present our statistical results for sketched/nyström-kcgm and classical KCGM. 3. Assumptions Assumption. There exist positive constants Q and M such that for all l with l N, y l dρy x l!m l Q, 7 R ρ X -almost surely. Furthermore, for some B > 0, f H satisfies f H x f ρ x x xdρ X x B T, 8 H Obviously, Assumption implies that the regression function f ρ is bounded almost surely, as f ρ x y dρy x y dρy x Q. 9 R R 7 is satisfied if y is bounded almost surely or y = ω, x H + ɛ for some Gaussian noise ɛ. 8 is satisfied if f H f ρ is bounded almost surely or the hypothesis space is consistent, i.e., inf Hρ E = Ef ρ. Assumption. f H satisfies the following Hölder source condition Here, R and ζ are non-negative numbers. f H = L ζ g 0, with g 0 ρ R. 0 Assumption relates to the regularity/smoothness of f H. The bigger the ζ is, the stronger the assumption is, the smoother f H is, as L ζ L ρ X L ζ L ρ X when ζ ζ. 6

7 Particularly, when ζ /, there exists some ω H H such that S ρ ω H = f H almost surely [33], while for ζ = 0, the assumption holds trivially. Assumption 3. For some γ [0, ] and c γ > 0, T satisfies N := trt T + I c γ γ, for all > 0. Assumption 3 characters the capacity of H. The left-hand side of is called the effective dimension [39]. As T is a trace-class operator, Condition is trivially satisfied with γ = which is called the capacity-independent case. γ 0, ] if the eigenvalues { i } of T satisfy i i γ. We refer to [9] for more comments on the above assumptions. Furthermore, it is satisfied with a general 3. General Results for Kernel Conjugate Gradient Method with Projection The following results provide convergence results for general projected-kcgm with a datadependent stopping rule. Theorem 3.. Under Assumptions, and 3, let a [0, ζ ]. Assume that for some C, and for any δ 0,, P I P T > C ζ a a log δ δ, = n ζ+γ b n,ζ,γ. Then the following results hold with probability at least δ. There exist positive constants C and C which depend only on ζ, γ, c γ, T, κ, M, Q, B, R, C such that if the stopping rule is then Uω t P S xȳ H C log 3 L a S ρ ωˆt f H ρ C log a δ n Furthermore, if ζ /, f H = S ρ ω H for some ω H H and ζ+/ δ n ζ+γ b ζ+/ n,ζ,γ, ζ a ζ+γ b ζ a n,ζ,γ. T a ωˆt ω H H C log a δ n ζ+γ. 3 ζ a Here, b n,ζ,γ = log n γ {ζ+γ }. 4 The convergence rate from the above is optimal as it matches the minimax lower rate On ζ a ζ+γ derived for ζ / in [7, 5]. Convergence results with respect to different measures are raised from statistical learning theory and inverse problems. In statistical learning theory, one typically is interested in the generalization ability, measured in terms of excess risks, S ρ ωˆt f ρ ρ = Ẽωˆt inf H Ẽ. In inverse problems, one is interested in the convergence within the space H. Theorem 3. asserts that projected-kcgm converges optimally if the projection error is small enough. The condition is satisfied with random projections induced by randomized sketches or Nystróm subsampling if the sketching dimension is large enough, as shown in Section 4. Thus we have the following corollaries for sketched or Nyström KCGM. 7

8 3.3 Results for Kernel Conjugate Gradient Methods with Randomized Sketches In this subsection, we state optimal convergence results with respect to different norms for KCGM with randomized sketches from Example.. We assume that the sketching matrix G satisfies the following concentration property: For any finite subset E in R n and for any t > 0, t m P Ga a t a c E e 0 logβ n. 5 Here, c 0 and β are universal non-negative constants. Example 3.. Many matrices satisfy the concentration property. Subgaussian sketches. Matrices with i.i.d. subgaussian such as Gaussian or Bernoulli entries satisfy 5 with some universal constant c 0 and β = 0. More general, if the rows of G are independent scaled copies of an isotropic ψ vector, then G also satisfies 5 [3]. Recall that a random vector a R n is ψ isotropic if for all v R n for some constant α. E[ a, v ] = v, and inf{t : E[exp a, v /t ] } α v, Randomized orthogonal system ROS sketches. As noted in [7], matrix that satisfies restricted isometric property from compressed sensing [6, ] with randomized column signs satisfies 5. Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies 5 with β = 4 for some universal constant c 0. Corollary 3.. Under Assumptions, and 3, let S = range{s xg }, where G R m n is a random matrix satisfying 5. Let δ 0,, a [0, ζ ] and m C 3 log 3 3 n γ [ log n γ ] γ, if ζ + γ, δ logβ n n γζ a aζ+γ, if ζ, n γ ζ+γ otherwise, for some C 3 > 0 which depends only on ζ, γ, c γ, T, κ, M, Q, B, R, c 0. Then the conclusions in Theorem 3. hold. When γ < ζ, the minimal sketching dimension is proportional to the effective dimension On γ ζ+γ up to a logarithmic factor, which we believe that it is unimprovable. According to Corollary 3., sketched-kcgm can generalize optimally if the sketching dimension is large enough. 3.4 Results for Kernel Conjugate Gradient Methods with Nyström Sketches In this subsection, we provide optimal rates with respect to different norms for KCGM with Nyström sketches from Example.. Corollary 3.3. Under Assumptions, and 3, let S = span{x,, x m }, ζ + γ >, δ 6 0,, a [0, ζ ] and m n ζ a aζ+γ [ log n γ ]. Then the conclusions in Theorem 3. are true. 8

9 The requirement on the sketch dimension m of Nyström-KCGM does not depend on the probability constant δ, but it is stronger than that of sketched-kcgm if γ < ignoring the factor δ. Remark 3.4. In the above, we only consider the plain Nyström subsampling. Using the approximated leveraging score ALS Nyström subsampling [35, 0], we can further improve the projection dimension condition to 6, see Section 4 for details. However, in this case, we need to compute the ALS with an appropriate pseudo regularization parameter. 3.5 Optimal Rates for Classical Kernel Conjugate Gradient Methods As a direct corollary, we derive optimal rates for classical KCGM as follows, covering the nonattainable cases. Corollary 3.5. Under Assumptions, and 3, let P = I, δ 0, and a [0, ζ ]. Then the conclusions in Theorem 3. are true. To the best of our knowledge, the above results provide the first optimal capacity-dependent rate for KCGM in the non-attainable case, i.e. ζ /. This thus provides an answer to a question open since [4]. Convergence results for kernel partial least squares under different stopping rules have been derived in [, 30], but the derived optimal rates are only for the attainable cases. Our analysis could be extended to this different type of algorithm with similar stopping rules N Sketched KCGM Nystrom KCGM.0 Sketched KCGM Nystrom KCGM 0.05 Squared Prediction Error Train Error Iterations Iterations Figure : Squared prediction errors and training errors for sketched KCGM with m = n /3 and plain Nystrom KCGM with m = n /3 and n = 04. We present some numerical results to illustrate our derived results in the setting of learning with kernel methods. In all the simulations, we constructed training datas {x i, y i } n i= R R from the regression model y = f ρ x + ξ, where the regression function f ρ x = x / /, the input x is uniformly drawn from [0, ], and ξ is a Gaussian noise with zero mean and standard deviation. By construction, the function f ρ belongs to the first-order Sobolev space with f ρ H =. In all the simulations, the RKHS is associated with a Sobolev 9

10 0.06 Sketched KCGM Nystrom KCGM KRR 0.9 Sketched KCGM Nystrom KCGM KRR Squared Prediction Error Scaled Squared Prediction Error Sample Size Sample Size Figure : Prediction errors S ρˆω f ρ ρ and scaled prediction errors n /3 S ρˆω f ρ ρ versus sample sizes for KRR, sketched KCGM with m = n /3, plain Nystrom KCGM with m = n /3. kernel Kx, x = + minx, x. As noted in [37, Example 3] for Sobolev kernel, according to [4], Assumption 3 is satisfied with γ =. As suggested by our theory, we set the projection dimension m = n /3, for KCGM with ROS sketches based on the fast Hadamard transform while m = n /3 for KCGM with plain Nyström sketches. We performed simulations for n in the set {3, 64, 8, 56, 5, 04} so as to study scaling with the sample size. For each n, we performed 00 trials and both squared prediction errors and training errors averaged over these 00 trials were computed. The errors for n = 04 versus the iterations were reported in Figure. For each n, the minimal squared prediction error over the first m iterations is computed and these errors versus the sample size were reported in Figure in order to compare with stateof-the-art algorithm, kernel ridge regression KRR. From Figure, we see that the squared prediction errors decrease at the first 3 iterations and then they increase for both sketched and Nyström KCGM. This indicates that the number of iteration has a regularization effect. Our theory predicts that the squared prediction loss should tend to zero at the same rate n /3 as that of KRR. Figure confirms this theoretical prediction. All the results stated in this section will be proved in Section 4. 4 Proof In this section and the appendix, we provide all the proofs. 4. Proof for Subsection. Let Q be a compact operator from the Euclidean space R m, to H such that S = rangeq. It is easy to see that Q Q R m m. Let t = rankr and R R m t be the matrix such that RR = Q Q. As P is the projection operator onto S, then P = QQ Q Q = QRR Q. 7 0

11 For any polynomial function q, we have that qup Sxȳ = qp T xp P Sxȳ = qp S xs x P P S xȳ. Noting that S x P = P Sx, and using Lemma 4. from the coming subsection, qup Sxȳ =P S xqs x P P Sxȳ = P SxqS x P Sxȳ. Introducing with 7, qup Sxȳ =QRR Q Sxq S x QRR Q Sx ȳ. 8 Noting that R Q Sx = S x QR, and applying Lemma 4., qup Sxȳ =QRqR Q SxS x QRR Q Sxȳ = QRq Kb, 9 where we denote b = R Q S xȳ, and K = R Q S xs x QR. Using RR = Q Q, which implies RR Q QRR = RR and for any g H, QRR Q g H = QRR Q QRR Q g, g H = QRR Q g, g H = R Q g, we get from 8 that qup S xȳ H = R Q S xq S x QRR Q S x ȳ H = q Kb, 30 where we used Lemma 4. for the last equality. Note that the solution of 5 is given by ω t = p t UP S xȳ, with p t = arg min p P t UpU IP S xȳ H. Using 9 and 30, we know that ω t = QRp t Kb, with which is equivalent to ω t = QRa t, with p t = arg min p P t Kp K Ib, a t = arg min Ka b. a K t K,b Proof for Example.. For general randomized sketches, Q = S xg. In this case, Q Q = GS x S xg = GK xx G, K = R GS x S xs x S xg R = R GK xxg R, b = R GS x Sxȳ = R GK xx ȳ, and ω t = SxG Râ t. Proof for Example.. In Nyström subsampling, x is a subset of size m < n drawn randomly following a distribution from x, Q = S x, and Q Q = K x x. In this case, K = R K xx K x x R, b = R K xx ȳ, and ω t = S x Ra t.

12 Proof for Example.3. For the ordinary non-sketching regimes, S = H and P = I. Denote K = S x S x. Then ω t = arg min T x ω S xȳ H, ω K tt x,s xȳ is equivalent to ω t = p t T x S xȳ = p ts xs x S xȳ = S xp t Kȳ = S xât, with â t given by â t = arg min Ka ȳ K. a K tk,ȳ Indeed, T x ω S xȳ H = S xs x ω ȳ H = S x ω ȳ K, and for any polynomial function p, S x pt x Sxȳ = S xpsxs x Sxȳ = KpKȳ. In the rest subsections, we present the proofs for Section Operator Inequalities We first introduce some necessary operator inequalities. Lemma 4.. [3, Cordes inequality] Let A and B be two positive bounded linear operators on a separable Hilbert space. Then A s B s AB s, when 0 s. Lemma 4.. Let H, H be two separable Hilbert spaces and S : H H a compact operator. Then for any well-defined function f over [0, S ], fss S = SfS S. Proof. The result can be proved using the singular value decomposition of a compact operator. Lemma 4.3. Let A and B be two non-negative bounded linear operators on a separable Hilbert space with max A, B κ for some non-negative κ. Then for any ζ > 0, A ζ B ζ C ζ,κ A B ζ, 3 where C ζ,κ = { when ζ, ζκ ζ when ζ >. 3 Proof. The proof is based on the fact that u ζ is operator monotone if 0 < ζ. For ζ, we refer to [9], or [5] for the proof. Lemma 4.4. Let X and A be bounded linear operators on a separable Hilbert space H. Suppose that A 0 and X. Then for any s [0, ] and any 0, X A + I s X X AX + X X s X AX + I s. 33 As a result, for any 0 and any ω H, A + I s Xω H X AX + X X s ω H X AX + I s ω H, 34 and for any bounded linear operator F on H, F X A + I s F X AX + I s. 35

13 Proof. Note that X X I since X. In fact, X Xω, ω H = Xω H ω H = ω, ω H. Following from [6], the fact that the function u s is operator monotone, one can prove 33: X A + I s X X AX + X X s X AX + I s. The proof for 34 can be done by applying 33: A + I s Xω H = X A + I s Xω, ω H X AX + I s ω, ω H = X AX + I s ω H. The proof for 35 can be done by applying 33: F X A + I s = F X A + I s XF F X AX + I s F = F X AX + I s. Lemma 4.5 [9]. Let P be a projection operator in a Hilbert space H, and A, B be two semidefinite positive operators on H. For any 0 s, t, we have A s I P A t A B s+t + B I P B s+t. 4.3 Orthogonal Polynomials and Some Notations We denote by ξ x,i, e x,i i an eigenvalue-eigenvector orthogonal basis for the operator U. It is easy to see that ξ x,i [0, κ ], as U is semi-definite and U T x κ by 8. For any u 0, we denote F u the orthogonal projection in H onto the subspace {e x,i : ξ x,i < u} and let F u = I F u. Denote N 0 = N {0}. For any t N 0, denote with P t the set of polynomials of degree at most t and P 0 t the set of polynomials in P t having constant term equal to. For any t N 0 and functions ψ, φ : R R, define Denote p r t and let q r t q t the minimizer for [ψ, φ] r = ψup S xȳ, U r φup S xȳ H. arg min[p, p] r, p Pt 0 P t be such that p r t u = uq r t u. We write p t and q t to mean p t and, respectively. According to the definition from Algorithm, we know that ω i = q i UP S xȳ, p i u = uq i u. In the case i = 0, we set q 0 = 0 and p 0 =. Let r N 0. Observe that for any function φ, [φ, φ] r = i φξ x,i ξ r x,i P S xȳ, e x,i H. Define m 0 the number of distinct positive eigenvalues of U such that P Sxȳ has nonzero projection on the corresponding eigenspace. Using that Ue x,i = 0 implies S x P e x,i = 0 as U = S x P S x P, we can prove that the measure defining [, ] r has finite support of cardinality m 0. Using the fact that a polynomial of degree t has at most t roots except t = 0, it is easy to show 3

14 that [, ] r with r N 0 is an inner product on the space P m0. Furthermore, there exists some p m0 P m 0 0 such that [p m0, p m0 ] r = 0, and p m0 has m 0 distinct roots belonging to 0, κ ]. Based on [5, Proposition.] or using a similar argument based on the projection theorem as that in [4], {p r i } m 0 i= are orthogonal with respect to [, ] r. Thus the polynomial p r t with t < m 0 has exactly t distinct roots belonging to 0, κ ], denoted by x r k,t k t in increasing order. For notational simplicity, we write x k,t to mean x k,t. The following lemma summarizes some basic facts about the orthogonal polynomials. Lemma 4.6. Let r N and t be any integer satisfying t < m 0. Then the following results hold. x r,t < xr+,t For u [0, x r,t 3 p r t ], 0 pr t 0 x r,t. 4 p t0 p t 0 + [p t,p t ] 0. u, 0 q r t [p t,p t ] Proof. See [5, Corollary.7]. As p r t 0 p r t Pt 0, p r t 0 =. Thus, p r t u. Moreover, 0 q r t uu = p r t uu and q r t u p r t 0. is convex and decreasing on [0, x r,t ]. Therefore, u and q r t u = pr t u u = pr t 0 p r t u 0 u p r t 0 = p r t 0. 3 Rewriting p r t u as t j= u/xr j,t, and taking the derivative on 0, we get p r t 0 = t j= x r j,t x r,t, which leads to the desired result. 4 Following from [5, Corollary.6], p r t 0 0 in the proof for Part, and that [p t, p t ] 0 [p t, p t ] 0 since p t is the minimizer of [, ] 0 over Pt 0, one can get the result. 4.4 Deterministic Analysis In the proof, we introduce an intermediate function ω H, defined as follows, ω = G T S ρf H, 36 where G u = { u, if u, 0, if u <. Lemma 4.7. Under Assumption, let ω be given by 36 for some > 0. Then we have For any a ζ, L a S ρ ω f H ρ R ζ a. 37 T a / ω H R { ζ+a, if ζ a ζ, κ ζ+a, if a ζ. 38 4

15 The proof can be found in [8, Page 40]. We next introduce some useful notations. := T x T T x T, := T T x ω S xȳ H, 3 := T x T HS, 4 := T T T x, 5 := T I P = T I P T, We also need the following preliminary lemmas. Lemma 4.8 [9]. Under Assumption, we have T x S xȳ T xp ω H + R { 5 + ζ, if ζ, κ κ ζ, if ζ >. 39 The proof for the above lemma can be found in [9]. We provide a proof in Appendix A. for completeness. Lemma 4.9. Let A : H H be a bounded operator. Under Assumption, AP ω R AU H ζ, if ζ, R A C ζ,κ ζ 3 + AU C ζ,κ ζ + AU ζ, if ζ >. 40 Proof. If 0 < ζ, by a simple calculation, and applying Part of Lemma 4.7, Using 35 from Lemma 4.4, we get which leads to the desired result. If ζ, applying Part of Lemma 4.7, AP ω H AP T x T x T T ω H AP T x T ω H AP T x R ζ. AP ω H AP T x P + I Rζ, AP ω H AP T ζ T ζ ω AP T ζ R. Adding and subtracting with the same term and using the triangle inequality, AP ω H R AP T ζ ζ T x + AP T ζ x R AP T ζ ζ T x + AP T ζ x. 5

16 Applying Lemma 4.3 with 3 and 8, we get With AP ω H R AP C ζ R AP C ζ,κ ζ,κ ζ V = T x P T x = P T x P T x 3 + AP T ζ x 3 + AP T ζ x. 4 and Lemma 4., we can rewrite P T ζ x as P T x T ζ x V ζ + P T x V ζ = P T x T ζ x V ζ + U ζ P T x. Thus, combining with the triangle inequality, we get AP T ζ x AP T x T ζ x V ζ + AU ζ P T x Applying Lemma 4.3 with V T x κ, AP T x T ζ x V ζ + AU ζ P T x. AP T ζ x AP T x C ζ,κ T x V ζ + AU ζ P T x. Using Lemma 4.5, I P = I P and A A = A, we have and we thus get T x V = T x I P T x T x T + T I P T 3 + 5, AP T ζ x AP T x C ζ,κ ζ + AU ζ P T x. 4 Applying 35 of Lemma 4.4, we get AP T x AP T x P = AU and AU ζ P T x AU ζ. Thus, AP T ζ x AU Cζ,κ ζ + AU ζ. Introducing the above into 4, one can get AP ω H R AP C ζ,κ ζ which leads to the desired result by noting that AP A. 3 + AU Cζ,κ ζ + AU ζ, With the above lemmas, we can prove the following result for estimating L a S ρ ω t f H ρ. Lemma 4.0. Under Assumption, let u 0, x,t ] and 0 a ζ. Then the following statements hold. If ζ, L a S ρ ω t f H ρ a + a u + a u p t0 a + a p t0 + u + a + 5 / + R ζ u Uω t P S xȳ H + R u + a ζ + 5 / a + R ζ a. 43 6

17 If ζ, L a S ρ ω t f H ρ a p t 0 a + a p t0 + + a RC ζ,κ u + a + Rκ ζ κ u ζ ζ u + u ζ u + a + a Uω t P S xȳ u + a H + R κ ζ a 5 + ζ a. 44 u Proof. Adding and subtracting with the same term, and then using the triangle inequality, L a S ρ ω t f H ρ L a S ρ ω t ω ρ + L a S ρ ω f H ρ L a S ρ ω t ω ρ + R ζ a, where we used Part of Lemma 4.7 for the last inequality. Using and 5, L a S ρ = L Sρ S ρ a S ρ = L Sρ S ρs ρ a = L Sρ T a L a S ρ ω t f H ρ L Sρ T a ω t ω ρ + R ζ a T a ω t ω H + R ζ a. 45 Subtracting and adding with the same term, then using the triangle inequality, L a S ρ ω t f H ρ T a ω t P ω H + T a I P ω H + R ζ a. Since P is a projection operator, I P s = I P for any s > 0, and we thus can get L a S ρ ω t f H ρ T a ω t P ω H + T a I P a I P T T ω H + R ζ a. Using Lemma 4. and Part of Lemma 4.7, we get [9], L a S ρ ω t f H ρ T a ω t P ω H + a 5 Rκ ζ + ζ + R ζ a. 46 In what follows, we estimate T a ω t P ω H. Estimating T a ω t P ω H. We first have T a ω t P ω H T a T a T a T a x T a x ω t P ω H. Obviously, T a T a and by Lemma 4., T a T a x T T x a a. Thus, T a ω t P ω H a T a x ω t P ω H = a T a x P ω t P ω H, where the last equality follows from the facts that ω t S and that P is the projection operator with range S which implies P = P and ω t = P ω t. Noting that P, using 34, we get T a ω t P ω H a U a ω t P ω H. Adding and subtracting with the same term, using the triangle inequality, and noting that ω t = P ω t, T a ω t P ω H a F u U a ω t P ω H + Fu U a P ω t ω H. 7

18 Introducing with ω t = q t UP S xȳ, T a ω t P ω H a F u U a q t UP Sxȳ P ω H + Fu U a P ω t ω H. In what follows, we estimate the last two terms from the above. Estimating Fu U a P ω t ω H. By a direct calculation, following from the definition of U given by 4 and P = P, Fu U a P ω t ω H Fu U a U Fu U UP ω t ω H u + a Fu U u Uω t P T x P ω H. Adding and subtracting with the same term, and using the triangle inequality, F u U a u + a u u + a u P ω t ω H Using 35, U P T x U F u U Uω t P S xȳ H + F u U Uω t P S xȳ H u + Fu U a P ω t ω u + a H u P S xȳ T xp ω H + U P T x T x S xȳ T xp ω H P T xp + I =, and thus Uω t P S xȳ H u + + T x S xȳ T xp ω H Estimating F u U a q t UP S xȳ ω H. Adding and subtracting with the same term, noting that P = P, and using the triangle inequality, we get Using 35, F u U a q t UP Sxȳ P ω H F u U a q t UP Sxȳ T xp ω H + F u U a p t UP ω H F u U a q t UP T x T x S xȳ T xp ω H + F u U a p t UP ω H. 49 F u U a q t UP T x F uu a q t UP T x P + I = Fu U a q t U max x + x [0,u] a q t x max xqt x a q t x a + a q t x x [0,u] p t0 a + a p t0, 50 where we used Part of Lemma 4.6 with u [0, x,t ] for the last inequality. Introducing the above into 49, we get F u U a q t UP Sxȳ P ω H p t0 a + a p t0 T x S xȳ T xp ω H + F u U a p t UP ω H. 8

19 Introducing the above and 48 into 47, we get T a ω t P ω H a p t0 a + a p t0 + + a u + a u u + a T u x S xȳ T xp ω H Uω t P S xȳ H + F u U a p t UP ω H In what follows, we estimate F u U a p t UP ω H, considering two different cases. If 0 < ζ, applying Lemma 4.9, F u U a p t UP ω H F u U a p t UU R ζ max x [0,u] p txx + a R ζ u + a R ζ, where we used Part of Lemma 4.6 for the last inequality. Introducing the above and 39 into 5, and then combing with 46, one can prove the desired result for ζ. If ζ, applying Lemma 4.9 with A = F u U a p t U, we get F u U a p t UP ω H R A C ζ,κ ζ For any s 0, using Part of Lemma 4.6, AU Cζ,κ ζ + AU ζ. AU s = max x [0,u] x + a p t xx s u + a u s. 5 Using the above with s = 0,, ζ into 5, we get F u U a p t UP ω H RC ζ,κ ζ 3 + C ζ,κ ζ u + u ζ u + Introducing the above and 39 into 5, and then combining with 46, we can prove the desired result for ζ. From Lemma 4.0, we can see that in order to control the error, we need to estimate the random quantities,, 3, 4, 5, p t0, and Uω t P S xȳ H. The random quantities will be estimated in Subsections 4.5 and 4.6, while Uω t P S xȳ H can be bounded due to the stopping rule. In order to estimate p t0, we introduce the following two lemmas, from which and the stopping rule we can estimate p t0 as shown in the coming proof for the main theorem. Lemma 4.. The following statements hold. If ζ, a. Uω t P S xȳ H p t / + R ζ + R c 3 p t0 3 ζ + ζ p t 0. 9

20 If ζ >, Uω t P S xȳ H + RC ζ,κ ζ p t0 + + R κ κ ζ 3 p t0 + c 3 C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. 53 Here, we denote 0 0 = and Proof. Let Following from [5, 3.8], c v = v v, v 0. x,t φ t x = p t x. x,t x Uω t P S xȳ H F x,t φ t UP S xȳ H. Using the triangle inequality, with a basic calculation, we get Uω t P S xȳ H F x,t φ t UP S xȳ T xp ω H + F x,t φ t UUω H F x,t φ t UP T x T x S xȳ T xp ω H + F x,t φ t UUω H F x,t φ t UU T x S xȳ T xp ω H + F x,t φ t UUω H, 54 where we used 35 of Lemma 4.4 for the last inequality. Note that F x,t φ t UU Following from [5, 3.0], Thus, we get that sup φ t xx + sup x + φt x. x [0,x,t ] x [0,x,t ] sup φ t xx v c v p t0 v, v x [0,x,t ] F x,t φ t UU p t0 +. Introducing the above into 54, we get that Uω t P S xȳ H p t0 + T x S xȳ T xp ω H + F x,t φ t UUω H. 56 Now, we consider tow cases. Case I: ζ. Using Lemma 4.9, with U = UP, F x,t φ t UUω H R F x,t φ t UUU ζ R ζ max φ txxx +. x [0,x,t ] Applying 55, F x,t φ t UUω H R c 3 p t0 3 ζ + ζ p t 0. 0

21 Introducing the above and 39 into 56, one can get the desired result. Case II: ζ >. Applying 55, we get that for any s 0, F x,t φ t UUU s H max φ txx s+ c s+ p t0 s+. x [0,x,t ] Using the above and Lemma 4.9, with U = UP, we get that F x,t φ t UUP ω H RC ζ,κ ζ 3 p t0 + c 3 Applying the above and 39 into 56, we get the desired result. C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. Lemma 4.. Let u 0, x,t ]. Then the following statements hold. If ζ, If ζ >, [p t, p t ] 0 u + + R 5 + ζ + Ru ζ + u [p t, p t ]. 57 [p t, p t ] 0 u + + R κ κ ζ + RC ζ,κu ζ 3 + u 3 Cζ,κ ζ + u ζ+ + u [p t, p t ]. 58 Proof. Since p t is the minimizer of [p, p] 0 over P 0 t and p t P 0 t, Using the triangle inequality, [p t, p t ] 0 [p t, p t ] 0 = p t UP S xȳ H [p t, p t ] 0 F u p t UP S xȳ H + Fu p t UP S xȳ H By a basic calculation, F u p t UP S xȳ T xp ω H + F u p t UUω H + F u p t UP S xȳ H. [p t, p t ] 0 F u p t UP T x T x S xȳ T xp ω H + F u p t UUω H + F u U U p t UP S xȳ H F u p t UU T x S xȳ T xp ω H + F u p t UUω H + u [p t, p t ], where we used 35 of Lemma 4.4 for the last inequality. Using Part of Lemma 4.6, we get and thus F u p t UU max x [0,u] p t xx + u +, [p t, p t ] 0 u + T x S xȳ T xp ω H + F u p t UUω H + u [p t, p t ]. 59

22 Case I: ζ. Using P = P and Lemma 4.9, F u p t UUω H = F u p t UUP ω R F u p t UUU ζ R max x [0,u] p t xxx + ζ. Using Part of Lemma 4.6, F u p t UUω H Ruu + ζ. Introducing the above and 39 into 59, one can get the desired result for ζ. Case II: ζ. Using Part of Lemma 4.6, fro any s 0, F u p t UUU s = max x [0,u] p t xx s+ u s+, Noting that as P = P, F u p t UUω H = F u p t UUP ω, and combining with Lemma 4.9, we get F u p t UUω H RC ζ,κu ζ 3 + u 3 Cζ,κ ζ + u ζ+. Introducing the above and 39 into 59, one can get the desired result for ζ. 4.5 Probabilistic Estimates In this subsection, we introduce some probabilistic estimates to bound the random quantities,, 3, and 4 Lemma 4.3. Under Assumption 3, let δ 0,, and = n θ with θ [0, or = [ log n γ ]/n. Then with probability at least δ, T + I / T x + I / T + I / T x + I / 3aδ, where aδ = 8κ log 4κ ec γ+ δ T if = [ log n γ ]/n, or aδ = 8κ otherwise. log 4κ c γ+ δ T + θγ e θ The proof of the above result for the case = n θ with θ [0, can be found in [8]. Here, using essentially the same idea, we also provide a similar result considering the case = [ log n γ ]/n. We report the proof in Appendix A.. Lemma 4.4. Let 0 < δ < /. It holds with probability at least δ : T T x T T x HS κ log/δ κ + 4 log/δ. n n Here, HS denotes the Hilbert-Schmidt norm. Proof. Using Lemma which is a direct corollary of the concentration inequality for Hilbertspace valued random variables from [6] from [3], one can prove the desired result.

23 Lemma 4.5. Under Assumptions and 3, with probability at least δ, the following holds: T T x ω S xȳ H 4κM + κ ζ R ζ 83R n + κ ζ + 3B + 4Q c γ γ log n δ + Rζ. 60 The above lemma is essentially proved in [8, ]. We provide a proof in Appendix A.3. Lemma 4.6. Under Assumption 3, let 0 < δ < /. It holds with probability at least δ : T κ T T x HS κ n + cγ n γ log δ. The proof for the above lemma can be found in [9]. 4.6 Projection Errors In this subsection, we estimate projection errors I P T, considering different projections. The first lemma provides upper bounds on projection errors with plain Nyström subsampling. Lemma 4.7. Under Assumption 3, let P be the projection operator with range Then with probability at least δ, δ 0, S = span{x,, x m }. I P T I P T η log mγ m 4κ log 4κ ec γ +, 6 δ T where η = log mγ m. The following lemma estimates projection errors with randomized sketches. Lemma 4.8. Under Assumption 3, let S = range{s xg }, where G R m n is a random matrix satisfying 5 and P be the projection operator with its range S. Then with probability at least 3δ δ 0, /3, we have I P T n θ log nγ n θ 7a γ log 4 δ, provided that m Cn θγ log β n log n γ c log 3 4 δ, c = { 0, if θ <, γ, if θ =. 6 Here, a γ = 4κ log κ e c γ+ T, and C = 00c 0 + 0b γ with b γ = 4κ 4κ + κ { θγ c γ + c γ log κ c γ + e θ, if θ <, + + c, c = T, if θ =. Finally, the next lemma upper bounds projection errors with ALS Nyström subsampling sketches. The ALS Nyström subsampling is defined as follows. 3

24 Approximated Leveraging Scores ALS Nyström Subsampling In this regime, S = range{sxg }, where each row m a j of G is i.i.d. drawn according to P a = e i = q i, qi where q i > 0 will be chosen later and {e i : i [n]} is the standard basis of R n. For every i [n] and > 0, the leveraging scores of KK + I is the sequence {l i } n i= with l i = KK + I ii, i [n]. In practice, the leveraging scores of KK + I is hard to compute, and we can only compute its approximation ˆl i such that L l i ˆl i Ll i, for some L. In the ALS Nyström subsampling, we set q i := q i = ˆl i j ˆl j. Lemma 4.9. Under Assumption 3, let S = range{s xg }, where G R m n is a randomized matrix related to ALS Nyström subsampling, and P be the projection operator with its range S. Then with probability at least 3δ δ 0, /3, we have I P T n θ log nγ n θ 4a γ log 4 δ, provided that m C n θγ log n γ c log 3 4 δ, c = {, if θ <, γ, if θ =. 63 Here, C = 8b γ L 4 + logb γ where a γ and b γ are given by Lemma 4.8. Here, = n θ if log nγ θ [0,, or = n if θ =. Part of the proofs for the above lemmas can be found in [9]. We provide the proofs in Appendix A.4, A.5 and A Deriving Main Results We are ready to prove the main theorem and its corollaries. Proof of Theorem 3.. Applying Lemmas 4.3, 4.4, , and Condition, and noting that [n, ], we get that with probability at least 5δ, the following inequalities hold: C log δ, C ζ log δ, 3 κ + n log δ C 3 ζ+γ log δ, C 3 = κ +, 4

25 where κ 4 κ n + cγ n γ log δ C 4 ζ log δ, C 4 = κκ + c γ, 5 C ζ a a log δ, 4κ log κ ec γ+ T + γ ζ+γ C = 4κ log κ ec γ+ T C = 4κM + κ ζ R +, if ζ + γ >, otherwise, 83R κ + 3B + 4Q c γ + R. In what follows, we assume the above estimates hold and we prove the results considering two different cases. Case I: ζ. We first have where we denote + 5 / + R ζ C 5 ζ log 3 C 5 = C C + C + R. By the above inequality and Lemma 4., we have δ, 64 Step : Uω t P S xȳ H p t0 + C 5 log 3 Now set the stopping rule as δ ζ + RC log δ c 3 Uω t P S xȳ H τ + C 5 log 3 where τ > 0 is given later. From the definition of ˆt, we have Combining with 65, noting that log δ τ log 3 δ ζ+ C 5 p ˆt 0 ζ log 3 3 log 3 δ ζ max Uωˆt P S xȳ H > τ + C 5 log 3 δ + RC log p t0 3 ζ + ζ p t δ ζ+,, by a simple calculation, δ C 5 p ˆt 0, RC c 3 δ ζ+. 66 c 3 p ˆt 0 3 ζ + ζ p ˆt 0 p ˆt 0 3, RC p ˆt 0. If the maximum is achieved at the first term of the right-hand side from the above, then and by a direct calculation, τ log 3 δ ζ+ 3C5 log 3 p ˆt 0 3C 5/τ. δ p ˆt 0 ζ, If the maximum is achieved at the second term or the third term, using a similar argument, one can show that at least one of the following two inequalities holds, p ˆt 0 3RC c 3 /τ 3, 5

26 Now we choose τ as Then, following from the above analysis, p ˆt 0 6C R/τ. τ max 3 C 5, 6 RC c 3, C R. p ˆt Step : In this step, we choose u =. Using 67 and Part 3 of Lemma 4.6, it is easy to show that Applying Lemma 4., with 64, Combing with 66, [pˆt, pˆt ] 0 u p ˆt 0 x,ˆt. C 5 + C R ζ+ 3 log C5 + C R δ + u [p ˆt, p ˆt ]. [pˆt, pˆt ] 0 Uωˆt C 5 + τ P S xȳ H + u [p ˆt, p ˆt ] C5 + C R = [pˆt C 5 + τ, pˆt ] 0 + u [p ˆt, p ˆt ] [pˆt, pˆt ] 0 + u [p ˆt, p ˆt ], provided that τ C 5 + C R C 5. Thus, we get Combining with Part 4 of Lemma 4.6 and 67, we get that [pˆt, pˆt ] 0 u [p ˆt, p ˆt ], 68 p ˆt 0 p ˆt 0 + 4u Step 3. In this step, we let u = 5. Then following from 69 and Part 3 of Lemma 4.6, we have u p ˆt 0 x,ˆt. 70 Using Lemma 4.0, and introducing with 69, 64 and the above estimates, we have where L a S ρ ωˆt f H ρ C 6 ζ a log a δ + 6C a a Uωˆt P S xȳ H log a δ. C 6 = C 5 a + C 5 + C a R6/5 a + C a + R. From the definition of the stopping rule, we get L a S ρ ωˆt f H ρ C 6 + 6C a τ + C 5 ζ a log a δ, 6

27 which leads to the desired result for ζ. Case II: ζ >. Step. Introducing the estimates given in the beginning of the proof and using ζ a a, where Using Lemma 4., Uω t P S xȳ H + RC ζ,κ ζ + R κ κ ζ C 7 ζ log 3 C 7 = C C + RκC 4 + C κ ζ. p t0 + C 7 ζ log 3 3 p t0 + c 3 δ ζ as δ, 7 C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. Notice that by a direct calculation, with ζ >, <, κ and log δ, Therefore, Uω t P S xȳ H ζ 3 C 3 ζ+γ/ log ζ C 3 ζ log, and 7 δ δ ζ C 3 ζ+γ/ log δ + ζ a a C log ζ δ p t0 + C 7 ζ log 3 + RC ζ,κc 3 ζ p t 0 + c 3 C ζ,κ C3 + C Now set the stopping rule as C 3 + C ζ log δ. 73 δ Uω t P S xȳ H τ + C 7 log 3 where τ > 0 is given later. From the definition of ˆt, we have Uωˆt P S xȳ H > τ + C 7 log 3 ζ p t0 3 + cζ+ δ ζ+, Letting t = ˆt in 74 and combining with 75, by a direct calculation, τ log 3 δ ζ+ C7 p ˆt 0 ζ log 3 + RC ζ,κc 3 ζ p ˆt 0 + c 3 C ζ,κ C3 + C 4 log 3 δ ζ max δ C 7 p ˆt 0, Rc 3 C ζ,κ C3 + C p ˆt 0 3, RC ζ,κc 3 p ˆt 0, c ζ+ R ζ p ˆt 0 ζ+. p t0 ζ+ log δ. 74 δ ζ+. 75 ζ p ˆt cζ+ p ˆt 0 ζ+ log δ 7

28 Therefore, if τ max 4 C 7, 8 Rc 3 C ζ,κ C3 + C, 6RCζ then 67 holds, using a similar basic argument,κc 3, ζ+ 5 cζ+ Step. In this step, we let u =. Using 67 and Part 3 of Lemma 4.6, it is easy to show that u p ˆt 0 x,ˆt. Applying Lemma 4., introducing with 7, 7 and 73, and by a direct calculation, where Combing with 75, we get that [pˆt, pˆt ] 0 C 8 ζ+ log 3 δ + [p ˆt, p ˆt ], C 8 = C 7 + R C ζ,κc 3 + C ζ,κ C 3 + C +. [pˆt, pˆt ] 0 C 8 τ + C 7 Uωˆt P S xȳ H+ [p ˆt, p ˆt ] [pˆt, pˆt ] 0 + [p ˆt, p ˆt ], provided that τ C 8 C 7. This leads to 68. Combining with Part 4 of Lemma 4.6 and 67, we get that 69 holds. Step 3. In this step, we let u = 5. Then following from 69 and Part 3 of Lemma 4.6, we have 70. The rest of the proof parallelizes as that for the case ζ. We thus include the sketch only. Applying Lemma 4.0, introducing with 69, 7, 7 and 73, where C 9 = C a R, L a S ρ ωˆt f H ρ C 9 ζ a log a δ + C a 6 Uωˆt P S xȳ H a log a δ, C 7 5 a + + RC ζ,κ6/5 a C 3 + C 3 + C / ζ +Rκ ζ C a +. Following from the definition of the stopping rule, one can get the desired result for the case ζ. The proof for 3 with ζ / is the same as we can replace L a S ρ ω t f H H by T a ω t ω H H in the whole proof for the convergence with respect to L ρ X -norm. Proof of Corollary 3.. We use Theorem 3. and Lemma 4.8 to prove the result. We only need to verify is satisfied. In Lemma 4.8, we let ζ a aζ+γ, if ζ >, θ = ζ+γ, othwewise,, if ζ + γ. log nγ γ log n θ Clearly, θ. For θ <, we have = γ n θ θn θ θ. Therefore, following from Lemma 4.8 and Condition 6, we have that with probability at least δ, with probability at least 3δ δ 0, /3, we have I P T C ζ a a log 4 δ, with C = 7a γ if = [ log n γ ]/n or C = 7aγ θ 8 otherwise. The proof is complete.

29 Proof of Corollary 3.3. The proof for Corollary 3.3 is similar, using Theorem 3. and Lemma 4.7. We thus skip it. Combing Theorem 3. with Lemma 4.9, we get the follow result for KCGM with ALS Nyström sketches. Corollary 4.0. Under Assumptions, and 3, let δ 0,, a [0, ζ ], and S = span{ x,, x m } with x j i.i.d drawn according to the ALS Nyström subsampling regime in Lemma 4.9 with an appropriate. Assume that m C 4 L log 3 3 n γ [ log n γ ] γ, if ζ + γ, n γζ a aζ+γ [ log n δ γ ], if ζ, 76 n γ ζ+γ [ log n γ ] otherwise, for some C 4 > 0 which depends only on ζ, γ, c γ, T, κ, M, Q, B, R. Then the conclusions in Theorem 3. are true. Acknowledgements This work was sponsored by the Department of the Navy, Office of Naval Research ONR under a grant number N It has also received funding from Hasler Foundation Program: Cyber Human Systems project number 6066, and from the European Research Council ERC under the European Unions Horizon 00 research and innovation program grant agreement n time-data. References [] A. Alaoui and M. W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. arxiv preprint arxiv: Advances in Neural Information Processing Systems, pages , 05. [] H. Avron, K. L. Clarkson, and D. P. Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 384:6 38, 07. [3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 83:53 63, 008. [4] G. Blanchard and N. Krämer. Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Information Processing Systems, pages 6 34, 00. [5] G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 84:97 03, 08. [6] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 5:403 45, 005. [7] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 73:33 368,

30 [8] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 4. Cambridge University Press, 007. [9] L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, :0 047, 07. [0] P. Drineas and M. W. Mahoney. Lectures on randomized numerical linear algebra. arxiv preprint arxiv: , 07. [] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 996. [] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume. [3] J. Fujii, M. Fujii, T. Furuta, and R. Nakamoto. Norm inequalities equivalent to Heinz inequality. Proceedings of the American Mathematical Society, 83:87 830, 993. [4] C. Gu. Smoothing spline ANOVA models, volume 97. Springer Science & Business Media, 03. [5] M. Hanke. Conjugate gradient type methods for ill-posed problems. Routledge, 07. [6] F. Hansen. An operator inequality. Mathematische Annalen, 463:49 50, 980. [7] F. Krahmer and R. Ward. New and improved johnson lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 433:69 8, 0. [8] J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. arxiv preprint arxiv: under revision to Journal of Machine Learning Research, 08. [9] J. Lin and V. Cevher. Optimal rates of sketched-regularized algorithms for least-squares regression over Hilbert spaces. arxiv preprint arxiv: Proceedings of the 35th International Conference on Machine Learning, 08. [0] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 897: 47, 07. [] J. Lin, A. Rudi, L. Rosasco, and V. Cevher. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. arxiv preprint arxiv: to appear in Applied and Computational Harmonic Analysis, 08. [] S.-B. Lin and D.-X. Zhou. Optimal learning rates for kernel partial least squares. Journal of Fourier Analysis and Applications, 43: , 08. [3] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 83:77 89, 008. [4] S. Minsker. On some extensions of Bernstein s inequality for self-adjoint operators. arxiv preprint arxiv:.5448, 0. 30

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

arxiv: v1 [math.st] 28 May 2016

arxiv: v1 [math.st] 28 May 2016 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Lecture 3. Random Fourier measurements

Lecture 3. Random Fourier measurements Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

Strengthened Sobolev inequalities for a random subspace of functions

Strengthened Sobolev inequalities for a random subspace of functions Strengthened Sobolev inequalities for a random subspace of functions Rachel Ward University of Texas at Austin April 2013 2 Discrete Sobolev inequalities Proposition (Sobolev inequality for discrete images)

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Characterization of half-radial matrices

Characterization of half-radial matrices Characterization of half-radial matrices Iveta Hnětynková, Petr Tichý Faculty of Mathematics and Physics, Charles University, Sokolovská 83, Prague 8, Czech Republic Abstract Numerical radius r(a) is the

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016 Lecture notes 5 February 9, 016 1 Introduction Random projections Random projections are a useful tool in the analysis and processing of high-dimensional data. We will analyze two applications that use

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Compressed Sensing: Lecture I. Ronald DeVore

Compressed Sensing: Lecture I. Ronald DeVore Compressed Sensing: Lecture I Ronald DeVore Motivation Compressed Sensing is a new paradigm for signal/image/function acquisition Motivation Compressed Sensing is a new paradigm for signal/image/function

More information

A Concise Course on Stochastic Partial Differential Equations

A Concise Course on Stochastic Partial Differential Equations A Concise Course on Stochastic Partial Differential Equations Michael Röckner Reference: C. Prevot, M. Röckner: Springer LN in Math. 1905, Berlin (2007) And see the references therein for the original

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

An algebraic perspective on integer sparse recovery

An algebraic perspective on integer sparse recovery An algebraic perspective on integer sparse recovery Lenny Fukshansky Claremont McKenna College (joint work with Deanna Needell and Benny Sudakov) Combinatorics Seminar USC October 31, 2018 From Wikipedia:

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration. V. Temlyakov

2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration. V. Temlyakov INTERDISCIPLINARY MATHEMATICS INSTITUTE 2014:05 Incremental Greedy Algorithm and its Applications in Numerical Integration V. Temlyakov IMI PREPRINT SERIES COLLEGE OF ARTS AND SCIENCES UNIVERSITY OF SOUTH

More information

Functional Analysis Exercise Class

Functional Analysis Exercise Class Functional Analysis Exercise Class Week 9 November 13 November Deadline to hand in the homeworks: your exercise class on week 16 November 20 November Exercises (1) Show that if T B(X, Y ) and S B(Y, Z)

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Sparse Recovery with Pre-Gaussian Random Matrices

Sparse Recovery with Pre-Gaussian Random Matrices Sparse Recovery with Pre-Gaussian Random Matrices Simon Foucart Laboratoire Jacques-Louis Lions Université Pierre et Marie Curie Paris, 75013, France Ming-Jun Lai Department of Mathematics University of

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

arxiv: v2 [math.pr] 27 Oct 2015

arxiv: v2 [math.pr] 27 Oct 2015 A brief note on the Karhunen-Loève expansion Alen Alexanderian arxiv:1509.07526v2 [math.pr] 27 Oct 2015 October 28, 2015 Abstract We provide a detailed derivation of the Karhunen Loève expansion of a stochastic

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Chapter 4 Euclid Space

Chapter 4 Euclid Space Chapter 4 Euclid Space Inner Product Spaces Definition.. Let V be a real vector space over IR. A real inner product on V is a real valued function on V V, denoted by (, ), which satisfies () (x, y) = (y,

More information

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit

Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit Uniform Uncertainty Principle and signal recovery via Regularized Orthogonal Matching Pursuit arxiv:0707.4203v2 [math.na] 14 Aug 2007 Deanna Needell Department of Mathematics University of California,

More information

arxiv: v2 [stat.ml] 8 Oct 2018

arxiv: v2 [stat.ml] 8 Oct 2018 Optimal Rates of Sketched-regularized Algorithms for Least-Squares Regression over Hilbert Spaces Junhong Lin Volkan Cevher arxiv:803.0437v2 [stat.ml] 8 Oct 208 Abstract We investigate regularized algorithms

More information

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices Jan Vybíral Austrian Academy of Sciences RICAM, Linz, Austria January 2011 MPI Leipzig, Germany joint work with Aicke

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Algebra C Numerical Linear Algebra Sample Exam Problems

Algebra C Numerical Linear Algebra Sample Exam Problems Algebra C Numerical Linear Algebra Sample Exam Problems Notation. Denote by V a finite-dimensional Hilbert space with inner product (, ) and corresponding norm. The abbreviation SPD is used for symmetric

More information

arxiv: v1 [math.na] 26 Nov 2009

arxiv: v1 [math.na] 26 Nov 2009 Non-convexly constrained linear inverse problems arxiv:0911.5098v1 [math.na] 26 Nov 2009 Thomas Blumensath Applied Mathematics, School of Mathematics, University of Southampton, University Road, Southampton,

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

Sparse Legendre expansions via l 1 minimization

Sparse Legendre expansions via l 1 minimization Sparse Legendre expansions via l 1 minimization Rachel Ward, Courant Institute, NYU Joint work with Holger Rauhut, Hausdorff Center for Mathematics, Bonn, Germany. June 8, 2010 Outline Sparse recovery

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Compressed Sensing and Sparse Recovery

Compressed Sensing and Sparse Recovery ELE 538B: Sparsity, Structure and Inference Compressed Sensing and Sparse Recovery Yuxin Chen Princeton University, Spring 217 Outline Restricted isometry property (RIP) A RIPless theory Compressed sensing

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

On the singular values of random matrices

On the singular values of random matrices On the singular values of random matrices Shahar Mendelson Grigoris Paouris Abstract We present an approach that allows one to bound the largest and smallest singular values of an N n random matrix with

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Geometry of log-concave Ensembles of random matrices

Geometry of log-concave Ensembles of random matrices Geometry of log-concave Ensembles of random matrices Nicole Tomczak-Jaegermann Joint work with Radosław Adamczak, Rafał Latała, Alexander Litvak, Alain Pajor Cortona, June 2011 Nicole Tomczak-Jaegermann

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

1 Math 241A-B Homework Problem List for F2015 and W2016

1 Math 241A-B Homework Problem List for F2015 and W2016 1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let

More information

Distinct distances between points and lines in F 2 q

Distinct distances between points and lines in F 2 q Distinct distances between points and lines in F 2 q Thang Pham Nguyen Duy Phuong Nguyen Minh Sang Claudiu Valculescu Le Anh Vinh Abstract In this paper we give a result on the number of distinct distances

More information

Invertibility of symmetric random matrices

Invertibility of symmetric random matrices Invertibility of symmetric random matrices Roman Vershynin University of Michigan romanv@umich.edu February 1, 2011; last revised March 16, 2012 Abstract We study n n symmetric random matrices H, possibly

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

LECTURE 1: SOURCES OF ERRORS MATHEMATICAL TOOLS A PRIORI ERROR ESTIMATES. Sergey Korotov,

LECTURE 1: SOURCES OF ERRORS MATHEMATICAL TOOLS A PRIORI ERROR ESTIMATES. Sergey Korotov, LECTURE 1: SOURCES OF ERRORS MATHEMATICAL TOOLS A PRIORI ERROR ESTIMATES Sergey Korotov, Institute of Mathematics Helsinki University of Technology, Finland Academy of Finland 1 Main Problem in Mathematical

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

MET Workshop: Exercises

MET Workshop: Exercises MET Workshop: Exercises Alex Blumenthal and Anthony Quas May 7, 206 Notation. R d is endowed with the standard inner product (, ) and Euclidean norm. M d d (R) denotes the space of n n real matrices. When

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont.

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont. Stat60/CS94: Randomized Algorithms for Matrices and Data Lecture 1-10/14/013 Lecture 1: Randomized Least-squares Approximation in Practice, Cont. Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning:

More information