arxiv: v1 [stat.ml] 5 Nov 2018

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 5 Nov 2018"

Arabella Farmer
5 years ago
Views:

1 Kernel Conjugate Gradient Methods with Random Projections Junhong Lin and Volkan Cevher {junhong.lin, Laboratory for Information and Inference Systems École Polytechnique Fédérale de Lausanne CH05-Lausanne, Switzerland arxiv:8.0760v [stat.ml] 5 Nov 08 November 6, 08 Abstract We propose and study kernel conjugate gradient methods KCGM with random projections for least-squares regression over a separable Hilbert space. Considering two types of random projections generated by randomized sketches and Nyström subsampling, we prove optimal statistical results with respect to variants of norms for the algorithms under a suitable stopping rule. Particularly, our results show that if the projection dimension is proportional to the effective dimension of the problem, KCGM with randomized sketches can generalize optimally, while achieving a computational advantage. As a corollary, we derive optimal rates for classic KCGM in the case that the target function may not be in the hypothesis space, filling a theoretical gap. Keywords: Learning theory, Conjugate gradient methods, Randomized sketches, Integral operator Mathematics Subject Classification: 68T05, 94A0, 4A35 Introduction Let the input space be a separable Hilbert space H with inner product, H, and the output space R. Let ρ be an unknown probability measure on H R. We study the following expected risk minimization, inf Ẽω, Ẽω = ω, x H y dρx, y, ω H H R where the measure ρ is known only through a sample z = {z i = x i, y i } n i= of size n N, independently and identically distributed i.i.d. according to ρ. As noted in [0, ], this setting covers nonparametric regression with kernel methods [8, 33], and it is close to functional linear regression [7] with the intercept to be zero and linear inverse problems []. In the large-scale learning scenarios, the search of an approximated estimator for the above problem via some specific algorithms could be limited to a smaller subspace S, in order to achieve some computational advantages [36, 3, 0]. Typically, with a subsample/sketch dimension m < n, S = span{ x j : j m} where x j is chosen randomly from the input set x = {x,, x n }, or S = span{ m j= G ijx j : i m} where G = [G ij ] i m, j n is a general random matrix whose rows are drawn according to a distribution. The former is called Nyström subsampling while the latter is called randomized sketches. Limiting the solution within the subspace S, replacing expected risk by empirical risk over z, and combining with a linear-fashion and

2 explicit regularized technique based on spectral-filtering of the empirical covariance operator, this leads to the projected-regularized algorithms. Refer to the previous papers [, 37, 9] and references therein for the statistical results and computational advantages of this kind of algorithms. In this paper, we take a different step and apply the random-projection techniques to another efficient powerful iterative algorithms: kernel conjugate gradient type algorithms. As noted in [9], a solution of the empirical risk minimization over the subspace S can be given by solving a projected normalized linear equation. We apply the kernel conjugate gradient methods KCGM [5, 5] for solving this normalized linear equation without any explicit regularization term, and at tth-iteration, we get an estimator that fits the linear equation best over the tth-order Krylov subspace. The regularization to ensure its best performance is realized by early-stopping the iterative procedure. Using the early-stopping iterative regularization [40, 38, 8] has its own benefit compared with spectral-filtering algorithms, as it can tune the regularization parameter in an adaptive way if a suitable stopping rule is used. Thus, for some easy learning problems, an iterative algorithm can stop earlier while generalizing optimally, leading to some computational advantages. Considering either randomized sketches or Nyström subsampling, we provide statistical results in terms of different norms with optimal rates. Particularly, our results indicate that for KCGM with randomized sketches, the algorithm can generalize optimally after some number of iterations, provided that the sketch dimension is proportional to the effective dimension [39] of the problem. Furthermore, we point out that the computational complexities for the algorithm are Om 3 in time and Om in space, which are lower than On t in time and On in space of classic KCGM. Thus, our results suggest that KCGM with randomized sketches can generalize optimally with less computational complexities, e.g., On 3/ in time and On in space without considering the begin assumptions of the problem in the attainable case i.e. the expected risk minimization has at least one solution in H. Finally, as a corollary, we derive the first result with optimal capacity-dependent rates for classical KCGM in the non-attainable case, filling a theoretical gap since [4]. The structure of this paper is organized as follows. We first introduce some preliminary notations and the studied algorithms in Section. We then introduce some basic assumptions and state our main results in Section 3, following with some simple discussions and numerical illustrations. All the proofs are given in Section 4 and Appendix. Learning with Kernel Conjugate Gradient Methods and Random Projection In this section, we first introduce some necessary notations. We then present KCGM with projection abbreviated as projected-kcgm, and discuss their numerical realizations considering two types of projection generated by randomized sketches and Nyström sketches/subsampling.

3 . Notations and Auxiliary Operators Let Z = H R, ρ X the induced marginal measure on H of ρ, and ρ x the conditional probability measure on R with respect to x H and ρ. Define the hypothesis space H ρ = {f : H R ω H with fx = ω, x H, ρ X -almost surely}. Denote L ρ X the Hilbert space of square integral functions from H to R with respect to ρ X, with its norm given by f ρ = H fx dρ X. Throughout this paper, we assume that the support of ρ X is compact and there exists a constant κ [, [, such that x, x H κ, x, x H, ρ X -almost every. For a given bounded operator L mapping from a separable Hilbert space H to another separable Hilbert space H, L denotes the operator norm of L, i.e., L = sup f H, f H = Lf H. Let r N +, the set {,, r} is denoted by [r]. For any real number a, a + = maxa, 0, a = min0, a. Let S ρ : H L ρ X be the linear map ω ω, H, which is bounded by κ under Assumption. Furthermore, we consider the adjoint operator Sρ : L ρ X H, the covariance operator T : H H given by T = S ρs ρ, and the integral operator L : L ρ X L ρ X given by S ρ S ρ. It can be easily proved that Sρg = Lf = S ρ Sρf = H H T = SρS ρ = xgxdρ X x, fx x, H dρ X x, H, x H xdρ X x. Under Assumption, the operators T and L can be proved to be positive trace class operators and hence compact: L = T trt = H trx xdρ X x = For any ω H, it is easy to prove the following isometry property, H and x Hdρ X x κ. 3 S ρ ω ρ = T ω H, 4 Moreover, according to the singular value decomposition of a compact operator, one can prove Similarly, for all f L ρ X, there holds, L Sρ ω ρ ω H. 5 S ρf H = L f ρ, and 6 T S ρ f H f ρ. 7 We define the normalized sampling operator S x : H R n by S x ω i = n ω, x i H, i [n], 3

4 where the norm in R n is the usual Euclidean norm. Its adjoint operator S x : R n H, defined by Sxy, ω H = y, S x ω for y R n is thus given by Sxy = n y i x i. n For notational simplicity, we also denote ȳ = n y. Moreover, we can define the empirical covariance operator T x : H H such that T x = SxS x. Obviously, T x = SxS x = n, x i H x i. n By Assumption, similar to 3, we have i= i= T x trt x κ. 8 Denote K x x the x x matrix with its i, j-th entry given by x i, x j H for any two x x input sets x and x. Obviously, Problem is equivalent to inf Ef, Ef = f H ρ K x x = S x S x. H R fx y dρx, y, 9 The function that minimizes the expected risk over all measurable functions is the regression function [8, 33], defined as, f ρ x = R ydρy x, x H, ρ X -almost every. 0 A simple calculation shows that the following well-known fact holds [8, 33], for all f L ρ X, Ef Ef ρ = f f ρ ρ. Under Assumption, H ρ is a subspace of L ρ X. Thus a solution f H for the problem 9 is the projection of the regression function f ρ onto the closure of H ρ in L ρ X, and for all f H ρ [0], S ρf ρ = S ρf H, and Ef Ef H = f f H ρ.. Kernel Conjugate Gradient Methods with Projection In this subsection, we introduce KCGM with solutions restricted to the subspace S, a closed subspace of H. Let P be the projection operator with its range S. As noted in [9], a solution for the empirical risk minimization over S is given by ˆω = P ˆω with ˆω such that Note that as T x = S xs x, P T x P = P S xs x P = S x P S x P. P T x P ˆω = P Sxȳ, 3 Thus, 3 could be viewed as a normalized equation of S x P ω = ȳ. Motivated by [5, 4], we study the following conjugate gradient type algorithms applied to this normalized equation. For notational simplicity, we let and write U to mean U + I. U = P T x P, 4 4

5 Algorithm Projected-KCGM. For any t =,, T, ω t = Here, K t U, P S xȳ is the so-called Krylov subspace, defined as arg min Uω P S xȳ H. 5 ω K tu,p S xȳ K t U, P S xȳ = span{p S xȳ, UP S xȳ,, U t P S xȳ} = {pup S xȳ : p P t }, where P t denotes the set of real polynomials of degree at most t. Different choices on the subspace S correspond to different algorithms. Particularly, when P = I, the algorithm is the classical KCGM. In this paper, we will set m S = span{ G ij x j : i m} j= where G = [G ij ] i m, j n is a random matrix, or S = span{ x j : j m} with x j chosen randomly from x. The following examples provide numerical realizations of Algorithm, considering randomized sketches, Nyström-subsampling sketches and non-sketching regimes. Example. Randomized sketches. Let S = span{ m j= G ijx j : i m}, and G = [G ij ] be a matrix in R m n. Let R R m r be the matrix such that RR = GK xx G with r = rankr. Denote K = R GK xxg R and b = R GK xx ȳ. In this case, Algorithm is equivalent to ω t = n n i= G Ra t i x i with a t given by We call this type of algorithm sketched-kcgm. a t = arg min Ka b. 6 a K t K,b Example. Subsampling sketches. In Nyström-subsampling sketches, x = { x,, x m } with each x j drawn randomly following a distribution from x. Let R R m r be the matrix such that RR = K x x with r = rankr. Denote K = R K xx K x x R and b = R K xx ȳ. In this case, Algorithm is equivalent to ω t = m m i= Ra t i x i with a t given by We call this algorithm Nyström-KCGM. a t = arg min Ka b. a K t K,b Example.3. Non-sketches [4] For the ordinary non-sketching regimes, S = H. Let K = S x S x. Then Algorithm is equivalent to ω t = n n i= a t i x i, with a t given by a t = arg min Ka ȳ K. a K tk,ȳ 5

6 In all the above examples, in order to execute the algorithms, one only needs to know how to compute x, x H for any two points x, x H, which is met by many cases such as learning with kernel methods. In general, as that the computation of the matrix GK xx = n [GK xx, GK xx,, K xxn ] or K x x R can be parallelized, the computational costs are Om 3 + m T in time and Om in space for sketched/nyström KCGM after T -iterations, while they are On T in time and On in space for non-sketched KCGM. As shown both in theory and our numerical results, the total number of iterations T for the algorithms to achieve best performance is typically less than m for sketched/nyström KCGM. A classical [9] or sketched [] kernel conjugate gradient type algorithm was proposed for solving the penalized empirical risk minimization. In contrast, Algorithm is for solving the unpenalized empirical risk minimization and it does not involve any explicit penalty. In this case, we do not need to tune the penalty parameter. The best generalization ability of Algorithm is ensured by early-stopping the procedure, considering a suitable stopping rule. The proofs for the three examples will be given in Subsection Main Results In this section, we first introduce some common assumptions from statistical learning theory, and then present our statistical results for sketched/nyström-kcgm and classical KCGM. 3. Assumptions Assumption. There exist positive constants Q and M such that for all l with l N, y l dρy x l!m l Q, 7 R ρ X -almost surely. Furthermore, for some B > 0, f H satisfies f H x f ρ x x xdρ X x B T, 8 H Obviously, Assumption implies that the regression function f ρ is bounded almost surely, as f ρ x y dρy x y dρy x Q. 9 R R 7 is satisfied if y is bounded almost surely or y = ω, x H + ɛ for some Gaussian noise ɛ. 8 is satisfied if f H f ρ is bounded almost surely or the hypothesis space is consistent, i.e., inf Hρ E = Ef ρ. Assumption. f H satisfies the following Hölder source condition Here, R and ζ are non-negative numbers. f H = L ζ g 0, with g 0 ρ R. 0 Assumption relates to the regularity/smoothness of f H. The bigger the ζ is, the stronger the assumption is, the smoother f H is, as L ζ L ρ X L ζ L ρ X when ζ ζ. 6

7 Particularly, when ζ /, there exists some ω H H such that S ρ ω H = f H almost surely [33], while for ζ = 0, the assumption holds trivially. Assumption 3. For some γ [0, ] and c γ > 0, T satisfies N := trt T + I c γ γ, for all > 0. Assumption 3 characters the capacity of H. The left-hand side of is called the effective dimension [39]. As T is a trace-class operator, Condition is trivially satisfied with γ = which is called the capacity-independent case. γ 0, ] if the eigenvalues { i } of T satisfy i i γ. We refer to [9] for more comments on the above assumptions. Furthermore, it is satisfied with a general 3. General Results for Kernel Conjugate Gradient Method with Projection The following results provide convergence results for general projected-kcgm with a datadependent stopping rule. Theorem 3.. Under Assumptions, and 3, let a [0, ζ ]. Assume that for some C, and for any δ 0,, P I P T > C ζ a a log δ δ, = n ζ+γ b n,ζ,γ. Then the following results hold with probability at least δ. There exist positive constants C and C which depend only on ζ, γ, c γ, T, κ, M, Q, B, R, C such that if the stopping rule is then Uω t P S xȳ H C log 3 L a S ρ ωˆt f H ρ C log a δ n Furthermore, if ζ /, f H = S ρ ω H for some ω H H and ζ+/ δ n ζ+γ b ζ+/ n,ζ,γ, ζ a ζ+γ b ζ a n,ζ,γ. T a ωˆt ω H H C log a δ n ζ+γ. 3 ζ a Here, b n,ζ,γ = log n γ {ζ+γ }. 4 The convergence rate from the above is optimal as it matches the minimax lower rate On ζ a ζ+γ derived for ζ / in [7, 5]. Convergence results with respect to different measures are raised from statistical learning theory and inverse problems. In statistical learning theory, one typically is interested in the generalization ability, measured in terms of excess risks, S ρ ωˆt f ρ ρ = Ẽωˆt inf H Ẽ. In inverse problems, one is interested in the convergence within the space H. Theorem 3. asserts that projected-kcgm converges optimally if the projection error is small enough. The condition is satisfied with random projections induced by randomized sketches or Nystróm subsampling if the sketching dimension is large enough, as shown in Section 4. Thus we have the following corollaries for sketched or Nyström KCGM. 7

8 3.3 Results for Kernel Conjugate Gradient Methods with Randomized Sketches In this subsection, we state optimal convergence results with respect to different norms for KCGM with randomized sketches from Example.. We assume that the sketching matrix G satisfies the following concentration property: For any finite subset E in R n and for any t > 0, t m P Ga a t a c E e 0 logβ n. 5 Here, c 0 and β are universal non-negative constants. Example 3.. Many matrices satisfy the concentration property. Subgaussian sketches. Matrices with i.i.d. subgaussian such as Gaussian or Bernoulli entries satisfy 5 with some universal constant c 0 and β = 0. More general, if the rows of G are independent scaled copies of an isotropic ψ vector, then G also satisfies 5 [3]. Recall that a random vector a R n is ψ isotropic if for all v R n for some constant α. E[ a, v ] = v, and inf{t : E[exp a, v /t ] } α v, Randomized orthogonal system ROS sketches. As noted in [7], matrix that satisfies restricted isometric property from compressed sensing [6, ] with randomized column signs satisfies 5. Particularly, random partial Fourier matrix, or random partial Hadamard matrix with randomized column signs satisfies 5 with β = 4 for some universal constant c 0. Corollary 3.. Under Assumptions, and 3, let S = range{s xg }, where G R m n is a random matrix satisfying 5. Let δ 0,, a [0, ζ ] and m C 3 log 3 3 n γ [ log n γ ] γ, if ζ + γ, δ logβ n n γζ a aζ+γ, if ζ, n γ ζ+γ otherwise, for some C 3 > 0 which depends only on ζ, γ, c γ, T, κ, M, Q, B, R, c 0. Then the conclusions in Theorem 3. hold. When γ < ζ, the minimal sketching dimension is proportional to the effective dimension On γ ζ+γ up to a logarithmic factor, which we believe that it is unimprovable. According to Corollary 3., sketched-kcgm can generalize optimally if the sketching dimension is large enough. 3.4 Results for Kernel Conjugate Gradient Methods with Nyström Sketches In this subsection, we provide optimal rates with respect to different norms for KCGM with Nyström sketches from Example.. Corollary 3.3. Under Assumptions, and 3, let S = span{x,, x m }, ζ + γ >, δ 6 0,, a [0, ζ ] and m n ζ a aζ+γ [ log n γ ]. Then the conclusions in Theorem 3. are true. 8

9 The requirement on the sketch dimension m of Nyström-KCGM does not depend on the probability constant δ, but it is stronger than that of sketched-kcgm if γ < ignoring the factor δ. Remark 3.4. In the above, we only consider the plain Nyström subsampling. Using the approximated leveraging score ALS Nyström subsampling [35, 0], we can further improve the projection dimension condition to 6, see Section 4 for details. However, in this case, we need to compute the ALS with an appropriate pseudo regularization parameter. 3.5 Optimal Rates for Classical Kernel Conjugate Gradient Methods As a direct corollary, we derive optimal rates for classical KCGM as follows, covering the nonattainable cases. Corollary 3.5. Under Assumptions, and 3, let P = I, δ 0, and a [0, ζ ]. Then the conclusions in Theorem 3. are true. To the best of our knowledge, the above results provide the first optimal capacity-dependent rate for KCGM in the non-attainable case, i.e. ζ /. This thus provides an answer to a question open since [4]. Convergence results for kernel partial least squares under different stopping rules have been derived in [, 30], but the derived optimal rates are only for the attainable cases. Our analysis could be extended to this different type of algorithm with similar stopping rules N Sketched KCGM Nystrom KCGM.0 Sketched KCGM Nystrom KCGM 0.05 Squared Prediction Error Train Error Iterations Iterations Figure : Squared prediction errors and training errors for sketched KCGM with m = n /3 and plain Nystrom KCGM with m = n /3 and n = 04. We present some numerical results to illustrate our derived results in the setting of learning with kernel methods. In all the simulations, we constructed training datas {x i, y i } n i= R R from the regression model y = f ρ x + ξ, where the regression function f ρ x = x / /, the input x is uniformly drawn from [0, ], and ξ is a Gaussian noise with zero mean and standard deviation. By construction, the function f ρ belongs to the first-order Sobolev space with f ρ H =. In all the simulations, the RKHS is associated with a Sobolev 9

10 0.06 Sketched KCGM Nystrom KCGM KRR 0.9 Sketched KCGM Nystrom KCGM KRR Squared Prediction Error Scaled Squared Prediction Error Sample Size Sample Size Figure : Prediction errors S ρˆω f ρ ρ and scaled prediction errors n /3 S ρˆω f ρ ρ versus sample sizes for KRR, sketched KCGM with m = n /3, plain Nystrom KCGM with m = n /3. kernel Kx, x = + minx, x. As noted in [37, Example 3] for Sobolev kernel, according to [4], Assumption 3 is satisfied with γ =. As suggested by our theory, we set the projection dimension m = n /3, for KCGM with ROS sketches based on the fast Hadamard transform while m = n /3 for KCGM with plain Nyström sketches. We performed simulations for n in the set {3, 64, 8, 56, 5, 04} so as to study scaling with the sample size. For each n, we performed 00 trials and both squared prediction errors and training errors averaged over these 00 trials were computed. The errors for n = 04 versus the iterations were reported in Figure. For each n, the minimal squared prediction error over the first m iterations is computed and these errors versus the sample size were reported in Figure in order to compare with stateof-the-art algorithm, kernel ridge regression KRR. From Figure, we see that the squared prediction errors decrease at the first 3 iterations and then they increase for both sketched and Nyström KCGM. This indicates that the number of iteration has a regularization effect. Our theory predicts that the squared prediction loss should tend to zero at the same rate n /3 as that of KRR. Figure confirms this theoretical prediction. All the results stated in this section will be proved in Section 4. 4 Proof In this section and the appendix, we provide all the proofs. 4. Proof for Subsection. Let Q be a compact operator from the Euclidean space R m, to H such that S = rangeq. It is easy to see that Q Q R m m. Let t = rankr and R R m t be the matrix such that RR = Q Q. As P is the projection operator onto S, then P = QQ Q Q = QRR Q. 7 0

11 For any polynomial function q, we have that qup Sxȳ = qp T xp P Sxȳ = qp S xs x P P S xȳ. Noting that S x P = P Sx, and using Lemma 4. from the coming subsection, qup Sxȳ =P S xqs x P P Sxȳ = P SxqS x P Sxȳ. Introducing with 7, qup Sxȳ =QRR Q Sxq S x QRR Q Sx ȳ. 8 Noting that R Q Sx = S x QR, and applying Lemma 4., qup Sxȳ =QRqR Q SxS x QRR Q Sxȳ = QRq Kb, 9 where we denote b = R Q S xȳ, and K = R Q S xs x QR. Using RR = Q Q, which implies RR Q QRR = RR and for any g H, QRR Q g H = QRR Q QRR Q g, g H = QRR Q g, g H = R Q g, we get from 8 that qup S xȳ H = R Q S xq S x QRR Q S x ȳ H = q Kb, 30 where we used Lemma 4. for the last equality. Note that the solution of 5 is given by ω t = p t UP S xȳ, with p t = arg min p P t UpU IP S xȳ H. Using 9 and 30, we know that ω t = QRp t Kb, with which is equivalent to ω t = QRa t, with p t = arg min p P t Kp K Ib, a t = arg min Ka b. a K t K,b Proof for Example.. For general randomized sketches, Q = S xg. In this case, Q Q = GS x S xg = GK xx G, K = R GS x S xs x S xg R = R GK xxg R, b = R GS x Sxȳ = R GK xx ȳ, and ω t = SxG Râ t. Proof for Example.. In Nyström subsampling, x is a subset of size m < n drawn randomly following a distribution from x, Q = S x, and Q Q = K x x. In this case, K = R K xx K x x R, b = R K xx ȳ, and ω t = S x Ra t.

12 Proof for Example.3. For the ordinary non-sketching regimes, S = H and P = I. Denote K = S x S x. Then ω t = arg min T x ω S xȳ H, ω K tt x,s xȳ is equivalent to ω t = p t T x S xȳ = p ts xs x S xȳ = S xp t Kȳ = S xât, with â t given by â t = arg min Ka ȳ K. a K tk,ȳ Indeed, T x ω S xȳ H = S xs x ω ȳ H = S x ω ȳ K, and for any polynomial function p, S x pt x Sxȳ = S xpsxs x Sxȳ = KpKȳ. In the rest subsections, we present the proofs for Section Operator Inequalities We first introduce some necessary operator inequalities. Lemma 4.. [3, Cordes inequality] Let A and B be two positive bounded linear operators on a separable Hilbert space. Then A s B s AB s, when 0 s. Lemma 4.. Let H, H be two separable Hilbert spaces and S : H H a compact operator. Then for any well-defined function f over [0, S ], fss S = SfS S. Proof. The result can be proved using the singular value decomposition of a compact operator. Lemma 4.3. Let A and B be two non-negative bounded linear operators on a separable Hilbert space with max A, B κ for some non-negative κ. Then for any ζ > 0, A ζ B ζ C ζ,κ A B ζ, 3 where C ζ,κ = { when ζ, ζκ ζ when ζ >. 3 Proof. The proof is based on the fact that u ζ is operator monotone if 0 < ζ. For ζ, we refer to [9], or [5] for the proof. Lemma 4.4. Let X and A be bounded linear operators on a separable Hilbert space H. Suppose that A 0 and X. Then for any s [0, ] and any 0, X A + I s X X AX + X X s X AX + I s. 33 As a result, for any 0 and any ω H, A + I s Xω H X AX + X X s ω H X AX + I s ω H, 34 and for any bounded linear operator F on H, F X A + I s F X AX + I s. 35

13 Proof. Note that X X I since X. In fact, X Xω, ω H = Xω H ω H = ω, ω H. Following from [6], the fact that the function u s is operator monotone, one can prove 33: X A + I s X X AX + X X s X AX + I s. The proof for 34 can be done by applying 33: A + I s Xω H = X A + I s Xω, ω H X AX + I s ω, ω H = X AX + I s ω H. The proof for 35 can be done by applying 33: F X A + I s = F X A + I s XF F X AX + I s F = F X AX + I s. Lemma 4.5 [9]. Let P be a projection operator in a Hilbert space H, and A, B be two semidefinite positive operators on H. For any 0 s, t, we have A s I P A t A B s+t + B I P B s+t. 4.3 Orthogonal Polynomials and Some Notations We denote by ξ x,i, e x,i i an eigenvalue-eigenvector orthogonal basis for the operator U. It is easy to see that ξ x,i [0, κ ], as U is semi-definite and U T x κ by 8. For any u 0, we denote F u the orthogonal projection in H onto the subspace {e x,i : ξ x,i < u} and let F u = I F u. Denote N 0 = N {0}. For any t N 0, denote with P t the set of polynomials of degree at most t and P 0 t the set of polynomials in P t having constant term equal to. For any t N 0 and functions ψ, φ : R R, define Denote p r t and let q r t q t the minimizer for [ψ, φ] r = ψup S xȳ, U r φup S xȳ H. arg min[p, p] r, p Pt 0 P t be such that p r t u = uq r t u. We write p t and q t to mean p t and, respectively. According to the definition from Algorithm, we know that ω i = q i UP S xȳ, p i u = uq i u. In the case i = 0, we set q 0 = 0 and p 0 =. Let r N 0. Observe that for any function φ, [φ, φ] r = i φξ x,i ξ r x,i P S xȳ, e x,i H. Define m 0 the number of distinct positive eigenvalues of U such that P Sxȳ has nonzero projection on the corresponding eigenspace. Using that Ue x,i = 0 implies S x P e x,i = 0 as U = S x P S x P, we can prove that the measure defining [, ] r has finite support of cardinality m 0. Using the fact that a polynomial of degree t has at most t roots except t = 0, it is easy to show 3

14 that [, ] r with r N 0 is an inner product on the space P m0. Furthermore, there exists some p m0 P m 0 0 such that [p m0, p m0 ] r = 0, and p m0 has m 0 distinct roots belonging to 0, κ ]. Based on [5, Proposition.] or using a similar argument based on the projection theorem as that in [4], {p r i } m 0 i= are orthogonal with respect to [, ] r. Thus the polynomial p r t with t < m 0 has exactly t distinct roots belonging to 0, κ ], denoted by x r k,t k t in increasing order. For notational simplicity, we write x k,t to mean x k,t. The following lemma summarizes some basic facts about the orthogonal polynomials. Lemma 4.6. Let r N and t be any integer satisfying t < m 0. Then the following results hold. x r,t < xr+,t For u [0, x r,t 3 p r t ], 0 pr t 0 x r,t. 4 p t0 p t 0 + [p t,p t ] 0. u, 0 q r t [p t,p t ] Proof. See [5, Corollary.7]. As p r t 0 p r t Pt 0, p r t 0 =. Thus, p r t u. Moreover, 0 q r t uu = p r t uu and q r t u p r t 0. is convex and decreasing on [0, x r,t ]. Therefore, u and q r t u = pr t u u = pr t 0 p r t u 0 u p r t 0 = p r t 0. 3 Rewriting p r t u as t j= u/xr j,t, and taking the derivative on 0, we get p r t 0 = t j= x r j,t x r,t, which leads to the desired result. 4 Following from [5, Corollary.6], p r t 0 0 in the proof for Part, and that [p t, p t ] 0 [p t, p t ] 0 since p t is the minimizer of [, ] 0 over Pt 0, one can get the result. 4.4 Deterministic Analysis In the proof, we introduce an intermediate function ω H, defined as follows, ω = G T S ρf H, 36 where G u = { u, if u, 0, if u <. Lemma 4.7. Under Assumption, let ω be given by 36 for some > 0. Then we have For any a ζ, L a S ρ ω f H ρ R ζ a. 37 T a / ω H R { ζ+a, if ζ a ζ, κ ζ+a, if a ζ. 38 4

15 The proof can be found in [8, Page 40]. We next introduce some useful notations. := T x T T x T, := T T x ω S xȳ H, 3 := T x T HS, 4 := T T T x, 5 := T I P = T I P T, We also need the following preliminary lemmas. Lemma 4.8 [9]. Under Assumption, we have T x S xȳ T xp ω H + R { 5 + ζ, if ζ, κ κ ζ, if ζ >. 39 The proof for the above lemma can be found in [9]. We provide a proof in Appendix A. for completeness. Lemma 4.9. Let A : H H be a bounded operator. Under Assumption, AP ω R AU H ζ, if ζ, R A C ζ,κ ζ 3 + AU C ζ,κ ζ + AU ζ, if ζ >. 40 Proof. If 0 < ζ, by a simple calculation, and applying Part of Lemma 4.7, Using 35 from Lemma 4.4, we get which leads to the desired result. If ζ, applying Part of Lemma 4.7, AP ω H AP T x T x T T ω H AP T x T ω H AP T x R ζ. AP ω H AP T x P + I Rζ, AP ω H AP T ζ T ζ ω AP T ζ R. Adding and subtracting with the same term and using the triangle inequality, AP ω H R AP T ζ ζ T x + AP T ζ x R AP T ζ ζ T x + AP T ζ x. 5

16 Applying Lemma 4.3 with 3 and 8, we get With AP ω H R AP C ζ R AP C ζ,κ ζ,κ ζ V = T x P T x = P T x P T x 3 + AP T ζ x 3 + AP T ζ x. 4 and Lemma 4., we can rewrite P T ζ x as P T x T ζ x V ζ + P T x V ζ = P T x T ζ x V ζ + U ζ P T x. Thus, combining with the triangle inequality, we get AP T ζ x AP T x T ζ x V ζ + AU ζ P T x Applying Lemma 4.3 with V T x κ, AP T x T ζ x V ζ + AU ζ P T x. AP T ζ x AP T x C ζ,κ T x V ζ + AU ζ P T x. Using Lemma 4.5, I P = I P and A A = A, we have and we thus get T x V = T x I P T x T x T + T I P T 3 + 5, AP T ζ x AP T x C ζ,κ ζ + AU ζ P T x. 4 Applying 35 of Lemma 4.4, we get AP T x AP T x P = AU and AU ζ P T x AU ζ. Thus, AP T ζ x AU Cζ,κ ζ + AU ζ. Introducing the above into 4, one can get AP ω H R AP C ζ,κ ζ which leads to the desired result by noting that AP A. 3 + AU Cζ,κ ζ + AU ζ, With the above lemmas, we can prove the following result for estimating L a S ρ ω t f H ρ. Lemma 4.0. Under Assumption, let u 0, x,t ] and 0 a ζ. Then the following statements hold. If ζ, L a S ρ ω t f H ρ a + a u + a u p t0 a + a p t0 + u + a + 5 / + R ζ u Uω t P S xȳ H + R u + a ζ + 5 / a + R ζ a. 43 6

17 If ζ, L a S ρ ω t f H ρ a p t 0 a + a p t0 + + a RC ζ,κ u + a + Rκ ζ κ u ζ ζ u + u ζ u + a + a Uω t P S xȳ u + a H + R κ ζ a 5 + ζ a. 44 u Proof. Adding and subtracting with the same term, and then using the triangle inequality, L a S ρ ω t f H ρ L a S ρ ω t ω ρ + L a S ρ ω f H ρ L a S ρ ω t ω ρ + R ζ a, where we used Part of Lemma 4.7 for the last inequality. Using and 5, L a S ρ = L Sρ S ρ a S ρ = L Sρ S ρs ρ a = L Sρ T a L a S ρ ω t f H ρ L Sρ T a ω t ω ρ + R ζ a T a ω t ω H + R ζ a. 45 Subtracting and adding with the same term, then using the triangle inequality, L a S ρ ω t f H ρ T a ω t P ω H + T a I P ω H + R ζ a. Since P is a projection operator, I P s = I P for any s > 0, and we thus can get L a S ρ ω t f H ρ T a ω t P ω H + T a I P a I P T T ω H + R ζ a. Using Lemma 4. and Part of Lemma 4.7, we get [9], L a S ρ ω t f H ρ T a ω t P ω H + a 5 Rκ ζ + ζ + R ζ a. 46 In what follows, we estimate T a ω t P ω H. Estimating T a ω t P ω H. We first have T a ω t P ω H T a T a T a T a x T a x ω t P ω H. Obviously, T a T a and by Lemma 4., T a T a x T T x a a. Thus, T a ω t P ω H a T a x ω t P ω H = a T a x P ω t P ω H, where the last equality follows from the facts that ω t S and that P is the projection operator with range S which implies P = P and ω t = P ω t. Noting that P, using 34, we get T a ω t P ω H a U a ω t P ω H. Adding and subtracting with the same term, using the triangle inequality, and noting that ω t = P ω t, T a ω t P ω H a F u U a ω t P ω H + Fu U a P ω t ω H. 7

18 Introducing with ω t = q t UP S xȳ, T a ω t P ω H a F u U a q t UP Sxȳ P ω H + Fu U a P ω t ω H. In what follows, we estimate the last two terms from the above. Estimating Fu U a P ω t ω H. By a direct calculation, following from the definition of U given by 4 and P = P, Fu U a P ω t ω H Fu U a U Fu U UP ω t ω H u + a Fu U u Uω t P T x P ω H. Adding and subtracting with the same term, and using the triangle inequality, F u U a u + a u u + a u P ω t ω H Using 35, U P T x U F u U Uω t P S xȳ H + F u U Uω t P S xȳ H u + Fu U a P ω t ω u + a H u P S xȳ T xp ω H + U P T x T x S xȳ T xp ω H P T xp + I =, and thus Uω t P S xȳ H u + + T x S xȳ T xp ω H Estimating F u U a q t UP S xȳ ω H. Adding and subtracting with the same term, noting that P = P, and using the triangle inequality, we get Using 35, F u U a q t UP Sxȳ P ω H F u U a q t UP Sxȳ T xp ω H + F u U a p t UP ω H F u U a q t UP T x T x S xȳ T xp ω H + F u U a p t UP ω H. 49 F u U a q t UP T x F uu a q t UP T x P + I = Fu U a q t U max x + x [0,u] a q t x max xqt x a q t x a + a q t x x [0,u] p t0 a + a p t0, 50 where we used Part of Lemma 4.6 with u [0, x,t ] for the last inequality. Introducing the above into 49, we get F u U a q t UP Sxȳ P ω H p t0 a + a p t0 T x S xȳ T xp ω H + F u U a p t UP ω H. 8

19 Introducing the above and 48 into 47, we get T a ω t P ω H a p t0 a + a p t0 + + a u + a u u + a T u x S xȳ T xp ω H Uω t P S xȳ H + F u U a p t UP ω H In what follows, we estimate F u U a p t UP ω H, considering two different cases. If 0 < ζ, applying Lemma 4.9, F u U a p t UP ω H F u U a p t UU R ζ max x [0,u] p txx + a R ζ u + a R ζ, where we used Part of Lemma 4.6 for the last inequality. Introducing the above and 39 into 5, and then combing with 46, one can prove the desired result for ζ. If ζ, applying Lemma 4.9 with A = F u U a p t U, we get F u U a p t UP ω H R A C ζ,κ ζ For any s 0, using Part of Lemma 4.6, AU Cζ,κ ζ + AU ζ. AU s = max x [0,u] x + a p t xx s u + a u s. 5 Using the above with s = 0,, ζ into 5, we get F u U a p t UP ω H RC ζ,κ ζ 3 + C ζ,κ ζ u + u ζ u + Introducing the above and 39 into 5, and then combining with 46, we can prove the desired result for ζ. From Lemma 4.0, we can see that in order to control the error, we need to estimate the random quantities,, 3, 4, 5, p t0, and Uω t P S xȳ H. The random quantities will be estimated in Subsections 4.5 and 4.6, while Uω t P S xȳ H can be bounded due to the stopping rule. In order to estimate p t0, we introduce the following two lemmas, from which and the stopping rule we can estimate p t0 as shown in the coming proof for the main theorem. Lemma 4.. The following statements hold. If ζ, a. Uω t P S xȳ H p t / + R ζ + R c 3 p t0 3 ζ + ζ p t 0. 9

20 If ζ >, Uω t P S xȳ H + RC ζ,κ ζ p t0 + + R κ κ ζ 3 p t0 + c 3 C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. 53 Here, we denote 0 0 = and Proof. Let Following from [5, 3.8], c v = v v, v 0. x,t φ t x = p t x. x,t x Uω t P S xȳ H F x,t φ t UP S xȳ H. Using the triangle inequality, with a basic calculation, we get Uω t P S xȳ H F x,t φ t UP S xȳ T xp ω H + F x,t φ t UUω H F x,t φ t UP T x T x S xȳ T xp ω H + F x,t φ t UUω H F x,t φ t UU T x S xȳ T xp ω H + F x,t φ t UUω H, 54 where we used 35 of Lemma 4.4 for the last inequality. Note that F x,t φ t UU Following from [5, 3.0], Thus, we get that sup φ t xx + sup x + φt x. x [0,x,t ] x [0,x,t ] sup φ t xx v c v p t0 v, v x [0,x,t ] F x,t φ t UU p t0 +. Introducing the above into 54, we get that Uω t P S xȳ H p t0 + T x S xȳ T xp ω H + F x,t φ t UUω H. 56 Now, we consider tow cases. Case I: ζ. Using Lemma 4.9, with U = UP, F x,t φ t UUω H R F x,t φ t UUU ζ R ζ max φ txxx +. x [0,x,t ] Applying 55, F x,t φ t UUω H R c 3 p t0 3 ζ + ζ p t 0. 0

21 Introducing the above and 39 into 56, one can get the desired result. Case II: ζ >. Applying 55, we get that for any s 0, F x,t φ t UUU s H max φ txx s+ c s+ p t0 s+. x [0,x,t ] Using the above and Lemma 4.9, with U = UP, we get that F x,t φ t UUP ω H RC ζ,κ ζ 3 p t0 + c 3 Applying the above and 39 into 56, we get the desired result. C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. Lemma 4.. Let u 0, x,t ]. Then the following statements hold. If ζ, If ζ >, [p t, p t ] 0 u + + R 5 + ζ + Ru ζ + u [p t, p t ]. 57 [p t, p t ] 0 u + + R κ κ ζ + RC ζ,κu ζ 3 + u 3 Cζ,κ ζ + u ζ+ + u [p t, p t ]. 58 Proof. Since p t is the minimizer of [p, p] 0 over P 0 t and p t P 0 t, Using the triangle inequality, [p t, p t ] 0 [p t, p t ] 0 = p t UP S xȳ H [p t, p t ] 0 F u p t UP S xȳ H + Fu p t UP S xȳ H By a basic calculation, F u p t UP S xȳ T xp ω H + F u p t UUω H + F u p t UP S xȳ H. [p t, p t ] 0 F u p t UP T x T x S xȳ T xp ω H + F u p t UUω H + F u U U p t UP S xȳ H F u p t UU T x S xȳ T xp ω H + F u p t UUω H + u [p t, p t ], where we used 35 of Lemma 4.4 for the last inequality. Using Part of Lemma 4.6, we get and thus F u p t UU max x [0,u] p t xx + u +, [p t, p t ] 0 u + T x S xȳ T xp ω H + F u p t UUω H + u [p t, p t ]. 59

22 Case I: ζ. Using P = P and Lemma 4.9, F u p t UUω H = F u p t UUP ω R F u p t UUU ζ R max x [0,u] p t xxx + ζ. Using Part of Lemma 4.6, F u p t UUω H Ruu + ζ. Introducing the above and 39 into 59, one can get the desired result for ζ. Case II: ζ. Using Part of Lemma 4.6, fro any s 0, F u p t UUU s = max x [0,u] p t xx s+ u s+, Noting that as P = P, F u p t UUω H = F u p t UUP ω, and combining with Lemma 4.9, we get F u p t UUω H RC ζ,κu ζ 3 + u 3 Cζ,κ ζ + u ζ+. Introducing the above and 39 into 59, one can get the desired result for ζ. 4.5 Probabilistic Estimates In this subsection, we introduce some probabilistic estimates to bound the random quantities,, 3, and 4 Lemma 4.3. Under Assumption 3, let δ 0,, and = n θ with θ [0, or = [ log n γ ]/n. Then with probability at least δ, T + I / T x + I / T + I / T x + I / 3aδ, where aδ = 8κ log 4κ ec γ+ δ T if = [ log n γ ]/n, or aδ = 8κ otherwise. log 4κ c γ+ δ T + θγ e θ The proof of the above result for the case = n θ with θ [0, can be found in [8]. Here, using essentially the same idea, we also provide a similar result considering the case = [ log n γ ]/n. We report the proof in Appendix A.. Lemma 4.4. Let 0 < δ < /. It holds with probability at least δ : T T x T T x HS κ log/δ κ + 4 log/δ. n n Here, HS denotes the Hilbert-Schmidt norm. Proof. Using Lemma which is a direct corollary of the concentration inequality for Hilbertspace valued random variables from [6] from [3], one can prove the desired result.

23 Lemma 4.5. Under Assumptions and 3, with probability at least δ, the following holds: T T x ω S xȳ H 4κM + κ ζ R ζ 83R n + κ ζ + 3B + 4Q c γ γ log n δ + Rζ. 60 The above lemma is essentially proved in [8, ]. We provide a proof in Appendix A.3. Lemma 4.6. Under Assumption 3, let 0 < δ < /. It holds with probability at least δ : T κ T T x HS κ n + cγ n γ log δ. The proof for the above lemma can be found in [9]. 4.6 Projection Errors In this subsection, we estimate projection errors I P T, considering different projections. The first lemma provides upper bounds on projection errors with plain Nyström subsampling. Lemma 4.7. Under Assumption 3, let P be the projection operator with range Then with probability at least δ, δ 0, S = span{x,, x m }. I P T I P T η log mγ m 4κ log 4κ ec γ +, 6 δ T where η = log mγ m. The following lemma estimates projection errors with randomized sketches. Lemma 4.8. Under Assumption 3, let S = range{s xg }, where G R m n is a random matrix satisfying 5 and P be the projection operator with its range S. Then with probability at least 3δ δ 0, /3, we have I P T n θ log nγ n θ 7a γ log 4 δ, provided that m Cn θγ log β n log n γ c log 3 4 δ, c = { 0, if θ <, γ, if θ =. 6 Here, a γ = 4κ log κ e c γ+ T, and C = 00c 0 + 0b γ with b γ = 4κ 4κ + κ { θγ c γ + c γ log κ c γ + e θ, if θ <, + + c, c = T, if θ =. Finally, the next lemma upper bounds projection errors with ALS Nyström subsampling sketches. The ALS Nyström subsampling is defined as follows. 3

24 Approximated Leveraging Scores ALS Nyström Subsampling In this regime, S = range{sxg }, where each row m a j of G is i.i.d. drawn according to P a = e i = q i, qi where q i > 0 will be chosen later and {e i : i [n]} is the standard basis of R n. For every i [n] and > 0, the leveraging scores of KK + I is the sequence {l i } n i= with l i = KK + I ii, i [n]. In practice, the leveraging scores of KK + I is hard to compute, and we can only compute its approximation ˆl i such that L l i ˆl i Ll i, for some L. In the ALS Nyström subsampling, we set q i := q i = ˆl i j ˆl j. Lemma 4.9. Under Assumption 3, let S = range{s xg }, where G R m n is a randomized matrix related to ALS Nyström subsampling, and P be the projection operator with its range S. Then with probability at least 3δ δ 0, /3, we have I P T n θ log nγ n θ 4a γ log 4 δ, provided that m C n θγ log n γ c log 3 4 δ, c = {, if θ <, γ, if θ =. 63 Here, C = 8b γ L 4 + logb γ where a γ and b γ are given by Lemma 4.8. Here, = n θ if log nγ θ [0,, or = n if θ =. Part of the proofs for the above lemmas can be found in [9]. We provide the proofs in Appendix A.4, A.5 and A Deriving Main Results We are ready to prove the main theorem and its corollaries. Proof of Theorem 3.. Applying Lemmas 4.3, 4.4, , and Condition, and noting that [n, ], we get that with probability at least 5δ, the following inequalities hold: C log δ, C ζ log δ, 3 κ + n log δ C 3 ζ+γ log δ, C 3 = κ +, 4

25 where κ 4 κ n + cγ n γ log δ C 4 ζ log δ, C 4 = κκ + c γ, 5 C ζ a a log δ, 4κ log κ ec γ+ T + γ ζ+γ C = 4κ log κ ec γ+ T C = 4κM + κ ζ R +, if ζ + γ >, otherwise, 83R κ + 3B + 4Q c γ + R. In what follows, we assume the above estimates hold and we prove the results considering two different cases. Case I: ζ. We first have where we denote + 5 / + R ζ C 5 ζ log 3 C 5 = C C + C + R. By the above inequality and Lemma 4., we have δ, 64 Step : Uω t P S xȳ H p t0 + C 5 log 3 Now set the stopping rule as δ ζ + RC log δ c 3 Uω t P S xȳ H τ + C 5 log 3 where τ > 0 is given later. From the definition of ˆt, we have Combining with 65, noting that log δ τ log 3 δ ζ+ C 5 p ˆt 0 ζ log 3 3 log 3 δ ζ max Uωˆt P S xȳ H > τ + C 5 log 3 δ + RC log p t0 3 ζ + ζ p t δ ζ+,, by a simple calculation, δ C 5 p ˆt 0, RC c 3 δ ζ+. 66 c 3 p ˆt 0 3 ζ + ζ p ˆt 0 p ˆt 0 3, RC p ˆt 0. If the maximum is achieved at the first term of the right-hand side from the above, then and by a direct calculation, τ log 3 δ ζ+ 3C5 log 3 p ˆt 0 3C 5/τ. δ p ˆt 0 ζ, If the maximum is achieved at the second term or the third term, using a similar argument, one can show that at least one of the following two inequalities holds, p ˆt 0 3RC c 3 /τ 3, 5

26 Now we choose τ as Then, following from the above analysis, p ˆt 0 6C R/τ. τ max 3 C 5, 6 RC c 3, C R. p ˆt Step : In this step, we choose u =. Using 67 and Part 3 of Lemma 4.6, it is easy to show that Applying Lemma 4., with 64, Combing with 66, [pˆt, pˆt ] 0 u p ˆt 0 x,ˆt. C 5 + C R ζ+ 3 log C5 + C R δ + u [p ˆt, p ˆt ]. [pˆt, pˆt ] 0 Uωˆt C 5 + τ P S xȳ H + u [p ˆt, p ˆt ] C5 + C R = [pˆt C 5 + τ, pˆt ] 0 + u [p ˆt, p ˆt ] [pˆt, pˆt ] 0 + u [p ˆt, p ˆt ], provided that τ C 5 + C R C 5. Thus, we get Combining with Part 4 of Lemma 4.6 and 67, we get that [pˆt, pˆt ] 0 u [p ˆt, p ˆt ], 68 p ˆt 0 p ˆt 0 + 4u Step 3. In this step, we let u = 5. Then following from 69 and Part 3 of Lemma 4.6, we have u p ˆt 0 x,ˆt. 70 Using Lemma 4.0, and introducing with 69, 64 and the above estimates, we have where L a S ρ ωˆt f H ρ C 6 ζ a log a δ + 6C a a Uωˆt P S xȳ H log a δ. C 6 = C 5 a + C 5 + C a R6/5 a + C a + R. From the definition of the stopping rule, we get L a S ρ ωˆt f H ρ C 6 + 6C a τ + C 5 ζ a log a δ, 6

27 which leads to the desired result for ζ. Case II: ζ >. Step. Introducing the estimates given in the beginning of the proof and using ζ a a, where Using Lemma 4., Uω t P S xȳ H + RC ζ,κ ζ + R κ κ ζ C 7 ζ log 3 C 7 = C C + RκC 4 + C κ ζ. p t0 + C 7 ζ log 3 3 p t0 + c 3 δ ζ as δ, 7 C ζ,κ ζ p t0 3 + cζ+ p t0 ζ+. Notice that by a direct calculation, with ζ >, <, κ and log δ, Therefore, Uω t P S xȳ H ζ 3 C 3 ζ+γ/ log ζ C 3 ζ log, and 7 δ δ ζ C 3 ζ+γ/ log δ + ζ a a C log ζ δ p t0 + C 7 ζ log 3 + RC ζ,κc 3 ζ p t 0 + c 3 C ζ,κ C3 + C Now set the stopping rule as C 3 + C ζ log δ. 73 δ Uω t P S xȳ H τ + C 7 log 3 where τ > 0 is given later. From the definition of ˆt, we have Uωˆt P S xȳ H > τ + C 7 log 3 ζ p t0 3 + cζ+ δ ζ+, Letting t = ˆt in 74 and combining with 75, by a direct calculation, τ log 3 δ ζ+ C7 p ˆt 0 ζ log 3 + RC ζ,κc 3 ζ p ˆt 0 + c 3 C ζ,κ C3 + C 4 log 3 δ ζ max δ C 7 p ˆt 0, Rc 3 C ζ,κ C3 + C p ˆt 0 3, RC ζ,κc 3 p ˆt 0, c ζ+ R ζ p ˆt 0 ζ+. p t0 ζ+ log δ. 74 δ ζ+. 75 ζ p ˆt cζ+ p ˆt 0 ζ+ log δ 7

28 Therefore, if τ max 4 C 7, 8 Rc 3 C ζ,κ C3 + C, 6RCζ then 67 holds, using a similar basic argument,κc 3, ζ+ 5 cζ+ Step. In this step, we let u =. Using 67 and Part 3 of Lemma 4.6, it is easy to show that u p ˆt 0 x,ˆt. Applying Lemma 4., introducing with 7, 7 and 73, and by a direct calculation, where Combing with 75, we get that [pˆt, pˆt ] 0 C 8 ζ+ log 3 δ + [p ˆt, p ˆt ], C 8 = C 7 + R C ζ,κc 3 + C ζ,κ C 3 + C +. [pˆt, pˆt ] 0 C 8 τ + C 7 Uωˆt P S xȳ H+ [p ˆt, p ˆt ] [pˆt, pˆt ] 0 + [p ˆt, p ˆt ], provided that τ C 8 C 7. This leads to 68. Combining with Part 4 of Lemma 4.6 and 67, we get that 69 holds. Step 3. In this step, we let u = 5. Then following from 69 and Part 3 of Lemma 4.6, we have 70. The rest of the proof parallelizes as that for the case ζ. We thus include the sketch only. Applying Lemma 4.0, introducing with 69, 7, 7 and 73, where C 9 = C a R, L a S ρ ωˆt f H ρ C 9 ζ a log a δ + C a 6 Uωˆt P S xȳ H a log a δ, C 7 5 a + + RC ζ,κ6/5 a C 3 + C 3 + C / ζ +Rκ ζ C a +. Following from the definition of the stopping rule, one can get the desired result for the case ζ. The proof for 3 with ζ / is the same as we can replace L a S ρ ω t f H H by T a ω t ω H H in the whole proof for the convergence with respect to L ρ X -norm. Proof of Corollary 3.. We use Theorem 3. and Lemma 4.8 to prove the result. We only need to verify is satisfied. In Lemma 4.8, we let ζ a aζ+γ, if ζ >, θ = ζ+γ, othwewise,, if ζ + γ. log nγ γ log n θ Clearly, θ. For θ <, we have = γ n θ θn θ θ. Therefore, following from Lemma 4.8 and Condition 6, we have that with probability at least δ, with probability at least 3δ δ 0, /3, we have I P T C ζ a a log 4 δ, with C = 7a γ if = [ log n γ ]/n or C = 7aγ θ 8 otherwise. The proof is complete.

29 Proof of Corollary 3.3. The proof for Corollary 3.3 is similar, using Theorem 3. and Lemma 4.7. We thus skip it. Combing Theorem 3. with Lemma 4.9, we get the follow result for KCGM with ALS Nyström sketches. Corollary 4.0. Under Assumptions, and 3, let δ 0,, a [0, ζ ], and S = span{ x,, x m } with x j i.i.d drawn according to the ALS Nyström subsampling regime in Lemma 4.9 with an appropriate. Assume that m C 4 L log 3 3 n γ [ log n γ ] γ, if ζ + γ, n γζ a aζ+γ [ log n δ γ ], if ζ, 76 n γ ζ+γ [ log n γ ] otherwise, for some C 4 > 0 which depends only on ζ, γ, c γ, T, κ, M, Q, B, R. Then the conclusions in Theorem 3. are true. Acknowledgements This work was sponsored by the Department of the Navy, Office of Naval Research ONR under a grant number N It has also received funding from Hasler Foundation Program: Cyber Human Systems project number 6066, and from the European Research Council ERC under the European Unions Horizon 00 research and innovation program grant agreement n time-data. References [] A. Alaoui and M. W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. arxiv preprint arxiv: Advances in Neural Information Processing Systems, pages , 05. [] H. Avron, K. L. Clarkson, and D. P. Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 384:6 38, 07. [3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 83:53 63, 008. [4] G. Blanchard and N. Krämer. Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Information Processing Systems, pages 6 34, 00. [5] G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 84:97 03, 08. [6] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 5:403 45, 005. [7] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 73:33 368,

30 [8] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 4. Cambridge University Press, 007. [9] L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, :0 047, 07. [0] P. Drineas and M. W. Mahoney. Lectures on randomized numerical linear algebra. arxiv preprint arxiv: , 07. [] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 996. [] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume. [3] J. Fujii, M. Fujii, T. Furuta, and R. Nakamoto. Norm inequalities equivalent to Heinz inequality. Proceedings of the American Mathematical Society, 83:87 830, 993. [4] C. Gu. Smoothing spline ANOVA models, volume 97. Springer Science & Business Media, 03. [5] M. Hanke. Conjugate gradient type methods for ill-posed problems. Routledge, 07. [6] F. Hansen. An operator inequality. Mathematische Annalen, 463:49 50, 980. [7] F. Krahmer and R. Ward. New and improved johnson lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 433:69 8, 0. [8] J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. arxiv preprint arxiv: under revision to Journal of Machine Learning Research, 08. [9] J. Lin and V. Cevher. Optimal rates of sketched-regularized algorithms for least-squares regression over Hilbert spaces. arxiv preprint arxiv: Proceedings of the 35th International Conference on Machine Learning, 08. [0] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 897: 47, 07. [] J. Lin, A. Rudi, L. Rosasco, and V. Cevher. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. arxiv preprint arxiv: to appear in Applied and Computational Harmonic Analysis, 08. [] S.-B. Lin and D.-X. Zhou. Optimal learning rates for kernel partial least squares. Journal of Fourier Analysis and Applications, 43: , 08. [3] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for bernoulli and subgaussian ensembles. Constructive Approximation, 83:77 89, 008. [4] S. Minsker. On some extensions of Bernstein s inequality for self-adjoint operators. arxiv preprint arxiv:.5448, 0. 30

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference