arxiv: v1 [math.st] 28 May 2016

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 28 May 2016"

Todd McKinney
5 years ago
Views:

1 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv: v1 [math.st] 8 May 016 May 31, 016 Abstract Regularization is an essential element of virtually all kernel methods for nonparametric regression problems. A critical factor in the effectiveness of a given kernel method is the type of regularization that is employed. This article compares and contrasts members from a general class of regularization techniques, which notably includes ridge regression and principal component regression. We derive an explicit finite-sample risk bound for regularization-based estimators that simultaneously accounts for i the structure of the ambient function space, ii the regularity of the true regression function, and iii the adaptability or qualification of the regularization. A simple consequence of this upper bound is that the risk of the regularization-based estimators matches the minimax rate in a variety of settings. The general bound also illustrates how some regularization techniques are more adaptable than others to favorable regularity properties that the true regression function may possess. This, in particular, demonstrates a striking difference between kernel ridge regression and kernel principal component regression. Our theoretical results are supported by numerical experiments. 1 Introduction Suppose that the observed data consists of z i = y i, x i, i = 1,..., n, where y i Y R and x i X R d. Suppose further that z 1,..., z n ρ are iid from some probability distribution ρ on Y X. Let ρ x denote the conditional distribution of y i given x i = x X and let ρ X denote the marginal distribution of x i. Our goal is to use the available data to estimate the regression function of y on x, f x = y dρy x, which minimizes the mean-squared prediction error y fx dρy, x Y X Y over ρ X -measurable functions f : X R. More specifically, for an estimator ˆf define the risk R ρ ˆf = E[ ˆfx ] [ f x dρx x = E f ˆf ] ρx, 1 where the expectation is computed over z 1,..., z n, and ρx denotes the norm on L ρ X ; we seek estimators ˆf which minimize R ρ ˆf. This is a version of the random design nonparametric regression problem. There is a vast literature on nonparametric regression, along with a huge variety of corresponding methods [e.g., 11, 3]. In this paper, Rutgers University, ldicker@stat.rutgers.edu Amazon NYC, dean@foster.net Columbia University, djhsu@cs.columbia.edu 1

2 we focus on regularization and kernel methods for estimating f. 1 Most of our results apply to general regularization operators. However, our motivating examples are two well-known regularization techniques: Kernel ridge regression which we refer to as KRR ; KRR is also known as Tikhonov regularization and kernel principal component regression KPCR ; also known as spectral cut-off regularization. Our main theorem is a new upper bound on the risk of a general class of kernel-regularization methods, which includes both KRR and KPCR Theorem 1. The theorem substantially generalizes previously published bounds see Section for a discussion of related work and illustrates the dependence of the risk on three important features: i the structure of the ambient reproducing kernel Hilbert space RKHS, ii the specific regularization technique employed, and iii the regularity often interpreted as smoothness of the function to be estimated. One consequence of the theorem is that the regularization methods studied in this paper including KRR and KPCR achieve the minimax rate for estimating f in a variety of settings. A second consequence is that certain regularization methods including KPCR, but not KRR may adapt to favorable regularity of f to attain even faster convergence rates, while others notably KRR are limited in this regard due to a well-known saturation effect [, 15, 17]. This illustrates a striking advantage that KPCR may have over KRR in these settings. Related work Kernel ridge regression has been studied extensively in the literature. Indeed, bounds for KRR that are similar to our Theorem 1 have been derived by Caponnetto and De Vito [4]. Moreover, it is well-known that KRR is minimax in many of the settings considered in this paper, such as those described in Corollaries 1 3 [4, 4, 5]. However, these cited results apply only to KRR, while the results in this paper apply to a substantially larger class of regularization operators including, for example, KPCR. Beyond KRR, there has also been significant research into more general regularization methods, like those considered in this paper. However, our bounds are sharper than previously published results on general regularization operators. For instance, unlike our Theorem 1, the bounds of Bauer et al. [] do not illustrate the dependence of the risk on the ambient Hilbert space. Thus, while our approach immediately implies that many of the regularization methods under consideration are minimax optimal, it seems difficult if not impossible to draw this conclusion using the approach of Bauer et al.. General regularization operators are studied by Caponnetto and Yao [5], but their results require a semi-supervised setting where an additional pool of unlabeled data is available. One of the major practical implications of this paper is that KPCR may have significant advantages over KRR in some settings. This has been observed previously by other researchers; others have even noted that Tikhonov regularization KRR saturates, while spectral cut-off regularization KPCR does not [, 14, 15]. Our results Theorem 1 and Corollaries 1 3 sharpen these observations by precisely quantifying the advantages of unsaturated regularization operators in terms of adaptability and minimaxity. In other related work, Dhillon et al. [8] have illustrated the potential advantages of KPCR over KRR in finitedimensional problems with linear kernels; though their work is not framed in terms of saturation and general regularization operators, it relies on similar concepts. We recently became aware of simultaneous independent work of Blanchard and Mücke [3] that proves upper bounds on the risk for a comparable class of regularization methods which includes KRR and KPCR, as well as minimax lower bounds that match the upper bounds; their results are specialized to RKHS s where the covariance operator has polynomially decaying eigenvalues. Compared to that work, our results require a weaker moment condition on the noise for risk bounds, apply to a much broader class of RKHS s, and also consider target functions that live in finite-dimensional subspaces see Proposition 1. Another distinguishing feature of the present work is that it contains numerical experiments that illustrate the implications of our theoretical bounds in some practical settings Section 6. The main engine behind the technical results in this paper is a collection of large-deviation results for Hilbert-Schmidt operators. The required machinery is developed in the appendix. These results build on straightforward extensions of results of Tropp [] and Minsker [16]. Our most precise results for KPCR and achieving parametric rates for estimation over finite-dimensional subspaces Proposition 1 rely on slightly 1 Here, we mean kernel as in reproducing kernel Hilbert space, rather than kernel-smoothing, which is another popular approach to nonparametric regression.

3 different arguments, which are based on well-known eigenvalue perturbation results that have been adapted to handle Hilbert-Schmidt operators e.g., the Davis-Kahan sin Θ theorem [7]. 3 Statistical setting and assumptions Our basic assumption on the distribution of z = y, x ρ is that the residual variance is bounded; more specifically, we assume that there exists a constant σ > 0 such that [ ] y f x dρy x σ Y for almost all x X. Zhang et al. [5] also assume ; this assumption is slightly weaker than the analogous assumption of Bauer et al. [] Equation 1 in their paper. Note that holds if y is bounded almost surely or if y = f x + ɛ, where ɛ is independent of x, Eɛ = 0, and Varɛ = σ. Let K : X X R be a symmetric positive-definite kernel function. We assume that K is bounded i.e., that there exists κ > 0 such that sup x X Kx, x κ. Additionally, we assume that there is a countable basis of eigenfunctions {ψ j } L ρ X and a sequence of corresponding eigenvalues t 1 t 0 such that Kx, x = t jψ j xψ j x, x, x X 3 and the convergence is absolute. Mercer s theorem and various generalizations give conditions under which representations like 3 are known to hold [6]; one of the simplest examples is when X is a compact Hausdorff space, ρ X is a probability measure on the Borel sets of X, and K is continuous. Observe that t j = t j X ψ j x dρ X x = X Kx, x dρ X x κ ; in particular, {t j } l1 N. Let H L ρ X be the RKHS corresponding to K [1] and let φ j = t j ψ j, j = 1,,.... It follows from basic facts about RKHSs that {φ j } is an orthonormal basis for H if t > t +1 = 0, then { } φ j is an orthonormal basis for H. Furthermore, H is characterized by H = f = θ j ψ j L θ j ρ X ; < and the inner product f, f H = θ j ψ j, θ j ψ j t j the corresponding norm is denoted by H. Our main assumption on the relationship between y, x, and the kernel K is that f H. 4 This is a regularity or smoothness assumption on f. Many of the results in this paper can be modified, so that they apply to settings where f / H, by replacing f with an appropriate projection of f onto H and including an approximation error term in the corresponding bounds. This approach leads to the study of oracle inequalities [1, 13, 1, 4, 5], which we do not pursue in detail here. However, investigating oracle inequalities for general regularization operators may be of interest for future research, as most existing work focuses on ridge regularization. Another interpretation of condition 4 is that it is a minimal regularity condition on f for ensuring that f can be estimated consistently using the kernel methods considered below. One key aspect of the upper bounds in Section 5 is that they show f can be estimated more efficiently, if it satisfies stronger regularity conditions and if the regularization method used is sufficiently adaptable. A convenient way to formulate H = θ j θj t j 3

4 a collection of regularity conditions with varying strengths, which will be useful in the sequel, is as follows. For ζ 0, define the Hilbert space H ζ = f = θ j ψ j L θ j ρ X ; < t 1+ζ. 5 j Then H = H 0 and H ζ H ζ1 whenever ζ ζ 1 0. The norm on H ζ is defined by f H ζ = θ j t 1+ζ j Additionally, positive integers define the finite-rank subspace H = f = θ j ψ j L ρ X ; θ 1,..., θ R. 6 We have the inclusion H H +1 and, if t 1,..., t > 0, then H H ζ for any ζ 0. In particular, it is clear from 5 6 that f H ζ is a stronger regularity condition than f H and that f H is an even stronger condition provided the t j are strictly positive. Conditions such as f H ζ and f H are known as source conditions elsewhere in the literature [e.g.,, 5]. 4 Regularization As discussed in Section 1, our goal is find estimators ˆf that minimize the risk 1. In this paper, we focus on regularization-based estimators for f. In order to precisely describe these estimators, we require some additional notation for various operators that will be of interest, and some basic definitions from regularization theory. 4.1 Finite-rank operators of interest For x X, define K x H by K x x = Kx, x, x X. Let X = x 1,..., x n R n d and y = y 1,..., y n R n. Additionally, define the finite-rank linear operators S X : H R n and T X : H H both depending on X by S X φ = φ, K x1 H,..., φ, K xn H = φx 1,..., φx n, T X φ = 1 n φ, K xi H K xi = 1 n φx i K xi, n n i=1 where φ H. Let, R n denote the normalized inner-product on R n, defined by v, ṽ R n = n 1 v ṽ for v = v 1,..., v n, ṽ = ṽ 1,..., ṽ n R n. Then the adjoint of S X with respect to, H and, R n, S X : Rn H, is given by S X v = 1 n n i=1 v ik xi. Additionally, we have T X = S X S X. Finally, observe that S X S X : Rn R n is given by the n n matrix S X S X = n 1 K, where K = Kx i, x j 1 i,j n ; K is the kernel matrix, which is ubiquitous in kernel methods and enables finite computation. 4. Basic definitions A family of functions g λ : [0, [0, indexed by λ > 0 is called a regularization family if it satisfies the following three conditions: R1 sup 0<t κ tg λ t < 1. R sup 0<t κ 1 tg λ t 1. R3 sup 0<t κ g λ t < λ 1. This definition follows Engl et al. [10] and Bauer et al. [], but is slightly more restrictive. i=1. 4

5 The main idea behind a regularization family is that it looks similar to t 1/t, but is better-behaved near t = 0, i.e., it is bounded by λ 1. An important quantity that is related to the adaptability of a regularization family is the qualification of the regularization. The qualification of the regularization family {g λ } λ>0 is defined to be the maximal ξ 0 such that sup 0<t κ 1 tg λ t t ξ λ ξ. If a regularization family has qualification ξ, we say that it saturates at ξ. Two regularization families that are the major motivation for the results in this paper are ridge Tikhonov regularization, where g λ t = 1 t+λ and principal component spectral cut-off regularization, where g λ t = s λ t = 1 1{t λ}. 7 t Observe that ridge regularization has qualification 1 and principal component regularization has qualification. Another example of a regularization family with qualification is the Landweber iteration, which can be viewed as a special case of gradient descent [see, e.g.,, 14, 19]. 4.3 Estimators Given a regularization family {g λ } λ>0, we define the g λ -regularized estimator for f, ˆf λ = g λ T X S Xy. 8 Here, g λ acts on the spectrum eigenvalues of the finite-rank operator T X which is the same as the spectrum of the kernel matrix K, up to scaling. Therefore, a finitely-computable representation is ˆf λ = n i=1 ˆγ ik xi, where ˆγ 1,..., ˆγ n = g λ K/ny/n; computing the ˆγ i involves an eigenvalue decomposition of the matrix K. The dependence of ˆfλ on the regularization family is implicit; our results hold for any regularization family except where explicitly stated otherwise. The estimators ˆf λ are the main focus of this paper. 5 Main results 5.1 General bound on the risk Theorem 1. Let ˆf λ be the estimator defined in 8 with regularization family {g λ } λ>0. Let 0 δ 1 and assume that there is some κ δ > 0 such that sup x X t 1 δ j ψ j x κ δ < 9 Assume that the source condition f H ζ holds for some ζ 0, and that g λ has qualification at least max{ζ+1/, 1}. Define the effective dimension d λ = t j t +λ. Finally, assume that 8/3+ 5/3κ δ /n j λ 1 δ κ δ. The following risk bound holds: R ρ ˆf λ ζ+3 f H ζ λ ζ+1 + 4d λσ n + 4d λ f Ht 1 + κ σ exp λn 3λ1 δ n 8κ δ + 1{ζ > 1} 16ζ 3/ ζ 1 f H ζ t 1 + λ ζ 1 κ 4 34 n + 15 n. 10 5

6 Theorem 1 is proved in Appendix A. The first two terms in the upper bound 10 are typically the dominant terms. In the upper bound 10, the interaction between the kernel K and the distribution ρ X is reflected in the effective dimension d λ [see, e.g., 4, 4]. The regularity of f enters through norm of f both the H- and H ζ -norms and the exponent on λ. The condition 9 in Theorem 1 is always satisfied by taking δ = 0 and κ 0 = κ. Requiring 9 with δ > 0 imposes additional conditions on the RKHS H. For Corollary 1 below which applies when the eigenvalues {t j } have polynomial-decay, we take δ = 0 and κ 0 = κ. The stronger condition with δ > 0 is required to obtain obtain minimax rates for kernels where the eigenvalues {t j } have exponential or Gaussian-type decay see Corollaries 3. Risk bounds on general regularization estimators similar to ˆf λ were previously obtained by Bauer et al. []. However, their bounds [e.g., Theorem 10 in ] are independent of the ambient RKHS H, i.e., they do not depend on the eigenvalues {t j }. Our bounds are tighter than those of Bauer et al. [] because we take advantage of the structure of H. In contrast with our Theorem 1, the results of Bauer et al. [] do not give minimax bounds not easily, at least, because minimax rates must depend on the t j. 5. Implications for kernels characterized by their eigenvalues rate of decay We now state consequences of Theorem 1 that give explicit rates for estimating f via ˆf λ, for any regularization family, under specific assumptions about the decay rate of the eigenvalues {t j }. We first consider the case where the eigenvalues have polynomial decay. Corollary 1. Assume that C 0 and ν > 1/ are constants such that 0 < t j Cj ν for all j = 1,,.... Assume the source condition f H ζ for some ζ 0, and that g λ has qualification at least max{ζ +1/, 1}. Finally, take λ = C n ν νζ+1+1 for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H ζ + σ } n νζ+1 νζ+1+1, where the constants implicit in the big-o may depend on κ, C, C, ν, and ζ, but nothing else. Remark 1. Observe that if g λ has qualification at least max{ζ + 1/, 1} and the other conditions of Corollary 1 are met, then ˆf λ obtains the minimax rate for estimating functions over H ζ [18]. Thus, if g λ has higher qualification, then ˆf λ can effectively adapt to a broader range of subspaces H ζ H = H 0. In particular, KPCR with infinite qualification can adapt to source conditions with arbitrary ζ 0; on the other hand, KRR satisfies the conditions of Corollary 1 only when ζ 1, because KRR has qualification 1. Remark. As mentioned earlier, a very similar result for polynomial decay eigenvalues was independently and simultaneously obtained by Blanchard and Mücke [3] for essentially the same class of regularization operators. Our Theorem 1, from which the corollary follows, applies to a broader class of kernels than results of Blanchard and Mücke. When the eigenvalues {t j } have exponential or Gaussian-type decay, the rates are nearly the same as in finite dimensions. Corollary. Assume that C, α 0 are constants such that 0 < t j Ce αj for all j = 1,,.... Assume that g λ has qualification at least 1 and that 9 holds for any 0 < δ 1. Finally, take λ = C n 1 logn for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H + σ } logn n where the constants implicit in the big-o may depend on κ, C, C, α, δ, and κ δ, but nothing else., 6

7 Corollary 3. Assume that C, α 0 are constants such that 0 < t j Ce αj for all j = 1,,.... Assume that g λ has qualification at least 1 and that 9 holds for any 0 < δ 1. Finally, take λ = C n 1 logn for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H + σ } logn, n where the constants implicit in the big-o may depend on κ, C, C, α, δ, and κ δ, but nothing else. Remark 3. In Corollaries 3, we get minimax estimation over H = H 0 [18]. However, our bounds are not refined enough to pick up any potential improvements which may be had if a stronger source condition is satisfied e.g., f H ζ for ζ > 0. This is typical in settings like this because the minimax rate is already quite fast, i.e., within a log-factor of the parametric rate n Parametric rates for finite-dimensional kernels and subspaces If the kernel has finite rank i.e., t j = 0 for j sufficiently large, then it follows directly from Theorem 1 that R ρ ˆf λ = O{ f H + σ }/n. If the kernel has infinite rank, but f is contained in the finite-dimensional subspace H for some <, then Theorem 1 can still be applied, provided g λ has high qualification. Indeed, if g λ has infinite qualification and f H, then it follows that f H ζ for all ζ 0 and Theorem 1 implies that for any 0 < α 1, R ρ ˆf λ = O{ f H + σ }/n 1 α for appropriately chosen λ. In fact, we can improve on this rate for KPCR. The next proposition implies that the risk of KPCR matches the parametric rate n 1 for f H ; the proof requires a different argument, based on eigenvalue perturbation theory, which we give in Appendix B. Proposition 1. Let ˆfKPCR,λ be the KPCR estimator, with principal component regularization 7, and assume that f H. Let 0 < r < 1 be a constant and let λ = 1 rt. If rt κ /n 1/ + κ /3n, then { } R ρ ˆf KPCR,λ 1 n 34κ 6 r t 4 f H + 3κ 1 rt σ + κ f H 15κ 4 nr t 4 n r t exp κ 4 + κ rt /3. Proposition 1 implies that KPCR may reach the parametric rate for estimating f H. On the other hand, it is known that KRR may perform dramatically worse than KPCR in these settings due to the saturation effect [see, e.g., 4, 8, 9]. 6 Numerical experiments 6.1 Simulated data This simulation study shows how KPCR is able to adapt to highly structured signals, while KRR requires more favorable structure from the ambient RKHS. For this experiment, we take X = {1,,..., 13 }. The data distribution ρ on Y X is specified as follows. The marginal distribution on X is ρ X x x 1/, the regression function f is given by f x = 5 1{x = j}, and ρ x is normal with mean f x and variance 1/4. To compute ˆf λ, we use the discrete kernel Kx, x = 1{x = x} Using an iid sample of size n = 13, we compute ˆf λ either KRR or KPCR for λ in a discrete grid of 10 values uniformly spaced between 10 5 and 0.0, and then choose the value of λ for which ˆf λ has smallest validation mean-squared error n 1 n i=1 yv i ˆf λ x v i, computed using a separate iid sample {x v i, yv i }n i=1 of size n = 13. Figure 1a shows the validation-mse of each ˆf λ ; the plot of ρ X -MSE f ˆf λ L ρ X has roughly the same shape, just shifted down by σ = 1/4 Figure 1b. The selected λ is ˆλ KRR = for KRR, and ˆλ KPCR = for KPCR. These choices of λ yield the final estimators, and ˆfKRR,ˆλKRR ˆf KPCR,ˆλKPCR ; 7

8 KRR KPCR KRR KPCR Validation-MSE ρx-mse ˆfx λ a KRR KPCR x c ρx-mse λ b KRR KPCR α d Figure 1: a Validation-MSE of ˆfKRR,λ and ˆf KPCR,λ for ρ X x x 1/ as λ varies; b ρ X -MSE of ˆf KRR,λ and ˆf KPCR,λ for ρ X x x 1/ as λ varies; c estimated functions ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR for ρ X x x 1/ ; d ρ X -MSE of ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR for ρ X x x α as α varies. 8

9 the ρ X -MSE is for KRR, and for KPCR. In Figure 1c, we plot the functions ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR ; the KRR function is non-zero for much of the domain, while the KPCR function is zero for nearly all of the domain like f. We repeat the above simulation for different marginal distributions ρ X x x α, for 1/ α, which imply different eigenvalue sequences {t j }. The mean and standard deviation of the ρ X -MSE s over 10 repetitions are shown in Figure 1b. This confirms KPCR s to adapt to the regularity of f, regardless of the ambient RKHS; KRR requires more structure to achieve similar results. 6. Real data We also compared KRR and KPCR using three weighted degree kernels designed for recognizing splice sites in genetic sequences [0]. 3 The 3300 samples are divided into a training set 1000, validation set 1100, and testing set 100. For each kernel, we use the training data to compute ˆf λ for λ in a discrete grid of 10 equally-spaced values between 10 5 and 0.4, and select the value of λ on which the MSE of ˆf λ on the validation set is smallest. The MSE on the testing set and the intrinsic dimension dˆλ for the selected ˆλ on the training data are as follows: Kernel 1 Kernel Kernel 3 MSE of ˆf KRR,ˆλKRR MSE of ˆf KPCR,ˆλKPCR dˆλkrr dˆλkpcr KRR outperforms KPCR with Kernel 1, where the intrinsic dimension of the kernel is low, while the reverse happens with Kernels and 3, where the intrinsic dimensions are high. This resonates with our theoretical results, which suggest that KRR requires low intrinsic dimension to perform most effectively. 7 Discussion Our unified analysis for a general class of regularization families in nonparametric regression highlights two important statistical properties. First, the results show minimax optimality for this general class in several commonly studied settings, which was only previously established for specific regularization methods. Second, the results demonstrate the adaptivity of certain regularization families to subspaces of the RKHS, showing that these techniques may take advantage of additional smoothness properties that the signal may possess. It is notable that the most well-studied family, KRR/Tikhonov regularization, does not possess this adaptability property. References [1] N. Aronszajn. Theory of reproducing kernels. T. A. Math. Soc., 68: , [] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory.. Complexity, 3:5 7, 007. [3] G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. ArXiv e-prints, April 016. [4] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math., 7: , 007. [5] A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl., 8: , We use the first three kernels for the data obtained from 9

10 [6] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl., 4: , 006. [7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM. Numer. Anal., 7:1 46, [8] P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk comparison of ordinary least squares vs ridge regression.. Mach. Learn. Res., 14: , 013. [9] L. H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, :1 37, 016. [10] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Mathematics and Its Applications, Vol Springer, [11] L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric Regression. Springer, 00. [1] D. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. Found. Comput. Math., 14: , 014. [13] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat., 34: , 006. [14] L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Comput., 7: , 008. [15] P. Mathé. Saturation of regularization methods for linear ill-posed problems in Hilbert spaces. SIAM. Numer. Anal., 4: , 005. [16] S. Minsker. On Some Extensions of Bernstein s Inequality for Self-adjoint Operators. ArXiv e-prints, 011. [17] A. Neubauer. On converse and saturation results for Tikhonov regularization of linear ill-posed problems. SIAM. Numer. Anal., 34:517 57, [18] M. S. Pinsker. Optimal filtration of square-integrable signals in Gaussian noise. Probl. Inf. Transm., 16:5 68, [19] L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory. Technical Report DISI-TR-05-18, Universitá degli Studi di Genova, Italy, 005. [0] S. Sonnenburg. Machine Learning for Genomic Sequence Analysis. PhD thesis, Fraunhofer Institute FIRST, December 008. Supervised by K.-R. Müller and G. Rätsch. [1] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Conference on Learning Theory, 009. []. A. Tropp. An Introduction to Matrix Concentration Inequalities. ArXiv e-prints, 015. [3] L. Wasserman. All of Nonparametric Statistics. Springer, 006. [4] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Comput., 17: , 005. [5] Y. Zhang,. C. Duchi, and M.. Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, pages ,

11 A Proof of Theorem 1 To provide some intuition behind ˆf λ and our proof strategy, define the positive self-adjoint operator T : H H by T φ = X φ, K x H K x dρ X x = t j φ, φ j H φ j, φ H. Observe that T is a population version of the operator T X. Unlike T X, T often has infinite rank; however, we still might expect that T T X for large n, where the approximation holds in some suitable sense. We also have a large-n approximation for SX y. For φ H, φ, SXy H = 1 n n y i φ, K xi H i=1 Y X yφxdρy, x = X f xφx dρ X x = φ, f L ρ X = φ, T f H, where, L ρ X denotes the inner-product on L ρ X and we have used the identity φ j = t j ψ j to obtain the last equality. It follows that SX y T f. Hence, to recover f from y, it would be natural to apply the inverse of T to SX y. However, T is not invertible whenever it has infinite rank, and regularization becomes necessary. We thus arrive at the chain of approximations which help motivate ˆf λ : ˆf λ = g λ T X S Xy g λ T T f f, where g λ T may be viewed as an approximate inverse for a suitably chosen regularization parameter λ. A.1 Bias-variance decomposition The proof of Theorem 1 is based on a simple bias-variance decomposition of the risk of ˆfλ. Let ɛ = y 1 f x 1,..., y n f x n R n. Proposition. The risk R ρ ˆf λ has the decomposition R ρ ˆf λ = B ρ ˆf λ + V ρ ˆf λ, 11 where B ρ ˆf λ and V ρ ˆf λ are defined as [ T B ρ ˆf ] [ λ = E 1/ {I g λ T X T X }f T, V ρ ˆf λ = E 1/ g λ T X SXɛ H H ]. Our proof separately bounds the bias B ρ ˆf λ and variance V ρ ˆf λ terms from Proposition. together, these bounds imply a bound on R ρ ˆf λ. Taken A. Translation to vector and matrix notation We first note that the Hilbert space H is isometric to l N via the isometric isomorphism ι: H l N, given by ι: α j φ j α 1, α,

12 we take all elements of l N to be infinite-dimensional column vectors. Using this equivalence, we can convert elements of H and linear operators on H appearing in Proposition into infinite-dimensional vectors and matrices, respectively, which we find simpler to analyze in the sequel. Define the infinite-dimensional diagonal matrix and the vector T = diagt 1, t,..., β = β 1, β,... l N, where β i = f, φ i H. Next, define the n random matrices Observe that Ψ = ψ j x i 1 i n;1 j<, Φ = ΨT = φ j x i 1 i n;1 j<. S X = Φ ι, 1 SX = ι 1, n Φ T = ι 1 T ι. Also, for 1 i n, let φ i φ i be the matrix whose j, j -th entry is φ j x i φ j x i, and define Σ = 1 n φ n i φ i = 1 Φ. n Φ i=1 Finally, let I = diag1, 1,..., and let = l N denote the norm on l N. In these matrix and vector notations, the bias-variance decomposition from Proposition translates to the following: T B ρ ˆf λ = E[ 1/ {I g λ ΣΣ}β ], V ρ ˆf λ = The boundedness of the kernel implies where the norms are the operator norms in l N. A.3 Probabilistic bounds For 0 < r < 1, define the event A r = 1 [ T n E 1/ g λ ΣΦ ɛ ]. trt κ, φ i φ i κ, Σ κ, { T + λi 1/ Σ TT + λi 1/ r }, 13 and let A c r denote its complement. Our bounds on bias B ρ ˆf λ and variance V ρ ˆf λ are based on analysis in the event A c r for a constant r. Therefore, we also need to show that A c r has large probability equivalently, show that A r has small probability. 1

13 Lemma 1. Assume that sup x X t 1 δ j ψ j x κ δ < 14 for some 0 δ < 1, κ δ > 0. Further assume that λ1 δ κ δ. If r κ δ /λ1 δ n + κ δ /3λ1 δ n, then P A r 4d λ exp λ1 δ nr κ. δ 1 + r/3 Proof. The proof is an application of Lemma 5. Define, for 1 i n, as well as Y = n i=1 X i. We have It is clear that E[X i ] = 0. Observe that t j t j + λψ jx i X i = 1 n T + λi 1/ {φ i φ i T}T + λi 1/, Y = T + λi 1/ Σ TT + λi 1/. max j 1 t δ j t j + λ t 1 δ j ψ j x i κ δ max j 1 t δ j t j + λ κ δ, 15 λ1 δ where the last inequality uses the inequality of arithmetic and geometric means. Therefore, by the assumption λ 1 δ κ δ, X i 1 } { T n max + λi 1/ φ i φ i T + λi 1/, T + λi 1/ TT + λi 1/ Moreover, = 1 n max t j t j + λψ jx i, max j 1 t j t j + λ κ δ λ 1 δ n. E[X i ] = 1 [ n E T + λi 1/ φ i φ i T + λi 1 φ i φ i T + λi 1/ T + λi T ] = 1 n E t j t j + λψ jx i T + λi 1/ φ i φ i T + λi 1/ T + λi T. Combining this with 15 gives E[Y ] κ δ λ 1 δ n [T E + λi 1/ φ 1 φ 1 T + λi 1/] and tr E[Y ] = κ δ λ 1 δ n T + λi 1 T κ δ λ 1 δ n κ δ λ 1 δ n tr E [T + λi 1/ φ 1 φ 1 T + λi 1/] = κ δ λ 1 δ n tr T + λi 1 T = d λκ δ λ 1 δ n. Applying Lemma 5 with V = R = κ δ /λ1 δ n and D = d λ, gives P Y r 4d λ exp λ1 δ nr κ. δ 1 + r/3 13

14 Lemma. E [ Σ T ] 34κ4 n + 15κ4 n. Proof. The proof is an application of Lemma 6. Define, for 1 i n, X i = 1 n φ iφ i T, and also define Y = n i=1 X i, so Y = Σ T. Clearly E[X i ] = 0 and X i κ /n. Additionally, since E[Y ] = n 1 E[ φ 1 φ 1 φ 1 T ], we have E[Y ] κ 4 /n and tre[y ] κ 4 /n. The claim thus follows by applying Lemma 6 with V = κ 4 /n, D = 1, and R = κ /n. A.4 Bias bound Lemma 3. Assume that β = T ζ/ α for some ζ 0 and that g λ has qualification at least max{ζ + 1/, 1}. For any 0 < r 1/, B ρ ˆf λ ζ+ 1 r α λ ζ+1 + t 1 β P A r + 1{ζ > 1} 8ζ 1 + r ζ 1 1 r [ α T + λi ζ 1 E Σ T ]. Proof. Define h λ Σ = I Σg λ Σ. Since 1 sg λ s 1 for 0 < s κ and Σ κ, we have h λ Σ 1. Moreover, using β = T ζ/ α, T B ρ ˆf λ = E[ 1/ h λ Σβ ] [ T ] E 1/ h λ ΣT ζ/ 1{A c d} α + T 1/ h λ Σ β P A d [ T ] E 1/ h λ ΣT ζ/ 1{A c d} α + t 1 β P A d. The rest of the proof involves bounding E[ T 1/ h λ ΣT ζ/ 1{A c r}]. We separately consider two cases: i ζ 1, and ii ζ > 1. Case 1: ζ 1. Since g λ has qualification at least 1, it follows that s + λ1 sg λ s λ for 0 < s κ. This implies T 1/ h λ ΣT ζ/ T 1/ Σ + λi 1/ Σ + λihλ Σ T ζ/ Σ + λi 1/ 4λ T 1/ Σ + λi 1 T 1/ T ζ/ Σ + λi 1 T ζ/. 16 For 0 z 1, T z/ Σ + λi 1 T z/ T = z/ { Σ T + T + λi } 1 T z/ T z/ T + λi 1/ T + λi 1/{ Σ T + T + λi } 1 T + λi 1/ = T z T + λi 1 I T + λi 1/ T ΣT + λi 1/ 1 λ z 1 I T + λi 1/ T ΣT + λi 1/ 1, 17 where the final inequality uses the fact s z /s + λ λ z 1 for 0 z 1 and s 0. The final quantity in 17 is bounded above by λ z 1 /1 r on the event A c r, so applying this inequality with z = 1 and z = ζ to 16 gives T 1/ h λ ΣT ζ/ 1{A c r } 4λ 1 1 r λζ 1 1 r = 4λζ+1 1 r. 14

15 So the bias in this case is bounded as B ρ ˆf λ α λ ζ+1 + t 1 β P A r. 1 r Case : ζ > 1. We have T 1/ h λ ΣT ζ/ T 1/ Σ + λi 1/ Σ + λi 1/ h λ ΣT ζ/ = T 1/ Σ + λi 1 T 1/ Σ + λi 1/ h λ ΣT ζ/ T 1/ Σ + λi 1 T 1/ Σ + λi 1/ h λ ΣT + λi ζ/. 18 The first factor on the right-hand side of 18 can be bounded using 17 on the event A c r. For the second factor, we have Σ + λi 1/ h λ ΣT + λi ζ/ Σ + λi ζ+1/ { h λ Σ + Σ + λi1/ h λ Σ T + λi ζ/ Σ + λi ζ/} λ ζ+1/ + λ 1/ T + λi ζ/ Σ + λi ζ/. 19 Above, the first inequality is due to the triangle inequality, and the second inequality uses the facts that g λ has qualification at least ζ + 1/, and that s + λ z+1/ 1+z+1/ s z+1/ + λ z+1/ for s 0 and z 1. We now bound T + λi ζ/ Σ + λi ζ/ in terms of T Σ. First, observe that on the event A c r, and, consequently, T Σ T + λi T + λi 1/ T ΣT + λi 1/ < r T + λi, For a small constant s > 0, define Σ + λi 1 + r T + λi. A s = B s = 1 T + λi, 1 + r + s T + λi 1 Σ + λi. 1 + r + s T + λi Then, on the event A c r, applying Lemma 7 and Lemma 8 gives A ζ/ s B ζ/ s ζ A 1/ s B 1/ s { } 1/ λ ζ A s B s 1 + r + s T + λi = ζ {λ 1 + r + s T + λi } 1/ T Σ. Taking s 0, it follows that on A c r, T + λi ζ/ Σ + λi ζ/ ζ λ 1/ {1 + r T + λi } ζ 1/ T Σ. 0 Combining 19 and 0 gives Σ + λi 1/ h λ ΣT + λi ζ/ λ ζ+1/ + ζ {1 + r T + λi } ζ 1/ T Σ. 15

16 Using this together with 17 in 18 and T 1/ h λ ΣT ζ/ 1{A c r } Therefore B ρ ˆf λ α 1 r A.5 Variance bound Lemma 4. For any 0 < r < 1, 1 1 r λ ζ+1/ + ζ {1 + r T + λi } ζ 1/ T Σ. λ ζ+1 + 8ζ {1 + r T + λi } ζ 1 E [ T Σ ] + t 1 β P A r. V ρ ˆf λ 1 r dλσ n + κ σ λn P A r. Proof. The assumption on ɛ = ɛ 1,..., ɛ n implies E[ɛ i ] = 0, E[ɛ i ɛ j ] = 0 for i j, and E[ɛ i ] σ. So, by Von Neumann s inequality, V ρ ˆf λ = 1 [ T n E 1/ g λ ΣΦ ɛ ] [ σ n E tr Tg λ Σ Σ ]. Using Von Neumann s inequality together with trt κ and g λ s s 1/λ for 0 < s κ, tr Tg λ Σ Σ trt g λ Σ Σ κ λ. Therefore [ V ρ ˆf ] λ σ n E tr Tg λ Σ Σ 1{A c r} + κ σ λn P A r. Using Von Neumann s inequality twice more, and s + λg λ s s for 0 < s κ, we have tr Tg λ Σ Σ = tr TΣ + λi 1 Σ + λig λ Σ Σ tr TΣ + λi 1 Σ + λigλ Σ Σ tr TΣ + λi 1 = tr TT + λi 1 T + λi 1/ Σ + λi 1 T + λi 1/ tr TT + λi 1 T + λi 1/ Σ + λi 1 T + λi 1/ = d λ I T + λi 1/ T Σ 1 T + λi 1/ 1. This final quantity is bounded above by d λ /1 r on the event A c d. A.6 Finishing the proof of Theorem 1 Using the bias-variance decomposition from Proposition, we apply the bias bound Lemma 3 and variance bound Lemma 4 to obtain a bound on the risk: R ρ ˆf λ = B ρ ˆf λ + V ρ ˆf λ ζ+ 1 r α λ ζ+1 + t 1 β P A r + 1{ζ > 1} 8ζ 1 + r ζ 1 1 r + 1 r dλσ n + κ σ λn P A r. α t 1 + λ ζ 1 E [ Σ T ] 16

17 Now set r = 1/, and apply Lemma 1 to bound P A r. Note that the assumption 8/3+ 5/3κ δ /n λ1 δ satisfies the conditions on r in Lemma 1 for r = 1/. Finally, apply Lemma to bound E[ Σ T ]. B Proof of Proposition 1 Let ˆt 1 ˆt 0 denote the eigenvalues of Σ and define T = diagˆt 1, ˆt,.... Let Û be an orthogonal transformation satisfying Σ = Û TÛ. Additionally, let h λ t = I{t λ}, define Ĵ = Ĵλ = inf{j : ˆt j > λ}, and write Û = Û Û+, where Û is the Ĵ matrix comprised of the first Ĵ columns of Û, and Û+ consists of the remaining columns of Û. Finally, define T = diagt 1,..., t, T + = diagˆt, ˆt,..., Ĵ+1 Ĵ+ and define the matrix U = I 0, where I is the identity matrix. Now consider the bias term B ρ ˆf KPCR,λ = E[ T 1/ h λ Σβ ], and observe that T 1/ h λ Σβ = T 1/ Û + Û +U U β κ Û +U f H. Thus, B ρ ˆf KPCR,λ κ f H E Û +U 1{ Ĵ } + κ f H P Ĵ <. 1 Now we bound +U Û on the event {Ĵ }. We derive this bound from basic principles, but it is essentially the Davis-Kahan inequality [7]. Let D = Σ T. Then DU = ΣU TU = ΣU U T and Next notice that U DÛ+ = U ΣÛ+ T U Û+ = U Û+ T + T U Û+. Thus, D U DÛ+ T U Û+ U Û+ T + t U Û+ 1 rt U Û+ = rt U Û+. U Û+ 1 rt D on the event {Ĵ > }. Next we bound P Ĵ < = P ˆt λ = P {ˆt 1 rt }. By Weyl s inequality, ˆt t D = Σ T, and, by Lemma 5, P Σ T rt nr t 4 4 exp κ 4 + κ rt /3, provided rt κ /n 1/ + κ /3n. Thus, P Ĵ < = P {ˆt 1 rt } nr t 4 4 exp κ 4 + κ rt /3. 3 Combining 1 3 and using Lemma gives B ρ ˆf [ KPCR,λ κ f HE Σ T ] + 4κ f H exp r t 4 κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f H exp nr t 4 κ 4 + κ rt /3 nr t 4 κ 4 + κ rt /3. 17

18 Next, we combine this bound on B ρ ˆf KPCR,λ with the variance bound Lemma 4 taking r = 0 in the lemma to obtain R ρ ˆf KPCR,λ = B ρ ˆf λ + V ρ ˆf λ κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f nr t 4 H exp κ 4 + κ rt /3 + d λσ + κ σ n λn κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f nr t 4 H exp κ 4 + κ rt /3 + 3κ σ 1 rt n { } = 1 34κ 6 n r t 4 f H + 3κ 1 rt σ + κ f 15κ 4 nr t 4 H r t exp n κ 4 + κ rt /3. This completes the proof of the proposition. C Supporting results C.1 Sums of random operators Lemma 5. Let R > 0 be a positive real constant and consider a finite sequence of self-adjoint Hilbert-Schmidt operators {X i } n i=1 satisfying E[X i] = 0 and X i R almost surely. Define Y = n i=1 X i and suppose there are constants V, D > 0 satisfying E[Y ] V and tre[y ] V D. For all t V 1/ + R/3, P Y t 4D exp t V + Rt/3 Proof. This is a straightforward generalization of [, Theorem 7.7.1], using the arguments from [16, Section 4] to extend from self-adjoint matrices to self-adjoint Hilbert-Schmidt operators. Lemma 6. In the same setting as Lemma 5, [ E Y ] + 3DV D Proof. The proof is based on integrating the tail bound from Lemma 5: [ E Y ] = = = 0 V 1/ + R 3 + V 1/ + R 3 + P Y t 1/ dt 4D exp V 1/ + R 3 9V /R 0 4D exp V 1/ + R { + 16V D 1 exp 3 t 4V V 1/ + R { + 16V D 1 + exp 9V D + 3DV + R R. t dt V + Rt 1/ /3 dt + 9V /R 4D exp 9V } 4R + 18R D 9 } 4R + 18R D exp 9 { 9V 4R + 1 9V 4R 3t1/ dt 4R } exp 9V 4R 18

19 C. Differences of powers of bounded operators Lemma 7. Let A and B be non-negative self-adjoint operators with A < 1 and B < 1. For any γ 1, A γ B γ γ A B. Proof. The proof considers three possible cases for the value of γ: i γ is an integer, ii 1 < γ <, and iii γ is a non-integer larger than two. Case 1: γ is an integer. In this case, we have the following identity: γ A γ B γ = A j 1 A BB γ j. So, by the triangle inequality, A γ B γ A B γ A j 1 B γ j γ A B. Case : 1 < γ <. Pick 0 < r < 1 such that A < r and B < r, and fix 0 < t < 1 r/. Define A t = A + ti and B t = B + ti, and γ γγ 1 γ k + 1 =, k = 1,,.... k k! Then, using the power series 1 + s γ = 1 + γ k=1 k s k for 1 < s < 1, A γ t B γ γ k γ t = {A t I k B t I k } = A j 1 A BB k j. k k k=1 Convergence is assured because A t I 1 t and B t I 1 t. Moreover, A γ t B γ γ k t A B A k t I j 1 B t I k j k=1 γ A B k1 tk 1 = γ A B k 1 + γ 1 1 tk k k=1 k=1 { = γ A B 1 1 t γ 1} γ A B. Above, we have used the power series 1 1 s γ 1 = k=1 γ 1 k s k at s = 1 t. Taking t 0 yields k=1 A γ B γ γ A B. Case 3: γ is a non-integer larger than two. We can write γ = k + q for an integer k and real number 0 < q < 1. Applying the results from the previous two cases gives A γ B γ = A k γ/k B k γ/k γ k Ak B k γ A B. Lemma 8. Pick any real numbers r, γ 0, 1. Let A and B be non-negative self-adjoint operators, each with spectrum contained in [r, 1. Then A γ B γ γr γ 1 A B. Proof. Since A I 1 r and B I 1 r, the proof is similar to that of Lemma 7: A γ B γ γ A B 1 + γ 1 1 rk k = γrγ 1 A B. k=1 Above, we have used the power series 1 s γ 1 = 1 + case in Lemma 7 because 0 < γ < 1. k=1 γ 1 k s k at s = 1 r which differs from the 19

Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators

Electronic Journal of Statistics Vol. 11 017 10 1047 ISSN: 1935-754 DOI: 10.114/17-EJS158 Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators