arxiv: v1 [math.st] 28 May 2016

Size: px
Start display at page:

Download "arxiv: v1 [math.st] 28 May 2016"

Transcription

1 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv: v1 [math.st] 8 May 016 May 31, 016 Abstract Regularization is an essential element of virtually all kernel methods for nonparametric regression problems. A critical factor in the effectiveness of a given kernel method is the type of regularization that is employed. This article compares and contrasts members from a general class of regularization techniques, which notably includes ridge regression and principal component regression. We derive an explicit finite-sample risk bound for regularization-based estimators that simultaneously accounts for i the structure of the ambient function space, ii the regularity of the true regression function, and iii the adaptability or qualification of the regularization. A simple consequence of this upper bound is that the risk of the regularization-based estimators matches the minimax rate in a variety of settings. The general bound also illustrates how some regularization techniques are more adaptable than others to favorable regularity properties that the true regression function may possess. This, in particular, demonstrates a striking difference between kernel ridge regression and kernel principal component regression. Our theoretical results are supported by numerical experiments. 1 Introduction Suppose that the observed data consists of z i = y i, x i, i = 1,..., n, where y i Y R and x i X R d. Suppose further that z 1,..., z n ρ are iid from some probability distribution ρ on Y X. Let ρ x denote the conditional distribution of y i given x i = x X and let ρ X denote the marginal distribution of x i. Our goal is to use the available data to estimate the regression function of y on x, f x = y dρy x, which minimizes the mean-squared prediction error y fx dρy, x Y X Y over ρ X -measurable functions f : X R. More specifically, for an estimator ˆf define the risk R ρ ˆf = E[ ˆfx ] [ f x dρx x = E f ˆf ] ρx, 1 where the expectation is computed over z 1,..., z n, and ρx denotes the norm on L ρ X ; we seek estimators ˆf which minimize R ρ ˆf. This is a version of the random design nonparametric regression problem. There is a vast literature on nonparametric regression, along with a huge variety of corresponding methods [e.g., 11, 3]. In this paper, Rutgers University, ldicker@stat.rutgers.edu Amazon NYC, dean@foster.net Columbia University, djhsu@cs.columbia.edu 1

2 we focus on regularization and kernel methods for estimating f. 1 Most of our results apply to general regularization operators. However, our motivating examples are two well-known regularization techniques: Kernel ridge regression which we refer to as KRR ; KRR is also known as Tikhonov regularization and kernel principal component regression KPCR ; also known as spectral cut-off regularization. Our main theorem is a new upper bound on the risk of a general class of kernel-regularization methods, which includes both KRR and KPCR Theorem 1. The theorem substantially generalizes previously published bounds see Section for a discussion of related work and illustrates the dependence of the risk on three important features: i the structure of the ambient reproducing kernel Hilbert space RKHS, ii the specific regularization technique employed, and iii the regularity often interpreted as smoothness of the function to be estimated. One consequence of the theorem is that the regularization methods studied in this paper including KRR and KPCR achieve the minimax rate for estimating f in a variety of settings. A second consequence is that certain regularization methods including KPCR, but not KRR may adapt to favorable regularity of f to attain even faster convergence rates, while others notably KRR are limited in this regard due to a well-known saturation effect [, 15, 17]. This illustrates a striking advantage that KPCR may have over KRR in these settings. Related work Kernel ridge regression has been studied extensively in the literature. Indeed, bounds for KRR that are similar to our Theorem 1 have been derived by Caponnetto and De Vito [4]. Moreover, it is well-known that KRR is minimax in many of the settings considered in this paper, such as those described in Corollaries 1 3 [4, 4, 5]. However, these cited results apply only to KRR, while the results in this paper apply to a substantially larger class of regularization operators including, for example, KPCR. Beyond KRR, there has also been significant research into more general regularization methods, like those considered in this paper. However, our bounds are sharper than previously published results on general regularization operators. For instance, unlike our Theorem 1, the bounds of Bauer et al. [] do not illustrate the dependence of the risk on the ambient Hilbert space. Thus, while our approach immediately implies that many of the regularization methods under consideration are minimax optimal, it seems difficult if not impossible to draw this conclusion using the approach of Bauer et al.. General regularization operators are studied by Caponnetto and Yao [5], but their results require a semi-supervised setting where an additional pool of unlabeled data is available. One of the major practical implications of this paper is that KPCR may have significant advantages over KRR in some settings. This has been observed previously by other researchers; others have even noted that Tikhonov regularization KRR saturates, while spectral cut-off regularization KPCR does not [, 14, 15]. Our results Theorem 1 and Corollaries 1 3 sharpen these observations by precisely quantifying the advantages of unsaturated regularization operators in terms of adaptability and minimaxity. In other related work, Dhillon et al. [8] have illustrated the potential advantages of KPCR over KRR in finitedimensional problems with linear kernels; though their work is not framed in terms of saturation and general regularization operators, it relies on similar concepts. We recently became aware of simultaneous independent work of Blanchard and Mücke [3] that proves upper bounds on the risk for a comparable class of regularization methods which includes KRR and KPCR, as well as minimax lower bounds that match the upper bounds; their results are specialized to RKHS s where the covariance operator has polynomially decaying eigenvalues. Compared to that work, our results require a weaker moment condition on the noise for risk bounds, apply to a much broader class of RKHS s, and also consider target functions that live in finite-dimensional subspaces see Proposition 1. Another distinguishing feature of the present work is that it contains numerical experiments that illustrate the implications of our theoretical bounds in some practical settings Section 6. The main engine behind the technical results in this paper is a collection of large-deviation results for Hilbert-Schmidt operators. The required machinery is developed in the appendix. These results build on straightforward extensions of results of Tropp [] and Minsker [16]. Our most precise results for KPCR and achieving parametric rates for estimation over finite-dimensional subspaces Proposition 1 rely on slightly 1 Here, we mean kernel as in reproducing kernel Hilbert space, rather than kernel-smoothing, which is another popular approach to nonparametric regression.

3 different arguments, which are based on well-known eigenvalue perturbation results that have been adapted to handle Hilbert-Schmidt operators e.g., the Davis-Kahan sin Θ theorem [7]. 3 Statistical setting and assumptions Our basic assumption on the distribution of z = y, x ρ is that the residual variance is bounded; more specifically, we assume that there exists a constant σ > 0 such that [ ] y f x dρy x σ Y for almost all x X. Zhang et al. [5] also assume ; this assumption is slightly weaker than the analogous assumption of Bauer et al. [] Equation 1 in their paper. Note that holds if y is bounded almost surely or if y = f x + ɛ, where ɛ is independent of x, Eɛ = 0, and Varɛ = σ. Let K : X X R be a symmetric positive-definite kernel function. We assume that K is bounded i.e., that there exists κ > 0 such that sup x X Kx, x κ. Additionally, we assume that there is a countable basis of eigenfunctions {ψ j } L ρ X and a sequence of corresponding eigenvalues t 1 t 0 such that Kx, x = t jψ j xψ j x, x, x X 3 and the convergence is absolute. Mercer s theorem and various generalizations give conditions under which representations like 3 are known to hold [6]; one of the simplest examples is when X is a compact Hausdorff space, ρ X is a probability measure on the Borel sets of X, and K is continuous. Observe that t j = t j X ψ j x dρ X x = X Kx, x dρ X x κ ; in particular, {t j } l1 N. Let H L ρ X be the RKHS corresponding to K [1] and let φ j = t j ψ j, j = 1,,.... It follows from basic facts about RKHSs that {φ j } is an orthonormal basis for H if t > t +1 = 0, then { } φ j is an orthonormal basis for H. Furthermore, H is characterized by H = f = θ j ψ j L θ j ρ X ; < and the inner product f, f H = θ j ψ j, θ j ψ j t j the corresponding norm is denoted by H. Our main assumption on the relationship between y, x, and the kernel K is that f H. 4 This is a regularity or smoothness assumption on f. Many of the results in this paper can be modified, so that they apply to settings where f / H, by replacing f with an appropriate projection of f onto H and including an approximation error term in the corresponding bounds. This approach leads to the study of oracle inequalities [1, 13, 1, 4, 5], which we do not pursue in detail here. However, investigating oracle inequalities for general regularization operators may be of interest for future research, as most existing work focuses on ridge regularization. Another interpretation of condition 4 is that it is a minimal regularity condition on f for ensuring that f can be estimated consistently using the kernel methods considered below. One key aspect of the upper bounds in Section 5 is that they show f can be estimated more efficiently, if it satisfies stronger regularity conditions and if the regularization method used is sufficiently adaptable. A convenient way to formulate H = θ j θj t j 3

4 a collection of regularity conditions with varying strengths, which will be useful in the sequel, is as follows. For ζ 0, define the Hilbert space H ζ = f = θ j ψ j L θ j ρ X ; < t 1+ζ. 5 j Then H = H 0 and H ζ H ζ1 whenever ζ ζ 1 0. The norm on H ζ is defined by f H ζ = θ j t 1+ζ j Additionally, positive integers define the finite-rank subspace H = f = θ j ψ j L ρ X ; θ 1,..., θ R. 6 We have the inclusion H H +1 and, if t 1,..., t > 0, then H H ζ for any ζ 0. In particular, it is clear from 5 6 that f H ζ is a stronger regularity condition than f H and that f H is an even stronger condition provided the t j are strictly positive. Conditions such as f H ζ and f H are known as source conditions elsewhere in the literature [e.g.,, 5]. 4 Regularization As discussed in Section 1, our goal is find estimators ˆf that minimize the risk 1. In this paper, we focus on regularization-based estimators for f. In order to precisely describe these estimators, we require some additional notation for various operators that will be of interest, and some basic definitions from regularization theory. 4.1 Finite-rank operators of interest For x X, define K x H by K x x = Kx, x, x X. Let X = x 1,..., x n R n d and y = y 1,..., y n R n. Additionally, define the finite-rank linear operators S X : H R n and T X : H H both depending on X by S X φ = φ, K x1 H,..., φ, K xn H = φx 1,..., φx n, T X φ = 1 n φ, K xi H K xi = 1 n φx i K xi, n n i=1 where φ H. Let, R n denote the normalized inner-product on R n, defined by v, ṽ R n = n 1 v ṽ for v = v 1,..., v n, ṽ = ṽ 1,..., ṽ n R n. Then the adjoint of S X with respect to, H and, R n, S X : Rn H, is given by S X v = 1 n n i=1 v ik xi. Additionally, we have T X = S X S X. Finally, observe that S X S X : Rn R n is given by the n n matrix S X S X = n 1 K, where K = Kx i, x j 1 i,j n ; K is the kernel matrix, which is ubiquitous in kernel methods and enables finite computation. 4. Basic definitions A family of functions g λ : [0, [0, indexed by λ > 0 is called a regularization family if it satisfies the following three conditions: R1 sup 0<t κ tg λ t < 1. R sup 0<t κ 1 tg λ t 1. R3 sup 0<t κ g λ t < λ 1. This definition follows Engl et al. [10] and Bauer et al. [], but is slightly more restrictive. i=1. 4

5 The main idea behind a regularization family is that it looks similar to t 1/t, but is better-behaved near t = 0, i.e., it is bounded by λ 1. An important quantity that is related to the adaptability of a regularization family is the qualification of the regularization. The qualification of the regularization family {g λ } λ>0 is defined to be the maximal ξ 0 such that sup 0<t κ 1 tg λ t t ξ λ ξ. If a regularization family has qualification ξ, we say that it saturates at ξ. Two regularization families that are the major motivation for the results in this paper are ridge Tikhonov regularization, where g λ t = 1 t+λ and principal component spectral cut-off regularization, where g λ t = s λ t = 1 1{t λ}. 7 t Observe that ridge regularization has qualification 1 and principal component regularization has qualification. Another example of a regularization family with qualification is the Landweber iteration, which can be viewed as a special case of gradient descent [see, e.g.,, 14, 19]. 4.3 Estimators Given a regularization family {g λ } λ>0, we define the g λ -regularized estimator for f, ˆf λ = g λ T X S Xy. 8 Here, g λ acts on the spectrum eigenvalues of the finite-rank operator T X which is the same as the spectrum of the kernel matrix K, up to scaling. Therefore, a finitely-computable representation is ˆf λ = n i=1 ˆγ ik xi, where ˆγ 1,..., ˆγ n = g λ K/ny/n; computing the ˆγ i involves an eigenvalue decomposition of the matrix K. The dependence of ˆfλ on the regularization family is implicit; our results hold for any regularization family except where explicitly stated otherwise. The estimators ˆf λ are the main focus of this paper. 5 Main results 5.1 General bound on the risk Theorem 1. Let ˆf λ be the estimator defined in 8 with regularization family {g λ } λ>0. Let 0 δ 1 and assume that there is some κ δ > 0 such that sup x X t 1 δ j ψ j x κ δ < 9 Assume that the source condition f H ζ holds for some ζ 0, and that g λ has qualification at least max{ζ+1/, 1}. Define the effective dimension d λ = t j t +λ. Finally, assume that 8/3+ 5/3κ δ /n j λ 1 δ κ δ. The following risk bound holds: R ρ ˆf λ ζ+3 f H ζ λ ζ+1 + 4d λσ n + 4d λ f Ht 1 + κ σ exp λn 3λ1 δ n 8κ δ + 1{ζ > 1} 16ζ 3/ ζ 1 f H ζ t 1 + λ ζ 1 κ 4 34 n + 15 n. 10 5

6 Theorem 1 is proved in Appendix A. The first two terms in the upper bound 10 are typically the dominant terms. In the upper bound 10, the interaction between the kernel K and the distribution ρ X is reflected in the effective dimension d λ [see, e.g., 4, 4]. The regularity of f enters through norm of f both the H- and H ζ -norms and the exponent on λ. The condition 9 in Theorem 1 is always satisfied by taking δ = 0 and κ 0 = κ. Requiring 9 with δ > 0 imposes additional conditions on the RKHS H. For Corollary 1 below which applies when the eigenvalues {t j } have polynomial-decay, we take δ = 0 and κ 0 = κ. The stronger condition with δ > 0 is required to obtain obtain minimax rates for kernels where the eigenvalues {t j } have exponential or Gaussian-type decay see Corollaries 3. Risk bounds on general regularization estimators similar to ˆf λ were previously obtained by Bauer et al. []. However, their bounds [e.g., Theorem 10 in ] are independent of the ambient RKHS H, i.e., they do not depend on the eigenvalues {t j }. Our bounds are tighter than those of Bauer et al. [] because we take advantage of the structure of H. In contrast with our Theorem 1, the results of Bauer et al. [] do not give minimax bounds not easily, at least, because minimax rates must depend on the t j. 5. Implications for kernels characterized by their eigenvalues rate of decay We now state consequences of Theorem 1 that give explicit rates for estimating f via ˆf λ, for any regularization family, under specific assumptions about the decay rate of the eigenvalues {t j }. We first consider the case where the eigenvalues have polynomial decay. Corollary 1. Assume that C 0 and ν > 1/ are constants such that 0 < t j Cj ν for all j = 1,,.... Assume the source condition f H ζ for some ζ 0, and that g λ has qualification at least max{ζ +1/, 1}. Finally, take λ = C n ν νζ+1+1 for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H ζ + σ } n νζ+1 νζ+1+1, where the constants implicit in the big-o may depend on κ, C, C, ν, and ζ, but nothing else. Remark 1. Observe that if g λ has qualification at least max{ζ + 1/, 1} and the other conditions of Corollary 1 are met, then ˆf λ obtains the minimax rate for estimating functions over H ζ [18]. Thus, if g λ has higher qualification, then ˆf λ can effectively adapt to a broader range of subspaces H ζ H = H 0. In particular, KPCR with infinite qualification can adapt to source conditions with arbitrary ζ 0; on the other hand, KRR satisfies the conditions of Corollary 1 only when ζ 1, because KRR has qualification 1. Remark. As mentioned earlier, a very similar result for polynomial decay eigenvalues was independently and simultaneously obtained by Blanchard and Mücke [3] for essentially the same class of regularization operators. Our Theorem 1, from which the corollary follows, applies to a broader class of kernels than results of Blanchard and Mücke. When the eigenvalues {t j } have exponential or Gaussian-type decay, the rates are nearly the same as in finite dimensions. Corollary. Assume that C, α 0 are constants such that 0 < t j Ce αj for all j = 1,,.... Assume that g λ has qualification at least 1 and that 9 holds for any 0 < δ 1. Finally, take λ = C n 1 logn for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H + σ } logn n where the constants implicit in the big-o may depend on κ, C, C, α, δ, and κ δ, but nothing else., 6

7 Corollary 3. Assume that C, α 0 are constants such that 0 < t j Ce αj for all j = 1,,.... Assume that g λ has qualification at least 1 and that 9 holds for any 0 < δ 1. Finally, take λ = C n 1 logn for a suitable constant C > 0 so that the conditions on λ from Theorem 1 are satisfied. Then { R ρ ˆf λ = O f H + σ } logn, n where the constants implicit in the big-o may depend on κ, C, C, α, δ, and κ δ, but nothing else. Remark 3. In Corollaries 3, we get minimax estimation over H = H 0 [18]. However, our bounds are not refined enough to pick up any potential improvements which may be had if a stronger source condition is satisfied e.g., f H ζ for ζ > 0. This is typical in settings like this because the minimax rate is already quite fast, i.e., within a log-factor of the parametric rate n Parametric rates for finite-dimensional kernels and subspaces If the kernel has finite rank i.e., t j = 0 for j sufficiently large, then it follows directly from Theorem 1 that R ρ ˆf λ = O{ f H + σ }/n. If the kernel has infinite rank, but f is contained in the finite-dimensional subspace H for some <, then Theorem 1 can still be applied, provided g λ has high qualification. Indeed, if g λ has infinite qualification and f H, then it follows that f H ζ for all ζ 0 and Theorem 1 implies that for any 0 < α 1, R ρ ˆf λ = O{ f H + σ }/n 1 α for appropriately chosen λ. In fact, we can improve on this rate for KPCR. The next proposition implies that the risk of KPCR matches the parametric rate n 1 for f H ; the proof requires a different argument, based on eigenvalue perturbation theory, which we give in Appendix B. Proposition 1. Let ˆfKPCR,λ be the KPCR estimator, with principal component regularization 7, and assume that f H. Let 0 < r < 1 be a constant and let λ = 1 rt. If rt κ /n 1/ + κ /3n, then { } R ρ ˆf KPCR,λ 1 n 34κ 6 r t 4 f H + 3κ 1 rt σ + κ f H 15κ 4 nr t 4 n r t exp κ 4 + κ rt /3. Proposition 1 implies that KPCR may reach the parametric rate for estimating f H. On the other hand, it is known that KRR may perform dramatically worse than KPCR in these settings due to the saturation effect [see, e.g., 4, 8, 9]. 6 Numerical experiments 6.1 Simulated data This simulation study shows how KPCR is able to adapt to highly structured signals, while KRR requires more favorable structure from the ambient RKHS. For this experiment, we take X = {1,,..., 13 }. The data distribution ρ on Y X is specified as follows. The marginal distribution on X is ρ X x x 1/, the regression function f is given by f x = 5 1{x = j}, and ρ x is normal with mean f x and variance 1/4. To compute ˆf λ, we use the discrete kernel Kx, x = 1{x = x} Using an iid sample of size n = 13, we compute ˆf λ either KRR or KPCR for λ in a discrete grid of 10 values uniformly spaced between 10 5 and 0.0, and then choose the value of λ for which ˆf λ has smallest validation mean-squared error n 1 n i=1 yv i ˆf λ x v i, computed using a separate iid sample {x v i, yv i }n i=1 of size n = 13. Figure 1a shows the validation-mse of each ˆf λ ; the plot of ρ X -MSE f ˆf λ L ρ X has roughly the same shape, just shifted down by σ = 1/4 Figure 1b. The selected λ is ˆλ KRR = for KRR, and ˆλ KPCR = for KPCR. These choices of λ yield the final estimators, and ˆfKRR,ˆλKRR ˆf KPCR,ˆλKPCR ; 7

8 KRR KPCR KRR KPCR Validation-MSE ρx-mse ˆfx λ a KRR KPCR x c ρx-mse λ b KRR KPCR α d Figure 1: a Validation-MSE of ˆfKRR,λ and ˆf KPCR,λ for ρ X x x 1/ as λ varies; b ρ X -MSE of ˆf KRR,λ and ˆf KPCR,λ for ρ X x x 1/ as λ varies; c estimated functions ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR for ρ X x x 1/ ; d ρ X -MSE of ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR for ρ X x x α as α varies. 8

9 the ρ X -MSE is for KRR, and for KPCR. In Figure 1c, we plot the functions ˆf KRR,ˆλKRR and ˆf KPCR,ˆλKPCR ; the KRR function is non-zero for much of the domain, while the KPCR function is zero for nearly all of the domain like f. We repeat the above simulation for different marginal distributions ρ X x x α, for 1/ α, which imply different eigenvalue sequences {t j }. The mean and standard deviation of the ρ X -MSE s over 10 repetitions are shown in Figure 1b. This confirms KPCR s to adapt to the regularity of f, regardless of the ambient RKHS; KRR requires more structure to achieve similar results. 6. Real data We also compared KRR and KPCR using three weighted degree kernels designed for recognizing splice sites in genetic sequences [0]. 3 The 3300 samples are divided into a training set 1000, validation set 1100, and testing set 100. For each kernel, we use the training data to compute ˆf λ for λ in a discrete grid of 10 equally-spaced values between 10 5 and 0.4, and select the value of λ on which the MSE of ˆf λ on the validation set is smallest. The MSE on the testing set and the intrinsic dimension dˆλ for the selected ˆλ on the training data are as follows: Kernel 1 Kernel Kernel 3 MSE of ˆf KRR,ˆλKRR MSE of ˆf KPCR,ˆλKPCR dˆλkrr dˆλkpcr KRR outperforms KPCR with Kernel 1, where the intrinsic dimension of the kernel is low, while the reverse happens with Kernels and 3, where the intrinsic dimensions are high. This resonates with our theoretical results, which suggest that KRR requires low intrinsic dimension to perform most effectively. 7 Discussion Our unified analysis for a general class of regularization families in nonparametric regression highlights two important statistical properties. First, the results show minimax optimality for this general class in several commonly studied settings, which was only previously established for specific regularization methods. Second, the results demonstrate the adaptivity of certain regularization families to subspaces of the RKHS, showing that these techniques may take advantage of additional smoothness properties that the signal may possess. It is notable that the most well-studied family, KRR/Tikhonov regularization, does not possess this adaptability property. References [1] N. Aronszajn. Theory of reproducing kernels. T. A. Math. Soc., 68: , [] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory.. Complexity, 3:5 7, 007. [3] G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse learning problems. ArXiv e-prints, April 016. [4] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math., 7: , 007. [5] A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl., 8: , We use the first three kernels for the data obtained from 9

10 [6] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl., 4: , 006. [7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM. Numer. Anal., 7:1 46, [8] P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk comparison of ordinary least squares vs ridge regression.. Mach. Learn. Res., 14: , 013. [9] L. H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, :1 37, 016. [10] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Mathematics and Its Applications, Vol Springer, [11] L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric Regression. Springer, 00. [1] D. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. Found. Comput. Math., 14: , 014. [13] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat., 34: , 006. [14] L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Comput., 7: , 008. [15] P. Mathé. Saturation of regularization methods for linear ill-posed problems in Hilbert spaces. SIAM. Numer. Anal., 4: , 005. [16] S. Minsker. On Some Extensions of Bernstein s Inequality for Self-adjoint Operators. ArXiv e-prints, 011. [17] A. Neubauer. On converse and saturation results for Tikhonov regularization of linear ill-posed problems. SIAM. Numer. Anal., 34:517 57, [18] M. S. Pinsker. Optimal filtration of square-integrable signals in Gaussian noise. Probl. Inf. Transm., 16:5 68, [19] L. Rosasco, E. De Vito, and A. Verri. Spectral methods for regularization in learning theory. Technical Report DISI-TR-05-18, Universitá degli Studi di Genova, Italy, 005. [0] S. Sonnenburg. Machine Learning for Genomic Sequence Analysis. PhD thesis, Fraunhofer Institute FIRST, December 008. Supervised by K.-R. Müller and G. Rätsch. [1] I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Conference on Learning Theory, 009. []. A. Tropp. An Introduction to Matrix Concentration Inequalities. ArXiv e-prints, 015. [3] L. Wasserman. All of Nonparametric Statistics. Springer, 006. [4] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Comput., 17: , 005. [5] Y. Zhang,. C. Duchi, and M.. Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, pages ,

11 A Proof of Theorem 1 To provide some intuition behind ˆf λ and our proof strategy, define the positive self-adjoint operator T : H H by T φ = X φ, K x H K x dρ X x = t j φ, φ j H φ j, φ H. Observe that T is a population version of the operator T X. Unlike T X, T often has infinite rank; however, we still might expect that T T X for large n, where the approximation holds in some suitable sense. We also have a large-n approximation for SX y. For φ H, φ, SXy H = 1 n n y i φ, K xi H i=1 Y X yφxdρy, x = X f xφx dρ X x = φ, f L ρ X = φ, T f H, where, L ρ X denotes the inner-product on L ρ X and we have used the identity φ j = t j ψ j to obtain the last equality. It follows that SX y T f. Hence, to recover f from y, it would be natural to apply the inverse of T to SX y. However, T is not invertible whenever it has infinite rank, and regularization becomes necessary. We thus arrive at the chain of approximations which help motivate ˆf λ : ˆf λ = g λ T X S Xy g λ T T f f, where g λ T may be viewed as an approximate inverse for a suitably chosen regularization parameter λ. A.1 Bias-variance decomposition The proof of Theorem 1 is based on a simple bias-variance decomposition of the risk of ˆfλ. Let ɛ = y 1 f x 1,..., y n f x n R n. Proposition. The risk R ρ ˆf λ has the decomposition R ρ ˆf λ = B ρ ˆf λ + V ρ ˆf λ, 11 where B ρ ˆf λ and V ρ ˆf λ are defined as [ T B ρ ˆf ] [ λ = E 1/ {I g λ T X T X }f T, V ρ ˆf λ = E 1/ g λ T X SXɛ H H ]. Our proof separately bounds the bias B ρ ˆf λ and variance V ρ ˆf λ terms from Proposition. together, these bounds imply a bound on R ρ ˆf λ. Taken A. Translation to vector and matrix notation We first note that the Hilbert space H is isometric to l N via the isometric isomorphism ι: H l N, given by ι: α j φ j α 1, α,

12 we take all elements of l N to be infinite-dimensional column vectors. Using this equivalence, we can convert elements of H and linear operators on H appearing in Proposition into infinite-dimensional vectors and matrices, respectively, which we find simpler to analyze in the sequel. Define the infinite-dimensional diagonal matrix and the vector T = diagt 1, t,..., β = β 1, β,... l N, where β i = f, φ i H. Next, define the n random matrices Observe that Ψ = ψ j x i 1 i n;1 j<, Φ = ΨT = φ j x i 1 i n;1 j<. S X = Φ ι, 1 SX = ι 1, n Φ T = ι 1 T ι. Also, for 1 i n, let φ i φ i be the matrix whose j, j -th entry is φ j x i φ j x i, and define Σ = 1 n φ n i φ i = 1 Φ. n Φ i=1 Finally, let I = diag1, 1,..., and let = l N denote the norm on l N. In these matrix and vector notations, the bias-variance decomposition from Proposition translates to the following: T B ρ ˆf λ = E[ 1/ {I g λ ΣΣ}β ], V ρ ˆf λ = The boundedness of the kernel implies where the norms are the operator norms in l N. A.3 Probabilistic bounds For 0 < r < 1, define the event A r = 1 [ T n E 1/ g λ ΣΦ ɛ ]. trt κ, φ i φ i κ, Σ κ, { T + λi 1/ Σ TT + λi 1/ r }, 13 and let A c r denote its complement. Our bounds on bias B ρ ˆf λ and variance V ρ ˆf λ are based on analysis in the event A c r for a constant r. Therefore, we also need to show that A c r has large probability equivalently, show that A r has small probability. 1

13 Lemma 1. Assume that sup x X t 1 δ j ψ j x κ δ < 14 for some 0 δ < 1, κ δ > 0. Further assume that λ1 δ κ δ. If r κ δ /λ1 δ n + κ δ /3λ1 δ n, then P A r 4d λ exp λ1 δ nr κ. δ 1 + r/3 Proof. The proof is an application of Lemma 5. Define, for 1 i n, as well as Y = n i=1 X i. We have It is clear that E[X i ] = 0. Observe that t j t j + λψ jx i X i = 1 n T + λi 1/ {φ i φ i T}T + λi 1/, Y = T + λi 1/ Σ TT + λi 1/. max j 1 t δ j t j + λ t 1 δ j ψ j x i κ δ max j 1 t δ j t j + λ κ δ, 15 λ1 δ where the last inequality uses the inequality of arithmetic and geometric means. Therefore, by the assumption λ 1 δ κ δ, X i 1 } { T n max + λi 1/ φ i φ i T + λi 1/, T + λi 1/ TT + λi 1/ Moreover, = 1 n max t j t j + λψ jx i, max j 1 t j t j + λ κ δ λ 1 δ n. E[X i ] = 1 [ n E T + λi 1/ φ i φ i T + λi 1 φ i φ i T + λi 1/ T + λi T ] = 1 n E t j t j + λψ jx i T + λi 1/ φ i φ i T + λi 1/ T + λi T. Combining this with 15 gives E[Y ] κ δ λ 1 δ n [T E + λi 1/ φ 1 φ 1 T + λi 1/] and tr E[Y ] = κ δ λ 1 δ n T + λi 1 T κ δ λ 1 δ n κ δ λ 1 δ n tr E [T + λi 1/ φ 1 φ 1 T + λi 1/] = κ δ λ 1 δ n tr T + λi 1 T = d λκ δ λ 1 δ n. Applying Lemma 5 with V = R = κ δ /λ1 δ n and D = d λ, gives P Y r 4d λ exp λ1 δ nr κ. δ 1 + r/3 13

14 Lemma. E [ Σ T ] 34κ4 n + 15κ4 n. Proof. The proof is an application of Lemma 6. Define, for 1 i n, X i = 1 n φ iφ i T, and also define Y = n i=1 X i, so Y = Σ T. Clearly E[X i ] = 0 and X i κ /n. Additionally, since E[Y ] = n 1 E[ φ 1 φ 1 φ 1 T ], we have E[Y ] κ 4 /n and tre[y ] κ 4 /n. The claim thus follows by applying Lemma 6 with V = κ 4 /n, D = 1, and R = κ /n. A.4 Bias bound Lemma 3. Assume that β = T ζ/ α for some ζ 0 and that g λ has qualification at least max{ζ + 1/, 1}. For any 0 < r 1/, B ρ ˆf λ ζ+ 1 r α λ ζ+1 + t 1 β P A r + 1{ζ > 1} 8ζ 1 + r ζ 1 1 r [ α T + λi ζ 1 E Σ T ]. Proof. Define h λ Σ = I Σg λ Σ. Since 1 sg λ s 1 for 0 < s κ and Σ κ, we have h λ Σ 1. Moreover, using β = T ζ/ α, T B ρ ˆf λ = E[ 1/ h λ Σβ ] [ T ] E 1/ h λ ΣT ζ/ 1{A c d} α + T 1/ h λ Σ β P A d [ T ] E 1/ h λ ΣT ζ/ 1{A c d} α + t 1 β P A d. The rest of the proof involves bounding E[ T 1/ h λ ΣT ζ/ 1{A c r}]. We separately consider two cases: i ζ 1, and ii ζ > 1. Case 1: ζ 1. Since g λ has qualification at least 1, it follows that s + λ1 sg λ s λ for 0 < s κ. This implies T 1/ h λ ΣT ζ/ T 1/ Σ + λi 1/ Σ + λihλ Σ T ζ/ Σ + λi 1/ 4λ T 1/ Σ + λi 1 T 1/ T ζ/ Σ + λi 1 T ζ/. 16 For 0 z 1, T z/ Σ + λi 1 T z/ T = z/ { Σ T + T + λi } 1 T z/ T z/ T + λi 1/ T + λi 1/{ Σ T + T + λi } 1 T + λi 1/ = T z T + λi 1 I T + λi 1/ T ΣT + λi 1/ 1 λ z 1 I T + λi 1/ T ΣT + λi 1/ 1, 17 where the final inequality uses the fact s z /s + λ λ z 1 for 0 z 1 and s 0. The final quantity in 17 is bounded above by λ z 1 /1 r on the event A c r, so applying this inequality with z = 1 and z = ζ to 16 gives T 1/ h λ ΣT ζ/ 1{A c r } 4λ 1 1 r λζ 1 1 r = 4λζ+1 1 r. 14

15 So the bias in this case is bounded as B ρ ˆf λ α λ ζ+1 + t 1 β P A r. 1 r Case : ζ > 1. We have T 1/ h λ ΣT ζ/ T 1/ Σ + λi 1/ Σ + λi 1/ h λ ΣT ζ/ = T 1/ Σ + λi 1 T 1/ Σ + λi 1/ h λ ΣT ζ/ T 1/ Σ + λi 1 T 1/ Σ + λi 1/ h λ ΣT + λi ζ/. 18 The first factor on the right-hand side of 18 can be bounded using 17 on the event A c r. For the second factor, we have Σ + λi 1/ h λ ΣT + λi ζ/ Σ + λi ζ+1/ { h λ Σ + Σ + λi1/ h λ Σ T + λi ζ/ Σ + λi ζ/} λ ζ+1/ + λ 1/ T + λi ζ/ Σ + λi ζ/. 19 Above, the first inequality is due to the triangle inequality, and the second inequality uses the facts that g λ has qualification at least ζ + 1/, and that s + λ z+1/ 1+z+1/ s z+1/ + λ z+1/ for s 0 and z 1. We now bound T + λi ζ/ Σ + λi ζ/ in terms of T Σ. First, observe that on the event A c r, and, consequently, T Σ T + λi T + λi 1/ T ΣT + λi 1/ < r T + λi, For a small constant s > 0, define Σ + λi 1 + r T + λi. A s = B s = 1 T + λi, 1 + r + s T + λi 1 Σ + λi. 1 + r + s T + λi Then, on the event A c r, applying Lemma 7 and Lemma 8 gives A ζ/ s B ζ/ s ζ A 1/ s B 1/ s { } 1/ λ ζ A s B s 1 + r + s T + λi = ζ {λ 1 + r + s T + λi } 1/ T Σ. Taking s 0, it follows that on A c r, T + λi ζ/ Σ + λi ζ/ ζ λ 1/ {1 + r T + λi } ζ 1/ T Σ. 0 Combining 19 and 0 gives Σ + λi 1/ h λ ΣT + λi ζ/ λ ζ+1/ + ζ {1 + r T + λi } ζ 1/ T Σ. 15

16 Using this together with 17 in 18 and T 1/ h λ ΣT ζ/ 1{A c r } Therefore B ρ ˆf λ α 1 r A.5 Variance bound Lemma 4. For any 0 < r < 1, 1 1 r λ ζ+1/ + ζ {1 + r T + λi } ζ 1/ T Σ. λ ζ+1 + 8ζ {1 + r T + λi } ζ 1 E [ T Σ ] + t 1 β P A r. V ρ ˆf λ 1 r dλσ n + κ σ λn P A r. Proof. The assumption on ɛ = ɛ 1,..., ɛ n implies E[ɛ i ] = 0, E[ɛ i ɛ j ] = 0 for i j, and E[ɛ i ] σ. So, by Von Neumann s inequality, V ρ ˆf λ = 1 [ T n E 1/ g λ ΣΦ ɛ ] [ σ n E tr Tg λ Σ Σ ]. Using Von Neumann s inequality together with trt κ and g λ s s 1/λ for 0 < s κ, tr Tg λ Σ Σ trt g λ Σ Σ κ λ. Therefore [ V ρ ˆf ] λ σ n E tr Tg λ Σ Σ 1{A c r} + κ σ λn P A r. Using Von Neumann s inequality twice more, and s + λg λ s s for 0 < s κ, we have tr Tg λ Σ Σ = tr TΣ + λi 1 Σ + λig λ Σ Σ tr TΣ + λi 1 Σ + λigλ Σ Σ tr TΣ + λi 1 = tr TT + λi 1 T + λi 1/ Σ + λi 1 T + λi 1/ tr TT + λi 1 T + λi 1/ Σ + λi 1 T + λi 1/ = d λ I T + λi 1/ T Σ 1 T + λi 1/ 1. This final quantity is bounded above by d λ /1 r on the event A c d. A.6 Finishing the proof of Theorem 1 Using the bias-variance decomposition from Proposition, we apply the bias bound Lemma 3 and variance bound Lemma 4 to obtain a bound on the risk: R ρ ˆf λ = B ρ ˆf λ + V ρ ˆf λ ζ+ 1 r α λ ζ+1 + t 1 β P A r + 1{ζ > 1} 8ζ 1 + r ζ 1 1 r + 1 r dλσ n + κ σ λn P A r. α t 1 + λ ζ 1 E [ Σ T ] 16

17 Now set r = 1/, and apply Lemma 1 to bound P A r. Note that the assumption 8/3+ 5/3κ δ /n λ1 δ satisfies the conditions on r in Lemma 1 for r = 1/. Finally, apply Lemma to bound E[ Σ T ]. B Proof of Proposition 1 Let ˆt 1 ˆt 0 denote the eigenvalues of Σ and define T = diagˆt 1, ˆt,.... Let Û be an orthogonal transformation satisfying Σ = Û TÛ. Additionally, let h λ t = I{t λ}, define Ĵ = Ĵλ = inf{j : ˆt j > λ}, and write Û = Û Û+, where Û is the Ĵ matrix comprised of the first Ĵ columns of Û, and Û+ consists of the remaining columns of Û. Finally, define T = diagt 1,..., t, T + = diagˆt, ˆt,..., Ĵ+1 Ĵ+ and define the matrix U = I 0, where I is the identity matrix. Now consider the bias term B ρ ˆf KPCR,λ = E[ T 1/ h λ Σβ ], and observe that T 1/ h λ Σβ = T 1/ Û + Û +U U β κ Û +U f H. Thus, B ρ ˆf KPCR,λ κ f H E Û +U 1{ Ĵ } + κ f H P Ĵ <. 1 Now we bound +U Û on the event {Ĵ }. We derive this bound from basic principles, but it is essentially the Davis-Kahan inequality [7]. Let D = Σ T. Then DU = ΣU TU = ΣU U T and Next notice that U DÛ+ = U ΣÛ+ T U Û+ = U Û+ T + T U Û+. Thus, D U DÛ+ T U Û+ U Û+ T + t U Û+ 1 rt U Û+ = rt U Û+. U Û+ 1 rt D on the event {Ĵ > }. Next we bound P Ĵ < = P ˆt λ = P {ˆt 1 rt }. By Weyl s inequality, ˆt t D = Σ T, and, by Lemma 5, P Σ T rt nr t 4 4 exp κ 4 + κ rt /3, provided rt κ /n 1/ + κ /3n. Thus, P Ĵ < = P {ˆt 1 rt } nr t 4 4 exp κ 4 + κ rt /3. 3 Combining 1 3 and using Lemma gives B ρ ˆf [ KPCR,λ κ f HE Σ T ] + 4κ f H exp r t 4 κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f H exp nr t 4 κ 4 + κ rt /3 nr t 4 κ 4 + κ rt /3. 17

18 Next, we combine this bound on B ρ ˆf KPCR,λ with the variance bound Lemma 4 taking r = 0 in the lemma to obtain R ρ ˆf KPCR,λ = B ρ ˆf λ + V ρ ˆf λ κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f nr t 4 H exp κ 4 + κ rt /3 + d λσ + κ σ n λn κ r t 4 f 34κ 4 H n + 15κ4 n + 4κ f nr t 4 H exp κ 4 + κ rt /3 + 3κ σ 1 rt n { } = 1 34κ 6 n r t 4 f H + 3κ 1 rt σ + κ f 15κ 4 nr t 4 H r t exp n κ 4 + κ rt /3. This completes the proof of the proposition. C Supporting results C.1 Sums of random operators Lemma 5. Let R > 0 be a positive real constant and consider a finite sequence of self-adjoint Hilbert-Schmidt operators {X i } n i=1 satisfying E[X i] = 0 and X i R almost surely. Define Y = n i=1 X i and suppose there are constants V, D > 0 satisfying E[Y ] V and tre[y ] V D. For all t V 1/ + R/3, P Y t 4D exp t V + Rt/3 Proof. This is a straightforward generalization of [, Theorem 7.7.1], using the arguments from [16, Section 4] to extend from self-adjoint matrices to self-adjoint Hilbert-Schmidt operators. Lemma 6. In the same setting as Lemma 5, [ E Y ] + 3DV D Proof. The proof is based on integrating the tail bound from Lemma 5: [ E Y ] = = = 0 V 1/ + R 3 + V 1/ + R 3 + P Y t 1/ dt 4D exp V 1/ + R 3 9V /R 0 4D exp V 1/ + R { + 16V D 1 exp 3 t 4V V 1/ + R { + 16V D 1 + exp 9V D + 3DV + R R. t dt V + Rt 1/ /3 dt + 9V /R 4D exp 9V } 4R + 18R D 9 } 4R + 18R D exp 9 { 9V 4R + 1 9V 4R 3t1/ dt 4R } exp 9V 4R 18

19 C. Differences of powers of bounded operators Lemma 7. Let A and B be non-negative self-adjoint operators with A < 1 and B < 1. For any γ 1, A γ B γ γ A B. Proof. The proof considers three possible cases for the value of γ: i γ is an integer, ii 1 < γ <, and iii γ is a non-integer larger than two. Case 1: γ is an integer. In this case, we have the following identity: γ A γ B γ = A j 1 A BB γ j. So, by the triangle inequality, A γ B γ A B γ A j 1 B γ j γ A B. Case : 1 < γ <. Pick 0 < r < 1 such that A < r and B < r, and fix 0 < t < 1 r/. Define A t = A + ti and B t = B + ti, and γ γγ 1 γ k + 1 =, k = 1,,.... k k! Then, using the power series 1 + s γ = 1 + γ k=1 k s k for 1 < s < 1, A γ t B γ γ k γ t = {A t I k B t I k } = A j 1 A BB k j. k k k=1 Convergence is assured because A t I 1 t and B t I 1 t. Moreover, A γ t B γ γ k t A B A k t I j 1 B t I k j k=1 γ A B k1 tk 1 = γ A B k 1 + γ 1 1 tk k k=1 k=1 { = γ A B 1 1 t γ 1} γ A B. Above, we have used the power series 1 1 s γ 1 = k=1 γ 1 k s k at s = 1 t. Taking t 0 yields k=1 A γ B γ γ A B. Case 3: γ is a non-integer larger than two. We can write γ = k + q for an integer k and real number 0 < q < 1. Applying the results from the previous two cases gives A γ B γ = A k γ/k B k γ/k γ k Ak B k γ A B. Lemma 8. Pick any real numbers r, γ 0, 1. Let A and B be non-negative self-adjoint operators, each with spectrum contained in [r, 1. Then A γ B γ γr γ 1 A B. Proof. Since A I 1 r and B I 1 r, the proof is similar to that of Lemma 7: A γ B γ γ A B 1 + γ 1 1 rk k = γrγ 1 A B. k=1 Above, we have used the power series 1 s γ 1 = 1 + case in Lemma 7 because 0 < γ < 1. k=1 γ 1 k s k at s = 1 r which differs from the 19

Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators

Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators Electronic Journal of Statistics Vol. 11 017 10 1047 ISSN: 1935-754 DOI: 10.114/17-EJS158 Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS TSOGTGEREL GANTUMUR Abstract. After establishing discrete spectra for a large class of elliptic operators, we present some fundamental spectral properties

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators

On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators On Some Extensions of Bernstein s Inequality for Self-Adjoint Operators Stanislav Minsker e-mail: sminsker@math.duke.edu Abstract: We present some extensions of Bernstein s inequality for random self-adjoint

More information

arxiv: v1 [math.pr] 22 May 2008

arxiv: v1 [math.pr] 22 May 2008 THE LEAST SINGULAR VALUE OF A RANDOM SQUARE MATRIX IS O(n 1/2 ) arxiv:0805.3407v1 [math.pr] 22 May 2008 MARK RUDELSON AND ROMAN VERSHYNIN Abstract. Let A be a matrix whose entries are real i.i.d. centered

More information

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Throughout these notes we assume V, W are finite dimensional inner product spaces over C. Math 342 - Linear Algebra II Notes Throughout these notes we assume V, W are finite dimensional inner product spaces over C 1 Upper Triangular Representation Proposition: Let T L(V ) There exists an orthonormal

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

On the simplest expression of the perturbed Moore Penrose metric generalized inverse

On the simplest expression of the perturbed Moore Penrose metric generalized inverse Annals of the University of Bucharest (mathematical series) 4 (LXII) (2013), 433 446 On the simplest expression of the perturbed Moore Penrose metric generalized inverse Jianbing Cao and Yifeng Xue Communicated

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

On regularization algorithms in learning theory

On regularization algorithms in learning theory Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,

More information

here, this space is in fact infinite-dimensional, so t σ ess. Exercise Let T B(H) be a self-adjoint operator on an infinitedimensional

here, this space is in fact infinite-dimensional, so t σ ess. Exercise Let T B(H) be a self-adjoint operator on an infinitedimensional 15. Perturbations by compact operators In this chapter, we study the stability (or lack thereof) of various spectral properties under small perturbations. Here s the type of situation we have in mind:

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

Statistical Inverse Problems and Instrumental Variables

Statistical Inverse Problems and Instrumental Variables Statistical Inverse Problems and Instrumental Variables Thorsten Hohage Institut für Numerische und Angewandte Mathematik University of Göttingen Workshop on Inverse and Partial Information Problems: Methodology

More information

5 Compact linear operators

5 Compact linear operators 5 Compact linear operators One of the most important results of Linear Algebra is that for every selfadjoint linear map A on a finite-dimensional space, there exists a basis consisting of eigenvectors.

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

Ventral Visual Stream and Deep Networks

Ventral Visual Stream and Deep Networks Ma191b Winter 2017 Geometry of Neuroscience References for this lecture: Tomaso A. Poggio and Fabio Anselmi, Visual Cortex and Deep Networks, MIT Press, 2016 F. Cucker, S. Smale, On the mathematical foundations

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION

THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION THE LINDEBERG-FELLER CENTRAL LIMIT THEOREM VIA ZERO BIAS TRANSFORMATION JAINUL VAGHASIA Contents. Introduction. Notations 3. Background in Probability Theory 3.. Expectation and Variance 3.. Convergence

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

On Regularization Algorithms in Learning Theory

On Regularization Algorithms in Learning Theory On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning

Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät

More information

Stein s Method for Matrix Concentration

Stein s Method for Matrix Concentration Stein s Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp University of California, Berkeley California Institute of Technology

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

The Residual Spectrum and the Continuous Spectrum of Upper Triangular Operator Matrices

The Residual Spectrum and the Continuous Spectrum of Upper Triangular Operator Matrices Filomat 28:1 (2014, 65 71 DOI 10.2298/FIL1401065H Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat The Residual Spectrum and the

More information

Math 307 Learning Goals. March 23, 2010

Math 307 Learning Goals. March 23, 2010 Math 307 Learning Goals March 23, 2010 Course Description The course presents core concepts of linear algebra by focusing on applications in Science and Engineering. Examples of applications from recent

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

Density estimators for the convolution of discrete and continuous random variables

Density estimators for the convolution of discrete and continuous random variables Density estimators for the convolution of discrete and continuous random variables Ursula U Müller Texas A&M University Anton Schick Binghamton University Wolfgang Wefelmeyer Universität zu Köln Abstract

More information

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Ioannis Gkioulekas arvard SEAS Cambridge, MA 038 igkiou@seas.harvard.edu Todd Zickler arvard SEAS Cambridge, MA 038 zickler@seas.harvard.edu

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Spectral Regularization for Support Estimation

Spectral Regularization for Support Estimation Spectral Regularization for Support Estimation Ernesto De Vito DSA, Univ. di Genova, and INFN, Sezione di Genova, Italy devito@dima.ungie.it Lorenzo Rosasco CBCL - MIT, - USA, and IIT, Italy lrosasco@mit.edu

More information

General Theory of Large Deviations

General Theory of Large Deviations Chapter 30 General Theory of Large Deviations A family of random variables follows the large deviations principle if the probability of the variables falling into bad sets, representing large deviations

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

A Randomized Algorithm for the Approximation of Matrices

A Randomized Algorithm for the Approximation of Matrices A Randomized Algorithm for the Approximation of Matrices Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert Technical Report YALEU/DCS/TR-36 June 29, 2006 Abstract Given an m n matrix A and a positive

More information

consistent learning by composite proximal thresholding

consistent learning by composite proximal thresholding consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 8 : Eigenvalues and Eigenvectors CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Random matrices: Distribution of the least singular value (via Property Testing)

Random matrices: Distribution of the least singular value (via Property Testing) Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu Department of Mathematics Rutgers vanvu@math.rutgers.edu (joint work with T. Tao, UCLA) 1 Let ξ be a real or complex-valued

More information