Distributed Learning with Regularized Least Squares

Size: px
Start display at page:

Download "Distributed Learning with Regularized Least Squares"

Transcription

1 Journal of Machine Learning Research Submitted /5; Revised 6/7; Published 9/7 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin983@gmailcom Department of Mathematics City University of Hong ong, Tat Chee Avenue, owloon, Hong ong Xin Guo xguo@polyueduhk Department of Applied Mathematics The Hong ong Polytechnic University, Hung Hom, owloon, Hong ong Ding-Xuan Zhou mazhou@cityueduhk Department of Mathematics City University of Hong ong, Tat Chee Avenue, owloon, Hong ong Editor: Ingo Steinwart Abstract We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space RHS By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor We show with error bounds and learning rates in expectation in both the L -metric and RHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine Our derived learning rates in expectation are optimal and stated in a general setting without any eigenfunction assumption The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach Even for the classical least squares regularization scheme in the RHS associated with a general kernel, we give the best learning rate in expectation in the literature eywords: Distributed learning, divide-and-conquer, error analysis, integral operator, second order decomposition Introduction and Distributed Learning Algorithms In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability Zhou et al, 04 To overcome the difficulty, some approaches for generating scalable approximate algorithms have been introduced in the literature such as low-rank approximations of kernel matrices for kernel principal component analysis Schölkopf et al, 998; Bach, 03, incomplete Cholesky decomposition Fine, 00, early-stopping of iterative optimization algorithms for gradient descent methods Yao et al, 007; Raskutti et al, 04; Lin et al, 06, and greedy-type algorithms Another method proposed recently is distributed learning based on a divide-and-conquer approach and a particular learning algorithm implemented in individual machines Zhang et al, 05; Shamir and Srebro, 04 This method c 07 Lin, Guo and Zhou License: CC-BY 40, see Attribution requirements are provided at

2 Lin, Guo and Zhou produces distributed learning algorithms consisting of three steps: partitioning the data into disjoint subsets, applying a particular learning algorithm implemented in an individual machine to each data subset to produce an individual output function, and synthesizing a global output by utilizing some average of the individual outputs This method can significantly reduce computing time and lower single-machine memory requirements For practical applications in medicine, finance, business and some other areas, the data are often stored naturally across multiple servers in a distributive way and are not combined for reasons of protecting privacy and avoiding high costs In such situations, the first step of data partitioning is not needed It has been observed in many practical applications that when the number of data subsets is not too large, the divide-and-conquer approach does not cause noticeable loss of performance, compared with the learning scheme which processes the whole data on a single big machine Theoretical attempts have been recently made in Zhang et al, 03, 05 to derive learning rates for distributed learning with least squares regularization schemes under certain assumptions This paper aims at error analysis of the distributed learning with regularized least squares and its approximation to the algorithm processing the whole data in one single machine Recall Cristianini and Shawe-Taylor, 000; Evgeniou et al, 000 that in a reproducing kernel Hilbert space RHS H, induced by a Mercer kernel on an input metric space X, with a sample D = {x i, y i } i= X Y where Y = R is the output space, the least squares regularization scheme can be stated as f D, = arg min fx y + f f H x,y D Here > 0 is a regularization parameter and =: is the cardinality of D This learning algorithm is also called kernel ridge regression in statistics and has been well studied in learning theory See eg De Vito et al, 005; Caponnetto and De Vito, 007; Steinwart et al, 009; Bauer et al, 007; Smale and Zhou, 007; Steinwart and Christmann, 008 The regularization scheme can be explicitly solved by using a standard matrix inversion technique, which requires costs of O 3 in time and O in memory However, this matrix inversion technique may not conquer challenges on storages or computations arising from big data The distributed learning algorithm studied in this paper starts with partitioning the data set D into m disjoint subsets {D j } m j= Then it assigns each data subset D j to one machine or processor to produce a local estimator f Dj, by the least squares regularization scheme Finally, these local estimators are communicated to a central processor, and a global estimator f D, is synthesized by taking a weighted average f D, = m j= f D j, of the local estimators {f Dj,} m j= This algorithm has been studied with a matrix analysis approach in Zhang et al, 05 where some error analysis has been conducted under some eigenfunction assumptions for the integral operator associated with the kernel, presenting error bounds in expectation

3 Distributed Learning In this paper we shall use a novel integral operator approach to prove that f D, is a good approximation of f D, We present a representation of the difference f D, f D, in terms of empirical integral operators, and analyze the error f D, f D, in expectation without any eigenfunction assumptions As a by-product, we present the best learning rates in expectation for the least squares regularization scheme in a general setting, which surprisingly has not been done for a general kernel in the literature see detailed comparisons below Main Results Our analysis is carried out in the standard least squares regression framework with a Borel probability measure ρ on Z := X Y, where the input space X is a compact metric space The sample D is independently drawn according to ρ The Mercer kernel : X X R defines an integral operator L on H as L f = x fxdρ X, f H, 3 X where x is the function, x in H and ρ X is the marginal distribution of ρ on X Error Bounds for the Distributed Learning Algorithm Our error bounds in expectation for the distributed learning algorithm require the uniform boundedness condition for the output y, that is, for some constant M > 0, there holds y M almost surely Our bounds are stated in terms of the approximation error where f is the data-free limit of defined by { f = arg min f H f f ρ ρ, 4 fx y dρ + f Z }, 5 ρ denotes the norm of L ρ, the Hilbert space of square integrable functions with respect X to ρ X, and f ρ is the regression function conditional mean of ρ defined by f ρ x = ydρy x, x X, Y with ρ x being the conditional distribution of ρ induced at x X Since is continuous, symmetric and positive semidefinite, L is a compact positive operator of trace class and L +I is invertible Define a quantity measuring the complexity of H with respect to ρ X, the effective dimension Zhang, 005, to be the trace of the operator L + I L as = Tr L + I L, > 0 6 In Section 6 we shall prove the following first main result of this paper concerning error bounds in expectation of f D, f D, in H and in L ρ X Denote κ = sup x X x, x 3

4 Lin, Guo and Zhou Theorem Assume y M almost surely If = m for j =,, m, then we have E { f D, f ρ D, C κ M { mc m + } mc m + f ρ f ρ m + } mc m and E f D, f D, C κ { M mc m { + mc m } + f ρ f ρ m + } mc m, where C κ is a constant depending only on κ, and C m is the quantity given in terms of m,,, by C m := m + To derive explicit learning rates from the general error bounds in Theorem, one can quantify the increment of by the following capacity assumption, a characteristic of the complexity of the hypothesis space Caponnetto and De Vito, 007; Blanchard and rämer, 00, c β, > 0 7 for some constants β > 0 and c > 0 Let { l, ϕ l } l be a set of normalized eigenpairs of L on H with {ϕ l } l being an orthonormal basis of H and { l } l arranged in a nonincreasing order A sufficient condition for the capacity assumption 7 with 0 < β < is l = Ol /β, which can be found in, eg Caponnetto and De Vito 007 Remark The sufficient condition l = Ol /β with the index β = d τ < is satisfied by the Sobolev space W τ BR d with the smoothness index τ > d/ on a ball BR d of the Euclidean space R d when the marginal distribution ρ X is the uniform distribution on BR d, see Steinwart et al, 009; Edmunds and Triebel, 996 Condition 7 with β = always holds true with the choice of the constant c = κ In fact, the eigenvalues of the operator L + I L are { l l + } l So its trace is given by = l l l + l l = TrL κ In the existing literature on learning rates for the classical least squares algorithm, the regularization parameter is often taken to satisfy the restriction = O as in Caponnetto and De Vito, 007 up to a logarithmic factor or in Steinwart et al, 009 under some assumptions on, ρ X see 4 below Here, to derive learning rates for E f D, f ρ D, with m corresponding to the distributed learning algorithm, we consider to satisfy the following restriction with some constant C 0 > 0, 0 < C 0 and m C 0 8 Corollary 3 Assume y M almost surely If = m for j =,, m, and satisfies 8, then we have E { f D, f ρ D, C κ M + m f } ρ f ρ 4

5 Distributed Learning and E f D, f D, C κ { M + m f } ρ f ρ, where C κ is a constant depending only on κ, C 0, and the largest eigenvalue of L In the special case that f ρ H, the approximation error can be bounded as f f ρ ρ f ρ A more general regularity condition can be imposed for the regression function as f ρ = L r g ρ for some g ρ L ρ X, r > 0, 9 where the integral operator L is regarded as a compact positive operator on L ρ X and its rth power L r is well defined for any r > 0 The condition 9 means f ρ lies in the range of L r, and the special case f ρ H corresponds to the choice r = / It implies f f ρ ρ r g ρ ρ by Lemma below Thus, under condition 9, we obtain from Corollary 3, by choosing to minimize the order of the error bound subject to the restriction 8, the following nice convergence rates for the error f D, f D, of the distributed learning algorithm Corollary 4 Assume regularity condition 9 for some 0 < r, capacity assumption 7 with β = α for some α > 0, and y M almost surely If D j = m for j =,, m with α max{r,}+ +αr m α max{r,}++αr, 0 and = α m α max{r,}+, then we have E f D, f ρ D, C m α max{r,} α max{r,}+ κ,c,r m and E f D, f D, C m α max{r,0} α max{r,}+ κ,c,r, m where C κ,c,r is a constant independent of or m In particular, when f ρ H and m E f D, f D, ρ C κ,c,r α α+ m +α, taking = m 4α+ and E f D, f D, C κ,c,r α α+ yields the rates Remark 5 In Corollary 4, we present learning rates in both H and L ρ X norms The rates with respect to the L ρ X norm provide estimates for the generalization ability of the algorithm for regression The rates with respect to the H norm give error estimates with respect to the uniform metric due to the relation f κ f, and might be used to solve some mismatch problems in learning theory where the generalization ability of learning algorithms is measured with respect to a probability measure µ different from ρ X Remark 6 The learning rates in Corollary 4 are stated for the norms of the difference f D, f D, which reflects the variance of the distributed learning algorithm These rates decrease as m increases subject to the restriction 0 and thereby the regularization parameter increases, which is different from the learning rates presented for E f D, f ρ ρ in Zhang et al, 05 To derive learning rates for E f D, f ρ ρ by our analysis, the regularization parameter should be independent of m, as shown in Corollary below m 5

6 Lin, Guo and Zhou Optimal Learning Rates for Least Squares Regularization Scheme The second main result of this paper is optimal learning rates in expectation for the least squares regularization scheme We can even relax the uniform boundedness to a moment condition that for some constant p, σ ρ L p ρ X, where σ ρ is the conditional variance defined by σ ρx = Y y f ρx dρy x The following learning rates in expectation for the least squares regularization scheme will be proved in Section 5 The existence of f is ensured by E[y < Theorem 7 Assume E[y < and moment condition for some p Then we have E [ f D, f ρ ρ + 56κ κ + κ + + { + f f ρ ρ + σρ p } p p If the parameters satisfy = O, we have the following explicit bound Corollary 8 Assume E[y < and moment condition for some p If satisfies 8 with m =, then we have E [ f D, f ρ ρ = O f f ρ ρ + p p In particular, if p =, that is, the conditional variances are uniformly bounded, then E [ f D, f ρ ρ = O f f ρ ρ + In particular, when regularity condition 9 is satisfied, we have the following learning rates in expectation Corollary 9 Assume E[y <, moment condition for some p, and regularity condition 9 for some 0 < r If the capacity assumption = O α holds with some α > 0, then by taking = α α max{r,}+ we have E [ f D, f ρ ρ = O rα α max{r,}+ + p α α max{r,}+ In particular, when p = the conditional variances are uniformly bounded, we have E [ f D, f ρ ρ = O rα α max{r,}+ 6

7 Distributed Learning Remark 0 It was shown in Caponnetto and De Vito, 007; Steinwart et al, 009; Bauer et al, 007 that under the condition of Corollary 9 with p = and r [,, the best learning rate, called minimax lower rate of convergence, for learning algorithms with output functions in H is O rα 4αr+ So the convergence rate we obtain in Corollary 9 is optimal Combining bounds for f D, f D, ρ and f D, f ρ ρ, we shall prove in Section 6 the following learning rates in expectation for the distributed learning algorithm for regression Corollary Assume y M almost surely and regularity condition 9 for some < r If the capacity assumption = O α holds with some α > 0, = m for j =,, m, and m satisfies the restriction α then by taking = 4αr+, we have [ E f D, f ρ ρ = O αr 4αr+ m αr 4αr+, 3 Remark Corollary shows that distributed learning with the least squares regularization scheme in a RHS can achieve the optimal learning rates in expectation, provided that m satisfies the restriction 3 It should be pointed out that our error analysis is carried out under regularity condition 9 with / < r while the work in Zhang et al, 05 focused on the case with r = / When r approaches /, the number m of local processors under the restriction 3 reduces to, which corresponds to the non-distributed case In our follow-up work, we will consider to relax the restriction 3 in a semi-supervised learning framework by using additional unlabelled data, as done in Caponnetto and Yao, 00 The main contribution of our analysis for distributed learning in this paper is to remove an eigenfunction assumption in Zhang et al, 05 by using a novel second order decomposition for a difference of operator inverses Remark 3 In Corollary and Corollary 9, the choice of the regularization parameter is independent of the number m of local processors This is consistent with the results in Zhang et al, 05 There have been several approaches for selecting the regularization parameter in regularization schemes in the literature including cross-validation Györfy et al, 00; Blanchard and rämer, 06 and the balancing principle De Vito et al, 00 For practical applications of distributed learning algorithms, how to choose and m except the situations when the data are stored naturally in a distributive way is crucial Though we only consider the theoretical topic of error analysis in this paper, it would be interesting to develop parameter selection methods for distributed learning 3 Comparisons and Discussion The least squares regularization scheme is a classical algorithm for regression and has been extensively investigated in statistics and learning theory There is a vast literature on 7

8 Lin, Guo and Zhou this topic Here for a general kernel beyond those for the Sobolev spaces, we compare our results with the best learning rates in the existing literature Under the assumption that the orthogonal projection f H of f ρ in L ρ X onto the closure of H satisfies regularity condition 9 for some r, and that the eigenvalues { i} i of L satisfy i i α with some α > /, it was proved in Caponnetto and De Vito, 007 that [ lim lim sup sup Prob f D, f H τ ρ > τr = 0, where ρ Pα = α log 4αr+, α if < r, α+, if r =, and Pα denotes a set of probability measures ρ satisfying some moment decay condition which is satisfied when y M This learning rate in probability is stated with a limit as τ and it has a logarithmic factor in the case r = In particular, to guarantee f D, f H ρ τ η r with confidence η by this result, one needs to restrict η to be large enough and has the constant τ η depending on η to be large enough Using existing tools for error analysis including those in Caponnetto and De Vito, 007, we develop a novel second order decomposition technique for a difference of operator inverses in this paper, and succeed in deriving the optimal learning rate in expectation in Corollary 9 by removing the logarithmic factor in the case r = Under the assumption that y M almost surely, the eigenvalues { i } i satisfying i ai α with some α > / and a > 0, and for some constant C > 0, the pair, ρ X satisfying f C f α f α ρ 4 for every f H, it was proved in Steinwart et al, 009 that for some constant c α,c depending only on α and C, with confidence η, for any 0 <, π M f D, f ρ ρ 9A a /α M log3/η + c α,c /α Here π M is the projection onto the interval [ M, M defined Chen et al, 004; Wu et al, 006 by M, if fx > M, π M fx = fx, if fx M, M, if fx < M, and A is the approximation error defined by { } A = inf f f ρ f H ρ + f When f ρ H, A = O and the choice = α α+ gives a learning rate of order f D, f ρ ρ = O α α+ Here one needs to impose the condition 4 for the functions spaces L ρ X and H, and to take the projection onto [ M, M Our learning rates 8

9 Distributed Learning in Corollary 9 do not require such a condition for the function spaces, nor do we take the projection Let us mention the fact from Steinwart et al, 009; Mendelson and eeman, 00 that condition 4 is satisfied when H is a Sobolev space on a domain of a Euclidean space and ρ X is the uniform distribution, or when the eigenfunctions {ϕ l / l : l > 0} of L normalized in L ρ X are uniformly bounded Recall that ϕ l L ρx = l since l = l ϕ l = lϕ l, ϕ l = L ϕ l, ϕ l equals x ϕ l xdρ X x, ϕ l = ϕ l xϕ l xdρ X x = ϕ l L ρ X X X Learning rates for the least squares regularization scheme in the H -metric have been investigated in Smale and Zhou, 007 For the distributed learning algorithm with subsets {D j } m j= of equal size, under the assumption that for some constants < k and A <, the eigenfunctions {ϕ i } i satisfy [ ϕ sup i >0 E i x k i A k, when k <, 5 sup i >0 ϕ ix i A, when k =, and that f ρ H and i ai α for some α > / and a > 0, it was proved in Zhang et al, 05 that for = α α+ and m satisfying the restriction k 4α k k α+ m c A 4k log k, when k < α 6 α α+ A 4 log, when k = with a constant c α depending only on α, there holds E [ f D, f ρ = O α α+ This ρ interesting result was achieved by a matrix analysis approach for which the eigenfunction assumption 5 played an essential role It might be challenging to check the eigenfunction assumption 5 involving the integral operator L = L,ρX for the pair, ρ X To our best knowledge, besides the case of finite dimensional RHSs, there exist in the literature only three classes of pairs, ρ X proved satisfying this eigenfunction assumption: the first class Steinwart et al, 009 is the Sobolev reproducing kernels on Euclidean domains together with the unform measures ρ X, the second Williamson et al, 00 is periodical kernels, and the third class can be constructed by a Mercer type expansion x, y = i ϕ i x ϕ i i y, 7 i i where { ϕ ix i } i is an orthonormal system in L ρ X An example of a C Mercer kernel was presented in Zhou, 00, 003 to show that only the smoothness of the Mercer kernel does not guarantee the uniform boundedness of the eigenfunctions { ϕ ix i } i Furthermore, it was shown in Gittens and Mahoney, 06 that these normalized eigenfunctions associated with radial basis kernels tend to be localized when the radial basis parameters are made smaller, which raises a practical concern for the uniform boundedness assumption on the eigenfunctions 9

10 Lin, Guo and Zhou Remark 4 To see how the eigenfunctions {ϕ i x} i change with the marginal distribution, we consider a different measure µ induced by a nonnegative function P L ρ X as dµ = P xdρ X An eigenpair, ϕ of the integral operator L,µ associated with the pair, µ with > 0 satisfies L,µ ϕ =, xϕxp xdρ X = ϕ i i X i i >0 X ϕ i x i ϕxp xdρ X = ϕ We introduce an index set I := {i : i > 0}, a possibly infinite matrix P ϕ i x = i P x ϕ jx dρ X, 8 i j X and express ϕ in terms of the orthonormal basis {ϕ i } i I as ϕ = i c i i I ϕ i where the ϕ sequence c = c i i I is given by c i = ϕ, i i L ρx = i ϕ, ϕ i Then the eigenpair, ϕ satisfies L,µ ϕ = ϕ i I ϕ i i j I P i,j c j = i I i,j I c i i ϕ i P c = c Thus the eigenpairs of the integral operator L,µ associated with, µ can be characterized by those of the possibly infinite matrix P defined by 8 Finding the eigenpairs of P is an interesting question involving matrix analysis in linear algebra and multiplier operators in harmonic analysis ote that when P corresponding to µ = ρ X, P is diagonal with diagonal entries { i } i and its eigenvectors yield eigenfunctions {ϕ i } i From the above observation, we can see that the marginal distribution ρ X plays an essential role in the eigenfunction assumption 5 which might be difficult to check for a general marginal distribution ρ X For example, it is even unknown whether any of the Gaussian kernels on X = [0, d satisfies the eigenfunction assumption 5 when ρ X is a general Borel measure Our learning rates stated in Corollary 4 or Corollary do not require the eigenfunction assumption 5 Moreover, our restriction 0 for the number m of local processors in Corollary 4 is more general than 6 when α is close to /: with r = / corresponding to the condition f ρ H, our restriction 0 in Corollary 4 is m 4+6α with the power index tending to 7 while the restriction in 6 with k = takes the form m c α α α+ A 4 log with the power index tending to 0 as α ote that the learning rates stated in Corollary 4 are for the difference f D, f D, between the output function of the distributed learning algorithm and that of the algorithm using the whole data In the special case of α α+ m r =, we can see that E f D, f ρ D, C κ,c,r 4α+, achieved by choosing = α m α+, is smaller as m becomes larger Here the dependence of on m is crucial for achieving this convergence rate of the sample error: if we fix and, the error bounds for E f D, f D, stated in Theorem and Corollary 3 become larger as m increases On the other hand, as one expects, increasing the number m of local processors would increase 0

11 Distributed Learning the approximation error for the regression problem, which can also be seen from the bound with = α m α+ stated in Theorem 7 [ The result in Corollary with r > compensates and gives the optimal learning rate E f D, f ρ ρ = O rα 4αr+ by restricting m as in 3 Besides the divide-and-conquer technique, there are many other approaches in the literature to reduce the computational complexity in handling big data These include the localized learning Meister and Steinwart, 06, yström regularization Bach, 03; Rudi et al, 05, and online learning Dekel et al, 0 The divide-and-conquer technique has the advantage of reducing the single-machine complexity of both space and time without a significant lost of prediction power It is an important and interesting problem how the big data are decomposed for distributed learning Here we only study the approach of random decomposition, and the data subsets are regarded as being drawn from the same distribution There should be better decomposition strategies for some practical applications For example, intuitively data points should be divided according to their spatial distribution so that the learning process would yield locally optimal predictors, which could then be synthesized to the output function See Thomann et al, 06 In this paper, we consider the regularized least squares with Mercer kernels Our results might be extended to more general kernels A natural class consists of bounded and positive semidefinite kernels as studied in Steinwart and Scovel, 0 By means of the Mercer type expansion 7, one needs some assumptions on the system { ϕ ix i } i, the domain X, and the measure ρ X to relax the continuity of the kernel while keeping compactness of the integral operator How to minimize the assumptions and to maximize the scope of applications of the framework such as the situation of an input space X without a metric Shen et al, 04; De Vito et al, 03 is a valuable question to investigate Here we only consider distributed learning with the regularized least squares It would be of great interest and value to develop the theory for distributed learning with other algorithms such as spectral algorithms Bauer et al, 007, empirical feature-based learning Guo and Zhou, 0; Guo et al, 07; Shi et al, 0, the minimum error entropy principle Hu et al, 05; Fan et al, 06, and randomized aczmarz algorithms Lin and Zhou, 05 Remark 5 After the submission of this paper, in our follow-up paper by Z C Guo, S B Lin, and D X Zhou entitled Learning theory of distributed spectral algorithms published in Inverse Problems, error analysis and optimal learning rates for distributed learning with spectral algorithms were derived In late 06, we learned that similar analysis was carried out for classical non-distributed spectral algorithms implemented in one machine by G Blanchard and Mücke in a paper entitled Optimal rates for regularization of statistical inverse learning problems arxiv: , April 06, and by L H Dicker, D P Foster, and D Hsu in a paper entitled ernel ridge vs principal component regression: minimax bounds and adaptibility of regularization operators arxiv: , May 06 We are indebted to one of the referees for pointing this out to us

12 Lin, Guo and Zhou 4 Second Order Decomposition of Operator Differences and orms Our error estimates are achieved by a novel second order decomposition of operator differences in our integral operator approach We approximate the integral operator L by the empirical integral operator L,Dx on H defined with the input data set Dx = {x i } i= = {x : x, y D for some y Y} as L,Dx f = x Dx fx x = x Dx f, x x, f H, 9 where the reproducing property fx = f, x for f H is used Since is a Mercer kernel, L,Dj x is a finite-rank positive operator and L,Dj x + I is invertible The operator difference in our study is that of the inverses of two operators defined by Q Dx = L,Dx + I L + I 0 If we denote A = L,Dx + I and B = L + I, then our second order decomposition for the operator difference Q Dx is a special case of the following second order decomposition for the difference A B Lemma 6 Let A and B be invertible operators on a Banach space Then we have A B = B {B A} B + B {B A} A {B A} B Proof We can decompose the operator A B as A B = B {B A} A This is the first order decomposition Write the last term A as B + A B and apply another first order decomposition similar to as A B = A {B A} B It follows from that A B = B {B A} { B + A {B A} B } Then the desired identity is verified The lemma is proved To demonstrate how the second order decomposition leads to tight error bounds for the classical least squares regularization scheme with the output function f D,, we recall the following formula see eg Caponnetto and De Vito, 007; Smale and Zhou, 007 f D, f = L,Dx + I D, D := ξ z E[ξ, 3 where D is induced by the random variables ξ with values in the Hilbert space H defined as ξ z = y f x x, z = x, y Z 4 z D

13 Distributed Learning Then we can express f D, f by means of the notation Q Dx = L,Dx + I L + I as f D, f = [ Q Dx D + L + I D Due to the identity between the L ρ X norm and the H metric g ρ = L g, g L ρ X, 5 we can estimate the error f D, f ρ = L f D, f by bounding the H norm of the following expression obtained from the second order decomposition with B A = L L,Dx { L [ QDx D = L L + I { + L L + I { {L } } } L,Dx L + I {L + I D } { L + I { } } L L,Dx L + I D } { L + I { } } L L,Dx L,Dx + I Combining this expression with the operator norm bound L L + I and the notation Ξ D = L + I { } L L,Dx 6 for convenience, we find L [ L QDx D Ξ D + I D + Ξ D L,Dx + I Ξ D L + I D By decomposing L + I D as L + I L + I D and using the bounds L,Dx + I, L + I /, we know that and L [ QDx D f D, f ρ ΞD + Ξ D L + I D 7 L ΞD + Ξ D + + I D 8 Thus the classical least squares regularization scheme can be analyzed after estimating the operator norm Ξ D = L + I { } L L,Dx and the function norm L + I D In the following two lemmas, to be proved in the appendix, these norms are estimated in terms of effective dimensions by methods in the existing literature Caponnetto and De Vito, 007; Bauer et al, 007; Blanchard and rämer, 00 3

14 Lin, Guo and Zhou Lemma 7 Let D be a sample drawn independently according to ρ Then the following estimates for the operator norm L + I { } L L,Dx hold [ L a E + I { } L L,Dx κ b For any 0 < δ <, with confidence at least δ, there holds L + I { } L L,Dx B, log/δ, 9 where we denote the quantity B, = { κ κ + } 30 c For any d >, there holds { [ L E + I { } } d d L L,Dx dγd + d B,, where Γ is the Gamma function defined for d > 0 by Γd = 0 u d exp { u} du Lemma 8 Let D be a sample drawn independently according to ρ and g be a measurable bounded function on Z and ξ g be the random variable with values in H defined by ξ g z = gz x for z = x, y Z Then the following statements hold [ L a E + I / x = b For almost every x X, there holds L + I / x κ c For any 0 < δ <, with confidence at least δ, there holds { L + I / ξ g z E [ξ g g log/δ κ + } z D Remark 9 In the existing literature, the first order decomposition was used To bound the norm of L,Dx + I D by this approach, one needs to either use the brute force estimate L,Dx + I leading to suboptimal learning rates, or applying the approximation of L by L,Dx which is valid only with confidence and leads to confidencebased estimates with depending on the confidence level In our second order decomposition, we decompose the inverse operator L,Dx + I further after the first order decomposition This leads to finer estimates with an additional factor B AB in the second term of the bound and gives the refined error bound 8 4

15 Distributed Learning 5 Deriving Error Bounds for Least Squares Regularization Scheme In this section we prove our main result on error bounds for the least squares regularization scheme, which illustrates how to apply the second order decomposition for operator differences in our integral operator approach Proposition 0 Assume E[y < and moment condition for some p Then E [ f D, f ρ + 56κ κ + + { κ p σ p p } p f f ρ ρ ρ + κ Proof We apply the error bound 8 and the Schwarz inequality to get { [ E [ f D, f ρ E + Ξ } / { [ L } D + Ξ D E + I / / D The first factor in the above bound involves Ξ D = it can be estimated by applying Lemma 7 as { [ E + Ξ D + Ξ D where L + I } / { [ Ξ } / { [ + E D Ξ 4 + E D { { κ 49B 4, + } / + 3 { } L L,Dx and } / } / + 56κ4 + 57κ 3 To deal with the factor in the bound 3, we separate D = z D ξ z E[ξ as D := z D D = D + D, ξ 0 z, D := ξ ξ 0 z E[ξ are induced by another random variable ξ 0 with values in the Hilbert space H defined by z D ξ 0 z = y f ρ x x, z = x, y Z 33 Then the second factor in 3 can be separated as { [ L } E + I / / D { E [ L + I / D } / + { [ L E + I / D } / 34 5

16 Lin, Guo and Zhou Let us bound the first term of 34 Observe that L + I / D = z D y f ρx L + I / x Each term in this expression is unbiased because Y y f ρx dρy x = 0 This unbiasedness and the independence tell us that { [ L E + I / D } / = If σρ L, then σρx σρ and by Lemma 8 we have = { [ L E + I / { [ y [ E fρ x L + I / } x / { [ [ E σρx L + I / x } / 35 D } / σ ρ / If σ ρ L p ρ X with p <, we take q = p p q = for p = satisfying p + q = and apply the Hölder inequality E[ ξη E[ ξ p /p E[ η q /q to ξ = σρ, η = L + I / x to find But [ [ E σρx L + I / x [ and E L + I / x [ L + I / x { [ [ σ p ρ E L + I / x q κ/ q = by Lemma 8 So we have q } /q { [ L E + I / D } / = { σ ρ σ ρ p κ p { } κ q /q } / p q p p Combining the above two cases, we know that for either p = or p <, the first term of 34 can be bounded as { [ L E + I / D } / σρ p κ p 6 p p

17 Distributed Learning The second term of 34 can be bounded easily as { [ L E + I / D } / [ {E f ρ x f x L + I / x {E [f ρ x f x κ } / = κ f ρ f ρ } / Putting the above estimates for the two terms of 34 into 3 and applying the bound 3 involving Ξ D, we know that E [ f D, f ρ is bounded by + 56κ4 + 57κ σρ p κ p p p κ f ρ f ρ + Then our desired error bound follows The proof of the proposition is complete Proof of Theorem 7 Combining Proposition 0 with the triangle inequality f D, f ρ ρ f D, f ρ + f f ρ ρ, we know that E [ f D, f ρ ρ f f ρ ρ κ κ + + { κ p σ p p } p κ ρ + f f ρ ρ Then the desired error bound holds true, and the proof of Theorem 7 is complete Proof of Corollary 8 By the definition of effective dimension, = l l l + + Combining this with the restriction 8 with m =, we find +C 0 and +C 0 C 0 Putting these and the restriction 8 with m = into the error bound, we know that E [ f D, f ρ ρ + 56κ κ + κ + + C 0 C0 + C 0 { + + C 0 C 0 / f f ρ ρ + σρ p Then the desired bound follows The proof of Corollary 8 is complete p p } To prove Corollary 9, we need the following bounds Smale and Zhou, 007 for f f ρ ρ and f f ρ 7

18 Lin, Guo and Zhou Lemma Assume regularity condition 9 with 0 < r There holds f f ρ ρ r g ρ ρ 36 Furthermore, if / r, then we have f f ρ r / g ρ ρ 37 Proof of Corollary 9 By Lemma, regularity condition 9 with 0 < r implies f f ρ ρ r g ρ ρ If C 0 α, > 0 for some constant C 0, then the choice = α α max{r,}+ yields C 0 α α+ = C 0 α max{r,}+ C 0 So 8 with m = is satisfied With this choice we also have p p C p 0 α max{r,} α max{r,}+ = C p 0 α max{r,} α max{r,}+ + p p α max{r,}+ α α max{r,}+ α α max{r,}+ p Putting these estimates into Corollary 8, we know that E [ f D, f ρ ρ = O αr α max{r,}+ + = O α min{r,max{r,}} + α max{r,}+ p α max{r,} α max{r,}+ + α p α max{r,}+ α α max{r,}+ But we find min {r, max{r, }} = r by discussing the two different cases 0 < r < and immediately The proof of Corollary 9 is complete r Then our conclusion follows 6 Proof of Error Bounds for the Distributed Learning Algorithm To analyze the error f D, f D, for distributed learning, we recall the notation Q Dx = L,Dx + I L + I for the difference of inverse operators and use the notation Q Dj x involving the data subset D j The empirical integral operator L,Dj x is defined with D replaced by the data subset D j For our error analysis for the distributed learning algorithm, we need the following error decomposition for f D, f D, 8

19 Distributed Learning Lemma Assume E[y < For > 0, we have f D, f D, = = m j= m j= [ L,Djx + I L,Dx + I j [ Q Dj x j + m j= [ Q Dj x j [ Q Dx D, 38 where and j := j := ξ z E[ξ, D := z D j ξ z E[ξ, z D ξ 0 z, j := ξ ξ 0 z E[ξ z D j z D j Proof Recall the expression 3 for f D, f When the data subset D j is used, we have So we know that f D, f = m j= We can decompose D as D = f Dj, f = ξ z E[ξ = z D L,Dj x + I j { fdj, f } = m m j= in the expression 3 for f D, f and find f D, f = m j= j= L,Dj x + I j ξ z E[ξ = m z D j j= L,Dx + I j Then the first desired expression for f D, f D, follows j By adding and subtracting the operator L + I, writing j = j + j, and noting E[ξ 0 = 0, we know that the first expression implies 38 This proves Lemma Before proving our first main result on the error f D, f D, in the H metric and L ρ metric, we state the following more general result which allows different sizes for data subsets {D j } 9

20 Lin, Guo and Zhou Theorem 3 Assume that for some constant M > 0, y M almost surely Then we have and [ E f D, f ρ D, f ρ f ρ Dj C κm m Dj j= + S, + C κ [ f E D, f D, C κm f ρ f ρ Dj + S, m Dj j= + C κ S, + S3, + S, S, + S3, + S, M M + C κ m j= + f ρ f ρ + C κ m j= + f ρ f ρ, where C κ is a constant depending only on κ, and S Dj, is the quantity given by S Dj, = + Proof Recall the expression 38 for f D, f D, in Lemma It enables us to express where the terms J, J, J 3 are given by J = m j= [ L / Q D j x L / { } f D, f D, = J + J + J 3, 39 j, J = m j= [ L / Q D j x [ j, J 3 = L / Q Dx D These three terms will be dealt with separately in the following For the first term[ J of 39, each summand with j {,, m} can be expressed as z D j y f ρx L / Q D j x x, and is unbiased because Y y f ρxdρy x = 0 The unbiasedness and the independence tell us that E [ J { [ } / E J m j= Dj [ [ E L / Q D j x j / 40 Let j {,, m} The estimate 7 derived from the second order decomposition yields [ L / Q D j x j ΞDj + Ξ D j L + I / j 4 0

21 Distributed Learning ow we apply the formula E[ξ = 0 Prob [ξ > t dt 4 to estimate the expected value of 4 By Part b of Lemma 7, for 0 < δ <, there exists a subset Z D j δ, of Z D j of measure at least δ such that Ξ Dj B Dj, log/δ, D j Z D j δ, 43 Applying Part c of Lemma 8 to gz = y f ρ x with g M and the data subset D j, we know that there exists another subset Z D j δ, of Z Dj of measure at least δ such that L + I / j M κ B, log/δ, D j Z D j Combining this with 43 and 4, we know that for D j Z D j δ, Z D j δ,, δ, [ L / Q D j x j B Dj, + B4, M κ B, log/δ6 Since the measure of the set Z D j δ, Z D j δ, is at least δ, by denoting C Dj, = 64 B Dj, + B4, M κ B,, we see that [ [ Prob L / Q D j x j > C, log/δ 6 δ For 0 < t <, the equation C Dj, log/δ 6 = t has the solution { } /6 δ t = exp t/c Dj, When δ t <, we have [ [ Prob L / Q D j x j { } /6 > t δ t = 4 exp t/c Dj, This inequality holds trivially when δ t since the probability is at most Thus we can [ apply the formula 4 to the nonnegative random variable ξ = L / Q D j x j and obtain [ [ { } E L / Q D j x /6 j = Prob [ξ > t dt 4 exp t/c Dj, dt 0 0

22 Lin, Guo and Zhou which equals 4Γ6C Dj, Therefore, m Dj E [ J 4Γ6C Dj, j= 536 m Dj κ 5κM + κ { + 8κ + } j= For the second term J of 39, we apply 7 again and obtain [ L / Q D j x ΞDj j + Ξ L D j + I / j Applying the Schwarz inequality and Lemmas 7 and 8, we get [[ / E L / Q D j x j E ΞDj + Ξ D j It follows that E [ J m j= / { [ E f ρ x f x L + I / x Dj { κ } / { } 49B 4 / Dj, + κ f ρ f ρ Dj κ + 56κ κ + κ f ρ f ρ Dj κ f ρ f ρ κ κ Dj + 56κ + } / The last term J 3 of 39 has been handled by 7 and in the proof of Proposition 0 by ignoring the summand in the bound 3, and we find from the trivial bound σ ρ 4M with p = that κ E [ J 3 κ + 56κ + κ f ρ f ρ M + Combining the above estimates for the three terms of 39, we see that the desired error bound in the L ρ X metric holds true The estimate in the H metric follows from the steps in deriving the error bound in the L ρ X metric except that in the representation 39 the operator L / in the front disappears This change gives an additional factor /, the bound for the operator L + I /, and proves the desired error bound in the H metric /

23 Distributed Learning Remark 4 A crucial step in the above error analysis for the distributed learning algorithm is to use the unbiasedness and independence to get 40 where the norm is squared in the expected value Thus we can obtain the optimal learning rates in expectation for the distributed learning algorithm but have difficulty in getting rates in probability It would be interesting to derive error bounds in probability by combining our second order decomposition technique with some analysis in the literature Caponnetto and De Vito, 007; Blanchard and rämer, 00; Steinwart and Christmann, 008; Wu and Zhou, 008 Proof of Theorem Since = m for j =,, m, the bound in Theorem 3 in the L ρ X metric can be simplified as } [ E f D, f ρ D, C κm m m +C κ +C κ f ρ f ρ m f ρ f ρ + M + { + m m + m m m } C κm m + m + { + m m + +C κ f ρ f ρ m + + m m m + C κm C m { m + m Cm } + C κ f ρ f ρ m + mc m + Then the desired error bound in the L ρ X metric with C κ = C κ follows The proof for the error bound in the H metric is similar The proof of Theorem is complete Proof of Corollary 3 As in the proof of Corollary 8, the restriction 8 implies m and +C 0 C 0 It follows that +C 0 and with C κ := +C 0 C 0 m + C 0 C 0 + C 0 C 0 +, C m C κ m Putting these bounds into Theorem and applying the restriction C 0, we know that E { } 3 f D, f ρ m m D, C κ C κ + M + m f ρ f ρ C κ C κ 3 + C 0 3 { M } m f ρ f ρ C 0 +

24 Lin, Guo and Zhou and E 3 f D, f D, C κ C κ + { C 0 M C 0 + m f } ρ f ρ Then the desired error bounds hold by taking the constant C κ = C κ C κ 3 + C0 This proves Corollary 3 Proof of Corollary 4 If C 0 α, > 0 for some constant C 0, then the choice = m choice we also have m C 0 m αmax{r,} α max{r,}+ α α max{r,}+ satisfies 8 With this Since regularity condition 9 yields f f ρ ρ g ρ ρ r by Lemma, we have by Corollary 3, E f D, f ρ D, C m αmax{r,} α max{r,}+ m α α max{r,}+ κ C0 g ρ m ρ m +M The inequality α m α max{r,}+ r m + m α α max{r,}+ α m m α max{r,}+ r + α α max{r,}+ is equivalent to α α max{r,}+ r and it can be expressed as 0 Since 0 is valid, we have E f D, f ρ D, C m α max{r,} α max{r,}+ κ C0 g ρ ρ + M m This proves the first desired convergence rate The second rate follows easily This proves Corollary 4 Proof of Corollary By Corollary 9, with the choice = 4αr+, we can immediately bound f D, f ρ ρ as E [ f D, f ρ ρ = O αr 4αr+ The assumption = O α tells us that for some constant C 0, C 0 α, > 0 α So the choice = α 4αr+ yields m m +α α C 0 = C 0 m +α 4αr+ = C 0 m α r 4αr+ 44 4

25 Distributed Learning If m satisfies 3, then 8 is valid, and by Corollary 3, E f D, f ρ D, C κ C0 α r 4αr+ r m g ρ + M C α r κ C0 g ρ + M r m 4αr+ + α r 4αr+ r = C κ C0 g ρ + M αr 4αr+ m αr + 4αr+ + Since 3 is satisfied, we have E f D, f D, ρ C κ C0 g ρ + M αr 4αr+, and thereby [ f E D, f ρ ρ = O αr 4αr+ This proves Corollary Acknowledgments We thank the anonymous referees for their constructive suggestions The work described in this paper is supported partially by the Research Grants Council of Hong ong [Project o CityU 3044 The corresponding author is Ding-Xuan Zhou Appendix This appendix provides detailed proofs of the norm estimates stated in Lemmas 7 and 8 involving the approximation of L by L,Dx To this end, we need the following probability inequality for vector-valued random variables in Pinelis, 994 Lemma 5 For a random variable ξ on Z, ρ with values in a separable Hilbert space H, satisfying ξ M < almost surely, and a random sample {z i } s i= independent drawn according to ρ, there holds with confidence δ, s [ξz i Eξ s M log/ δ E ξ + log/ δ 45 s s i= Proof of Lemma 7 We apply Lemma 5 to the random variable η defined by η x = L + I /, x x, x X 46 It takes values in HSH, the Hilbert space of Hilbert-Schmidt operators on H, with inner product A, B HS = TrB T A Here Tr denotes the trace of a trace-class linear operator The norm is given by A HS = i Ae i where {e i} is an orthonormal basis 5

26 Lin, Guo and Zhou of H The space HSH is a subspace of the space of bounded linear operators on H, denoted as LH,, with the norm relations A A HS, AB HS A HS B 47 ow we use effective dimensions to estimate norms involving η The random variable η defined by 46 has mean Eη = L + I / L and sample mean L + I / L,Dx Recall the set of normalized in H eigenfunctions {ϕ i } i of L It is an orthonormal basis of H If we regard L as an operator on L ρ X, the normalized eigenfunctions in L ρ X are { i ϕ i } i and they form an orthonormal basis of the orthogonal complement of the eigenspace associated with eigenvalue 0 By the Mercer Theorem, we have the following uniform convergent Mercer expansion x, y = i i i ϕ i x i ϕ i y = i ϕ i xϕ i y 48 Take the orthonormal basis {ϕ i } i of H By the definition of the HS norm, we have η x HS = L + I /, x x ϕ i i For a fixed i,, x x ϕ i = ϕ i x x, and x H can be expended by the orthonormal basis {ϕ l } l as x = l ϕ l, x ϕ l = l ϕ l xϕ l 49 Hence η x HS = ϕ ix i l = ϕ ix i l Combining this with 48, we see that ϕ l x L + I / ϕ l ϕ l x l + ϕ l = i ϕ i x l ϕ l x l + η x HS = x, x l ϕ l x, x X 50 l + and But E [ η x HS κ E X [ l ϕ l x = κ l + l X ϕ lx dρ X l + ϕ l x dρ X = ϕ l L = ρ X l ϕ l = l 5 l L ρ X 6

27 Distributed Learning So we have and E E [ η HS κ l l + = κ Tr L + I L = κ 5 l x Dx η x E[η HS [ L = E + I / { } L L,Dx HS Then our desired inequality in Part a follows from the first inequality of 47 From 49 and 50, we find a bound for η as κ η x HS κ l ϕ l x κ x, x κ, x X Applying Lemma 5 to the random variable η with M = κ, we know by 47 that with confidence at least δ, E[η η x E[η η x x Dx Writing the above bound by taking a factor x Dx κ log/δ + HS κ log/δ κ log/δ, we get the desired bound 9 Recall B, defined by 30 Apply the formula 4 for nonnegative random variables to ξ = L + I / { } d L L,Dx and use the bound Prob [ξ > t = Prob derived from 9 for t log d B, We find { [ξ d > t d exp E [ L + I / { L L,Dx } d log d B, + t d B, 0 } exp { t d B, The second term on the right-hand side of above equation equals db, d 0 u d exp { u} du Then the desired bound in Part c follows from 0 u d exp { u} du = Γd and the lemma is proved Proof of Lemma 8 Consider the random variable η defined by η z = L + I / x, z = x, y Z 53 } dt 7

28 Lin, Guo and Zhou It takes values in H By 49, it satisfies So η z = L + I / ϕ l xϕ l = [ L E + I / x This is the statement of Part a We also see from 49 that for x X, η z = l l ϕ l x / l + l [ = E l l ϕ l x / l + ϕ l x = l + ϕ l x / = x, x This verifies the statement of Part b For Part c, we consider another random variable η 3 defined by It takes values in H and satisfies κ η 3 z = L + I / gz x, z = x, y Z 54 So η 3 z = gz L + I / x = gz η 3 z κ g, z Z l ϕ l x / l + and E [ η 3 g E [ l ϕ l x = g l + Applying Lemma 5 verifies the statement in Part c The proof of the lemma is complete References F Bach Sharp analysis of low-rank kernel matrix approximations Proceedings of the 6th Annual Conference on Learning Theory, 85-09, 03 F Bauer, S Pereverzev, and L Rosasco On regularization algorithms in learning theory Journal of Complexity, 3:5-7, 007 G Blanchard and rämer Optimal learning rates for kernel conjugate gradient regression Advances in eural Information Processing Systems, 6-34, 00 G Blanchard and rämer Convergence rates for kernel conjugate gradient for random design regression Analysis and Applications, 4: , 06 8

29 Distributed Learning A Caponnetto and E De Vito Optimal rates for the regularized least squares algorithm Foundations of Computational Mathematics, 7:33-368, 007 A Caponnetto and Y Yao Cross-validation based adaptation for regularization operators in learning theory Analysis and Applications, 8:6-83, 00 D R Chen, Q Wu, Y Ying, and D X Zhou Support vector machine soft margin classifiers: error analysis Journal of Machine Learning Research, 5:43-75, 004 Cristianini and J Shawe-Taylor An Introduction to Support Vector Machines Cambridge University Press, 000 O Dekel, R Gilad-Bachrach, O Shamir, and X Lin Optimal distributed online prediction using mini-batches Journal of Machine Learning Research, 3:65-0, 0 E De Vito, A Caponnetto, and L Rosasco Model selection for regularized least-squares algorithm in learning theory Foundations of Computational Mathematics, 5:59-85, 005 E De Vito, S Pereverzyev, and L Rosasco Adaptive kernel methods using the balancing principle Foundations of Computational Mathematics, 0: , 00 E De Vito, V Umanità, and S Villa An extension of Mercer theorem to vector-valued measurable kernels Applied and Computational Harmonic Analysis, 34:339-35, 03 DE Edmunds and H Triebel Function Spaces, Entropy umbers, Differential Operators Cambridge University Press, Cambridge, 996 T Evgeniou, M Pontil, and T Poggio Regularization networks and support vector machines Advance in Computional Mathematics, 3:-50, 000 J Fan, T Hu, Q Wu, and DX Zhou Consistency analysis of an empirical minimum error entropy algorithm Applied and Computational Harmonic Analysis, 4:64-89, 06 S Fine and Scheinberg Efficient SVM training using low-rank kernel representations Journal of Machine Learning Research, :43-64, 00 A Gittens and M Mahoney Revisiting the yström method for improved large-scale machine learning Journal of Machine Learning Research, 7:-65, 06 X Guo and D X Zhou An empirical feature-based learning algorithm producing sparse approximations Applied and Computational Harmonic Analysis, 3: , 0 ZC Guo, DH Xiang, X Guo, and DX Zhou Thresholded spectral algorithms for sparse approximations Analysis and Applications, 5: , 07 L Györfy, M ohler, A rzyzak, H Walk A Distribution-Free Theory of onparametric Regression Springer-Verlag, Berlin, 00 T Hu, J Fan, Q Wu, and D X Zhou Regularization schemes for minimum error entropy principle Analysis and Applications, 3: , 05 9

30 Lin, Guo and Zhou J H Lin, L Rosasco, and D X Zhou Iterative regularization for learning with convex loss functions Journal of Machine Learning Research, 777: -38, 06 J H Lin and D X Zhou Learning theory of randomized aczmarz algorithm Journal of Machine Learning Research, 6: , 05 S Mendelson and J eeman Regularization in kernel learning The Annals of Statistics, 38:56-565, 00 M Meister and I Steinwart Optimal learning rates for localized SVMs Journal of Machine Learning Research, 7: -44, 06 I Pinelis Optimum bounds for the distributions of martingales in Banach spaces The Annals of Probability, : , 994 G Raskutti, M Wainwright, and B Yu Early stopping and non-parametric regression: an optimal data-dependent stopping rule Journal of Machine Learning Research, 5: , 04 A Rudi, R Camoriano, and L Rosasco Less is more: yström computational regularization Advances in eural Information Processing Systems, , 05 B Schölkopf, A Smola, and R Müller onlinear component analysis as a kernel eigenvalue problem IEEE Transactions on Information Theory, 0:99-39, 998 O Shamir and Srebro Distributed stochastic optimization and learning In the 5nd Annual Allerton Conference on Communication, Control and Computing, 04 WJ Shen, HS Wong, QW Xiao, X Guo, and S Smale Introduction to the peptide binding problem of computational immunology: new results Foundations of Computational Mathematics, 4:95-984, 04 L Shi, YL Feng, and DX Zhou Concentration estimates for learning with l -regularizer and data dependent hypothesis spaces Applied and Computational Harmonic Analysis, 3:86-30, 0 S Smale and DX Zhou Learning theory estimates via integral operators and their approximations Constructive Approximation, 6:53-7, 007 I Steinwart and A Christmann, Support Vector Machines Springer, ew York, 008 I Steinwart, D Hush, and C Scovel Optimal rates for regularized least squares regression In Proceedings of the nd Annual Conference on Learning Theory S Dasgupta and A livans, eds, pp 79-93, 009 I Steinwart and C Scovel Mercer s theorem on general domains: On the interaction between measures, kernels, and RHSs Constructive Approximation, 353:363-47, 0 P Thomann, I Steinwart, I Blaschzyk, and M Meister Spatial decompositions for large scale SVMs ArXiv:600374v, 06 30

Learning Theory of Randomized Kaczmarz Algorithm

Learning Theory of Randomized Kaczmarz Algorithm Journal of Machine Learning Research 16 015 3341-3365 Submitted 6/14; Revised 4/15; Published 1/15 Junhong Lin Ding-Xuan Zhou Department of Mathematics City University of Hong Kong 83 Tat Chee Avenue Kowloon,

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Distributed Semi-supervised Learning with Kernel Ridge Regression

Distributed Semi-supervised Learning with Kernel Ridge Regression Journal of Machine Learning Research 18 (017) 1- Submitted 11/16; Revised 4/17; Published 5/17 Distributed Semi-supervised Learning with Kernel Ridge Regression Xiangyu Chang Center of Data Science and

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Analysis Preliminary Exam Workshop: Hilbert Spaces

Analysis Preliminary Exam Workshop: Hilbert Spaces Analysis Preliminary Exam Workshop: Hilbert Spaces 1. Hilbert spaces A Hilbert space H is a complete real or complex inner product space. Consider complex Hilbert spaces for definiteness. If (, ) : H H

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

arxiv: v1 [math.st] 28 May 2016

arxiv: v1 [math.st] 28 May 2016 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data Raymond K. W. Wong Department of Statistics, Texas A&M University Xiaoke Zhang Department

More information

1 Math 241A-B Homework Problem List for F2015 and W2016

1 Math 241A-B Homework Problem List for F2015 and W2016 1 Math 241A-B Homework Problem List for F2015 W2016 1.1 Homework 1. Due Wednesday, October 7, 2015 Notation 1.1 Let U be any set, g be a positive function on U, Y be a normed space. For any f : U Y let

More information

arxiv: v2 [math.pr] 27 Oct 2015

arxiv: v2 [math.pr] 27 Oct 2015 A brief note on the Karhunen-Loève expansion Alen Alexanderian arxiv:1509.07526v2 [math.pr] 27 Oct 2015 October 28, 2015 Abstract We provide a detailed derivation of the Karhunen Loève expansion of a stochastic

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

The following definition is fundamental.

The following definition is fundamental. 1. Some Basics from Linear Algebra With these notes, I will try and clarify certain topics that I only quickly mention in class. First and foremost, I will assume that you are familiar with many basic

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Recall that any inner product space V has an associated norm defined by

Recall that any inner product space V has an associated norm defined by Hilbert Spaces Recall that any inner product space V has an associated norm defined by v = v v. Thus an inner product space can be viewed as a special kind of normed vector space. In particular every inner

More information

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Spectral theory for compact operators on Banach spaces

Spectral theory for compact operators on Banach spaces 68 Chapter 9 Spectral theory for compact operators on Banach spaces Recall that a subset S of a metric space X is precompact if its closure is compact, or equivalently every sequence contains a Cauchy

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

Spectral Geometry of Riemann Surfaces

Spectral Geometry of Riemann Surfaces Spectral Geometry of Riemann Surfaces These are rough notes on Spectral Geometry and their application to hyperbolic riemann surfaces. They are based on Buser s text Geometry and Spectra of Compact Riemann

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define Homework, Real Analysis I, Fall, 2010. (1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define ρ(f, g) = 1 0 f(x) g(x) dx. Show that

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Eigenvalues and Eigenfunctions of the Laplacian

Eigenvalues and Eigenfunctions of the Laplacian The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior 3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 21-22, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define HILBERT SPACES AND THE RADON-NIKODYM THEOREM STEVEN P. LALLEY 1. DEFINITIONS Definition 1. A real inner product space is a real vector space V together with a symmetric, bilinear, positive-definite mapping,

More information

Optimal Rates for Regularized Least Squares Regression

Optimal Rates for Regularized Least Squares Regression Optimal Rates for Regularized Least Suares Regression Ingo Steinwart, Don Hush, and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Real Variables # 10 : Hilbert Spaces II

Real Variables # 10 : Hilbert Spaces II randon ehring Real Variables # 0 : Hilbert Spaces II Exercise 20 For any sequence {f n } in H with f n = for all n, there exists f H and a subsequence {f nk } such that for all g H, one has lim (f n k,

More information

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t))

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t)) Notations In this chapter we investigate infinite systems of interacting particles subject to Newtonian dynamics Each particle is characterized by its position an velocity x i t, v i t R d R d at time

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

An Even Order Symmetric B Tensor is Positive Definite

An Even Order Symmetric B Tensor is Positive Definite An Even Order Symmetric B Tensor is Positive Definite Liqun Qi, Yisheng Song arxiv:1404.0452v4 [math.sp] 14 May 2014 October 17, 2018 Abstract It is easily checkable if a given tensor is a B tensor, or

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

A Concise Course on Stochastic Partial Differential Equations

A Concise Course on Stochastic Partial Differential Equations A Concise Course on Stochastic Partial Differential Equations Michael Röckner Reference: C. Prevot, M. Röckner: Springer LN in Math. 1905, Berlin (2007) And see the references therein for the original

More information

Inner product spaces. Layers of structure:

Inner product spaces. Layers of structure: Inner product spaces Layers of structure: vector space normed linear space inner product space The abstract definition of an inner product, which we will see very shortly, is simple (and by itself is pretty

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Online Learning with Samples Drawn from Non-identical Distributions

Online Learning with Samples Drawn from Non-identical Distributions Journal of Machine Learning Research 10 (009) 873-898 Submitted 1/08; Revised 7/09; Published 1/09 Online Learning with Samples Drawn from Non-identical Distributions Ting Hu Ding-uan hou Department of

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data

Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data Supplementary Material: A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data Jinfeng Yi JINFENGY@US.IBM.COM IBM Thomas J. Watson Research Center, Yorktown

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1 Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction SHARP BOUNDARY TRACE INEQUALITIES GILES AUCHMUTY Abstract. This paper describes sharp inequalities for the trace of Sobolev functions on the boundary of a bounded region R N. The inequalities bound (semi-)norms

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

The goal of this chapter is to study linear systems of ordinary differential equations: dt,..., dx ) T

The goal of this chapter is to study linear systems of ordinary differential equations: dt,..., dx ) T 1 1 Linear Systems The goal of this chapter is to study linear systems of ordinary differential equations: ẋ = Ax, x(0) = x 0, (1) where x R n, A is an n n matrix and ẋ = dx ( dt = dx1 dt,..., dx ) T n.

More information

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt. SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information