Derivative reproducing properties for kernel methods in learning theory

Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics, City University of Hong ong, owloon, Hong ong, China Received 27 June 2007 Abstract The regularity of functions from reproducing kernel Hilbert spaces (RHSs) is studied in the setting of learning theory. We provide a reproducing property for partial derivatives up to order s when the Mercer kernel is C 2s. For such a kernel on a general domain we show that the RHS can be embedded into the function space C s. These observations yield a representer theorem for regularized learning algorithms involving data for function values and gradients. Examples of Hermite learning and semi-supervised learning penalized by gradients on data are considered. 2007 Elsevier B.V. All rights reserved. MSC: 68T05; 62J02 eywords: Learning theory; Reproducing kernel Hilbert spaces; Derivative reproducing; Representer theorem; Hermite learning and semi-supervised learning. Introduction Reproducing kernel Hilbert spaces (RHSs) form an important class of function spaces in learning theory. Their reproducing property together with the Hilbert space structure ensures the effectiveness of many practical learning algorithms implemented in these function spaces. Let X be a separable metric space and : X X R be a continuous and symmetric function such that for any finite set of points {x,...,x l } X, the matrix ((x i,x j )) l i,j= is positive semidefinite. Such a function is called a Mercer kernel. The RHS H associated with the kernel is defined (see [2]) to be the completion of the linear span of the set of functions { x := (x, ) : x X} with the inner product, given by x, y = (x,y). That is, i α i xi, j β j yj = i,j α iβ j (x i,y j ). The reproducing property takes the form x,f = f(x) x X, f H. (.) Tel.: +852 2788 9708; fax: +852 2788 856. E-mail address: mazhou@cityu.edu.hk. 0377-0427/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:0.06/j.cam.2007.08.023

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 457.. Learning algorithms by regularization in RHS A large family of learning algorithms are generated by regularization schemes in RHS. Such a scheme can be expressed [6] in terms of a sample z ={(x i,y i )} m i= (X R)m for learning and a loss function V : R 2 R + as { } m f z,λ = arg min V(y i,f(x i )) + λ f 2, λ > 0. (.2) f H m i= The reproducing property (.) makes solution (.2) to a minimization problem over H, a possibly infinite dimensional space, achieved in the finite dimensional subspace spanned by { xi } m i=. This is called a representer theorem [5]. When V is convex with respect to the second variable, (.2) can be solved by a convex optimization problem for the coefficients of f z,λ = m i= c i xi over R m. For some special loss functions such as those in support vector machines, this convex optimization problem is actually a convex quadratic programming one, hence many efficient computing tools are available. This makes the kernel method in learning theory very powerful in various applications. For regression problems, one often takes the loss function to be V(y,f(x)) = ψ(y f(x)) with ψ : R R + a convex function. In particular, for the least square regression, we choose ψ(t) = t 2 and for the support vector machine regression, we choose ψ(t) = max{0, t ε},anε-insensitive loss function with some threshold ε > 0. For binary classification problems, y {, }, so one usually sets V(y,f(x)) = (yf (x)) with : R R + a convex function. In particular, for the least square classification, ψ(t) = ( t) 2 ; for the support vector machine classification, we choose to be the hinge loss (t) = max{0, t} or the support vector machine q-norm loss q (t) = (max{0, t}) q with <q<..2. Learning with gradients For some applications, one may have gradient data or unlabelled data available for improving learning ability [3,7]. Such situations yield learning algorithms involving data for function values or their gradients. Here X R n and with n variables {x,...,x n } of R n, the gradient of a differentiable function f : X R is a vector formed by its partial derivatives as f = ( f/ x,..., f/ x n ) T. Example (Semi-supervised learning with gradients of functions in H ). If z={(x i,y i )} m i= (X R)m are labelled data and u={x i } m+l i=m+ Xl are unlabelled data, we introduce a semi-supervised learning algorithm involving gradients as { } m f z,u,λ,μ = arg min V(y i,f(x i )) + μ m+l f(x i ) 2 + λ f 2. (.3) f H m m + l i= Here λ, μ > 0 are two regularization parameters. Example 2 (Hermite learning with gradient data). Assume in addition to the data z approximating values of a desired function f (i.e. y i f(x i )), we get sampling values y ={y i }m i=,y i Rn, for the gradients of f (i.e. y i f(x i )), then we introduce an Hermite learning algorithm by learning the function values and gradients simultaneously as { } m f z,y,λ = arg min [(y i f(x i )) 2 + y i f H m f(x i) 2 ]+λ f 2. (.4) i= To solve optimization problems like (.3) and (.4) with effective computing tools (such as those for convex quadratic programming), we study representer theorems and need reproducing properties for gradients similar to (.). This is the first purpose of this paper..3. Capacity of RHS Learning ability of algorithms in H depends on the kernel, a loss function measuring errors and probability distributions from which the samples are drawn [,]. Its quantitative estimates in terms of H rely on two features i=

458 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 of the RHS: the approximation power and the capacity [2,5,6,7]. The latter can be measured by covering numbers of the unit ball of the RHS as a subset of C(X). These covering numbers have been extensively studied in learning theory, see e.g. [,9,8]. In particular, when X =[0, ] n and is C s with s not being an even integer, an explicit bound for the covering numbers was presented in [9]. This was done by showing that H can be embedded in C s/2 (X), a Hölder space on X. The embedding result yields error estimates in the C s/2 metric by means of bounds in the H metric for learning algorithms. For example, for the least square regularized regression algorithm (.2) with V(y,f(x)) = (f (x) y) 2, when C 2+ε (X X) with some ε > 0, rates of the error f z,λ f ρ C (X) for learning the regression function f ρ was provided in [4] from bounds for f z,λ f ρ. A natural question is whether the extra ε > 0 can be omitted. The general problem for C s is whether H can be embedded in C s/2 (X) when X is a general domain in R n or when s is an even integer. Solving this general question is the second purpose of this paper. The main difficulty we overcome here is the lack of regularity of the general domain. 2. Reproducing partial derivatives in RHS To allow a general situation, we would not assume any regularity for the boundary of X. To consider partial derivatives, we assume that the interior of X is nonempty. For s Z +, we denote an index set I s := {α Z n + : α s} where α = n j= α j for α = (α,...,α n ) Z n +. For a function f of n variables and x = (x,...,x n ) R n, we denote its partial derivative D α f at x (if it exists) as D α f(x)= D α...dαn n f(x)= α f(x). (x ) α... (x n αn ) Definition (Ziemer [2]). Let X be a compact subset of R n which is the closure of its nonempty interior X o. Define C s (X o ) to be the space of functions f on X o such that D α f is well-defined and continuous on X o for each α I s. Define C s (X) to be the space of continuous functions f on X such that f X o C s (X o ) and for each α I s, D α (f X o) has a continuous extension to X denoted as D α f. In particular, the extension of f X o is f itself. The linear space C s (X) is actually a Banach space with the norm f C s (X) = α I s D α f = α I s sup D α (f X o)(x). x X o Observe that for f C (X), the gradient f equals (D e f,...,d en f)where e j is the jth standard unit vector in R n. The property of reproducing partial derivatives of functions in H is given by partial derivatives of the Mercer kernel. If C 2s (X X), for α I s we extend α to Z 2n + by adding zeros to the last n components and denote the partial derivative of as D α. That is, α D α (x,y) = (x ) α... (x n ) (x,...,x n,y,...,y n ), αn x, y X o and D α is a continuous extension of D α ( X o X o) to X X.Forx X, denote (Dα ) x as the function on X given by (D α ) x (y) = D α (x,y). By the symmetry of,wehave D α ( y )(x) = (D α ) x (y) = D α (x,y) x,y X. (2.) Now we can give the result on reproducing partial derivatives and embedding of H into C s (X). Here (.) and the weak compactness of a closed ball of a Hilbert space play an important role. Theorem. Let s N and : X X R be a Mercer kernel such that C 2s (X X). Then the following statements hold: (a) For any x X and α I s, (D α ) x H. (b) A partial derivative reproducing property holds true for α I s : D α f(x)= (D α ) x,f x X, f H. (2.2)

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 459 (c) The inclusion J : H C s (X) is well-defined and bounded: f C s (X) n s C 2s (X X) f f H. (2.3) (d) J is compact. More strongly, for any closed bounded subset B of H, J(B)is a compact subset of C s (X). Proof. We first prove (a) and (b) together by induction on α =0,,...,s. The case α =0 is trivial since α = 0 and (D 0 ) x = x satisfies (.). Let 0 l s. Suppose that (D α ) x H and (2.2) holds for any x X and α Z n + implies that for any y X, with α =l. Then (2.2) (D α ) y,(d α ) x = D α ((D α ) x )(y) = D α (D α (x, ))(y) = D (α,α) (x,y). (2.4) Here (α, α) Z 2n + is formed by α in the first and second sets of n components. Now we turn to the case l +. Consider the index α + e j with α + e j =l +. We prove (a) and (b) for this index in four steps. Step : Proving (D α+ej ) x H for x X o. Since x X o, there exists some r>0 such that x +{y R n : y r} X o. Then by (2.4), the set {(/t)((d α ) x+te j (D α ) x ) : t r} of functions in H satisfies t ((Dα ) x+te j (D α ) x ) 2 = t 2 {D(α,α) (x + te j,x+ te j ) D (α,α) (x + te j,x) D (α,α) (x,x + te j ) + D (α,α) (x,x)} D (α+ej,α+e j ) t r. Here we have used the assumption C 2s (X X) and (α + e j, α + e j ) =2 α +2 = 2l + 2 2s. That means {(/t)((d α ) x+te j (D α ) x ) : t r} lies in the closed ball of the Hilbert space H with a finite radius D (α+ej,α+e j ). Since this ball is weakly compact, there is a sequence {t i } i= with t i r and lim i t i = 0 such that {(/t i )((D α ) x+ti e j (Dα ) x )} converges weakly to an element g x of H as i. The weak convergence tells us that lim i t i ((D α ) x+ti e j (Dα ) x ), f In particular, by taking f = y with y X, there holds g x (y) = lim ((D α ) i t x+ti e j (Dα ) x ), y. i By (2.2) for α and (2.) we have g x (y) = lim i = lim i = g x,f f H. (2.5) (D α ( y )(x + t i e j ) D α ( y )(x)) t i (D α (x + t i e j,y) D α (x,y)) = D α+ej (x,y) = (D α+ej ) t x (y). i This is true for an arbitrary point y X. Hence (D α+ej ) x = g x as functions on X. Since g x H, we know (D α+ej ) x H. Step 2: Proving for x X o the convergence t ((Dα ) x+te j (D α ) x ) (D α+ej ) x in H (t 0). (2.6)

460 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 Applying (2.5) and (2.2) for α to the function (D α+ej ) x H yields (D α+ej ) x,(d α+ej ) x = lim {D α ((D α+ej ) i t x )(x + t i e j ) D α ((D α+ej ) x )(x)} i = lim {D α (D α+ej (x, ))(x + t i e j ) D α (D α+ej (x, ))(x)} i t i = D (α+ej,α+e j ) (x,x). This in connection with (2.2) implies t ((Dα ) x+te j (D α 2 ) x ) (D α+ej ) x = t 2 {D(α,α) (x + te j,x+ te j ) 2D (α,α) (x + te j,x)+ D (α,α) (x,x)} 2 t {Dα ((D α+ej ) x )(x + te j ) D α ((D α+ej ) x )(x)}+d (α+ej,α+e j ) (x,x) = t t t 2 D (α+ej,α+e j ) (x + ue j,x+ ve j ) du dv 0 0 t 2 D (α+ej,α+e j ) (x,x + ve j ) dv + D (α+ej,α+e j ) (x,x) t 0 = t t t 2 {D (α+ej,α+e j ) (x + ue j,x+ ve j ) 0 0 2D (α+ej,α+e j ) (x,x + ve j ) + D (α+ej,α+e j ) (x,x)} du dv. If we define the modulus of continuity for a function g C(X X) to be a function of δ (0, ) as ω(g, δ) := sup{ g(x,y ) g(x 2,y 2 ) :x i,y i X with x x 2 δ, y y 2 δ}, (2.7) we know from the uniform continuity of g that lim δ 0+ ω(g, δ) = 0. Moreover, the function ω(g, δ) is continuous on (0, ). Using the modulus of continuity for the function D (α+ej,α+e j ) C(X X) we see that t ((Dα ) x+te j (D α 2 ) x ) (D α+ej ) x 2ω(D (α+ej,α+e j ), t ). (2.8) This converges to zero as t 0. Therefore (2.6) holds true. Step 3: Proving (2.2) for x X o and α + e j. Let f H. By (2.6) we have (D α+ej ) x,f = lim t 0 t ((Dα ) x+te j (D α ) x ), f By (2.2) for α, we see that this equals (D α+ej ) x,f = lim t 0 t {Dα f(x+ te j ) D α f(x)}. That is, D α+ej f(x)exists and equals (D α+ej ) x,f. This verifies (2.2) for α + e j. Step 4: Proving (a) and (b) for x X := X\X o. Notice from the first three steps that for x,x X o, there holds (D α+ej ) x (D α+ej ) x 2 ={D(α+ej,α+e j ) (x,x ) D (α+ej,α+e j ) (x,x ) D (α+ej,α+e j ) (x,x ) + D (α+ej,α+e j ) (x,x )} 2ω(D (α+ej,α+e j ), x x )..

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 46 It follows that for any sequence {x (i) X o } i= converging to x, the sequence of functions {(Dα+ej ) x (i)} is a Cauchy sequence in the Hilbert space H. So it converges to a limit function h H. Applying what we have proved for x (i) X o we get h(y) = h, y = lim i (Dα+ej ) x (i), y = (D α+ej ) x (y) y X. This verifies (D α+ej ) x = h H. Let f H. We define a function f [j] on X as f [j] (x) = (D α+ej ) x,f, x X. By the conclusion in Step 3, we know that f [j] (x) = D α+ej f(x)for x X o, hence f [j] is continuous on X o. Let us now prove the continuity of f [j] at each x X.If{x (i) X o } i= is a sequence satisfying lim i x (i) =x, then the above proof tells us that (D α+ej ) x (i) converges (D α+ej ) x in the H metric meaning that lim i (D α+ej ) x (i) (D α+ej ) x = 0. So by the definition of f [j] and the Schwarz inequality, we have f [j] (x (i) ) f [j] (x) = (D α+ej ) x (i) (D α+ej ) x,f (D α+ej ) x (i) (D α+ej ) x f 0 as i. Thus the function f [j] is continuous on X, and it is a continuous extension of D α+ej f from X o onto X. So (b) holds true for x X. This completes the induction procedure for proving the statements in (a) and (b). (c) We use (2.2) and (2.4). For f H, x, x X and α I s, the Schwarz inequality implies Hence D α f(x) D α f( x) = (D α ) x (D α ) x,f (D α ) x (D α ) x f D α f(x) D α f( x) D (α,α) (x,x) 2D (α,α) (x, x) + D (α,α) ( x, x) f. 2ω(D (α,α), x x ) f x, x X. (2.9) As lim δ 0+ ω(d (α,α),δ) = 0, we know that D α f C(X). It means f C s (X) and the inclusion J is well-defined. To see the boundedness, we apply the Schwarz inequality again and have D α f(x) = (D α ) x,f D (α,α) (x,x) f D (α,α) f. It follows that f C s (X) = α I s D α f α I s D (α,α) f n s α I s D (α,α) f. Then (2.3) is verified. (d) If B is a closed bounded subset of H, there is some R>0such that B {f H : f R}. To show that J(B)is compact, let {f j } j= be a sequence in B. The estimate (2.9) tells us that for each α I s and j N, D α f j (x) D α f j ( x) 2ω(D (α,α), x x )R x, x X. It says that the sequence of functions {D α f j } j= is uniformly continuous. This is true for each α I s. So by taking subsequences for α (one-after-one), we know that there is a subsequence {f jl } l= which converges to a function f C s (X) in the metric C s (X). Observe that {f jl } l= lies in the ball of H with radius R which is weakly compact, it contains a subsequence {f jlk } k= which is also a subsequence of {f j } j= and converges weakly to a function f H in the metric. According to (2.2), the weak convergence in H tells us that {f jlk } k= converges to f in the metric C s (X). Therefore, f = f H and {f j } j= J(B) contains a subsequence which converges in Cs (X) to f. This proves that J(B)is compact. The proof of Theorem is complete.

462 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 Theorem can be extended to other kernels [0]. Relation (2.3) tells us that the error bounds in the norm can be used to estimate convergence rates of learning algorithms in the norm C s (X), as done in [4]. 3. Representer theorems for learning with derivative data A general learning algorithm of regularization in H involving partial derivative data takes the form { m } f x, y,λ = arg min V i ( y i, {D α f(x i )} α Ji ) + λ f 2, (3.) f H i= where for each i {,...,m}, x i X, y i is a vector, J i is a subset of I s and V i is a loss function with values in R + of compatible variables. Denote the number of elements in the set J i as #(J i ). The partial derivative reproducing property (2.2) stated in Theorem enables us to derive a representer theorem for the learning algorithm (3.), which asserts that the minimization over the possibly infinite dimensional space H can be achieved in a finite dimensional subspace generated by { xi } and their partial derivatives. Theorem 2. Let s N and : X X R be a Mercer kernel such that C 2s (X X). If λ > 0, then the solution f x, y,λ of scheme (3.) exists and lies in the subspace spanned by {(D α ) xi : α J i,i=,...,m}. If we write f x, y,λ = m i= α J i c i,α (Dα ) xi with c = (c i,α ) α J i,i=,...,m R N where N = m i= #(J i ), then c = arg min c R N +λ m i= m m y i, c j,β D (β,α) (x j,x i ) j= β J j m c i,α c j,β D (β,α) (x j,x i ). β J j V i i= α J i j= α J i Proof. By Theorem, we know that for any α in J i, the function (D α ) xi lies in H. Denote the subspace of H spanned by {(D α ) xi : α J i,i =,...,m} as H,x. Let P be the orthogonal projection onto this subspace. Then for any f H, the function f P(f) is orthogonal to H,x. In particular, f P(f),(D α ) xi = 0 for any α J i and i m. This in connection with the partial derivative reproducing property (2.2) tells us that D α (f P (f ))(x i ) = D α f(x i ) D α (P (f ))(x i ) = 0 α J i, i =,...,m. Thus, if we denote E z (f )= m i= V i ( y i, {D α f(x i )} α Ji ), we see that E z (f )=E z (P (f )). Notice that P(f) f and the strict inequality holds unless f = P(f), i.e., f H,x. Therefore, min {E z (f ) + λ f 2 }= min {E z (f ) + λ f 2 } f H f H,x and a minimizer f x, y,λ exists and lies in H,x since the subspace is finite dimensional. The second statement is trivial. We shall not discuss learning rates of the learning algorithms (.3) and (.4). Though rough estimates can be given using methods from [4,8,3], satisfactory error analysis for more general learning algorithms [9,20] will be done later. Acknowledgements The work described in this paper was supported by the Research Grants Council of the Hong ong Special Administrative Region, China (Project No. CityU 03704) and a Joint Research Fund for Hong ong and Macau Young Scholars from the National Science Fund for Distinguished Young Scholars (Project No. 05290).

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 463 References [] M. Anthony, P.L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, Cambridge, 999. [2] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (950) 337 404. [3] M. Belkin, P. Niyogi, Semisupervised learning on Riemannian manifolds, Mach. Learn. 56 (2004) 209 239. [4] E. De Vito, A. Caponnetto, L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005) 59 85. [5] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, A. Verri, Some properties of regularized kernel methods, J. Mach. Learn. Res. 5 (2004) 363 390. [6] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math. 3 (2000) 50. [7] D. Hardin, I. Tsamardinos, C.F. Aliferis, A theoretical characterization of linear SVM-based feature selection, Proceedings of the 2st International Conference on Machine Learning, Banff, Canada, 2004. [8] S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin, Learning theory: stability is sufficient for generalization and necessary and sufficient for empirical risk minimization, Adv. Comput. Math. 25 (2006) 6 93. [9] S. Mukherjee, Q. Wu, Estimation of gradients and coordinate covariation in classification, J. Mach. Learn. Res. 7 (2006) 248 254. [0] R. Schaback, J. Werner, Linearly constrained reconstruction of functions by kernels with applications to machine learning, Adv. Comput. Math. 25 (2006) 237 258. [] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44 (998) 926 940. [2] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. (2003) 7 4. [3] S. Smale, D.X. Zhou, Shannon sampling and function reconstruction from point values, Bull. Amer. Math. Soc. 4 (2004) 279 305. [4] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007) 53 72. [5] G. Wahba, Spline Models for Observational Data, SIAM, Philadelphia, PA, 990. [6] Y. Yao, On complexity issue of online learning algorithms, IEEE Trans. Inform. Theory, to appear. [7] Y. Ying, Convergence analysis of online algorithms, Adv. Comput. Math. 27 (2007) 273 29. [8] D.X. Zhou, The covering number in learning theory, J. Complexity 8 (2002) 739 767. [9] D.X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inform. Theory 49 (2003) 743 752. [20] D.X. Zhou,. Jetter, Approximation with polynomial kernels and SVM classifiers, Adv. Comput. Math. 25 (2006) 323 344. [2] W.P. Ziemer, Weakly Differentiable Functions, Springer, New York, 989.