Derivative reproducing properties for kernel methods in learning theory

Similar documents
Geometry on Probability Spaces

Are Loss Functions All the Same?

Reproducing Kernel Hilbert Spaces

An introduction to some aspects of functional analysis

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Lecture Support Vector Machine (SVM) Classifiers

Learnability of Gaussians with flexible variances

Online Gradient Descent Learning Algorithms

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Kernels for Multi task Learning

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

CIS 520: Machine Learning Oct 09, Kernel Methods

NOWHERE LOCALLY UNIFORMLY CONTINUOUS FUNCTIONS

Reproducing Kernel Hilbert Spaces

BINARY CLASSIFICATION

Linear Analysis Lecture 5

Analysis Comprehensive Exam Questions Fall F(x) = 1 x. f(t)dt. t 1 2. tf 2 (t)dt. and g(t, x) = 2 t. 2 t

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Support Vector Machine

Your first day at work MATH 806 (Fall 2015)

Boundedly complete weak-cauchy basic sequences in Banach spaces with the PCP

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Back to the future: Radial Basis Function networks revisited

Learning Theory of Randomized Kaczmarz Algorithm

Viscosity approximation methods for the implicit midpoint rule of asymptotically nonexpansive mappings in Hilbert spaces

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Multi-view Laplacian Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Stochastic optimization in Hilbert spaces

Problem Set 6: Solutions Math 201A: Fall a n x n,

BERGMAN KERNEL ON COMPACT KÄHLER MANIFOLDS

Convex Optimization and Modeling

Functional Analysis MATH and MATH M6202

Mercer s Theorem, Feature Maps, and Smoothing

Linear Normed Spaces (cont.) Inner Product Spaces

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Learning gradients: prescriptive models

Sobolev Spaces. Chapter Hölder spaces

Generalizing the Bias Term of Support Vector Machines

Your first day at work MATH 806 (Fall 2015)

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

PROJECTIONS ONTO CONES IN BANACH SPACES

Online Learning with Samples Drawn from Non-identical Distributions

Regularization on Discrete Spaces

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

l(y j ) = 0 for all y j (1)

The following definition is fundamental.

Kernel Methods. Machine Learning A W VO

BEST APPROXIMATIONS AND ORTHOGONALITIES IN 2k-INNER PRODUCT SPACES. Seong Sik Kim* and Mircea Crâşmăreanu. 1. Introduction

Strictly Positive Definite Functions on a Real Inner Product Space

Linear & nonlinear classifiers

NUMERICAL RADIUS OF A HOLOMORPHIC MAPPING

Introduction and Preliminaries

Prof. M. Saha Professor of Mathematics The University of Burdwan West Bengal, India

On Some Estimates of the Remainder in Taylor s Formula

FUNCTIONAL ANALYSIS HAHN-BANACH THEOREM. F (m 2 ) + α m 2 + x 0

The Representor Theorem, Kernels, and Hilbert Spaces

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Distributed Learning with Regularized Least Squares

ON A HYBRID PROXIMAL POINT ALGORITHM IN BANACH SPACES

Learning with Consistency between Inductive Functions and Kernels

Relationships between upper exhausters and the basic subdifferential in variational analysis

Sample width for multi-category classifiers

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Construction of Smooth Fractal Surfaces Using Hermite Fractal Interpolation Functions. P. Bouboulis, L. Dalla and M. Kostaki-Kosta

Available online at J. Nonlinear Sci. Appl., 10 (2017), Research Article

THEOREMS, ETC., FOR MATH 515

Homework I, Solutions

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines and Kernel Methods

Iterative algorithms based on the hybrid steepest descent method for the split feasibility problem

Functional Analysis Exercise Class

Solving the 3D Laplace Equation by Meshless Collocation via Harmonic Kernels

REAL AND COMPLEX ANALYSIS

COMPLETE METRIC SPACES AND THE CONTRACTION MAPPING THEOREM

Discriminative Models

ITERATIVE SCHEMES FOR APPROXIMATING SOLUTIONS OF ACCRETIVE OPERATORS IN BANACH SPACES SHOJI KAMIMURA AND WATARU TAKAHASHI. Received December 14, 1999

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Analysis-3 lecture schemes

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Linear & nonlinear classifiers

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

Basis Expansion and Nonlinear SVM. Kai Yu

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A collocation method for solving some integral equations in distributions

Oslo Class 2 Tikhonov regularization and kernels

UNIFORM BOUNDS FOR BESSEL FUNCTIONS

2. Function spaces and approximation

Received 8 June 2003 Submitted by Z.-J. Ruan

Functional Analysis Review

Implicit Functions, Curves and Surfaces

Transcription:

Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics, City University of Hong ong, owloon, Hong ong, China Received 27 June 2007 Abstract The regularity of functions from reproducing kernel Hilbert spaces (RHSs) is studied in the setting of learning theory. We provide a reproducing property for partial derivatives up to order s when the Mercer kernel is C 2s. For such a kernel on a general domain we show that the RHS can be embedded into the function space C s. These observations yield a representer theorem for regularized learning algorithms involving data for function values and gradients. Examples of Hermite learning and semi-supervised learning penalized by gradients on data are considered. 2007 Elsevier B.V. All rights reserved. MSC: 68T05; 62J02 eywords: Learning theory; Reproducing kernel Hilbert spaces; Derivative reproducing; Representer theorem; Hermite learning and semi-supervised learning. Introduction Reproducing kernel Hilbert spaces (RHSs) form an important class of function spaces in learning theory. Their reproducing property together with the Hilbert space structure ensures the effectiveness of many practical learning algorithms implemented in these function spaces. Let X be a separable metric space and : X X R be a continuous and symmetric function such that for any finite set of points {x,...,x l } X, the matrix ((x i,x j )) l i,j= is positive semidefinite. Such a function is called a Mercer kernel. The RHS H associated with the kernel is defined (see [2]) to be the completion of the linear span of the set of functions { x := (x, ) : x X} with the inner product, given by x, y = (x,y). That is, i α i xi, j β j yj = i,j α iβ j (x i,y j ). The reproducing property takes the form x,f = f(x) x X, f H. (.) Tel.: +852 2788 9708; fax: +852 2788 856. E-mail address: mazhou@cityu.edu.hk. 0377-0427/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:0.06/j.cam.2007.08.023

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 457.. Learning algorithms by regularization in RHS A large family of learning algorithms are generated by regularization schemes in RHS. Such a scheme can be expressed [6] in terms of a sample z ={(x i,y i )} m i= (X R)m for learning and a loss function V : R 2 R + as { } m f z,λ = arg min V(y i,f(x i )) + λ f 2, λ > 0. (.2) f H m i= The reproducing property (.) makes solution (.2) to a minimization problem over H, a possibly infinite dimensional space, achieved in the finite dimensional subspace spanned by { xi } m i=. This is called a representer theorem [5]. When V is convex with respect to the second variable, (.2) can be solved by a convex optimization problem for the coefficients of f z,λ = m i= c i xi over R m. For some special loss functions such as those in support vector machines, this convex optimization problem is actually a convex quadratic programming one, hence many efficient computing tools are available. This makes the kernel method in learning theory very powerful in various applications. For regression problems, one often takes the loss function to be V(y,f(x)) = ψ(y f(x)) with ψ : R R + a convex function. In particular, for the least square regression, we choose ψ(t) = t 2 and for the support vector machine regression, we choose ψ(t) = max{0, t ε},anε-insensitive loss function with some threshold ε > 0. For binary classification problems, y {, }, so one usually sets V(y,f(x)) = (yf (x)) with : R R + a convex function. In particular, for the least square classification, ψ(t) = ( t) 2 ; for the support vector machine classification, we choose to be the hinge loss (t) = max{0, t} or the support vector machine q-norm loss q (t) = (max{0, t}) q with <q<..2. Learning with gradients For some applications, one may have gradient data or unlabelled data available for improving learning ability [3,7]. Such situations yield learning algorithms involving data for function values or their gradients. Here X R n and with n variables {x,...,x n } of R n, the gradient of a differentiable function f : X R is a vector formed by its partial derivatives as f = ( f/ x,..., f/ x n ) T. Example (Semi-supervised learning with gradients of functions in H ). If z={(x i,y i )} m i= (X R)m are labelled data and u={x i } m+l i=m+ Xl are unlabelled data, we introduce a semi-supervised learning algorithm involving gradients as { } m f z,u,λ,μ = arg min V(y i,f(x i )) + μ m+l f(x i ) 2 + λ f 2. (.3) f H m m + l i= Here λ, μ > 0 are two regularization parameters. Example 2 (Hermite learning with gradient data). Assume in addition to the data z approximating values of a desired function f (i.e. y i f(x i )), we get sampling values y ={y i }m i=,y i Rn, for the gradients of f (i.e. y i f(x i )), then we introduce an Hermite learning algorithm by learning the function values and gradients simultaneously as { } m f z,y,λ = arg min [(y i f(x i )) 2 + y i f H m f(x i) 2 ]+λ f 2. (.4) i= To solve optimization problems like (.3) and (.4) with effective computing tools (such as those for convex quadratic programming), we study representer theorems and need reproducing properties for gradients similar to (.). This is the first purpose of this paper..3. Capacity of RHS Learning ability of algorithms in H depends on the kernel, a loss function measuring errors and probability distributions from which the samples are drawn [,]. Its quantitative estimates in terms of H rely on two features i=

458 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 of the RHS: the approximation power and the capacity [2,5,6,7]. The latter can be measured by covering numbers of the unit ball of the RHS as a subset of C(X). These covering numbers have been extensively studied in learning theory, see e.g. [,9,8]. In particular, when X =[0, ] n and is C s with s not being an even integer, an explicit bound for the covering numbers was presented in [9]. This was done by showing that H can be embedded in C s/2 (X), a Hölder space on X. The embedding result yields error estimates in the C s/2 metric by means of bounds in the H metric for learning algorithms. For example, for the least square regularized regression algorithm (.2) with V(y,f(x)) = (f (x) y) 2, when C 2+ε (X X) with some ε > 0, rates of the error f z,λ f ρ C (X) for learning the regression function f ρ was provided in [4] from bounds for f z,λ f ρ. A natural question is whether the extra ε > 0 can be omitted. The general problem for C s is whether H can be embedded in C s/2 (X) when X is a general domain in R n or when s is an even integer. Solving this general question is the second purpose of this paper. The main difficulty we overcome here is the lack of regularity of the general domain. 2. Reproducing partial derivatives in RHS To allow a general situation, we would not assume any regularity for the boundary of X. To consider partial derivatives, we assume that the interior of X is nonempty. For s Z +, we denote an index set I s := {α Z n + : α s} where α = n j= α j for α = (α,...,α n ) Z n +. For a function f of n variables and x = (x,...,x n ) R n, we denote its partial derivative D α f at x (if it exists) as D α f(x)= D α...dαn n f(x)= α f(x). (x ) α... (x n αn ) Definition (Ziemer [2]). Let X be a compact subset of R n which is the closure of its nonempty interior X o. Define C s (X o ) to be the space of functions f on X o such that D α f is well-defined and continuous on X o for each α I s. Define C s (X) to be the space of continuous functions f on X such that f X o C s (X o ) and for each α I s, D α (f X o) has a continuous extension to X denoted as D α f. In particular, the extension of f X o is f itself. The linear space C s (X) is actually a Banach space with the norm f C s (X) = α I s D α f = α I s sup D α (f X o)(x). x X o Observe that for f C (X), the gradient f equals (D e f,...,d en f)where e j is the jth standard unit vector in R n. The property of reproducing partial derivatives of functions in H is given by partial derivatives of the Mercer kernel. If C 2s (X X), for α I s we extend α to Z 2n + by adding zeros to the last n components and denote the partial derivative of as D α. That is, α D α (x,y) = (x ) α... (x n ) (x,...,x n,y,...,y n ), αn x, y X o and D α is a continuous extension of D α ( X o X o) to X X.Forx X, denote (Dα ) x as the function on X given by (D α ) x (y) = D α (x,y). By the symmetry of,wehave D α ( y )(x) = (D α ) x (y) = D α (x,y) x,y X. (2.) Now we can give the result on reproducing partial derivatives and embedding of H into C s (X). Here (.) and the weak compactness of a closed ball of a Hilbert space play an important role. Theorem. Let s N and : X X R be a Mercer kernel such that C 2s (X X). Then the following statements hold: (a) For any x X and α I s, (D α ) x H. (b) A partial derivative reproducing property holds true for α I s : D α f(x)= (D α ) x,f x X, f H. (2.2)

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 459 (c) The inclusion J : H C s (X) is well-defined and bounded: f C s (X) n s C 2s (X X) f f H. (2.3) (d) J is compact. More strongly, for any closed bounded subset B of H, J(B)is a compact subset of C s (X). Proof. We first prove (a) and (b) together by induction on α =0,,...,s. The case α =0 is trivial since α = 0 and (D 0 ) x = x satisfies (.). Let 0 l s. Suppose that (D α ) x H and (2.2) holds for any x X and α Z n + implies that for any y X, with α =l. Then (2.2) (D α ) y,(d α ) x = D α ((D α ) x )(y) = D α (D α (x, ))(y) = D (α,α) (x,y). (2.4) Here (α, α) Z 2n + is formed by α in the first and second sets of n components. Now we turn to the case l +. Consider the index α + e j with α + e j =l +. We prove (a) and (b) for this index in four steps. Step : Proving (D α+ej ) x H for x X o. Since x X o, there exists some r>0 such that x +{y R n : y r} X o. Then by (2.4), the set {(/t)((d α ) x+te j (D α ) x ) : t r} of functions in H satisfies t ((Dα ) x+te j (D α ) x ) 2 = t 2 {D(α,α) (x + te j,x+ te j ) D (α,α) (x + te j,x) D (α,α) (x,x + te j ) + D (α,α) (x,x)} D (α+ej,α+e j ) t r. Here we have used the assumption C 2s (X X) and (α + e j, α + e j ) =2 α +2 = 2l + 2 2s. That means {(/t)((d α ) x+te j (D α ) x ) : t r} lies in the closed ball of the Hilbert space H with a finite radius D (α+ej,α+e j ). Since this ball is weakly compact, there is a sequence {t i } i= with t i r and lim i t i = 0 such that {(/t i )((D α ) x+ti e j (Dα ) x )} converges weakly to an element g x of H as i. The weak convergence tells us that lim i t i ((D α ) x+ti e j (Dα ) x ), f In particular, by taking f = y with y X, there holds g x (y) = lim ((D α ) i t x+ti e j (Dα ) x ), y. i By (2.2) for α and (2.) we have g x (y) = lim i = lim i = g x,f f H. (2.5) (D α ( y )(x + t i e j ) D α ( y )(x)) t i (D α (x + t i e j,y) D α (x,y)) = D α+ej (x,y) = (D α+ej ) t x (y). i This is true for an arbitrary point y X. Hence (D α+ej ) x = g x as functions on X. Since g x H, we know (D α+ej ) x H. Step 2: Proving for x X o the convergence t ((Dα ) x+te j (D α ) x ) (D α+ej ) x in H (t 0). (2.6)

460 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 Applying (2.5) and (2.2) for α to the function (D α+ej ) x H yields (D α+ej ) x,(d α+ej ) x = lim {D α ((D α+ej ) i t x )(x + t i e j ) D α ((D α+ej ) x )(x)} i = lim {D α (D α+ej (x, ))(x + t i e j ) D α (D α+ej (x, ))(x)} i t i = D (α+ej,α+e j ) (x,x). This in connection with (2.2) implies t ((Dα ) x+te j (D α 2 ) x ) (D α+ej ) x = t 2 {D(α,α) (x + te j,x+ te j ) 2D (α,α) (x + te j,x)+ D (α,α) (x,x)} 2 t {Dα ((D α+ej ) x )(x + te j ) D α ((D α+ej ) x )(x)}+d (α+ej,α+e j ) (x,x) = t t t 2 D (α+ej,α+e j ) (x + ue j,x+ ve j ) du dv 0 0 t 2 D (α+ej,α+e j ) (x,x + ve j ) dv + D (α+ej,α+e j ) (x,x) t 0 = t t t 2 {D (α+ej,α+e j ) (x + ue j,x+ ve j ) 0 0 2D (α+ej,α+e j ) (x,x + ve j ) + D (α+ej,α+e j ) (x,x)} du dv. If we define the modulus of continuity for a function g C(X X) to be a function of δ (0, ) as ω(g, δ) := sup{ g(x,y ) g(x 2,y 2 ) :x i,y i X with x x 2 δ, y y 2 δ}, (2.7) we know from the uniform continuity of g that lim δ 0+ ω(g, δ) = 0. Moreover, the function ω(g, δ) is continuous on (0, ). Using the modulus of continuity for the function D (α+ej,α+e j ) C(X X) we see that t ((Dα ) x+te j (D α 2 ) x ) (D α+ej ) x 2ω(D (α+ej,α+e j ), t ). (2.8) This converges to zero as t 0. Therefore (2.6) holds true. Step 3: Proving (2.2) for x X o and α + e j. Let f H. By (2.6) we have (D α+ej ) x,f = lim t 0 t ((Dα ) x+te j (D α ) x ), f By (2.2) for α, we see that this equals (D α+ej ) x,f = lim t 0 t {Dα f(x+ te j ) D α f(x)}. That is, D α+ej f(x)exists and equals (D α+ej ) x,f. This verifies (2.2) for α + e j. Step 4: Proving (a) and (b) for x X := X\X o. Notice from the first three steps that for x,x X o, there holds (D α+ej ) x (D α+ej ) x 2 ={D(α+ej,α+e j ) (x,x ) D (α+ej,α+e j ) (x,x ) D (α+ej,α+e j ) (x,x ) + D (α+ej,α+e j ) (x,x )} 2ω(D (α+ej,α+e j ), x x )..

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 46 It follows that for any sequence {x (i) X o } i= converging to x, the sequence of functions {(Dα+ej ) x (i)} is a Cauchy sequence in the Hilbert space H. So it converges to a limit function h H. Applying what we have proved for x (i) X o we get h(y) = h, y = lim i (Dα+ej ) x (i), y = (D α+ej ) x (y) y X. This verifies (D α+ej ) x = h H. Let f H. We define a function f [j] on X as f [j] (x) = (D α+ej ) x,f, x X. By the conclusion in Step 3, we know that f [j] (x) = D α+ej f(x)for x X o, hence f [j] is continuous on X o. Let us now prove the continuity of f [j] at each x X.If{x (i) X o } i= is a sequence satisfying lim i x (i) =x, then the above proof tells us that (D α+ej ) x (i) converges (D α+ej ) x in the H metric meaning that lim i (D α+ej ) x (i) (D α+ej ) x = 0. So by the definition of f [j] and the Schwarz inequality, we have f [j] (x (i) ) f [j] (x) = (D α+ej ) x (i) (D α+ej ) x,f (D α+ej ) x (i) (D α+ej ) x f 0 as i. Thus the function f [j] is continuous on X, and it is a continuous extension of D α+ej f from X o onto X. So (b) holds true for x X. This completes the induction procedure for proving the statements in (a) and (b). (c) We use (2.2) and (2.4). For f H, x, x X and α I s, the Schwarz inequality implies Hence D α f(x) D α f( x) = (D α ) x (D α ) x,f (D α ) x (D α ) x f D α f(x) D α f( x) D (α,α) (x,x) 2D (α,α) (x, x) + D (α,α) ( x, x) f. 2ω(D (α,α), x x ) f x, x X. (2.9) As lim δ 0+ ω(d (α,α),δ) = 0, we know that D α f C(X). It means f C s (X) and the inclusion J is well-defined. To see the boundedness, we apply the Schwarz inequality again and have D α f(x) = (D α ) x,f D (α,α) (x,x) f D (α,α) f. It follows that f C s (X) = α I s D α f α I s D (α,α) f n s α I s D (α,α) f. Then (2.3) is verified. (d) If B is a closed bounded subset of H, there is some R>0such that B {f H : f R}. To show that J(B)is compact, let {f j } j= be a sequence in B. The estimate (2.9) tells us that for each α I s and j N, D α f j (x) D α f j ( x) 2ω(D (α,α), x x )R x, x X. It says that the sequence of functions {D α f j } j= is uniformly continuous. This is true for each α I s. So by taking subsequences for α (one-after-one), we know that there is a subsequence {f jl } l= which converges to a function f C s (X) in the metric C s (X). Observe that {f jl } l= lies in the ball of H with radius R which is weakly compact, it contains a subsequence {f jlk } k= which is also a subsequence of {f j } j= and converges weakly to a function f H in the metric. According to (2.2), the weak convergence in H tells us that {f jlk } k= converges to f in the metric C s (X). Therefore, f = f H and {f j } j= J(B) contains a subsequence which converges in Cs (X) to f. This proves that J(B)is compact. The proof of Theorem is complete.

462 D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 Theorem can be extended to other kernels [0]. Relation (2.3) tells us that the error bounds in the norm can be used to estimate convergence rates of learning algorithms in the norm C s (X), as done in [4]. 3. Representer theorems for learning with derivative data A general learning algorithm of regularization in H involving partial derivative data takes the form { m } f x, y,λ = arg min V i ( y i, {D α f(x i )} α Ji ) + λ f 2, (3.) f H i= where for each i {,...,m}, x i X, y i is a vector, J i is a subset of I s and V i is a loss function with values in R + of compatible variables. Denote the number of elements in the set J i as #(J i ). The partial derivative reproducing property (2.2) stated in Theorem enables us to derive a representer theorem for the learning algorithm (3.), which asserts that the minimization over the possibly infinite dimensional space H can be achieved in a finite dimensional subspace generated by { xi } and their partial derivatives. Theorem 2. Let s N and : X X R be a Mercer kernel such that C 2s (X X). If λ > 0, then the solution f x, y,λ of scheme (3.) exists and lies in the subspace spanned by {(D α ) xi : α J i,i=,...,m}. If we write f x, y,λ = m i= α J i c i,α (Dα ) xi with c = (c i,α ) α J i,i=,...,m R N where N = m i= #(J i ), then c = arg min c R N +λ m i= m m y i, c j,β D (β,α) (x j,x i ) j= β J j m c i,α c j,β D (β,α) (x j,x i ). β J j V i i= α J i j= α J i Proof. By Theorem, we know that for any α in J i, the function (D α ) xi lies in H. Denote the subspace of H spanned by {(D α ) xi : α J i,i =,...,m} as H,x. Let P be the orthogonal projection onto this subspace. Then for any f H, the function f P(f) is orthogonal to H,x. In particular, f P(f),(D α ) xi = 0 for any α J i and i m. This in connection with the partial derivative reproducing property (2.2) tells us that D α (f P (f ))(x i ) = D α f(x i ) D α (P (f ))(x i ) = 0 α J i, i =,...,m. Thus, if we denote E z (f )= m i= V i ( y i, {D α f(x i )} α Ji ), we see that E z (f )=E z (P (f )). Notice that P(f) f and the strict inequality holds unless f = P(f), i.e., f H,x. Therefore, min {E z (f ) + λ f 2 }= min {E z (f ) + λ f 2 } f H f H,x and a minimizer f x, y,λ exists and lies in H,x since the subspace is finite dimensional. The second statement is trivial. We shall not discuss learning rates of the learning algorithms (.3) and (.4). Though rough estimates can be given using methods from [4,8,3], satisfactory error analysis for more general learning algorithms [9,20] will be done later. Acknowledgements The work described in this paper was supported by the Research Grants Council of the Hong ong Special Administrative Region, China (Project No. CityU 03704) and a Joint Research Fund for Hong ong and Macau Young Scholars from the National Science Fund for Distinguished Young Scholars (Project No. 05290).

D.-X. Zhou / Journal of Computational and Applied Mathematics 220 (2008) 456 463 463 References [] M. Anthony, P.L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, Cambridge, 999. [2] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (950) 337 404. [3] M. Belkin, P. Niyogi, Semisupervised learning on Riemannian manifolds, Mach. Learn. 56 (2004) 209 239. [4] E. De Vito, A. Caponnetto, L. Rosasco, Model selection for regularized least-squares algorithm in learning theory, Found. Comput. Math. 5 (2005) 59 85. [5] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, A. Verri, Some properties of regularized kernel methods, J. Mach. Learn. Res. 5 (2004) 363 390. [6] T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines, Adv. Comput. Math. 3 (2000) 50. [7] D. Hardin, I. Tsamardinos, C.F. Aliferis, A theoretical characterization of linear SVM-based feature selection, Proceedings of the 2st International Conference on Machine Learning, Banff, Canada, 2004. [8] S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin, Learning theory: stability is sufficient for generalization and necessary and sufficient for empirical risk minimization, Adv. Comput. Math. 25 (2006) 6 93. [9] S. Mukherjee, Q. Wu, Estimation of gradients and coordinate covariation in classification, J. Mach. Learn. Res. 7 (2006) 248 254. [0] R. Schaback, J. Werner, Linearly constrained reconstruction of functions by kernels with applications to machine learning, Adv. Comput. Math. 25 (2006) 237 258. [] J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44 (998) 926 940. [2] S. Smale, D.X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. (2003) 7 4. [3] S. Smale, D.X. Zhou, Shannon sampling and function reconstruction from point values, Bull. Amer. Math. Soc. 4 (2004) 279 305. [4] S. Smale, D.X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx. 26 (2007) 53 72. [5] G. Wahba, Spline Models for Observational Data, SIAM, Philadelphia, PA, 990. [6] Y. Yao, On complexity issue of online learning algorithms, IEEE Trans. Inform. Theory, to appear. [7] Y. Ying, Convergence analysis of online algorithms, Adv. Comput. Math. 27 (2007) 273 29. [8] D.X. Zhou, The covering number in learning theory, J. Complexity 8 (2002) 739 767. [9] D.X. Zhou, Capacity of reproducing kernel spaces in learning theory, IEEE Trans. Inform. Theory 49 (2003) 743 752. [20] D.X. Zhou,. Jetter, Approximation with polynomial kernels and SVM classifiers, Adv. Comput. Math. 25 (2006) 323 344. [2] W.P. Ziemer, Weakly Differentiable Functions, Springer, New York, 989.