Universal algorithms for learning theory Part II : piecewise polynomial functions

Similar documents
A Simple Regression Problem

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Computational and Statistical Learning Theory

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Shannon Sampling II. Connections to Learning Theory

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Learnability of Gaussians with flexible variances

Sharp Time Data Tradeoffs for Linear Inverse Problems

3.8 Three Types of Convergence

Chapter 6 1-D Continuous Groups

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Polygonal Designs: Existence and Construction

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Understanding Machine Learning Solution Manual

CS Lecture 13. More Maximum Likelihood

Stochastic Subgradient Methods

Non-Parametric Non-Line-of-Sight Identification 1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

Computational and Statistical Learning Theory

Tail estimates for norms of sums of log-concave random vectors

Max-Product Shepard Approximation Operators

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Weighted- 1 minimization with multiple weighting sets

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

List Scheduling and LPT Oliver Braun (09/05/2017)

1 Proof of learning bounds

arxiv: v1 [cs.ds] 3 Feb 2014

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Feature Extraction Techniques

Block designs and statistics

Machine Learning Basics: Estimators, Bias and Variance

The degree of a typical vertex in generalized random intersection graph models

In this chapter, we consider several graph-theoretic and probabilistic models

Computable Shell Decomposition Bounds

Generalized eigenfunctions and a Borel Theorem on the Sierpinski Gasket.

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

On Conditions for Linearity of Optimal Estimation

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Necessity of low effective dimension

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Kernel Methods and Support Vector Machines

The Weierstrass Approximation Theorem

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Learnability and Stability in the General Learning Setting

Fixed-to-Variable Length Distribution Matching

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

COS 424: Interacting with Data. Written Exercises

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

Testing Properties of Collections of Distributions

Support Vector Machines. Goals for the lecture

On the Use of A Priori Information for Sparse Signal Approximations

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

1 Generalization bounds based on Rademacher complexity

Probability Distributions

1 Bounding the Margin

Least Squares Fitting of Data

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

Supplement to: Subsampling Methods for Persistent Homology

Konrad-Zuse-Zentrum für Informationstechnik Berlin Heilbronner Str. 10, D Berlin - Wilmersdorf

Fairness via priority scheduling

Lecture 21. Interior Point Methods Setup and Algorithm

Math Reviews classifications (2000): Primary 54F05; Secondary 54D20, 54D65

Compressive Distilled Sensing: Sparse Recovery Using Adaptivity in Compressive Measurements

A Note on the Applied Use of MDL Approximations

Computable Shell Decomposition Bounds

Support recovery in compressed sensing: An estimation theoretic approach

FAST DYNAMO ON THE REAL LINE

Exact tensor completion with sum-of-squares

Lecture 20 November 7, 2013

The Transactional Nature of Quantum Information

Bipartite subgraphs and the smallest eigenvalue

Symmetrization and Rademacher Averages

A Bernstein-Markov Theorem for Normed Spaces

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

The Methods of Solution for Constrained Nonlinear Programming

Chaotic Coupled Map Lattices

The isomorphism problem of Hausdorff measures and Hölder restrictions of functions

Hybrid System Identification: An SDP Approach

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

THE POLYNOMIAL REPRESENTATION OF THE TYPE A n 1 RATIONAL CHEREDNIK ALGEBRA IN CHARACTERISTIC p n

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

arxiv: v1 [cs.ds] 17 Mar 2016

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

OPTIMIZATION in multi-agent networks has attracted

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint.

LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

A Note on Online Scheduling for Jobs with Arbitrary Release Times

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Transcription:

Universal algoriths for learning theory Part II : piecewise polynoial functions Peter Binev, Albert Cohen, Wolfgang Dahen, and Ronald DeVore Deceber 6, 2005 Abstract This paper is concerned with estiating the regression function f ρ in supervised learning by utilizing piecewise polynoial approxiations on adaptively generated partitions. The ain point of interest is algoriths that with high probability are optial in ters of the least square error achieved for a given nuber of observed data. In a previous paper [1], we have developed for each β > 0 an algorith for piecewise constant approxiation which is proven to provide such optial order estiates with probability larger than 1 β. In this paper, we consider the case of higher degree polynoials. We show that for general probability easures ρ epirical least squares iniization will not provide optial error estiates with high probability. We go further in identifying certain conditions on the probability easure ρ which will allow optial estiates with high probability. Key words: Regression, universal piecewise polynoial estiators, optial convergence rates in probability, adaptive partitioning, thresholding on-line updates AMS Subject Classification: 62G08, 62G20, 41A25 1 Introduction This paper is concerned with providing estiates in probability for the approxiation of the regression function in supervised learning when using piecewise polynoials on adaptively generated partitions. We shall work in the following setting. We suppose that ρ is an unknown easure on a product space Z := X Y, where X is a bounded doain of R d and Y = R. Given independent rando observations z i = (x i, y i ), i = 1,...,, This research was supported by the Office of Naval Resarch Contracts ONR-N00014-03-1-0051, ONR/DEPSCoR N00014-03-1-0675 and ONR/DEPSCoR N00014-00-1-0470; the Ary Research Office Contract DAAD 19-02-1-0028; the AFOSR Contract UF/USAF F49620-03-1-0381; and NSF contracts DMS-0221642 and DMS-0200187 and by the European Counity s Huan Potential Prograe under contract HPRN-CT-202-00286, (BREAKING COMPLEXITY) and by the National Science Foundation Grant DMS-0200665. 1

identically distributed according to ρ, we are interested in estiating the regression function f ρ (x) defined as the conditional expectation of the rando variable y at x: f ρ (x) := ydρ(y x) (1.1) Y with ρ(y x) the conditional probability easure on Y with respect to x. We shall use z = {z 1,..., z } Z to denote the set of observations. One of the goals of learning is to provide estiates under inial restrictions on the easure ρ since this easure is unknown to us. In this paper, we shall always work under the assuption that y M, (1.2) alost surely. It follows in particular that f ρ M. This property of ρ can often be inferred in practical applications. We denote by ρ X the arginal probability easure on X defined by We shall assue that ρ X is a Borel easure on X. We have ρ X (S) := ρ(s Y ). (1.3) dρ(x, y) = dρ(y x)dρ X (x). (1.4) It is easy to check that f ρ is the iniizer of the risk functional E(f) := (y f(x)) 2 dρ, (1.5) Z over f L 2 (X, ρ X ) where this space consists of all functions fro X to Y which are square integrable with respect to ρ X. In fact, one has where E(f) = E(f ρ ) + f f ρ 2, f L 2 (X, ρ X ), (1.6) := L2 (X,ρ X ). (1.7) Our objective will be to find an estiator f z for f ρ based on z such that the quantity f z f ρ is sall with high probability. This type of regression proble is referred to as distribution-free. A recent survey on distribution free regression theory is provided in the book [8], which includes ost existing approaches as well as the analysis of their rate of convergence in the expectation sense. A coon approach to this proble is to choose an hypothesis (or odel) class H and then to define f z, in analogy to (1.5), as the iniizer of the epirical risk f z := argin f H E z (f), with E z (f) := 1 (y j f(x j )) 2. (1.8) j=1 2

In other words, f z is the best approxiation to (y j ) j=1 fro H in the the epirical nor g 2 := 1 g(x j ) 2. (1.9) j=1 Typically, H = H depends on a finite nuber n = n() of paraeters. In soe algoriths, the nuber n is chosen using an a priori assuption on f ρ. We want to avoid such prior assuptions. In other procedures, the nuber n is adapted to the data and thereby avoids any a priori assuptions. We shall be interested in estiators of this type. The usual way of evaluating the perforance of the estiator f z is by studying its convergence either in probability or in expectation, i.e. the rate of decay of the quantities P{ f ρ f z η}, η > 0 or E( f ρ f z 2 ) (1.10) as the saple size increases. Here both the expectation and the probability are taken with respect to the product easure ρ defined on Z. Estiations in probability are to be preferred since they give ore inforation about the success of a particular algorith and they autoatically yield an estiate in expectation by integrating with respect to η. Much ore is known about the perforance of algoriths in expectation than in probability as we shall explain below. The present paper will be ainly concerned about estiates in probability and we shall show that this proble has soe interesting twists. Estiates for the decay of the quantities in (1.10) are usually obtained under certain assuptions (called priors) on f ρ. We ephasize that the algoriths should not depend on prior assuptions on f ρ. Only in the analysis of the algoriths do we ipose such prior assuptions in order to see how well the algorith perfors. Priors on f ρ are typically expressed by a condition of the type f ρ Θ where Θ is a class of functions that necessarily ust be contained in L 2 (X, ρ X ). If we wish the error, as easured in (1.10), to tend to zero as the nuber of saples tends to infinity then we necessarily need that Θ is a copact subset of L 2 (X, ρ X ). There are three coon ways to easure the copactness of a set Θ: (i) inial coverings, (ii) soothness conditions on the eleents of Θ, (iii) the rate of approxiation of the eleents of Θ by a specific approxiation process. We have discussed the advantages and liitations of each of these approaches in [1] (see also [6]). In the present paper we shall take the view of (iii) and seek estiators which are optial for a certain collection of priors of this type. Describing copactness in this way provides a bench ark for what could be achieved at best by a concrete estiator. Our previous work [1] has considered the special case of approxiation using piecewise constants on adaptively generated partitions. In that case, we have introduced algoriths that we prove to be optial in the sense that their rate of convergence is best possible, for a large collection of prior classes, aong all ethods that utilize piecewise constant approxiation on adaptively generated partitions based on isotropic refineents. Moreover, the ethods we proposed in [1] had certain aesthetic and nuerical advantages. For exaple, they are ipleented by a siple thresholding procedure that can be done on line. This eans that in the case of streaing data only a sall nuber of updates are necessary as new data appear. Also, the analysis of our ethods provided not only estiates in expectation but also the estiates in probability which we seek. 3

On the other hand, fro an approxiation theoretic point of view, using just piecewise constants severely liits the range of error decay rates that can be obtained even for very regular approxiands. Much better rates can be expected when eploying higher order piecewise polynoials which would result in equally local and, due to higher accuracy, overall ore econoical procedures. However, in this previous work, we have purposefully not considered the general case of piecewise polynoial approxiation because we had already identified certain notable - perhaps at first sight surprising - distinctions with the piecewise constant case. The purpose of the present paper is to analyze the case of general piecewise polynoial approxiation, and in particular, to draw out these distinctions. We ention a few of these in this introduction. In the piecewise constant case, we have shown that estiators built on epirical risk iniization (1.8) are guaranteed with high probability to approxiate the regression function with optial accuracy (in ters of rates of convergence). Here, by high probability we ean that the probability of not providing an optial order of approxiation tends to zero faster than β for any prescribed β > 0. We shall show in 3 that in general such probability estiates do not hold when using epirical risk iniization with piecewise polynoial approxiation. This eans that if we seek estiators which perfor well in probability then either we ust assue soething ore about the underlying probability easure ρ or we ust find an alternative to epirical risk iniization. In 4 we put soe additional restrictions on the easure ρ X and show that under these restrictions, we can again design algoriths based on epirical risk iniization which perfor optially with high probability. While, as we have already entioned, these assuption on ρ X are undesirable, we believe that in view of the counter exaple of 3 they represent roughly what can be done if one proceeds only with epirical risk iniization. 2 Approxiating the Regression Function: General Strategies In studying the estiation of the regression function, the question arises at the outset as to what are the best approxiation ethods to use in deriving algoriths for approxiating f ρ and therefore indirectly in defining prior classes? With no additional knowledge of ρ (and thereby f ρ ) there is no general answer to this question. However, we can draw soe distinctions between certain strategies. Suppose that we seek to approxiate f ρ by the eleents fro a hypothesis class H = Σ n. Here the paraeter n easures the coplexity associated to the process. In the case of approxiation by eleents fro linear spaces we will take the space Σ n to be of diension n. For nonlinear ethods, the space Σ n is not linear and now n represents the nuber of paraeters used in the approxiation. For exaple, if we choose to approxiate by piecewise polynoials on partitions with the degree r of the polynoials fixed then n could be chosen as the nuber of cells in the partition. The potential effectiveness of the approxiation process for our regression proble would be easured by the error of approxiation in the L 2 (X, ρ X ) nor. We define this error for a function 4

g L 2 (X, ρ X ) by E n (g) := E(g, Σ n ) := inf S Σ n g S, n = 1, 2,.... (2.1) If we have two approxiation ethods corresponding to sequences of approxiation spaces (Σ n ) and (Σ n), then the second process would be superior to the first in ters of rates of approxiation if E n(g) CE n (g) for all g and an absolute constant C > 0. For exaple, approxiation using piecewise linear functions would in this sense be superior to using approxiation by piecewise constants. In our learning context however, there are other considerations since: (i) the rate of approxiation need not translate directly into results about estiating f ρ because of the uncertainty in our observations, (ii) it ay be that the superior approxiation ethod is in fact uch ore difficult (or ipossible) to ipleent in practice. For exaple, a typical nonlinear ethod ay consist of finding an approxiation to g fro a faily of linear spaces each of diension N. The larger the faily the ore powerful the approxiation ethod. However, too large of a faily will generally ake the nuerical ipleentation of this ethod of approxiation ipossible. Suppose that we have chosen the space Σ n to be used as our hypothesis class H in the approxiation of f ρ fro our given data z. How should we define our approxiation? As we have noted in the introduction, the ost coon approach is epirical risk iniization which gives the function ˆf z := ˆf z,σn defined by (1.8). However, since we know f ρ M, the approxiation will be iproved if we post-truncate ˆf z by M. For this, we define the truncation operator for any real nuber x and define T M (x) := in( x, M) sign(x) (2.2) f z := f z,h := T M ( ˆf z,h ). (2.3) There are general results that provide estiates for how well f z approxiates f ρ. One such estiate given in [8] (see Theore 11.3) applies when H is a linear space of diension n and gives 1 E( f ρ f Z 2 ) < n log() + inf g H f ρ g 2. (2.4) The second ter is the bias and equals our approxiation error E n (f ρ ) for approxiation using the eleents of H. The first ter is the variance which bounds the error due to uncertainty. One can derive rates of convergence in expectation by balancing both ters (see [8] and [6]) for specific applications. The deficiency of this approach is that one needs to know the behavior of E n (f ρ ) in order to choose the best value of n and this requires a priori knowledge of f ρ. There is a general procedure known as odel selection which circuvents this difficulty and tries to autoatically choose a good value of n (depending on f ρ ) by introducing a penalty ter. Suppose that (Σ n ) n=1 is a faily of linear spaces each of diension n. For each C 1 Here and later in this paper we use the notation A < B to ean A CB for soe absolute constant 5

n = 1, 2,...,, we have the corresponding estiator f z,σn defined by (2.3) and the epirical error E n,z := 1 (y j f z,σn (x j )) 2. (2.5) j=1 Notice that E n,z is a coputable quantity which we can view as an estiate for E n (f ρ ). In coplexity regularization, one chooses a value of n by { n := n (z) := argin E n,z + n log }. (2.6) We now define f z := f z,σn (2.7) as our estiator to f ρ. One can then prove (see Chapter 12 of [8]) that whenever f ρ can be approxiated to accuracy E n (f ρ ) Mn s for soe s > 0, then 2 [ log E( f ρ f z 2 C ] 2s 2s+1 (2.8) which save for the logarith is an optial rate estiation in expectation. For a certain range of s, one can also prove siilar estiates in probability (see [6]). Notice that the estiator did not need to have knowledge of s and nevertheless obtains the optial perforance. Model selection can also be applied in the setting of nonlinear approxiation, i.e. when the spaces Σ n are nonlinear but in this case, one needs to invoke conditions on the copatibility of the penalty with the coplexity of the approxiation process as easured by an entropy restriction. We refer the reader to Chapter 12 of [8] for a ore detailed discussion of this topic and will briefly take up this point again in 2.3. Let us also note that the penalty approach is not always copatible with the practical requireent of on-line coputations. By on-line coputation, we ean that the estiator for the saple size can be derived by a siple update of the estiator for the saple size 1. In penalty ethods, the optiization proble needs to be globally re-solved when adding a new saple. However, when there is additional structure in the approxiation process such as the adaptive partitioning that we discuss in the next section, then there are algoriths that circuvent this difficulty (see the discussion of CART algoriths given in the following section). 2.1 Adaptive Partitioning In this paper, we shall be interested in approxiation by piecewise polynoials on partitions generated by adaptive partitioning. We shall restrict our discussion to the case X = [0, 1] d and the case of dyadic partitions. However, all results would follow in the ore general setting described in [1]. 2 We use the following conventions concerning constants throughout this paper. Constants like C, c, c depend on the specified paraeters but they ay vary at each occurrence, even in the sae line. We shall indicate the dependence of the constant on other paraeters whenever this is iportant. 6

Let D j = D j (X) be the collection of dyadic subcubes of X of sidelength 2 j and D := j=0d j. These cubes are naturally aligned on a tree T = T (D). Each node of the tree T corresponds to a cube I D. If I D j, then its children are the 2 d dyadic cubes J D j+1 with J I. We denote the set of children of I by C(I). We call I the parent of each such child J and write I = P (J). A proper subtree T 0 of T is a collection of nodes of T with the properties: (i) the root node I = X is in T 0, (ii) if I X is in T 0 then its parent is also in T 0. We obtain (dyadic) partitions Λ of X fro finite proper subtrees T 0 of T. Given any such T 0 the outer leaves of T 0 consist of all J T such that J / T 0 but P (J) is in T 0. The collection Λ = Λ(T 0 ) of outer leaves of T 0 is a partition of X into dyadic cubes. It is easily checked that #(T 0 ) #(Λ) 2 d #(T 0 ). (2.9) A unifor partition of X into dyadic cubes consists of all dyadic cubes in D j (X) for soe j 0. Thus, each cube in a unifor partition has the sae easure 2 jd. Another way of generating partitions is through soe refineent strategy. One begins at the root X and decides whether to refine X (i.e. subdivide X) based on soe refineent criteria. If X is subdivided then one exaines each child and decides whether or not to refine such a child based on the refineent strategy. Partitions obtained this way are called adaptive. We let Π K denote the space of ultivariate polynoials of total degree K with K 0 a fixed integer. In the analysis we present, the space Π K could be replaced by any space of functions of fixed finite diension without affecting our general discussion. Given a dyadic cube I D, and a function f L 2 (X, ρ X ), we denote by p I (f) the best approxiation to f on I: p I (f) := argin p ΠK f p L2 (I,ρ X ). (2.10) Given K > 0 and a partition Λ, let us denote by SΛ K the space of piecewise polynoial functions of degree K subordinate to Λ. Each S SΛ K can be written S = I Λ p I χ I, p I Π K, (2.11) where for G X we denote by χ G the indicator function, i.e. χ G (x) = 1 for x G and χ G (x) = 0 for x G. We shall consider the approxiation of a given function f L 2 (X, ρ X ) by the eleents of SΛ K. The best approxiation to f in this space is given by P Λ f := I Λ p I (f)χ I. (2.12) We shall be interested in two types of approxiation corresponding to unifor refineent and adaptive refineent. We first discuss unifor refineent. Let E n (f) := f P Λn f, n = 0, 1,... (2.13) which is the error for unifor refineent. We shall denote by A s the approxiation class consisting of all functions f L 2 (X, ρ X ) such that E n (f) M 0 2 nds, n = 0, 1,.... (2.14) 7

Notice that #(Λ n ) = 2 nd so that the decay in (2.14) is like N s with N the nuber of eleents in the partition. The sallest M 0 for which (2.14) holds serves to define the sei-nor f A s on A s. The space A s can be viewed as a soothness space of order ds > 0 with soothness easured with respect to ρ X. For exaple, if ρ X is the Lebesgue easure then A s/d = B (L s 2 ), 0 < s 1, with equivalent nors. Here B (L s 2 ) is a Besov space. For s < K one can take sup t t s ω K (f, t) L2 as a nor for this space, where ω K (f, t) L2 is the Kth order odulus of soothness in L 2 (see [2] for the definition and properties of Besov spaces). Instead of working with a priori fixed partitions there is a second kind of approxiation where the partition is generated adaptively and will vary with f. Adaptive partitions are typically generated by using soe refineent criterion that deterines whether or not to subdivide a given cell. We shall use a refineent criterion that is otivated by adaptive wavelet constructions such as those given in [4] for iage copression. Given a function f L 2 (X, ρ X ), we define the local atos ψ I (f) := p J (f)χ J p I (f)χ I, I X, ψ X (f) := p X (f), (2.15) and Clearly, we have ɛ I (f) := ψ I (f). (2.16) f = I D ψ I (f), (2.17) and since the ψ I are utually orthogonal, we also have f 2 L 2 (X,ρ X ) = I D ɛ I (f) 2. (2.18) The nuber ɛ I (f) gives the iproveent in the L 2 (X, ρ X ) error squared when the cell I is refined. We let T (f, η) be the sallest proper tree that contains all I D such that ɛ I (f) > η. Corresponding to this tree we have the partition Λ(f, η) consisting of the outer leaves of T (f, η). We shall define soe new soothness spaces B s which easure the regularity of a given function f by the size of the tree T (f, η). Given s > 0, we let B s be the collection of all f L 2 (X, ρ X ) such that for p = (s + 1/2) 1/2, the following is finite f p Bs := sup η p #(T (f, η)). (2.19) We obtain the nor for B s by adding f to f B s. One can show that f P Λ(f,η) f C s f 1 2s+1 B s η>0 η 2s 2s+1 Cs f B sn s, N := #(T (f, η)), (2.20) where the constant C s depends only on s. The proof of this estiate can be based on the sae strategy as used in [4] where a siilar result is proven in the case of the Lebesgue 8

easure: one introduces the trees T j := T (f, 2 j η) which have the property T j T j+1, and then writes f P Λ(f,η) f 2 = I / T (f,η) ψ I 2 = j 0 I T j+1 \T j ψ I 2 j 0 #(T j+1)(2 j η) 2 f p B j 0 s (2 j η) 2 p, which gives (2.20) with C s := j 0 2(p 2)j. Invoking (2.9), it follows that every function f B s can be approxiated to order O(N s ) by P Λ f for soe partition Λ with #(Λ) = N. This should be contrasted with A s which has the sae approxiation order for the unifor partition. It is easy to see that B s is larger than A s. In classical settings, the class B s is well understood. For exaple, in the case of Lebesgue easure and dyadic partitions we know that each Besov space B s q(l τ ) with τ > (s/d + 1/2) 1 and 0 < q, is contained in B s/d (see [4]). This should be copared with the A s where we know that A s/d = B s (L 2 ) as we have noted earlier. 2.2 An adaptive algorith for learning In the learning context, we cannot use the algorith described in the previous section since the regression function f ρ and the easure ρ are not known to us. Instead we shall use an epirical version of this adaptive procedure. Given the data z and any Borel set I X, we define p I,z := argin p ΠK 1 When there are no x i in I, we set p I,z = 0. Given a partition Λ of X we define the estiator f z as (p(x i ) y i ) 2 χ I (x i ). (2.21) i=1 f z = f z,λ := I Λ T M (p I,z )χ I (2.22) with T M the truncation operator defined earlier. Note that the epirical iniization (2.21) is not done over the truncated polynoials, since this is not nuerically feasible. Instead, truncation is only used as a post processing. As in the previous section, we have two ways to generate partitions Λ. The first is to use the unifor partition Λ n consisting of all dyadic cells in D n. The second is to define an epirical analogue of the ɛ I. For each cell I in the aster tree T, we define ɛ I (z) := T M ( p J,z χ J p I,z χ I ), (2.23) where is the epirical nor defined in (1.9). A data based adaptive partitioning requires liiting the depth of corresponding trees. To this end, let γ > 0 be an arbitrary but fixed constant. We define j 0 = j 0 (, γ) as 9

the sallest integer j such that 2 jd τ 1/γ. We then consider the sallest tree T (τ, z) which contains the set Σ(z, ) := {I j j0 Λ j : ɛ I (z) τ }, (2.24) where τ is a threshold to be set further. We then define the partition Λ = Λ(τ, z) associated to this tree and the corresponding estiator f z := f z,λ. Obviously, the role of the integer j 0 is to liit the depth search for the coefficient ɛ I (z) which are larger than the threshold τ. Without this restriction, the tree T (τ, z) could be infinite preventing a nuerical ipleentation. The essential steps of the adaptive algorith in the present setting read as follows: Algorith: Given z, choose γ > 0, τ := κ log and for j 0 (, γ) deterine the set Σ(z, ) according to (2.24); for T (τ, z), Λ(τ, z) and copute f z according to (2.22). For further coents concerning the treatent of streaing data we refer to an analogous strategy outlined in [1]. In our previous work [1], we have analyzed the above algorith in the case of piecewise constant approxiation and we have proved the following result. Theore 2.1 Let β, γ > 0 be arbitrary. Then, using piecewise constant approxiations in the above schee, i.e. K = 0, there exists κ 0 = κ 0 (β, γ, M) such that if κ κ 0 in the definition of τ, then whenever f ρ A γ B s for soe s > 0, the following concentration estiate holds P { f ρ f z c ( log ) s 2s+1 where the constants c and C are independent of. } C β, (2.25) Let us ake soe rearks on this theore. First note that truncation does not play any role in the case of piecewise constant approxiation since in that case the constant of best epirical approxiation autoatically is M in absolute value. The theore gives an estiate for the error f ρ f z in probability which is the type of estiate we are looking for in this paper. Fro this one obtains a corresponding estiate in expectation. The order of approxiation can be shown to be optial save for the logarithic ter by using the results on lower estiates fro [6]. Finally, note that the role of the space A γ is a inor one since the only assuption on γ is that it be positive. This assuption erely guarantees that a finite depth search will behave close to an infinite depth search. The goal of the present paper is to deterine whether the analogue of Theore 2.1 holds when piecewise polynoials are used in place of piecewise constants. We shall see that this is not the case by eans of a counterexaple in 3. We shall then show that such estiates are possible if we place restrictions on the easure ρ X. 10

2.3 Estiates in Expectation Before starting the analysis of results in probability for the higher order piecewise polynoial case, let us point out that it is possible to derive estiates in expectation for adaptive partitions constructed by other strategies, such as odel selection by coplexity regularization. In particular, the following result can easily be derived fro Theore 12.1 in [8]. Theore 2.2 Let γ > 0 be arbitrary and let j 0 = j 0 (γ, ) be defined again as the sallest integer j such that 2 jd (log /) 1/2γ. Consider the set M of all partitions Λ induced by proper trees T j j0 Λ j. Then, there exists κ 0 = κ 0 (d, K) such that if pen (Λ) = κ log #(Λ). for soe κ κ 0, the estiator defined by f z := f z,λ with Λ := argin Λ M { fz,λ y 2 + pen (Λ) }, satisfies ( log ) s 2s+1 E( f ρ f z ) C, = 1, 2,..., (2.26) if f ρ A γ B s where the constant C depends on κ, M, f ρ B s, f ρ A s, but not on. Let us also reark that the search of the optial partition Λ in the above theore can be perfored at a reasonable coputational cost using a CART algorith (see e.g. [3] or [7]). Note that our approach for selecting the appropriate partition differs fro the CART algoriths, in the sense that it is based on a thresholding procedure rather than solving an optiization proble. 3 A counteraple We begin by showing that in general we cannot obtain optial estiates with high probability when using epirical risk iniization with piecewise polynoials of degree larger than zero. We shall first consider the case of approxiation by linear functions on the interval X = [ 1, 1] for the bound M = 1. For each = 1, 2,..., we will construct a easure ρ = ρ on [ 1, 1] [ 1, 1] for which epirical risk iniization does not perfor well in probability. Let p be the polynoial p = argin g Π1 E( g y 2 ) = argin g Π1 f ρ g 2, (3.1) with f ρ the regression function and the L 2 (X, ρ X ) nor. Consider also the epirical least square iniizer ˆp = argin g Π1 i=1 g(x i ) y i 2. (3.2) 11

We are interested in the concentration properties between T (p) and T (ˆp) in the etric, where T (u) = sign(u) ax{1, u } is the truncation operator. We shall prove the following result, which expresses the fact that we cannot hope for a distribution free concentration inequality with a fast rate of decay of the probability. Lea 3.1 Given any β > 2, there exist absolute constants c, c > 0 such that for each = 1, 2,..., there is a distribution ρ = ρ such that the following inequalities hold P{ T (p) T (ˆp) c} c β+1 (3.3) and On the other hand, we have P{ f ρ T (ˆp) c} c β+1. (3.4) for an absolute constant C 0. f ρ T (p) 2 C 0 [ 2β+4 + β ] (3.5) Reark 3.2 Note that this clearly iplies the sae results as in (3.3) and (3.4) with the truncation operator reoved. By contrast, for least square approxiation by constants q, we have (see [1]) for all ρ, and η P{ q ˆq η} < e cη2. (3.6) Proof of Lea 3.1: In order to prove this result, we consider the probability easure ρ X := (1/2 κ)(δ γ + δ γ ) + κ(δ 1 + δ 1 ), (3.7) where γ := γ = 1 3 and κ := κ := β. We then define ρ = ρ copletely by y(γ) = 1, y( γ) = 1, y(±1) = 0, with probability 1. (3.8) Therefore, there is no randoness in the y direction. It follows that We next proceed in three steps. f ρ (γ) = 1, f ρ ( γ) = 1, f ρ (±1) = 0. (3.9) 1. Properties of p: By syetry, the linear function that best approxiates f ρ in L 2 (ρ X ) is of the for p(x) = ax, where a iniizes We therefore find that F (a) = (1/2 κ)(aγ 1) 2 + κa 2. (3.10) (1/2 κ)γ2 p(±γ) = ±aγ = ± (1/2 κ)γ 2 + κ = ±1 + O( β+2 ). (3.11) 12

This shows that with C 0 an absolute constant. f ρ T p 2 C 0 2β+4 + 2κ C 0 [ 2β+4 + β ] (3.12) 2. Properties of ˆp: We can write the epirical least square polynoial ˆp as ˆp(x) = ˆb + â(x ˆξ), (3.13) where ˆb = 1 i=1 y i and ˆξ := 1 i=1 x i. (Notice that 1 and x ˆξ are orthogonal with respect to the epirical easure 1 i=1 δ x i.) Fro ˆp(γ) ˆp( γ) 2γ = â = ˆp(ˆξ) ˆp(γ), (3.14) (ˆξ γ) it follows that ˆp(γ) = ˆp( γ)(1 + 2γ ˆξ γ ) 1 + ˆp(ˆξ)(1 + ˆξ γ 2γ ) 1. (3.15) Since ˆp(ˆξ) = ˆb [ 1, 1], (3.15) shows that whenever ˆξ 2γ, then either ˆp(γ) 1/2 or ˆp( γ) 1/2. It follows that whenever ˆξ 2γ, we have f ρ T (ˆp) 2 (1/2 κ)(1/2) 2 c 2 (3.16) with c an absolute constant. Using (3.12), we see that (3.3) is also satisfied provided that P{ˆξ > 2γ} c β+1. 3. Study of ˆξ: We consider the event where x i = 1 for one i {1,, } and x j = γ or γ for the other j i. In such an event, we have The probability of this event is ˆξ 1 (1 ( 1)γ) > 1 2 (1 1/3) = 3 = 2γ. (3.17) P = κ(1 2κ) 1 c β+1. (3.18) This concludes the proof of (3.3). Let us now adjust the exaple of the lea to give inforation about piecewise linear approxiation on adaptively generated dyadic partitions. We first note that we can rescale the easure of the lea to any given interval I. We choose a dyadic interval I of length 2 and scale the easure of the lea to that interval. We denote this easure again by ρ. The regression function f ρ will be denoted by f and it will take values ±1 at the rescaled points corresponding to γ and γ and will take the value 0 everywhere else. We know fro Lea 3.1 that we can approxiate f by a single polynoial to accuracy f ρ p 2 C 0 [ 2β+4 + β ]. (3.19) 13

Also, if we allow dyadic partitions with ore than eleents, we can approxiate f ρ exactly. In other words, f ρ is in B s with s = in(β 2, β/2). On the other hand, any adaptively generated partition with at ost eleents will have an interval J containing I. The epirical data sets will be rescaled to I and the epirical ˆp J s will all be the epirical linear functions of Lea 3.1 scaled to I. Hence, for any of the bad draws z of Lea 3.1, we will have f ρ ˆf z c (3.20) on a set of z with probability larger than c β+1. This shows that epirical least squares using piecewise linear functions on adaptively generated partitions will not provide optial bounds with high probability. Note that the above results are not in contradiction with optial estiates in expectation. The counter exaple also indicates, however, that the arguents leading to optial rates in expectation based on coplexity regularization cannot be expected to be refined towards estiates in probability. 4 Optial results in probability under regularity assuptions on ρ X In view of the results of the previous section, it is not possible to prove optial convergence rates with high probability for adaptive ethods based on piecewise polynoials for the general setting of the regression proble in learning. In this section, we want to show that if we ipose soe restrictions on the arginal easure ρ X, then high probability results are possible. We fix the value K of the polynoial space Π K in this section. For each dyadic cube I R d and any function f L 2 (X, ρ X ), we define p I (f) as in (2.10). This eans that given any dyadic partition Λ, the best approxiation to f fro S Λ is given by the projector P Λ f = I Λ p I (f)χ I. (4.1) We shall work under the following assuption in this section: Assuption A : There exists a constant C A > 0 such that for each dyadic cube I, there exists an L 2 (I, ρ X )-orthonoral basis (L I,k ) k=1,,λ of Π K (with λ the algebraic diension of Π K ) such that L I,k L (I) C A (ρ X (I)) 1/2, k = 1,..., λ. (4.2) Note that this assuption iplies that the least squares projection is uniquely defined on each I by λ p I (f) = f, L I,k L2 (I,ρ X )L I,k, (4.3) k=1 14

and in particular ρ X (I) 0. It also iplies that for all partitions Λ and for all f L (X), P Λ f L λc A f L, (4.4) i.e. the projectors P Λ are bounded in L independently of Λ. Indeed, fro Cauchy- Schwartz inequality, we have L I,k L1 (I,ρ X ) (ρ X (I)) 1/2 L I,k L2 (I,ρ X ) = (ρ X (I)) 1/2 (4.5) and therefore for all x I Λ (P Λ f)(x) λ f, L I,k χ I L I,k (x) λc A f L (I). (4.6) k=1 It is readily seen that Assuption A holds when ρ X is the Lebesgue easure dx. On the other hand, it ay not hold when ρ X is a very irregular easure. A particular siple case where Assuption A always holds is the following: Assuption B : We have dρ X = ω(x)dx and there exists a constant C B > 0 such that for all I D, there exists a convex subset D I with I B D and such that 0 < ρ X (I) C B I inf x D ω(x). This assuption is obviously fulfilled by weight functions such that 0 < c ω(x) C, but it is also fulfilled by functions which ay vanish (or go to + ) with a power-like behaviour at isolated points or lower diensional anifolds. Let us explain why Assuption B iplies Assuption A. Any convex doain D can be fraed by E D Ẽ where E is an ellipsoid and Ẽ a dilated version of E by a fixed factor depending only on the diension d. Therefore, if P is a polynoial with P L2 (I,ρ X ) = 1, we have P 2 L (I) C P 2 L (D) holds for a constant C depending only on the degree K, the spatial diension d and the constant B in Assuption B. Hence, using Assuption B, we further conclude P 2 L < (I) P 2 < L (D) D 1 P (x) 2 dx < < D 1 D ρ X(I) 1 D P (x) 2 ρ X (I) 1 I ω(x)dx I P (x) 2 dρ = ρ X (I) 1. Here, all the constants in the inequalities depend on d, K, B, C B but are independent of I T. 4.1 Probability estiates for the epirical estiator We shall now show that under Assuption A, we can estiate the discrepancy between the truncated least squares polynoial approxiation to f ρ and the truncated least squares 15

polynoial fit to the epirical data. This should be copared with the counterexaple of the last section which showed that for general ρ we do not have this property. Lea 4.1 There exists a constant c > 0 which depends on the constant C A in Assuption A, on the polynoial space diension λ = λ(k) of Π K and on the bound M, such that for all I D where c = 2(λ + λ 2 ), Proof : We clearly have P{ T M (p I )χ I T M (p I,z )χ I > η} ce cη2, (4.7) T M (p I )χ I T M (p I,z )χ I 2 4M 2 ρ X (I), (4.8) and therefore it suffices to prove (4.7) for those I such that η 2 < 4M 2 ρ X (I). We use the basis (L I,k ) k=1,,l of Assuption A to express p I and p I,z according to p I = λ c I,k L I,k and p I,z = k=1 λ c I,k (z)l I,k, (4.9) k=1 and we denote by c I and c I (z) the corresponding coordinate vectors. We next reark that since T M is a contraction it follows that T M (p I )χ I T M (p I,z )χ I p I χ I p I,z χ I, (4.10) T M (p I )χ I T M (p I,z )χ I 2 λ c I,k c I,k (z) 2 = c I c I (z) 2 l 2 (R λ ), (4.11) k=1 were l2 (R λ ) is the λ-diensional Euclidean nor. Introducing α I,k = yl I,k (x)χ I (x)dρ and α I,k (z) := 1 Z y i L I,k (x i )χ I (x i ), (4.12) with α I and α I (z) the corresponding coordinate vectors, we clearly have c I = α I. (4.13) On the other hand, we have G I (z)c I (z) = α I (z), where (G I (z)) k,l := 1 i=1 L I,k (x i )L I,l (x i ) = L I,k, L I,l, i=1 16

and c I (z) := 0 when there are no x i s in I. Therefore, when G I (z) is invertible, we can write c I (z) c I = G I (z) 1 α I (z) α I = G I (z) 1 [α I (z) G I (z)α I ], and therefore c I (z) c I l2 (R λ ) G I (z) 1 l2 (R λ ) ( α I (z) α I l2 (R λ ) + G I (z) I l2 (R ) α λ I l2 (R λ ) ), (4.14) where we also denote by G l2 (R λ ) the spectral nor for an λ λ atrix G. Since G I (z) I l2 (R λ ) 1/2 iplies G I (z) 1 l2 (R λ ) 2 it follows that c I (z) c I l2 (R λ ) η provided that we have jointly and By Bernstein s inequality, we have α I (z) α I l2 (R λ ) η 4, (4.15) { 1 G I (z) I l2 (R λ ) in 2, η } 4 α I 1 l 2. (4.16) (R λ ) P { α I (z) α I l2 (R λ ) > η 4 } P{ α I,k(z) α I,k > η for soe k} 4λ1/2 2λe η 2 c with c depending on λ where, on account of (4.2), and S := sup k σ 2 +Sη, (4.17) sup yl I,k (x) MC A ρ X (I) 1/2, x,y σ 2 := sup k E(y 2 L i,k 2 ) M 2. Since η 2 < 2M 2 ρ X (I), we thus have, up to a change in the constant c, { P α I (z) α I l2 (R λ ) > η } 2λe c η 2 M 2 +Mρ X (I) 1/2 η 2λe cη2, (4.18) 4 with c depending on λ, C A and M. In a siilar way, we obtain by Bernstein s inequality that P { G I (z) I l2 (R λ ) > ξ} P { (G I (z)) k,l δ k,l > ξ λ for soe (k, l)} λ 2 e ξ 2 c σ 2 +Sξ, (4.19) with c depending on λ, where now again by (4.2) S := sup k,l sup L I,k (x)l I,l (x) CAρ 2 X (I) 1, (4.20) x and σ 2 := sup k,l E( L I,k (x)l I,l (x) 2 ) C 2 Aρ X (I) 1. (4.21) 17

In the case ξ = 1 2 η 4 α I 1, we thus obtain fro (4.19), using that η 2 2M 2 ρ X (I), P { G I (z) I l2 (R λ ) > ξ} 2λ 2 e c ρ X (I) 1 2λ 2 e cη2, (4.22) with c depending on λ, C A and M. In the case ξ = η α 4 I 1 1, we notice that 2 α I l2 (R λ ) λ 1/2 M sup L I,k L 1 (I,ρ X ) λ 1/2 MC A ρ X (I) 1/2, (4.23) k because of Assuption A. It follows that ξ η(16λm 2 C 2 A ρ X(I)) 1/2 and the absolute value of the exponent in the exponential of (4.19) is bounded fro below by cη 2 ρ X (I) 1 24λM 2 C 4 A ρ X(I) 1 cη2, (4.24) with c depending on λ, C A and M. The proof of (4.7) is therefore coplete. Reark 4.2 The constant c in (4.7) depends on M and C A and behaves like (MC 2 A ) 2. 4.2 Learning on Fixed and Unifor Partitions Fro the basic estiate (4.7), we can iediately derive an estiate for an arbitrary but fixed partition Λ consisting of disjoint dyadic cubes. If Λ = N, we have P { f z,λ T M (P Λ f ρ ) > η} P { T M (p I )χ I T M (p I,z )χ I > η N 1/2 for soe I Λ}, which yields the following analogue to Theore 2.1 of [1]. Reark 4.3 Under Assuption A one has for any fixed integer K 0, any partition Λ and η > 0 η2 c P { f z,λ T M (P Λ f ρ ) > η} C 0 Ne N, (4.25) where N := #(Λ) and C 0 = C 0 (λ) and c = c(λ, M, C A ). We can then derive by integration over η > 0 an estiate in the ean square sense E ( f z,λ T M (P Λ f ρ ) 2 ) C N log N, (4.26) siilar to Corollary 2.2 of [1], with C = C(M, C A, λ). Cobining these estiates with the definition of the approxiation classes A s, we obtain siilar convergence rates as in Theore 2.3 of [1]. Theore 4.4 If f ρ A s and the estiator is defined by f z := f Λj,z with j chosen as the sallest integer such that 2 jd(1+2s), then given any β > 0, there exist constants log C = C(M, C A, λ, β) and c = c(m, C A, λ, β) such that { ( log ) } s 2s+1 P f ρ f z > ( c + f ρ A s) C β. (4.27) There also exists a constant C = C(M, C A, λ) such that ) ( log ) 2s E ( f ρ f z 2 (C + f ρ 2 A s) 2s+1. (4.28) 18

As we have noted earlier, the second estiate (4.28) could have been obtained in a different way, naely by using Theore 11.3 in [8] and without Assuption A. On the other hand, it is not clear that the probability estiate (4.27) could be obtained without this assuption. 4.3 Learning on Adaptive Partitions We now turn to the adaptive algorith as defined in the 2.2. Here, we shall extend Theore 2.5 of [1] in two ways. Recall that the depth of the tree is liited by j 0 = j 0 (, γ) the sallest integer j such that 2 jd τ 1/γ. Theore 4.5 We fix an arbitrary β 1 and γ > 0. If we take the threshold τ := log κ with κ κ 0 = κ 0 (γ, β, M, λ, d, C A ), then for any ρ such that f ρ A γ B s, s > 0, the adaptive algorith gives an f z satisfying the concentration estiate { ( log ) } s 2s+1 P f ρ f z c β, (4.29) with c = c(s, d, λ, β, f ρ B s, f ρ A γ). We also have the following expectation bound with C = C(s, λ, M, C A, d, β, f ρ B s, f ρ A γ). ( log ) 2s E ( f ρ f z 2 2s+1 ) C, (4.30) A defect of this theore is the dependency of κ 0 on C A which ay be unknown in our regression proble. We shall propose later another convergence theore where this defect is reoved. The strategy for proving Theore 4.5 is siilar to the one for Theore 2.5 in [1]. The idea is to show that the set of coefficients chosen by the adaptive epirical algorith are with high probability siilar to the set that would be chosen if the adaptive thresholding took place directly on f ρ. This will be established by probability estiates which control the discrepancy between ɛ I and ɛ I (z). This is given by the following result, which is a substitute to Lea 4.1 in [1]. Lea 4.6 For any η > 0 and any eleent I Λ j0, one has and P {ɛ I (z) η and ɛ I 8λC A η} c(1 + η C )e cη2 (4.31) P {ɛ I η and ɛ I (z) 4η} c(1 + η C )e cη2, (4.32) where c = c(λ, M, d), c = c(λ, M, C A, d) and C = C(λ, d). The proof of Lea 4.6 is rather different fro the counterpart in [1] and is postponed to the end of this section. Its usage in deriving the claied bounds in Theore 4.5, however, is very siilar to the proof in [1]. Because of a few changes due to the truncation 19

operator we briefly recall first its ain steps suppressing at ties details that could be recovered fro [1]. For any two partitions Λ 0, Λ 1 we denote by Λ 0 Λ 1, Λ 0 Λ 1 the partitions induced by the corresponding trees T 0 T 1, T 0 T 1, respectively. Recall that the data dependent partitions Λ(η, z), defined in (2.24), do not contain cubes fro levels higher than j 0. Likewise, for any threshold η > 0, we need the deterinistic analogue Λ(η) := Λ(f ρ, η), where Λ(f ρ, η) is the partition associated to the sallest tree T (f ρ, η) that contains all I for which ɛ I (f ρ ) > η. As in the proof of Theore 2.5 in [1], we estiate the error by with each ter now defined by f ρ f z,λ(τ,z) e 1 + e 2 + e 3 + e 4 (4.33) e 1 := f ρ T M (P Λ(τ,z) Λ(8C A λτ )f ρ ), e 2 := T M (P Λ(τ,z) Λ(8C A λτ )f ρ ) T M (P Λ(τ,z) Λ(τ /4)f ρ ), e 3 := T M (P Λ(τ,z) Λ(τ /4)f ρ ) f z,λ(τ,z) Λ(τ /4), e 4 := f z,λ(τ,z) Λ(τ /4) f z,λ(τ,z). The first ter e 1 is treated by approxiation ( e 1 C s (8C A λτ ) 2s 2s+1 ) s 2s+1 fρ B s + τ f ρ A γ c( f ρ B s + f ρ A γ), (4.34) log where c = c(s, C A, λ, κ). The second suand in the first estiate of (4.34) stes fro the fact that optial trees ight be clipped by the restriction to Λ j0 and this issing part exploits the assuption f ρ A γ. The third ter e 3 can be estiated by an inequality siilar to (4.25) although Λ(τ, z) Λ(τ /4) is a data-dependent partition. We use the fact that all the cubes in this partition are always contained in the tree consisting of T (f ρ, τ /4) copleted by its outer leaves, i.e. by the cubes of Λ(τ /4). We denote by T (f ρ, τ /4) this tree and by N its cardinality. Using the sae arguent that led us to (4.25) for a fixed partition, we obtain η2 c P {e 3 > η} C 0 Ne N (4.35) with c and C 0 the sae constants as in (4.25). Fro the definition of the space B s, we have the estiate ( N (2 d + 1)#(T (f ρ, τ /4)) (2 d + 1)(4κ 1 f ρ B s) 2 2s+1 ) 1 2s+1. (4.36) log Thus, for κ larger than κ 0 (β) we obtain that { P e 3 > c with c = c(d, λ, β, C A, f ρ B s, s, κ, d). ( log ) s 2s+1 } β, (4.37) 20

The ters e 2 and e 4 are treated using the probabilistic estiates of Lea 4.6. Fro these estiates it follows that P {e 2 > 0} + Pr{e 4 > 0} 2#(Λ j0 ) c(1 + η C )e cη2 (4.38) with η = τ /4. It follows that for κ κ 0 (β, γ, λ, M, C A, d), we have P {e 2 > 0} + P {e 4 > 0} β. (4.39) Cobining the estiates for e 1, e 2, e 3 and e 4, we therefore obtain the probabilistic estiate (2.25). For the expectation estiate, we clearly infer fro (4.34) ( log ) 2s E (e 2 1) c 2 ( f ρ 2 B + f ρ 2 s A γ) 2s+1 with c = c(s, C A, λ, κ). Also, since e 2 2, e 2 4 4M 2, we obtain (4.40) E (e 2 2) + E (e 2 4) M 2 β, (4.41) with C = C(κ, λ, C A, M, d). We can estiate E (e 2 3) by integration of (4.35) which gives ( log ) 2s E (e 2 3) 4M 2 β + c 2 2s+1, (4.42) with c as in (4.37). This copletes the proof of the Theore 4.5. We now turn to the proof of Lea 4.6. We shall give a different approach than that used in [1] for piecewise constant approxiation. Recall that ɛ I (z) := T M ( p J,z χ J p I,z χ I ), (4.43) and ɛ I := ɛ I (f ρ ) := We introduce the auxiliary quantities and ɛ I (z) := T M ( ɛ I := T M ( p J χ J p I χ I. (4.44) p J,z χ J p I,z χ I ), (4.45) p J χ J p I χ I ). (4.46) For the proof of (4.31), we first reark that fro the L bound (4.4) on the least square projection, we know that p J χ J p I χ I L sup p J χ J L + p I χ I L 2λC A M. (4.47) 21

It follows that the inequality p J χ J p I χ I 2λC A T M ( p J χ J p I χ I ), (4.48) holds at all points, and therefore ɛ I 2λC A ɛ I so that We now write P {ɛ I (z) η and ɛ I 8λC A η} P {ɛ I (z) η and ɛ I 4η}. (4.49) P {ɛ I (z) η and ɛ I 4η} P { ɛ I (z) 3η and ɛ I 4η} The first probability p 1 is estiated by p 1 P { T M ( + P {ɛ I (z) η and ɛ I (z) 3η} P { ɛ I (z) ɛ I η} + P { ɛ I (z) 2ɛ I (z) η} = p 1 + p 2. p J χ J p I χ I ) T M ( p J,z χ J p I,z χ I ) > η} ce cη2, (4.50) with c = c(λ, d) and c = c(λ, M, C A, d), where the last inequality is obtained by the sae technique as in the proof of Lea 4.1 applied to p Jχ J p I χ I instead of p I χ I only. The second probability p 2 is estiated by using results fro [8]. Fix I and let F = F(I) be the set of all functions f of the for f = T M ( q J χ J q I χ I ). (4.51) where q I and the q J are arbitrary polynoials fro Π K. Let t = (t 1,..., t 2 ) with the t j chosen independently and at rando with respect to the easure ρ X and consider the discrete nor f t := 1 2 f(t j ) 2. (4.52) 2 We will need a bound for the covering nuber N (F, η, t) which is the sallest nuber of balls of radius η which cover F with respect to the nor t. It is well known that if V is a linear space of diension q and G := {T M g : g V } then j=1 N (G, η, t) (Cη) (2q+1), 0 < η 1, (4.53) with C = C(M) (see e.g. Theores 9.4 and 9.5 in [8]). In our present situation, this leads to N (F, η, t) (Cη) 2λ[2d +1]+2, 0 < η < 1, (4.54) We apply this bound on covering nubers in Theore 11.2 of [8] which states that P { f 2 f > η for soe f F} 3e η2 288M 2 E (N (F, η, t)) (4.55) 22

Here the probability is with respect to z and the expectation is with respect to t. We can put in the entropy bound (4.54) for N into (4.55) and obtain p 2 P { f 2 f > η for soe f F} cη C e cη2 (4.56) with c, C, c as stated in the lea. This is the estiate we wanted for p 2. For the proof of (4.32), we first reark that obviously ɛ I ɛ I so that P {ɛ I η and ɛ I (z) 4η} P {ɛ I η and ɛ I (z) 4η}. (4.57) We proceed siilarily to the proof of (4.31) by writing P {ɛ I η and ɛ I (z) 4η} P {ɛ I η and ɛ I (z) 3 } 2 η { + P ɛ I (z) 3 } 2 η and ɛ I(z) 4η P { ɛ I (z) ɛ I η/2} + P {ɛ I (z) 2 ɛ I (z) η} = p 1 + p 2. (4.58) The first probability p 1 is estiated as previously. For the second probability p 2, we need a syetric stateent to Theore 11.2 in [8], to derive P { f 2 f > η for soe f F} Ce cη2 (4.59) with c and C as before. It is easily checked fro the proof of Theore 11.2 in [8], that such a stateent also holds (one only needs to odify the first page of the proof of this theore in [8] in order to bound P{ f 2 f > η for soe f F} by 3P{ f f > η/4 for soe f F}/2 and the rest of the proof is then identical). The proof of Lea 4.6 is therefore coplete. We finally odify our algorith slightly by choosing a slightly larger threshold which is now independent of the unknown constant C A, naely τ := log. We could actually use any threshold of the type log τ := κ(), (4.60) where κ() is a sequence which grows very slowly to +. This results in an additional logarith factor in our convergence estiates. Moreover, the sae analysis as in 5 of [1] shows that this new algorith is universally consistent. We record this in the following theore for which we do not give a proof since it is very siilar to the proof of Theore 4.5. Theore 4.7 Given an arbitrary β 1 and γ > 0, we take the threshold τ := log. Then the adaptive algorith has the property that whenever f ρ A γ B s for soe s > 0, the following concentration estiate holds { ( log ) } 2s 2s+1 P f ρ f z c β, (4.61) 23

with c = c(s, C A, λ, f ρ B s, f ρ A γ), as well as the following expectation bound ( log ) 4s E ( f ρ f z 2 2s+1 ) C (4.62) with C = C(s, λ, M, C A, d, f ρ B s, f ρ A γ). For a general regression function f ρ, we have the universal consistency estiate li E( f ρ f z 2 ) = 0, (4.63) + which in turn iplies the convergence in probability: for all ɛ > 0, References li P{ f ρ f z > ɛ} = 0. (4.64) + [1] Binev, P., A. Cohen, W. Dahen, R. DeVore, and V. Telyakov (2004) Universal algoriths in learning theory - Part I : piecewise constant functions, Journal of Machine Learning Research (JMLR), 6(2005), 1297 1321. [2] C. Bennett and R. Sharpley, Interpolation of Operators, Vol. 129 in Pure and Applied Matheatics, Acadeic Press, N.Y., 1988. [3] Breian, L., J.H. Friedan, R.A. Olshen and C.J. Stone (1984) Classification and regression trees, Wadsworth international, Belont, CA. [4] Cohen, A., W. Dahen, I. Daubechies and R. DeVore (2001) Tree-structured approxiation and optial encoding, App. Cop. Har. Anal. 11, 192 226. [5] Cohen, A., R. DeVore, G. Kerkyacharian and D. Picard (2001) Maxial spaces with given rate of convergence for thresholding algoriths, App. Cop. Har. Anal. 11, 167 191. [6] DeVore, R., G. Kerkyacharian, D. Picard and V. Telyakov (2004) Matheatical ethods for supervised learning, to appear in J. of FOCM. [7] Donoho, D.L (1997) CART and best-ortho-basis : a connection, Ann. Stat. 25, 1870 1911. [8] Györfy, L., M. Kohler, A. Krzyzak, A. and H. Walk (2002) A distribution-free theory of nonparaetric regression, Springer, Berlin. Peter Binev, Industrial Matheatics Institute, University of South Carolina, Colubia, SC 29208, binev @ath.sc.edu Albert Cohen, Laboratoire Jacques-Louis Lions, Université Pierre et Marie Curie 175, rue du Chevaleret, 75013 Paris, France, cohen@ann.jussieu.fr 24