Adaptive Minimax Classification with Dyadic Decision Trees

Size: px

Start display at page:

Download "Adaptive Minimax Classification with Dyadic Decision Trees"

Conrad Shelton
5 years ago
Views:

1 Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston, TX Madison, WI June 4, 2004 Abstract Decision trees are surprisingly adaptive in four important respects: They automatically (1) adapt to Tsybakov s noise condition; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features; (4) adapt to Bayes decision boundary smoothness. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve optimal rates for the class of distributions studied here while being practical and implementable. 1 Introduction This paper establishes four adaptivity properties of decision trees that lead to faster rates of convergence for a broad range of pattern classification problems. These properties are: Noise Adaptivity: Decision trees are capable of automatically adapting to the (unknown) regularity of the risk function in the neighborhood of the Bayes decision boundary. This condition is expressed in terms of Tsybakov s noise exponent [1]. Manifold Focus: When the distribution of features happens to have support on a lower dimensional manifold, decision trees can automatically detect and adapt their structure to the manifold. Thus decision trees learn the effective data dimension. Feature Rejection: If certain features are irrelevant (i.e., independent of the class labels), then decision trees automatically ignore these features. Thus decision trees learn the relevant data dimension. Decision Boundary Adaptivity: If the Bayes decision boundary has bounded derivatives, decision trees adapt to achieve faster rates for smoother boundaries. We consider only trees with axis-orthogonal splits. For more general trees, e.g., perceptron trees, adaptivity to should be possible, although retaining implementability may be challenging. Each of the above properties can be formalized and translated into a class of distributions with known minimax rates of convergence. Adaptivity is a highly desirable quality of classifiers since in practice the precise characteristics of the distribution are unknown. We show that dyadic decision trees can achieve the optimal rate (to within a log factor) without needing to know the specific parameters defining the class. Such trees are constructed by minimizing a complexity, 1

2 penalized empirical risk over an appropriate family of dyadic partitions. The complexity term is derived from a new generalization error bound for trees, inspired by [2]. This bound in turn leads to an oracle inequality from which the optimal rates are derived. The restriction to dyadic splits is necessary to achieve a computationally tractable classifier. Our classifiers have computational complexity nearly linear in the training sample size. The same rates may be achieved by more general tree classifiers, but these require searches over prohibitively large families of partitions. Dyadic decision trees are thus prefered because they are simultaneously implementable, analyzable, and sufficiently flexible to achieve optimal rates. 1.1 Notation Let be a random variable taking values in a set, and let be iid realizations of. Let be the probability measure for, and let be the empirical estimate of based on :,, where denotes the indicator function. In classification we take, where is the collection of feature vectors and is a finite set of class labels. Assume and. A classifier is a measurable function! " #. Each classifier! induces a set $ % ' & (! % ) &. Define the probability of error and empirical error (risk) of! by *! $ $ and *!, respectively. The Bayes classifier! + achieves minimum probability of error and is given by! + %, -./0 12, where 3 % ( % is the posterior probability that the correct label is 1. The Bayes error is *! + and denoted * +. The Bayes decision boundary, denoted 7 8+, is the topological boundary of the Bayes decision set 8+ % (! + %. 1.2 Rates of Convergence in Classification In this paper we study the rate at which 9 : *! ; * + goes to zero as # <, where! is a classifier constructed from but without knowledge of. Yang [3] shows that for 3 % in certain smoothness classes minimax optimal rates are achieved by appropriate plug-in density estimates. Tsybakov and collaborators replace global restrictions on 3 by restrictions on 3 near Faster rates are then possible, although existing optimal classifiers typically rely on =-nets for their construction and in general are not implementable [1, 4, 5]. Other authors have derived rates of convergence for existing practical classifiers, but these rates are suboptimal in the minimax sense considered here [6 8]. Our contribution is to demonstrate implementable classifiers that adaptively achieve minimax optimal rates for some of Tsybakov s and other classes. In an earlier paper, we demonstrated the (non-adaptive) near-minimax optimality of dyadic decision trees in a very special case of the general class considered herein. Here we also simplify and improve the bounding techniques used in that work [10]. 2 Dyadic Decision Trees A dyadic decision tree (DDT) is a decision tree that divides the input space by means of axis-orthogonal dyadic splits. More precisely, a dyadic decision tree > is specified by assigning an ' A to each internal of > (corresponding to the coordinate this is split at that node), and a binary label 0 or 1 to each terminal (or leaf) node of >. Each node of > is associated with a cell (hyperrectangle) in B according to the following rules: (1) The root node is associated with ; (2) is an internal node with children C and D, is associated to the cell EF, then C and D are associated with the cell EG and EH obtained by splitting EF at its midpoint along Let I > E E J denote the partition induced by >. has depth K then L EF MNO where L is the Lebesgue measure on B. 2

3 We distinguish two kinds of DDTs. In a cyclic DDT is the root node, and? mod A for every parent-child C. In a free DDT may be arbitrary for each internal node. Define FR and CY to be the collections of all free and cyclic DDTs, respectively. Similarly, define FR and CY to be the collections of all cells corresponding to nodes of trees in FR and CY, respectively. Let be a nonnegative integer and define FR to be the collection of all free DDTs such that no terminal cell has a sidelength smaller than MN. In other words, no coordinate is split more than times when traversing a path from the root to a leaf node. Similarly, define CY to be the collection of all cyclic DDTs such that no terminal cell has a sidelength smaller than MN. Because of the forced nature of the splits in CDDTs, CY is simply the set of all CDDTs with depth not exceeding A. To preview our results, appropriately constructed FDDTs can optimally adapt to all four properties outlined above. CDDTs, on the other hand, can optimally adapt to the first two. We will consider classifiers of the form > arg min * > > (1) where FR or CY and is a penalty or regularization term specified below. This penalty has the property that it is additive, meaning it is the sum over its leaves of a certain functional. This fact allows for fast algorithms for constructing > when combined with the fact that most cells of trees in contain no data. When CY the optimization reduces to pruning a tree with A nodes, and may be solved via a simple bottom-up tree-pruning algorithm in A operations. When FR an algorithm of Blanchard et al. [9] may be used to compute > in A A operations. For all of our theorems on rates of convergence below we have, in which case the complexities are A for and A for FR CY. 3 Generalization Error Bounds for Trees In this section we state a uniform error bound and an oracle inequality for DDTs. The bounding techniques are quite general and can be extended to larger (even uncountable) families of trees using VC theory, but for the sake of simplicity we confine the discussion to DDTs. Before stating these results, some additional notation is necessary. Let FR or CY. Suppose E ' corresponds to a at depth K. If FR define E ; 2 K K 2 A, and if CY define E 2 K K ;. E represents the number of bits needed to uniquely encode E and will be used to measure the complexity of a DDT having E as a leaf cell. In particular, E satisfies the Kraft summability condition MN - /, and allows us to manage the confidence with which a collection of bounds (one for each E ' ) holds simultaneoulsy. Note that cells in FDDTs require more bits than those in CDDTs because there are more cells to encode at any given depth. In what follows, the value of E will depend on whether free or cyclic DDTs are being considered. For a cell E, define 6 E and 6. Further define E M and E M M It can be shown (Lemma 3) that with high probability, and ' uniformly over all E. This is a key to making our proposed classifier both computable on the one hand and analyzable on the other. 3

4 Let FR or CY, with E defined accordingly. Define the data-dependent penalty > - / E M M (2) Our first main result is the following uniform error bound. Theorem 1. With probability at least ; M, * > * > > for all > '. (3) Traditional error bounds for trees involve a penalty proportional to (> (, where (> ( denotes the number of leaves in > (see [11] or the naive bound in [2]). The penalty in (2) assigns a different weight to each leaf of the tree depending on both the depth of the leaf and the fraction of data reaching the leaf. Indeed, for very deep leaves, little data will reach those nodes, and such leaves will contribute very little to the overall penalty. For example, we may bound by with high probability, and if has a bounded density, then decays like MNO, where K is the depth of E. Thus, as K increases, E grows additively with K, but decays at a multiplicative rate. The upshot is that the penalty > favors unbalanced trees. Intuitively, if two trees have the same size and empirical error, minimizing the penalized empirical risk with this new penalty will select the tree that is more unbalanced, whereas a traditional penalty based only on tree size would not distinguish the two trees. This has advantages for classification because unbalanced trees are what we expect to do a good job of approximating a lower dimensional decision boundary. The derivation of (2) comes from applying standard concentration inequalities (for sums of Bernoulli trials) in a spatially decomposed manner. Spatial decomposition allows the introduction of local probabilities to offset the complexity of each leaf node E. Our analysis is inspired by the work of Mansour and McAllester [2]. The uniform error bound of Theorem 1 can be converted (using standard techniques) into an oracle inequality that is the key to deriving rates of convergence for DDTs. Theorem 2. Let FR or CY, and let > be as in (1) with as in (2). Define Then > - / E M 9 : * ; > * + * > ; * + > (4) Note that with high probability, is a non-random upper bound on, and therefore upper bounds. The use of instead of in the oracle bound facilitates rate of convergence analysis. The oracle inequality essentially says that > performs nearly as well as the DDT chosen by an oracle to minimize * > ; * +. The right hand side of (4) bears the interpretation of a decomposition into approximation error (* > ; * +) and estimation error >. We also remark that the remainder term can be improved to using relative VC bounds, although this improvement is not needed here. 4 Rates of Convergence The classes of distributions we study are motivated by the work of Mammen and Tsybakov [4] and Tsybakov [1] which we now review. The classes are indexed by the smoothness of the Bayes decision boundary 4

5 7 8+ and a parameter that quantifies how noisy the distribution is near We write when and if both and. Let, and take ; to be the largest integer not exceeding. Suppose " N # is times differentiable, and let denote the Taylor polynomial of order at the point?. For a constant, define, the class of functions with Hölder regularity, to be the collection of all such that (? ;? ( (? ;? ( for all?? ' N. Using Tsybakov s terminology, the Bayes decision set 8 + is a boundary fragment of smoothness if 8 + epi ' for some. Here epi '? "? is the epigraph of. In other words, for a boundary fragment, the last coordinate of 7 8+ is a Hölder- function of the first A ; coordinates. Tsybakov also introduces a margin assumption (not to be confused with the data-dependent notion of margin) that characterizes the level of noise near in terms of a noise exponent <. Let!!2 % ' "! % )!2 %. A distribution satisfies the margin assumption with noise exponent if, for some constants 2 =2, 6!! + 2 *! ; * + 1 for all! '! + =2 Here! + =2! " 6!! + =2. The case is the low noise case and corresponds to a jump of 3 % at the Bayes decision boundary. The case < is the high noise case and imposes no constraint on the distribution (provided 2 ). See [6] for further discussion. Define the class 2 =2 to be the collection of distributions of such that 0A For all measurable E, 6 E L E 1A 8 + is a boundary fragment with smoothness 2A The margin condition is satisfied with noise exponent and constants 2 =2. Introducing the parameter A ;, Tsybakov [1] proved the lower bound : ; $ : 9 *! * + N 1-2 N (5) Mammen and Tsybakov [4] demonstrate that empirical risk minimization (ERM) over yields a classifier achieving the minimax rate when. Tsybakov [1] shows that ERM over a suitable =-net also achieves the minimax rate for. Tsybakov and van de Geer [5] propose a minimum penalized empirical risk classifier that achieves the minimax rate for all. In all of these works, no computationally efficient algorithms are given for constructing these classifiers. It is important to understand how the minimax rate changes for different smoothness and noise parameters. Clearly, for fixed, the rate tends to N 12 as # <. If, the lower bound in (5) increases as # <, and the easiest case, meaning the smallest lower bound, occurs when. In contrast, if, the lower bound decreases as # <, and the easiest case occurs when <. Thus, when the smoothness exceeds A ;, convergence to the Bayes error is faster when there is less noise near the decision boundary. When A ;, on the other hand, high noise actually increases the rate. Of course, high noise does not make learning easier in an absolute sense, only relative to the hardest problem in that class. In light of the above, care must be taken when speaking about the minimax rate when. From the definition, for any. Hence the minimax rate for can be no faster 5

6 than N 1 -, the minimax rate for. To allow the possibility of a classifier actually achieving rates faster than N 1 - when, clearly an additional assumption must be made. We propose the following two-sided margin condition. Let <, and let 2 =2. We say the distribution of satisfies the two-sided margin condition with noise exponents if 6!! + ; 1 ' 2 *! * + for all!! + =2 (6) 6 ; 1 '!! + *! * + for all!! + (7) 2 Here! + is the set of all! such that!! + has no (topological) connected component which does not intersect Note that Tsybakov and van de Geer [5] make a similar assumption, but with. Theirs is a much more restrictive assumption since it requires 3 % to behave very uniformly in all regions of space near We also note that! + is not needed in their definition because they are assuming 8+ is a boundary fragment, a condition we will be relaxing. Observe that (6) entails the original margin condition for, while the second excludes distributions with margin. In other words, the noise level is not too high and not too low. We write 2B The two-sided margin condition is satisfied with noise parameters and constants 2 =2. It can be shown that the optimal rate under 0A, 1A, and 2B is no faster than N 1-2 N. Although this is not immediately obvious (2B is more restrictive than 2A, so the lower bound could in principle be smaller), it follows from a close inspection of the original derivation of the lower bound (5) given in [4]. There, the discrete hypercube of classifiers used in an Assouad-style argument satisfies not only to 2A but also to 2B. 5 Adaptive Rates for Dyadic Decision Trees All of our rate of convergence proofs use the oracle inequality in the same basic way. The objective is to find an oracle tree > + ' such that both * > + ; * + and > + decay at the desired rate. This tree is roughly constructed as follows. First form a regular dyadic partition (the exact construction will depend on the specific problem) into cells of sidelength MN, for a certain. Then prune back all cells that do not intersect Both approximation and estimation error may now be bounded using the given assumptions and elementary bounding methods. For example, * > + ; * + > +! + (by 2B) L > +! + (by 0A) N (by 1A). This example reveals how the noise exponent enters the picture to affect the approximation error. Clearly other classifiers satisfying similar oracle inequalities will admit a similar analysis. 5.1 Margin Adaptive Classification Dyadic decision trees, selected according to the penalized empirical risk criterion discussed earlier, adapt to the unknown margin to achieve faster rates as stated in Theorem 3 below. For now we focus on distributions with ( A ; ), i.e., Lipschitz decision boundaries (the case is discussed in Section 5.4), and arbitrary two-sided margin parameters. From our previous discussion, the optimal rate for this class is N 1-2 N 2/. We will see that DDTs can adaptively learn at a rate of 1-2 N 2/. In an effort to be more general and practical, we replace the boundary fragment condition 1A with a less restrictive assumption. Tysbakov and van de Geer [5] assume the Bayes decision set 8 + is a boundary 1 If this condition is not imposed, it is always possible, except for trivial distributions, to find an such that (7) is not satisfied, rendering the class degenerate. On the other hand, little is lost by only considering because the margin condition seeks to characterize the noise level near the Bayes decision boundary. 6

7 fragment, meaning it is known a priori that (a) one coordinate of 7 8+ is a function of the others, (b) that coordinate is known, and (c) class 1 corresponds to the region above The following condition includes all piecewise Lipschitz decision boundaries, and allows to have arbitrary orientation and 8+ to have multiple connected components. Let denote the regular partition of into hypercubes of sidelength where is a dyadic integer (i.e., a power of 2). A distribution of satisties the box-counting assumption with constant if 1B For all dyadic integers, intersects at most N of the cells in. Condition 1A ( ) implies 1B, so the minimax rate under 0A, 1B, and 2B is no faster than N 1-2 N 2/. Theorem 3. Let and 2B hold, then FR or CY, with M. Define > as in (1) with as in (2). If 0A, 1B, 9 : * > ; * + N (8) The complexity regularized DDT is adaptive in the sense that the noise exponents and constants 2 =2 need not be known. > can always be constructed and in opportune circumstances the rate in (8) is achieved. Also notice that in the special case, > achieves the minimax rate (within a log factor). 5.2 When the Data Lie on a Manifold For certain problems it may happen that the feature vectors lie on a manifold in the ambient space. When this happens, dyadic decision trees automatically adapt to achieve faster rates of convergence. To recast assumptions 0A and 1B in terms of a data manifold 2, we again use box-counting ideas. Let and A A. The boundedness and regularity assumptions for a A dimensional manifold are given by 0B For all dyadic integers and all E ', 6 E N. 1C For all dyadic integers, passes through at most N of the hypercubes in. This minimax rate under these assumptions is no faster than N 1. To see this, consider the mapping of features ' to M M '. Then lives on a A dimensional manifold, and clearly there can be no classifier achieving a rate faster than N 1 uniformly over all such, as this would lead to a classifier outperforming the minimax rate for. As the following theorem shows, DDTs can achieve this rate to within a log factor. Theorem 4. Let FR or CY, with M. Define > as in (1) with as in (2). If assumptions 0B and 1C hold, then 9 : * > ; * + N (9) Again, > is adaptive in that it does not require knowledge of the dimension A or the constants. 2 For simplicity, we eliminate the margin assumption here and in subsequent sections, although it could be easily incorporated to yield faster rates. 7

8 5.3 Irrelevant Features Whether the favorable conditions characterizing the distribution take the form of margin or manifold constraints, dyadic decision trees based on either cyclic or free dyadic partitions achieve the same rates of convergence. On the other hand, when some of the features are irrelevant (having no discriminatory information), cyclic DDTs clearly seem limited. In this case, the relevant data dimension is some number A A, where A is the number of relevant features, and we might hope to achieve a faster rate on the order of N 1. This can in fact be seen to be the minimax rate by an argument similar to the one in the previous section. As the following theorem indicates, free DDTs achieve this rate to within a log factor. Theorem 5. Let FR with M. Define > as in (1) with as in (2). If all but A of the features are independent of, and 0A and 1B hold, then 9 : * > ; * + N (10) As in the previous theorems, the classifier is adaptive in the sense that it does not need to be told A or which A features are relevant. 5.4 Adapting to Bayes Decision Boundary Smoothness Thus far in our discussion we have assumed has smoothness. In [10] we show that DDTs with polynomial classifiers decorating the leaves can achieve faster rates for. Combined with the margin analysis here, these rates can approach N. Unfortunately, these rates are suboptimal and the classifiers are not practical. For, free DDTs can adaptively attain the minimax rate (within a log factor). Our results apply to distributions such that 1D One coordinate of is a Hölder- function of the others. It is possible to relax this condition to more general 7 8+ (piecewise Hölder boundaries with multiple connected components) using, e.g., box-counting ideas, although we do not pursue this here. Even without this generalization, when compared to Tsybakov and van de Geer [5] free DDTs have the advantage (in addition of being implementable) that it is not necessary to know the orientation of 7 8+ (which coordinate is a function of the others), or which side of corresponds to class 1. Theorem 6. Let FR with M. Define > as in (1) with as in (2). If assumptions 0A and 1D (with ) hold, then 9 : * > ; * + N This problem of finding practical classifiers that adapt to the optimal rate for that we are currently pursuing. 6 Conclusion (11) is an open problem Dyadic decision trees adapt to Tsybakov s noise condition, data manifold dimension, relevant features, and (partially) Bayes decision boundary smoothness to achieve minimax optimal rates of convergence (to within a log factor). Although we consider each condition separately so as to simplify the discussion, it should not be too difficult combine results to yeild a rate of 1-2 N where A+ ; and A+ is the dimension of the manifold supporting the relevant features. 8

9 7 Proofs Lemma 1. Let FR or CY and define E 2 K K ; (if CY ) or E 2 K K ; 2 A (if FR ). Then MN - / (12) Proof. E may be thought of as a codelength in a prefix code for. Suppose E corresponds to a having depth K. Then 2 K bits are needed to encode K, K ; bits are needed to encode whether the K ; ancestors branch left or right, and, in the case of free DDTs, an additional K ; 2 A bits are needed to encode which axes are being split at the ancestors This leads to the stated formulas for E which, by the Kraft inequality for uniquely decodable codes, satisfy (12). Our analysis employs the following concentration inequalities. The first is known as a relative Chernoff bound [12], the second is a standard (additive) Chernoff bound, and the last two were proved by Okamoto [13]. Lemma 2. Let be a Bernoulli random variables with, and let be iid realizations. Set. For all = : ; = N 12 (13) : = 2 N (14) : = 2 N (15) : = N (16) Corollary 1. Under the assumptions of the previous theorem : ; M. The main step in proving Theorem 1 is the following result. This is proved by applying (13) with = 2-1/ FR or CY. Let '. With probability at least ;, all > ' simultane- Proposition 1. Let ously satisfy * > ; * > - / E M M (17) Proof. Let > '. * > ; * > - ' E > ) ; ' E > ) / ; - / 9

10 where % ' B & ( % ' E > % ) &. For fixed E >, consider the Bernoulli trail which equals 1 if ' and 0 otherwise. By Corollary 1 ; M E M except on a set of probability not exceeding MN - - /. We want this to hold for all. Note that the sets are in 2-to-1 correspondece with cells E ', because each cell could have one of two class labels. Using, the union bound, and applying the same argument for each possible, we have that (17) holds uniformly except on a set of probability not exceeding MN - - / label or where the last step follows from Lemma 1. MN - - / MN - / In order for the above bound to be useful, it must be computable, which means we must eliminate the dependence on. This is accomplished by the following lemma. Lemma 3. Let FR of CY, with E defined accordingly. Then : ' for all E, ; (18) : ' for all E, ; (19) Proof. We prove the second statement. The first follows in like fashion. For fixed E : : E M M : M E M M : E M M MN - /, where the last inequality follows from (15) with = E M M. The result follows by repeating this argument for each E, applying the union bound and Lemma 1. Proof of Theorem 1. Apply (17) (with ), (18), and the union bound. ' Proof of Theorem 2. Let > be the tree minimizing the expression on the right-hand side of (4). By the additive Chernoff bound (14), with probability at least ;, * > * > M (20) Take to be the set of all such that the events in (3), (19), and (20) hold. Then 9 : * > : 9 : * > ( : 9 : * > ( 9 : * > ( Given ', we know * > * > > * > > * > > * > > M where the first inequality follows from (3), the second from (1), the third from (19), and the fourth from (20). The result now follows by subtracting * + from both sides. 10

11 Proof of Theorem 3. Since CY FR, it suffices to prove the result for CY. Let M be a dyadic integer,, such that 1-2 N 2/. This is always possible by the assumption M. Recall denotes the partition of into hypercubes of sidelength. Let be the collection of cells in that intersect Take > + to be the smallest element of CY such that I > +. In other words, > + consists of the cells in, together with all cells in CY that are ancestors of nodes in, together with their children. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. Our method of proof is to show that, for the given choice of, both * > + ; * + and the desired rate. We begin by bounding the approximation error. Lemma 4. For all, where 2. Proof. From assumption 2B we have * > + ; * + N * > + ; * + 2 > +! + Note that by construction, > + '! +. Continuing, we have > +! + L > +! + L E " E ' L E ( ( N N N N > + decay at where the first inequality follows from 0A and the last inequality from 1B. Next we bound the estimation error. Lemma 5. Proof. We begin with three observations. First, > + 12N M E M M M E M M Second, if E corresponds to a node of depth K K E, then by 0A, L E MNO - /. Third, E 2 K K ; M 2 MA. Combining these, we have > + > + > +, where > MNO - / - / and 2 > - (21) / 11

12 2 We note that > + > +. This follows from 1-2 N 2/, for then N MN MNO - / for all E, since a leaf in > + can have a maximum depth of A. It remains to bound > +. Let O be the number of terminal cells in > + at depth K. By 1B, O 1 M O - N. Then > + O M O 1 - N /MNO M - N MN - N / M - N MN - N / 12 MN 12 M - 12 N M - 12 N 12 N The proof now follows by the oracle inequality and plugging 1-2 N 2/ into the above bounds on approximation and estimation error. Proof of Theorem 4. As in the previous proof, it suffices to consider cyclic DDTs. Again we seek a tree > + ' CY so that the approximation and estimation errors are balanced. Let M be a dyadic integer,, with 1. Let be the collection of cells in that intersect Take > + to be the smallest element of CY such that I > +. In other words, > + consists of the cells in, together with all cells in CY that are ancestors of nodes in, together with their children. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. The construction of > + is identical to the previous proof; the difference now is that ( ( is substantially smaller. Lemma 6. For all, * > + ; * + N Proof. We have * > + ; * + > + )! + E " E ' E ( ( N N N N where the third inequality follows from 0B and the last inequality from 1C. Next we bound the estimation error. 12

13 Lemma 7. > + 12 N Proof. If E is a cell at depth K in >, then 1 M O by assumption 0B. Arguing as in the proof of Theorem 3, we have 2 > + > + > +, where > M O 1 - / and 2 2 > is as in (21). Note that > + > +. This follows from 1, for then N MN MNO - / for all E, since a leaf in > + can have a maximum depth of A. It remains to bound > +. Let O be the number of terminal cells in > + at depth K. By 1C, O 1 M O - N. Then > + O M O 1 - N /MN O 1 M - N A M - 12 N M - 12 N 12 N MN - N / The proof now follows by the oracle inequality and plugging 1 into the above bounds on approximation and estimation error. Proof of Theorem 5. Assume without loss of generality that the first A features are relevant and the remaining A ; A are statistically independent of. Before we construct the oracle tree > +, we develop some intuition. If A M and A, then is a vertical line segment. If A and A, then 7 8+ is a plane orthogonal to the axis. If A and A M, then is a vertical sheet over a Lipshitz curve in the 2 plane. In general, 7 8+ is the Cartesian product of a Lipshitz curve in the with N. We may translate this intuition into the following lemma. Lemma 8. Let be a dyadic integer, and consider the partition of into hypercubes of sidelength. Then the projection of 7 8+ onto intersects at most N of those hypercubes. Proof. If not, then 7 8+ intersects more than N members of in, in violation of the boxcounting assumption. Now construct the tree > + as follows. Let M be a dyadic integer,, with 1. Let be the partition of obtained by splitting the first A features uniformly into cells of sidelength. Let ' be the collection of cells in that intersect Let > INIT + FR be the free DDT formed by splitting cyclicly through the first A features until all leaf nodes have a depth of 13

14 A 2. Take > + to be the smallest pruned subtree of > + INIT such that I > +. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. Lemma 9. For all, * > + ; * + N Proof. We have * > + ; * + > + )! + E " E ' E ( ( N N N N where the third inequality follows from 0B and the last inequality from Lemma 8. The remainder of the proof follows in a manner entirely analagous to our previous proofs, where now O M O 1 - N /. Proof of Theorem 6. Assume without loss of generality that the last coordinate of 7 8+ is a function of the others. Let M be a dyadic integer,, with 1- N. Let M be the largest dyadic integer not exceeding. Note that. Construct the tree > + as follows. First, cycle through the first A ; coordinates ; times. Then, cycle through all A coordinates times. Call this tree > INIT +. Finally, form > + by pruning back all cells in > INIT + whose parents do not interesect Note that > + has depth ; A ; A. Lemma 10. Let denote the number of nodes in > + at depth K. Then O K ; A ; O M - N /- N K ; A ; ; A where A ; 12 M and in the second case and A. Proof. In the first case the result is obvious by construction of > + and the assumption that one coordinate of 7 8+ is a function of the others. For the second case, let K ; A ; ; A for some and. Define K to be the partition of formed by the set of cells in > INIT + having depth K. Define K to be the set of cells in K that intersect Observe that leaves of > + at depth K are in one-to-one correspondence with cells in K. If K is the maximum depth, then the two sets of cells are the same. If K, then the leaves of > + at depth K are precisely the siblings of the cells in K. From the obvious fact ( K ( ( K (, we conclude O ( K ( ( ; A ; A ( Thus it remains to show ( ; A ; A ( M - N /- N ;. Each cell in A ; A is a rectangle of the form, where N M - N /- N is a hypercube of sidelenth MN - N /, and M is an interval of length MN. For each M - N /- N, set ' ; A ; A ( M. 14

15 The proof will be complete if we can show ( ( A ; 12 M, for then ( ; A ; A ( 2 ( ( M - N /- N as desired. To prove this fact, recall 7 8+? ' (? for some function " N # satisfying ( ;?? ( ; ' (?? ( for all?? N. Therefore, the value of on a single hypercube can vary by no more than A ; MN - N /. Here we use the fact that the maximum distance between points in is A ; MN - N /. Since each interval has length MN This completes the proof. ( ( A ; 12 MN - N / MN M ; A 12 MN - N N / M A ; 12 MN - N N / M A ; 12 MN - N /- N / M A ; 12 M The following lemma bounds the approximation error. Lemma 11. For all, where M A ; 12 M. * > + ; * + N Proof. Recall > + has depth ; A ; A, and define K as in the proof of the Lemma 10. * > + ; ) * + > +! + E " E ' - / E - / L E By construction, L E MN - N / N. Noting that MN MN MN, we have L E M MN - N / M N - N. Thus * > + ; * + M N - N M - N / N - N N N - N N The estimation error decays as follows. 15

16 Lemma 12. > + - N N 12 The proof of this lemma follows by Lemma 10 and techniques used in previous proofs. The proof of the theorem follows by the oracle inequality and plugging 1- N into the above bounds on approximation and estimation error. References [1] A. B. Tsybakov, Optimal aggregation of classifiers in statistical learning, Ann. Stat., vol. 32, no. 1, pp , [2] Y. Mansour and D. McAllester, Generalization bounds for decision trees, in Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, N. Cesa-Bianchi and S. Goldman, Eds., Palo Alto, CA, 2000, pp [3] Y. Yang, Minimax nonparametric classification Part I: Rates of convergence, IEEE Trans. Inform. Theory, vol. 45, no. 7, pp , [4] E. Mammen and A. B. Tsybakov, Smooth discrimination analysis, Ann. Stat., vol. 27, pp , [5] A. B. Tsybakov and S. A. van de Geer, Square root penalty: adaptation to the margin in classification and in edge estimation, preprint, [Online]. Available: [6] P. Bartlett, M. Jordan, and J. McAuliffe, Convexity, classification, and risk bounds, Department of Statistics, U.C. Berkeley, Tech. Rep. 638, [7] G. Blanchard, G. Lugosi, and N. Vayatis, On the rate of convergence of regularized boosting classifiers, J. Machine Learning Research, vol. 4, pp , [8] J. Scovel and I. Steinwart, Fast rates for support vector machines, Los Alamos National Laboratory, Tech. Rep. LA-UR , [9] G. Blanchard, C. Schäfer, and Y. Rozenholc, Oracle bounds and exact algorithm for dyadic classification trees, preprint, [Online]. Available: pdf [10] C. Scott and R. Nowak, Near-minimax optimal classification with dyadic classification trees, in Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Schölkopf, Eds. Cambridge, MA: MIT Press, [11] A. Nobel, Analysis of a complexity based pruning scheme for classification trees, IEEE Trans. Inform. Theory, vol. 48, no. 8, pp , [12] T. Hagerup and C. Rüb, A guided tour of Chernoff bounds, Inform. Process. Lett., vol. 33, no. 6, pp , [13] M. Okamoto, Some inequalities relating to the partial sum of binomial probabilites, Annals of the Institute of Statistical Mathematics, vol. 10, pp ,

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract