Adaptive Minimax Classification with Dyadic Decision Trees

Size: px
Start display at page:

Download "Adaptive Minimax Classification with Dyadic Decision Trees"

Transcription

1 Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston, TX Madison, WI June 4, 2004 Abstract Decision trees are surprisingly adaptive in four important respects: They automatically (1) adapt to Tsybakov s noise condition; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features; (4) adapt to Bayes decision boundary smoothness. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve optimal rates for the class of distributions studied here while being practical and implementable. 1 Introduction This paper establishes four adaptivity properties of decision trees that lead to faster rates of convergence for a broad range of pattern classification problems. These properties are: Noise Adaptivity: Decision trees are capable of automatically adapting to the (unknown) regularity of the risk function in the neighborhood of the Bayes decision boundary. This condition is expressed in terms of Tsybakov s noise exponent [1]. Manifold Focus: When the distribution of features happens to have support on a lower dimensional manifold, decision trees can automatically detect and adapt their structure to the manifold. Thus decision trees learn the effective data dimension. Feature Rejection: If certain features are irrelevant (i.e., independent of the class labels), then decision trees automatically ignore these features. Thus decision trees learn the relevant data dimension. Decision Boundary Adaptivity: If the Bayes decision boundary has bounded derivatives, decision trees adapt to achieve faster rates for smoother boundaries. We consider only trees with axis-orthogonal splits. For more general trees, e.g., perceptron trees, adaptivity to should be possible, although retaining implementability may be challenging. Each of the above properties can be formalized and translated into a class of distributions with known minimax rates of convergence. Adaptivity is a highly desirable quality of classifiers since in practice the precise characteristics of the distribution are unknown. We show that dyadic decision trees can achieve the optimal rate (to within a log factor) without needing to know the specific parameters defining the class. Such trees are constructed by minimizing a complexity, 1

2 penalized empirical risk over an appropriate family of dyadic partitions. The complexity term is derived from a new generalization error bound for trees, inspired by [2]. This bound in turn leads to an oracle inequality from which the optimal rates are derived. The restriction to dyadic splits is necessary to achieve a computationally tractable classifier. Our classifiers have computational complexity nearly linear in the training sample size. The same rates may be achieved by more general tree classifiers, but these require searches over prohibitively large families of partitions. Dyadic decision trees are thus prefered because they are simultaneously implementable, analyzable, and sufficiently flexible to achieve optimal rates. 1.1 Notation Let be a random variable taking values in a set, and let be iid realizations of. Let be the probability measure for, and let be the empirical estimate of based on :,, where denotes the indicator function. In classification we take, where is the collection of feature vectors and is a finite set of class labels. Assume and. A classifier is a measurable function! " #. Each classifier! induces a set $ % ' & (! % ) &. Define the probability of error and empirical error (risk) of! by *! $ $ and *!, respectively. The Bayes classifier! + achieves minimum probability of error and is given by! + %, -./0 12, where 3 % ( % is the posterior probability that the correct label is 1. The Bayes error is *! + and denoted * +. The Bayes decision boundary, denoted 7 8+, is the topological boundary of the Bayes decision set 8+ % (! + %. 1.2 Rates of Convergence in Classification In this paper we study the rate at which 9 : *! ; * + goes to zero as # <, where! is a classifier constructed from but without knowledge of. Yang [3] shows that for 3 % in certain smoothness classes minimax optimal rates are achieved by appropriate plug-in density estimates. Tsybakov and collaborators replace global restrictions on 3 by restrictions on 3 near Faster rates are then possible, although existing optimal classifiers typically rely on =-nets for their construction and in general are not implementable [1, 4, 5]. Other authors have derived rates of convergence for existing practical classifiers, but these rates are suboptimal in the minimax sense considered here [6 8]. Our contribution is to demonstrate implementable classifiers that adaptively achieve minimax optimal rates for some of Tsybakov s and other classes. In an earlier paper, we demonstrated the (non-adaptive) near-minimax optimality of dyadic decision trees in a very special case of the general class considered herein. Here we also simplify and improve the bounding techniques used in that work [10]. 2 Dyadic Decision Trees A dyadic decision tree (DDT) is a decision tree that divides the input space by means of axis-orthogonal dyadic splits. More precisely, a dyadic decision tree > is specified by assigning an ' A to each internal of > (corresponding to the coordinate this is split at that node), and a binary label 0 or 1 to each terminal (or leaf) node of >. Each node of > is associated with a cell (hyperrectangle) in B according to the following rules: (1) The root node is associated with ; (2) is an internal node with children C and D, is associated to the cell EF, then C and D are associated with the cell EG and EH obtained by splitting EF at its midpoint along Let I > E E J denote the partition induced by >. has depth K then L EF MNO where L is the Lebesgue measure on B. 2

3 We distinguish two kinds of DDTs. In a cyclic DDT is the root node, and? mod A for every parent-child C. In a free DDT may be arbitrary for each internal node. Define FR and CY to be the collections of all free and cyclic DDTs, respectively. Similarly, define FR and CY to be the collections of all cells corresponding to nodes of trees in FR and CY, respectively. Let be a nonnegative integer and define FR to be the collection of all free DDTs such that no terminal cell has a sidelength smaller than MN. In other words, no coordinate is split more than times when traversing a path from the root to a leaf node. Similarly, define CY to be the collection of all cyclic DDTs such that no terminal cell has a sidelength smaller than MN. Because of the forced nature of the splits in CDDTs, CY is simply the set of all CDDTs with depth not exceeding A. To preview our results, appropriately constructed FDDTs can optimally adapt to all four properties outlined above. CDDTs, on the other hand, can optimally adapt to the first two. We will consider classifiers of the form > arg min * > > (1) where FR or CY and is a penalty or regularization term specified below. This penalty has the property that it is additive, meaning it is the sum over its leaves of a certain functional. This fact allows for fast algorithms for constructing > when combined with the fact that most cells of trees in contain no data. When CY the optimization reduces to pruning a tree with A nodes, and may be solved via a simple bottom-up tree-pruning algorithm in A operations. When FR an algorithm of Blanchard et al. [9] may be used to compute > in A A operations. For all of our theorems on rates of convergence below we have, in which case the complexities are A for and A for FR CY. 3 Generalization Error Bounds for Trees In this section we state a uniform error bound and an oracle inequality for DDTs. The bounding techniques are quite general and can be extended to larger (even uncountable) families of trees using VC theory, but for the sake of simplicity we confine the discussion to DDTs. Before stating these results, some additional notation is necessary. Let FR or CY. Suppose E ' corresponds to a at depth K. If FR define E ; 2 K K 2 A, and if CY define E 2 K K ;. E represents the number of bits needed to uniquely encode E and will be used to measure the complexity of a DDT having E as a leaf cell. In particular, E satisfies the Kraft summability condition MN - /, and allows us to manage the confidence with which a collection of bounds (one for each E ' ) holds simultaneoulsy. Note that cells in FDDTs require more bits than those in CDDTs because there are more cells to encode at any given depth. In what follows, the value of E will depend on whether free or cyclic DDTs are being considered. For a cell E, define 6 E and 6. Further define E M and E M M It can be shown (Lemma 3) that with high probability, and ' uniformly over all E. This is a key to making our proposed classifier both computable on the one hand and analyzable on the other. 3

4 Let FR or CY, with E defined accordingly. Define the data-dependent penalty > - / E M M (2) Our first main result is the following uniform error bound. Theorem 1. With probability at least ; M, * > * > > for all > '. (3) Traditional error bounds for trees involve a penalty proportional to (> (, where (> ( denotes the number of leaves in > (see [11] or the naive bound in [2]). The penalty in (2) assigns a different weight to each leaf of the tree depending on both the depth of the leaf and the fraction of data reaching the leaf. Indeed, for very deep leaves, little data will reach those nodes, and such leaves will contribute very little to the overall penalty. For example, we may bound by with high probability, and if has a bounded density, then decays like MNO, where K is the depth of E. Thus, as K increases, E grows additively with K, but decays at a multiplicative rate. The upshot is that the penalty > favors unbalanced trees. Intuitively, if two trees have the same size and empirical error, minimizing the penalized empirical risk with this new penalty will select the tree that is more unbalanced, whereas a traditional penalty based only on tree size would not distinguish the two trees. This has advantages for classification because unbalanced trees are what we expect to do a good job of approximating a lower dimensional decision boundary. The derivation of (2) comes from applying standard concentration inequalities (for sums of Bernoulli trials) in a spatially decomposed manner. Spatial decomposition allows the introduction of local probabilities to offset the complexity of each leaf node E. Our analysis is inspired by the work of Mansour and McAllester [2]. The uniform error bound of Theorem 1 can be converted (using standard techniques) into an oracle inequality that is the key to deriving rates of convergence for DDTs. Theorem 2. Let FR or CY, and let > be as in (1) with as in (2). Define Then > - / E M 9 : * ; > * + * > ; * + > (4) Note that with high probability, is a non-random upper bound on, and therefore upper bounds. The use of instead of in the oracle bound facilitates rate of convergence analysis. The oracle inequality essentially says that > performs nearly as well as the DDT chosen by an oracle to minimize * > ; * +. The right hand side of (4) bears the interpretation of a decomposition into approximation error (* > ; * +) and estimation error >. We also remark that the remainder term can be improved to using relative VC bounds, although this improvement is not needed here. 4 Rates of Convergence The classes of distributions we study are motivated by the work of Mammen and Tsybakov [4] and Tsybakov [1] which we now review. The classes are indexed by the smoothness of the Bayes decision boundary 4

5 7 8+ and a parameter that quantifies how noisy the distribution is near We write when and if both and. Let, and take ; to be the largest integer not exceeding. Suppose " N # is times differentiable, and let denote the Taylor polynomial of order at the point?. For a constant, define, the class of functions with Hölder regularity, to be the collection of all such that (? ;? ( (? ;? ( for all?? ' N. Using Tsybakov s terminology, the Bayes decision set 8 + is a boundary fragment of smoothness if 8 + epi ' for some. Here epi '? "? is the epigraph of. In other words, for a boundary fragment, the last coordinate of 7 8+ is a Hölder- function of the first A ; coordinates. Tsybakov also introduces a margin assumption (not to be confused with the data-dependent notion of margin) that characterizes the level of noise near in terms of a noise exponent <. Let!!2 % ' "! % )!2 %. A distribution satisfies the margin assumption with noise exponent if, for some constants 2 =2, 6!! + 2 *! ; * + 1 for all! '! + =2 Here! + =2! " 6!! + =2. The case is the low noise case and corresponds to a jump of 3 % at the Bayes decision boundary. The case < is the high noise case and imposes no constraint on the distribution (provided 2 ). See [6] for further discussion. Define the class 2 =2 to be the collection of distributions of such that 0A For all measurable E, 6 E L E 1A 8 + is a boundary fragment with smoothness 2A The margin condition is satisfied with noise exponent and constants 2 =2. Introducing the parameter A ;, Tsybakov [1] proved the lower bound : ; $ : 9 *! * + N 1-2 N (5) Mammen and Tsybakov [4] demonstrate that empirical risk minimization (ERM) over yields a classifier achieving the minimax rate when. Tsybakov [1] shows that ERM over a suitable =-net also achieves the minimax rate for. Tsybakov and van de Geer [5] propose a minimum penalized empirical risk classifier that achieves the minimax rate for all. In all of these works, no computationally efficient algorithms are given for constructing these classifiers. It is important to understand how the minimax rate changes for different smoothness and noise parameters. Clearly, for fixed, the rate tends to N 12 as # <. If, the lower bound in (5) increases as # <, and the easiest case, meaning the smallest lower bound, occurs when. In contrast, if, the lower bound decreases as # <, and the easiest case occurs when <. Thus, when the smoothness exceeds A ;, convergence to the Bayes error is faster when there is less noise near the decision boundary. When A ;, on the other hand, high noise actually increases the rate. Of course, high noise does not make learning easier in an absolute sense, only relative to the hardest problem in that class. In light of the above, care must be taken when speaking about the minimax rate when. From the definition, for any. Hence the minimax rate for can be no faster 5

6 than N 1 -, the minimax rate for. To allow the possibility of a classifier actually achieving rates faster than N 1 - when, clearly an additional assumption must be made. We propose the following two-sided margin condition. Let <, and let 2 =2. We say the distribution of satisfies the two-sided margin condition with noise exponents if 6!! + ; 1 ' 2 *! * + for all!! + =2 (6) 6 ; 1 '!! + *! * + for all!! + (7) 2 Here! + is the set of all! such that!! + has no (topological) connected component which does not intersect Note that Tsybakov and van de Geer [5] make a similar assumption, but with. Theirs is a much more restrictive assumption since it requires 3 % to behave very uniformly in all regions of space near We also note that! + is not needed in their definition because they are assuming 8+ is a boundary fragment, a condition we will be relaxing. Observe that (6) entails the original margin condition for, while the second excludes distributions with margin. In other words, the noise level is not too high and not too low. We write 2B The two-sided margin condition is satisfied with noise parameters and constants 2 =2. It can be shown that the optimal rate under 0A, 1A, and 2B is no faster than N 1-2 N. Although this is not immediately obvious (2B is more restrictive than 2A, so the lower bound could in principle be smaller), it follows from a close inspection of the original derivation of the lower bound (5) given in [4]. There, the discrete hypercube of classifiers used in an Assouad-style argument satisfies not only to 2A but also to 2B. 5 Adaptive Rates for Dyadic Decision Trees All of our rate of convergence proofs use the oracle inequality in the same basic way. The objective is to find an oracle tree > + ' such that both * > + ; * + and > + decay at the desired rate. This tree is roughly constructed as follows. First form a regular dyadic partition (the exact construction will depend on the specific problem) into cells of sidelength MN, for a certain. Then prune back all cells that do not intersect Both approximation and estimation error may now be bounded using the given assumptions and elementary bounding methods. For example, * > + ; * + > +! + (by 2B) L > +! + (by 0A) N (by 1A). This example reveals how the noise exponent enters the picture to affect the approximation error. Clearly other classifiers satisfying similar oracle inequalities will admit a similar analysis. 5.1 Margin Adaptive Classification Dyadic decision trees, selected according to the penalized empirical risk criterion discussed earlier, adapt to the unknown margin to achieve faster rates as stated in Theorem 3 below. For now we focus on distributions with ( A ; ), i.e., Lipschitz decision boundaries (the case is discussed in Section 5.4), and arbitrary two-sided margin parameters. From our previous discussion, the optimal rate for this class is N 1-2 N 2/. We will see that DDTs can adaptively learn at a rate of 1-2 N 2/. In an effort to be more general and practical, we replace the boundary fragment condition 1A with a less restrictive assumption. Tysbakov and van de Geer [5] assume the Bayes decision set 8 + is a boundary 1 If this condition is not imposed, it is always possible, except for trivial distributions, to find an such that (7) is not satisfied, rendering the class degenerate. On the other hand, little is lost by only considering because the margin condition seeks to characterize the noise level near the Bayes decision boundary. 6

7 fragment, meaning it is known a priori that (a) one coordinate of 7 8+ is a function of the others, (b) that coordinate is known, and (c) class 1 corresponds to the region above The following condition includes all piecewise Lipschitz decision boundaries, and allows to have arbitrary orientation and 8+ to have multiple connected components. Let denote the regular partition of into hypercubes of sidelength where is a dyadic integer (i.e., a power of 2). A distribution of satisties the box-counting assumption with constant if 1B For all dyadic integers, intersects at most N of the cells in. Condition 1A ( ) implies 1B, so the minimax rate under 0A, 1B, and 2B is no faster than N 1-2 N 2/. Theorem 3. Let and 2B hold, then FR or CY, with M. Define > as in (1) with as in (2). If 0A, 1B, 9 : * > ; * + N (8) The complexity regularized DDT is adaptive in the sense that the noise exponents and constants 2 =2 need not be known. > can always be constructed and in opportune circumstances the rate in (8) is achieved. Also notice that in the special case, > achieves the minimax rate (within a log factor). 5.2 When the Data Lie on a Manifold For certain problems it may happen that the feature vectors lie on a manifold in the ambient space. When this happens, dyadic decision trees automatically adapt to achieve faster rates of convergence. To recast assumptions 0A and 1B in terms of a data manifold 2, we again use box-counting ideas. Let and A A. The boundedness and regularity assumptions for a A dimensional manifold are given by 0B For all dyadic integers and all E ', 6 E N. 1C For all dyadic integers, passes through at most N of the hypercubes in. This minimax rate under these assumptions is no faster than N 1. To see this, consider the mapping of features ' to M M '. Then lives on a A dimensional manifold, and clearly there can be no classifier achieving a rate faster than N 1 uniformly over all such, as this would lead to a classifier outperforming the minimax rate for. As the following theorem shows, DDTs can achieve this rate to within a log factor. Theorem 4. Let FR or CY, with M. Define > as in (1) with as in (2). If assumptions 0B and 1C hold, then 9 : * > ; * + N (9) Again, > is adaptive in that it does not require knowledge of the dimension A or the constants. 2 For simplicity, we eliminate the margin assumption here and in subsequent sections, although it could be easily incorporated to yield faster rates. 7

8 5.3 Irrelevant Features Whether the favorable conditions characterizing the distribution take the form of margin or manifold constraints, dyadic decision trees based on either cyclic or free dyadic partitions achieve the same rates of convergence. On the other hand, when some of the features are irrelevant (having no discriminatory information), cyclic DDTs clearly seem limited. In this case, the relevant data dimension is some number A A, where A is the number of relevant features, and we might hope to achieve a faster rate on the order of N 1. This can in fact be seen to be the minimax rate by an argument similar to the one in the previous section. As the following theorem indicates, free DDTs achieve this rate to within a log factor. Theorem 5. Let FR with M. Define > as in (1) with as in (2). If all but A of the features are independent of, and 0A and 1B hold, then 9 : * > ; * + N (10) As in the previous theorems, the classifier is adaptive in the sense that it does not need to be told A or which A features are relevant. 5.4 Adapting to Bayes Decision Boundary Smoothness Thus far in our discussion we have assumed has smoothness. In [10] we show that DDTs with polynomial classifiers decorating the leaves can achieve faster rates for. Combined with the margin analysis here, these rates can approach N. Unfortunately, these rates are suboptimal and the classifiers are not practical. For, free DDTs can adaptively attain the minimax rate (within a log factor). Our results apply to distributions such that 1D One coordinate of is a Hölder- function of the others. It is possible to relax this condition to more general 7 8+ (piecewise Hölder boundaries with multiple connected components) using, e.g., box-counting ideas, although we do not pursue this here. Even without this generalization, when compared to Tsybakov and van de Geer [5] free DDTs have the advantage (in addition of being implementable) that it is not necessary to know the orientation of 7 8+ (which coordinate is a function of the others), or which side of corresponds to class 1. Theorem 6. Let FR with M. Define > as in (1) with as in (2). If assumptions 0A and 1D (with ) hold, then 9 : * > ; * + N This problem of finding practical classifiers that adapt to the optimal rate for that we are currently pursuing. 6 Conclusion (11) is an open problem Dyadic decision trees adapt to Tsybakov s noise condition, data manifold dimension, relevant features, and (partially) Bayes decision boundary smoothness to achieve minimax optimal rates of convergence (to within a log factor). Although we consider each condition separately so as to simplify the discussion, it should not be too difficult combine results to yeild a rate of 1-2 N where A+ ; and A+ is the dimension of the manifold supporting the relevant features. 8

9 7 Proofs Lemma 1. Let FR or CY and define E 2 K K ; (if CY ) or E 2 K K ; 2 A (if FR ). Then MN - / (12) Proof. E may be thought of as a codelength in a prefix code for. Suppose E corresponds to a having depth K. Then 2 K bits are needed to encode K, K ; bits are needed to encode whether the K ; ancestors branch left or right, and, in the case of free DDTs, an additional K ; 2 A bits are needed to encode which axes are being split at the ancestors This leads to the stated formulas for E which, by the Kraft inequality for uniquely decodable codes, satisfy (12). Our analysis employs the following concentration inequalities. The first is known as a relative Chernoff bound [12], the second is a standard (additive) Chernoff bound, and the last two were proved by Okamoto [13]. Lemma 2. Let be a Bernoulli random variables with, and let be iid realizations. Set. For all = : ; = N 12 (13) : = 2 N (14) : = 2 N (15) : = N (16) Corollary 1. Under the assumptions of the previous theorem : ; M. The main step in proving Theorem 1 is the following result. This is proved by applying (13) with = 2-1/ FR or CY. Let '. With probability at least ;, all > ' simultane- Proposition 1. Let ously satisfy * > ; * > - / E M M (17) Proof. Let > '. * > ; * > - ' E > ) ; ' E > ) / ; - / 9

10 where % ' B & ( % ' E > % ) &. For fixed E >, consider the Bernoulli trail which equals 1 if ' and 0 otherwise. By Corollary 1 ; M E M except on a set of probability not exceeding MN - - /. We want this to hold for all. Note that the sets are in 2-to-1 correspondece with cells E ', because each cell could have one of two class labels. Using, the union bound, and applying the same argument for each possible, we have that (17) holds uniformly except on a set of probability not exceeding MN - - / label or where the last step follows from Lemma 1. MN - - / MN - / In order for the above bound to be useful, it must be computable, which means we must eliminate the dependence on. This is accomplished by the following lemma. Lemma 3. Let FR of CY, with E defined accordingly. Then : ' for all E, ; (18) : ' for all E, ; (19) Proof. We prove the second statement. The first follows in like fashion. For fixed E : : E M M : M E M M : E M M MN - /, where the last inequality follows from (15) with = E M M. The result follows by repeating this argument for each E, applying the union bound and Lemma 1. Proof of Theorem 1. Apply (17) (with ), (18), and the union bound. ' Proof of Theorem 2. Let > be the tree minimizing the expression on the right-hand side of (4). By the additive Chernoff bound (14), with probability at least ;, * > * > M (20) Take to be the set of all such that the events in (3), (19), and (20) hold. Then 9 : * > : 9 : * > ( : 9 : * > ( 9 : * > ( Given ', we know * > * > > * > > * > > * > > M where the first inequality follows from (3), the second from (1), the third from (19), and the fourth from (20). The result now follows by subtracting * + from both sides. 10

11 Proof of Theorem 3. Since CY FR, it suffices to prove the result for CY. Let M be a dyadic integer,, such that 1-2 N 2/. This is always possible by the assumption M. Recall denotes the partition of into hypercubes of sidelength. Let be the collection of cells in that intersect Take > + to be the smallest element of CY such that I > +. In other words, > + consists of the cells in, together with all cells in CY that are ancestors of nodes in, together with their children. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. Our method of proof is to show that, for the given choice of, both * > + ; * + and the desired rate. We begin by bounding the approximation error. Lemma 4. For all, where 2. Proof. From assumption 2B we have * > + ; * + N * > + ; * + 2 > +! + Note that by construction, > + '! +. Continuing, we have > +! + L > +! + L E " E ' L E ( ( N N N N > + decay at where the first inequality follows from 0A and the last inequality from 1B. Next we bound the estimation error. Lemma 5. Proof. We begin with three observations. First, > + 12N M E M M M E M M Second, if E corresponds to a node of depth K K E, then by 0A, L E MNO - /. Third, E 2 K K ; M 2 MA. Combining these, we have > + > + > +, where > MNO - / - / and 2 > - (21) / 11

12 2 We note that > + > +. This follows from 1-2 N 2/, for then N MN MNO - / for all E, since a leaf in > + can have a maximum depth of A. It remains to bound > +. Let O be the number of terminal cells in > + at depth K. By 1B, O 1 M O - N. Then > + O M O 1 - N /MNO M - N MN - N / M - N MN - N / 12 MN 12 M - 12 N M - 12 N 12 N The proof now follows by the oracle inequality and plugging 1-2 N 2/ into the above bounds on approximation and estimation error. Proof of Theorem 4. As in the previous proof, it suffices to consider cyclic DDTs. Again we seek a tree > + ' CY so that the approximation and estimation errors are balanced. Let M be a dyadic integer,, with 1. Let be the collection of cells in that intersect Take > + to be the smallest element of CY such that I > +. In other words, > + consists of the cells in, together with all cells in CY that are ancestors of nodes in, together with their children. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. The construction of > + is identical to the previous proof; the difference now is that ( ( is substantially smaller. Lemma 6. For all, * > + ; * + N Proof. We have * > + ; * + > + )! + E " E ' E ( ( N N N N where the third inequality follows from 0B and the last inequality from 1C. Next we bound the estimation error. 12

13 Lemma 7. > + 12 N Proof. If E is a cell at depth K in >, then 1 M O by assumption 0B. Arguing as in the proof of Theorem 3, we have 2 > + > + > +, where > M O 1 - / and 2 2 > is as in (21). Note that > + > +. This follows from 1, for then N MN MNO - / for all E, since a leaf in > + can have a maximum depth of A. It remains to bound > +. Let O be the number of terminal cells in > + at depth K. By 1C, O 1 M O - N. Then > + O M O 1 - N /MN O 1 M - N A M - 12 N M - 12 N 12 N MN - N / The proof now follows by the oracle inequality and plugging 1 into the above bounds on approximation and estimation error. Proof of Theorem 5. Assume without loss of generality that the first A features are relevant and the remaining A ; A are statistically independent of. Before we construct the oracle tree > +, we develop some intuition. If A M and A, then is a vertical line segment. If A and A, then 7 8+ is a plane orthogonal to the axis. If A and A M, then is a vertical sheet over a Lipshitz curve in the 2 plane. In general, 7 8+ is the Cartesian product of a Lipshitz curve in the with N. We may translate this intuition into the following lemma. Lemma 8. Let be a dyadic integer, and consider the partition of into hypercubes of sidelength. Then the projection of 7 8+ onto intersects at most N of those hypercubes. Proof. If not, then 7 8+ intersects more than N members of in, in violation of the boxcounting assumption. Now construct the tree > + as follows. Let M be a dyadic integer,, with 1. Let be the partition of obtained by splitting the first A features uniformly into cells of sidelength. Let ' be the collection of cells in that intersect Let > INIT + FR be the free DDT formed by splitting cyclicly through the first A features until all leaf nodes have a depth of 13

14 A 2. Take > + to be the smallest pruned subtree of > + INIT such that I > +. Choose class labels for the leaves of > + such that * > + is minimized. Note that > + has depth A. Lemma 9. For all, * > + ; * + N Proof. We have * > + ; * + > + )! + E " E ' E ( ( N N N N where the third inequality follows from 0B and the last inequality from Lemma 8. The remainder of the proof follows in a manner entirely analagous to our previous proofs, where now O M O 1 - N /. Proof of Theorem 6. Assume without loss of generality that the last coordinate of 7 8+ is a function of the others. Let M be a dyadic integer,, with 1- N. Let M be the largest dyadic integer not exceeding. Note that. Construct the tree > + as follows. First, cycle through the first A ; coordinates ; times. Then, cycle through all A coordinates times. Call this tree > INIT +. Finally, form > + by pruning back all cells in > INIT + whose parents do not interesect Note that > + has depth ; A ; A. Lemma 10. Let denote the number of nodes in > + at depth K. Then O K ; A ; O M - N /- N K ; A ; ; A where A ; 12 M and in the second case and A. Proof. In the first case the result is obvious by construction of > + and the assumption that one coordinate of 7 8+ is a function of the others. For the second case, let K ; A ; ; A for some and. Define K to be the partition of formed by the set of cells in > INIT + having depth K. Define K to be the set of cells in K that intersect Observe that leaves of > + at depth K are in one-to-one correspondence with cells in K. If K is the maximum depth, then the two sets of cells are the same. If K, then the leaves of > + at depth K are precisely the siblings of the cells in K. From the obvious fact ( K ( ( K (, we conclude O ( K ( ( ; A ; A ( Thus it remains to show ( ; A ; A ( M - N /- N ;. Each cell in A ; A is a rectangle of the form, where N M - N /- N is a hypercube of sidelenth MN - N /, and M is an interval of length MN. For each M - N /- N, set ' ; A ; A ( M. 14

15 The proof will be complete if we can show ( ( A ; 12 M, for then ( ; A ; A ( 2 ( ( M - N /- N as desired. To prove this fact, recall 7 8+? ' (? for some function " N # satisfying ( ;?? ( ; ' (?? ( for all?? N. Therefore, the value of on a single hypercube can vary by no more than A ; MN - N /. Here we use the fact that the maximum distance between points in is A ; MN - N /. Since each interval has length MN This completes the proof. ( ( A ; 12 MN - N / MN M ; A 12 MN - N N / M A ; 12 MN - N N / M A ; 12 MN - N /- N / M A ; 12 M The following lemma bounds the approximation error. Lemma 11. For all, where M A ; 12 M. * > + ; * + N Proof. Recall > + has depth ; A ; A, and define K as in the proof of the Lemma 10. * > + ; ) * + > +! + E " E ' - / E - / L E By construction, L E MN - N / N. Noting that MN MN MN, we have L E M MN - N / M N - N. Thus * > + ; * + M N - N M - N / N - N N N - N N The estimation error decays as follows. 15

16 Lemma 12. > + - N N 12 The proof of this lemma follows by Lemma 10 and techniques used in previous proofs. The proof of the theorem follows by the oracle inequality and plugging 1- N into the above bounds on approximation and estimation error. References [1] A. B. Tsybakov, Optimal aggregation of classifiers in statistical learning, Ann. Stat., vol. 32, no. 1, pp , [2] Y. Mansour and D. McAllester, Generalization bounds for decision trees, in Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, N. Cesa-Bianchi and S. Goldman, Eds., Palo Alto, CA, 2000, pp [3] Y. Yang, Minimax nonparametric classification Part I: Rates of convergence, IEEE Trans. Inform. Theory, vol. 45, no. 7, pp , [4] E. Mammen and A. B. Tsybakov, Smooth discrimination analysis, Ann. Stat., vol. 27, pp , [5] A. B. Tsybakov and S. A. van de Geer, Square root penalty: adaptation to the margin in classification and in edge estimation, preprint, [Online]. Available: [6] P. Bartlett, M. Jordan, and J. McAuliffe, Convexity, classification, and risk bounds, Department of Statistics, U.C. Berkeley, Tech. Rep. 638, [7] G. Blanchard, G. Lugosi, and N. Vayatis, On the rate of convergence of regularized boosting classifiers, J. Machine Learning Research, vol. 4, pp , [8] J. Scovel and I. Steinwart, Fast rates for support vector machines, Los Alamos National Laboratory, Tech. Rep. LA-UR , [9] G. Blanchard, C. Schäfer, and Y. Rozenholc, Oracle bounds and exact algorithm for dyadic classification trees, preprint, [Online]. Available: pdf [10] C. Scott and R. Nowak, Near-minimax optimal classification with dyadic classification trees, in Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Schölkopf, Eds. Cambridge, MA: MIT Press, [11] A. Nobel, Analysis of a complexity based pruning scheme for classification trees, IEEE Trans. Inform. Theory, vol. 48, no. 8, pp , [12] T. Hagerup and C. Rüb, A guided tour of Chernoff bounds, Inform. Process. Lett., vol. 33, no. 6, pp , [13] M. Okamoto, Some inequalities relating to the partial sum of binomial probabilites, Annals of the Institute of Statistical Mathematics, vol. 10, pp ,

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL Minimax-Optimal Classification With Dyadic Decision Trees

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL Minimax-Optimal Classification With Dyadic Decision Trees IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1335 Minimax-Optimal Classification With Dyadic Decision Trees Clayton Scott, Member, IEEE, Robert D. Nowak, Senior Member, IEEE Abstract

More information

Dyadic Decision Trees

Dyadic Decision Trees RICE UNIVERSITY Dyadic Decision Trees by Clayton D. Scott A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Robert D. Nowak,

More information

Learning Minimum Volume Sets

Learning Minimum Volume Sets Learning Minimum Volume Sets Clayton Scott and Robert Nowak UW-Madison Technical Report ECE-05-2 cscott@rice.edu, nowak@engr.wisc.edu June 2005 Abstract Given a probability measure P and a reference measure

More information

TREE-BASED methods are one of the most widely applied

TREE-BASED methods are one of the most widely applied 4518 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 53, NO. 12, DECEMBER 2005 Tree Pruning With Subadditive Penalties Clayton Scott, Member, IEEE Abstract In this paper we study the problem of pruning a

More information

Learning Minimum Volume Sets

Learning Minimum Volume Sets Learning Minimum Volume Sets Clayton Scott and Robert Nowak UW-Madison Technical Report ECE-05-2 cscott@rice.edu, nowak@engr.wisc.edu June 2005 Abstract Given a probability measure P and a reference measure

More information

Learning Minimum Volume Sets

Learning Minimum Volume Sets Journal of Machine Learning Research x (2006) xx Submitted 9/05; Published xx/xx Learning Minimum Volume Sets Clayton D. Scott Department of Statistics Rice University Houston, TX 77005, USA Robert D.

More information

A Neyman-Pearson Approach to Statistical Learning

A Neyman-Pearson Approach to Statistical Learning A Neyman-Pearson Approach to Statistical Learning Clayton Scott and Robert Nowak Technical Report TREE 0407 Department of Electrical and Computer Engineering Rice University Email: cscott@rice.edu, nowak@engr.wisc.edu

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Realization Plans for Extensive Form Games without Perfect Recall

Realization Plans for Extensive Form Games without Perfect Recall Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in

More information

SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE

SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE SCALE INVARIANT FOURIER RESTRICTION TO A HYPERBOLIC SURFACE BETSY STOVALL Abstract. This result sharpens the bilinear to linear deduction of Lee and Vargas for extension estimates on the hyperbolic paraboloid

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

MUTUAL INFORMATION (MI) specifies the level of

MUTUAL INFORMATION (MI) specifies the level of IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 7, JULY 2010 3497 Nonproduct Data-Dependent Partitions for Mutual Information Estimation: Strong Consistency and Applications Jorge Silva, Member, IEEE,

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 9, SEPTEMBER /$ IEEE

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 9, SEPTEMBER /$ IEEE IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 9, SEPTEMBER 2007 3171 Multiscale Poisson Intensity and Density Estimation Rebecca M. Willett, Member, IEEE, and Robert D. Nowak, Senior Member, IEEE

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Faster Rates in Regression Via Active Learning

Faster Rates in Regression Via Active Learning Faster Rates in Regression Via Active Learning Rui Castro, Rebecca Willett and Robert Nowak UW-Madison Technical Report ECE-05-3 rcastro@rice.edu, willett@cae.wisc.edu, nowak@engr.wisc.edu June, 2005 Abstract

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: September 25, A quick reminder about random variables and convexity Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS 1.1. The Rutherford-Chadwick-Ellis Experiment. About 90 years ago Ernest Rutherford and his collaborators at the Cavendish Laboratory in Cambridge conducted

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

The Simplex Method: An Example

The Simplex Method: An Example The Simplex Method: An Example Our first step is to introduce one more new variable, which we denote by z. The variable z is define to be equal to 4x 1 +3x 2. Doing this will allow us to have a unified

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Unlabeled data: Now it helps, now it doesn t

Unlabeled data: Now it helps, now it doesn t Unlabeled data: Now it helps, now it doesn t Aarti Singh, Robert D. Nowak Xiaojin Zhu Department of Electrical and Computer Engineering Department of Computer Sciences University of Wisconsin - Madison

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

ADECENTRALIZED detection system typically involves a

ADECENTRALIZED detection system typically involves a IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Properties and Classification of the Wheels of the OLS Polytope.

Properties and Classification of the Wheels of the OLS Polytope. Properties and Classification of the Wheels of the OLS Polytope. G. Appa 1, D. Magos 2, I. Mourtos 1 1 Operational Research Department, London School of Economics. email: {g.appa, j.mourtos}@lse.ac.uk

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Notes on the Dual Ramsey Theorem

Notes on the Dual Ramsey Theorem Notes on the Dual Ramsey Theorem Reed Solomon July 29, 2010 1 Partitions and infinite variable words The goal of these notes is to give a proof of the Dual Ramsey Theorem. This theorem was first proved

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Mihai Pǎtraşcu, MIT, web.mit.edu/ mip/www/ Index terms: partial-sums problem, prefix sums, dynamic lower bounds Synonyms: dynamic trees 1

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

Multiscale Density Estimation

Multiscale Density Estimation Multiscale Density Estimation 1 R. M. Willett, Student Member, IEEE, and R. D. Nowak, Member, IEEE July 4, 2003 Abstract The nonparametric density estimation method proposed in this paper is computationally

More information

A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones

A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones A Noisy-Influence Regularity Lemma for Boolean Functions Chris Jones Abstract We present a regularity lemma for Boolean functions f : {, } n {, } based on noisy influence, a measure of how locally correlated

More information

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute

More information

A practical theory for designing very deep convolutional neural networks. Xudong Cao.

A practical theory for designing very deep convolutional neural networks. Xudong Cao. A practical theory for designing very deep convolutional neural networks Xudong Cao notcxd@gmail.com Abstract Going deep is essential for deep learning. However it is not easy, there are many ways of going

More information

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms Tim Roughgarden March 9, 2017 1 Preamble Our first lecture on smoothed analysis sought a better theoretical

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Tree sets. Reinhard Diestel

Tree sets. Reinhard Diestel 1 Tree sets Reinhard Diestel Abstract We study an abstract notion of tree structure which generalizes treedecompositions of graphs and matroids. Unlike tree-decompositions, which are too closely linked

More information

Dominating Set Counting in Graph Classes

Dominating Set Counting in Graph Classes Dominating Set Counting in Graph Classes Shuji Kijima 1, Yoshio Okamoto 2, and Takeaki Uno 3 1 Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan kijima@inf.kyushu-u.ac.jp

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu February

More information