Lecture 11: Decision Trees

Size: px

Start display at page:

Download "Lecture 11: Decision Trees"

Sara Tate
5 years ago
Views:

1 ECE9 Sprig 7 Statistical Learig Theory Istructor: R. Nowak Lecture : Decisio Trees Miimum Complexity Pealized Fuctio Recall the basic results of the last lectures: let X ad Y deote the iput ad output spaces respectively. Let X X ad Y X be radom variables with ukow joit probability distributio P XY. We would like to use X to predict Y. Cosider a loss fuctio l(y, y ), y, y Y. This fuctio is used to measure the accuracy of our predictio. Let F be a collectio of cadidate fuctios (models), f : X Y. The expected risk we icur is give by R(f) E XY [l(f(x), Y )]. We have access oly to a umber of i.i.d. samples, {X i, Y i } i=. These allow us to compute the empirical risk R (f) i= l(f(x i), Y i ). Assume i the followig that F is coutable. Assig a positive umber c(f) to each f F such that f F c(f). If we use a prefix code to describe each elemet of F ad defie c(f) to be the codeword legth (i bits) for each f F, the last iequality is automatically satisfied. We defie the miimum complexity pealized estimator as f arg mi f F R c(f) log + (f) + log. As we showed previously we have the boud )] mi f F R(f) + c(f) log + log +. The performace (risk) of f is o average better tha R(f) c(f + ) log + log +, where f = arg mi f F R(f) + c(f) log + log. If it happes that the optimal fuctio, that is f = arg mi f measurable R(f), is close to a f F with a small c(f), the f will perform almost as well as the optimal fuctio. Example Suppose f F, the )] R(f ) + c(f ) log + log +.

2 Lecture : Decisio Trees Furthermore if c(f ) = O(log ) the ( ) log )] R(f ) + O, ( ) log that is, oly withi a small O offset of the optimal risk. I geeral, we ca also boud the excess risk )] R, where R is the Bayes risk, R = if f measurable R(f). By subtractig R (a costat) from both sides of the iequality )] mi f F R(f) + c(f) log + log + we obtai )] R mi f F R(f) R + c(f) log + log +. Note that two terms i this upper boud: R(f) R is a boud o the approximatio error of a model f, ad remaider is a boud o the estimatio error associated with f. Thus, we see that complexity regularizatio automatically optimizes a balace betwee approximatio ad estimatio errors. I other words, complexity regularizatio is adaptive to the ukow tradeoff betwee approximatio ad estimatio. Classificatio Cosider the particularizatio of the above to a classificatio sceario. Let X = [, ] d, Y = {, } ad l(ŷ, y) {by y}. The R(f) = E XY [ {f(x) Y } ] = P (f(x) Y ). The Bayes risk is give by R = if f measurable R(f). As it was observed before, the Bayes classifier (i.e., a classifier that achieves the Bayes risk) is give by f (x) = {, P (Y = X = x), P (Y = X = x) <. This classifier ca be expressed i a differet way. Cosider the set G = {x : P (Y = X = x) /}. The Bayes classifier ca writte as f (x) = {x G }. Therefore the classifier is characterized etirely by the set G, if X G the the best guess is that Y is oe, ad vice-versa. The boudary of this set correspods to the poits where the decisio is harder. The boudary of G is called the Bayes Decisio Boudary. I Figure (a) this cocept is illustrated. If η(x) = P (Y = X = x) is a cotiuous fuctio the the Bayes decisio boudary is simply give by {x : P (Y = X = x) = /}. Clearly the structure of the decisio boudary provides importat iformatio o the difficulty of the problem.. Empirical Classifier Desig Give i.i.d. traiig pairs, {X i, Y i } i=, we wat to costruct a classifier f that performs well o average, i.e., we wat )] as close to R as possible. I Figure (b) a example of the i.i.d. traiig pairs is depicted.

3 Lecture : Decisio Trees 3 Bayes Decisio Boudary Bayes Decisio Boudary (a) (b) Figure : (a) The Bayes classifier ad the Bayes decisio boudary ; (b) Example of the i.i.d. traiig pairs. The costructio of a classifier boils dow to the estimatio of the Bayes decisio boudary. The histogram rule, discussed i a previous lecture, approaches the problem by subdividig the feature space ito small boxes ad takig a majority vote of the traiig data i each box. A typical result is depicted i Figure (a). The mai problem with the histogram rule is that it is solvig a more complicated problem tha it is actually ecessary. We do ot eed to determie the correct label for each idividual box directly (the histogram rule is essetially estimatig η(x)). I priciple we oly eed to locate the decisio boudary ad assig the correct label o either side (otice that the accuracy of a majority vote over a regio icreases with the size of the regio). The ext example illustrates this. Example (Three Differet Classifiers) The pictures below correspod to the approximatio of the Bayes classifier by three differet classifiers: Histogram Classifier Liear Classifier Tree Classifier (a) (b) (c) Figure : (a) Histogram classifier ; (b) Liear classifier; (c)decisio tree. The liear classifier ad the tree classifier (to be defied formally later) both attack the problem of fidig the boudary more directly tha the histogram classifier, ad therefore they ted to produce much better results i theory ad practice. I the followig we will demostrate this for decisio trees.

4 Lecture : Decisio Trees 4 3 Decisio Trees Decisio trees are costructed by a two-step process:. Tree growig. Tree pruig The basic idea is to first grow a very large, complicated tree classifier, that explais the the traiig data very accurately, but has poor geeralizatio characteristics, ad the prue this tree, to avoid overfittig. 3. Growig Trees The growig process is based o recursively subudividig the feature space. Usually the subdivisios are splits of existig regios ito two smaller regios (i.e., biary splits) ad usually the splits are perpedicular to oe of the feature axes. A example of such costructio is depicted i Figure 3. ad so o... Figure 3: Growig a recursive biary tree (X = [, ] ). Ofte the splittig process is based o the traiig dat, ad is desiged to separate data with differet labels as much as possible. It such costructios, the splits, ad hece the tree-structure itself, are data depedet. Alteratively, the splittig ad subdivisio could be idepedet from the traiig data. The latter approach is the oe we are goig to ivestigate i detail, ad we will cosider Dyadic Decisio Trees ad Recursive Dyadic Partitios (depicted i Figure 4) i particular. Util ow we have bee referrig to trees, but did ot made clear how do trees relate to partitios. It turs out that ay decisio tree ca be associated with a partitio of the iput space X ad vice-versa. I particular, a Recursive Dyadic Partitio (RDP) ca be associated with a (biary) tree. I fact, this is the most efficiet way of describig a RDP. I Figure 4 we illustrate the procedure. Each leaf of the tree correspods to a cell of the partitio. The odes i the tree correspod to the various partitio cells that are geerated through i the costructio of the tree. The orietatio of the dyadic split alterates betwee the levels of the tree (for the example of Figure 4, at the root level the split is doe i the horizotal axis, at the level below that (the level of odes ad 3) the split is doe i the vertical axis, ad so o...). The tree is called dyadic because the splits of cells are always at the midpoit alog oe coordiate axis, ad cosequetly the sidelegths of all cells are dyadic (i.e., powers of ). I the followig we are goig to cosider the -dimesioal case, but all the results ca be easily geeralized for the d-dimesioal case (d ), provided the dyadic tree costructio is defied properly. Cosider a recursive dyadic partitio of the feature space ito k boxes of equal size. Associated with this partitio is a tree T. Miimizig the empirical risk with respect to this partitio produces the histogram classifier with k equal-sized cells. Cosider also all the possible partitios correspodig to prued versios of the tree T. Miimizig the empirical risk with respect to those other partitios results i other classifiers (dyadic decisio trees) that are fudametally differet tha the histogram rule we aalyzed earlier. 3. Pruig Let F be the collectio of all possible dyadic decisio trees correspodig to recursive dyadic partitios of the feature space. Each such tree ca be prefix ecoded with a bit-strig proportioal to the umber of leafs

5 Lecture : Decisio Trees Figure 4: Example of Recursive Dyadic Partitio (RDP) growig (X = [, ] ). i the tree as follows; ecode the structure of the tree i a top-dow fashio: (i) assig a zero at each brach ode ad a oe at each leaf ode (termial ode) (ii) read the code i a breadth-first fashio, top-dow, left-right. Figure exemplifies this codig strategy. Notice that, sice we are cosiderig biary trees, the total umber of odes is twice the umber of leafs mius oe, that is, if the umber of leafs i the tree is k the the umber of odes is k. Therefore to ecode a tree with k leafs we eed k bits. Sice we wat to use the partitio associated with this tree for classificatio we eed to assig a decisio label (either zero or oe) to each leaf. Hece, to ecode a decisio tree i this fashio we eed 3k bits, where k is the umber of leafs. For a tree with k leafs the first k bits of the codeword ecode the tree structure, ad the remaiig k bits ecode the classificatio labels. This is easily show to be a prefix code, therefore we ca use this uder our classificatio sceario. Figure : Illustratio of the tree codig techique: example of a tree ad correspodig prefix code. Let f = arg mi f F R (3k ) log + (f) + log. This optimizatio ca be solved through a bottom-up pruig process (startig from a very large iitial tree T ) i O( T ) operatios, where T is the umber of leafs i the iitial tree. The complexity regularizatio theorem tells us that )] mi f F R(f) + (3k ) log + log +. ()

6 Lecture : Decisio Trees 6 4 Compariso betwee Histogram Classifiers ad Classificatio Trees I the followig we will illustrate the idea behid complexity regularizatio by applyig the basic theorem to histogram classifiers ad decisio trees (usig our setup above). Cosider the classificatio setup described i Sectio, with X = [, ]. 4. Histogram Risk Boud Recall the setup ad results of a previous lecture. Let F H k = {histogram rules with k cells}. The Fk H =. Let F H = k k F k H. We ca ecode each elemet f of F H with c H (f) = k + k bits, where the first k bits idicate the smallest k such that f Fk H ad the followig k bits ecode the labels of each bi. This is a prefix ecodig of all the elemets i F H. We defie our estimator as f H = f ( b k), where f (k) = arg mi R (f), f Fk H ad Therefore f H miimizes k = arg mi k R ( R (f) + f (k) ) + over all f F H. We showed before that H )] R mi f F H R(f) R + (k + k ) log + log. c H (f) log + log, c H (f) log + log +. To proceed with our aalysis we eed to make some assumptios o the itrisic difficulty of the problem. We will assume that the Bayes decisio boudary is a well-behaved -dimesioal set, i the sese that it has box-coutig dimesio oe (see Appedix A). This implies that, for a histogram with k cells, the Bayes decisio boudary itersects less tha Ck cells, where C is a costat that does ot deped o k. Furthermore we assume that the margial distributio of X satisfies P X (A) K A, for ay measurable subset A [, ]. This meas that the samples collected do ot accumulate aywhere i the uit square. Uder the above assumptios we ca coclude that Therefore mi f F H k H )] R CK/k + R(f) R K k Ck = CK k. (k + k ) log + log +. We ca balace the terms i the right side of the above expressio usig k = /4 (for large) therefore H )] R = O( /4 ), as. The descriptio here is slightly differet tha the oe i the previous lecture.

7 Lecture : Decisio Trees 7 4. Risk Bouds for Dyadic Decisio Trees Now let s cosider the dyadic decisio trees, uder the assumptios above, ad cotrast these with the histogram classifier. Let = {dyadic decisio trees with k leafs}. F T k Let F T = k F k T. We ca prefix ecode each elemet f of F T with c T (f) = 3k bits, as described before. Let f T = f ( b k), where ad Hece f T miimizes k = arg mi k mi f F T f (k) R ( R (f) + over all f F T. I fact, the optimizatio = arg mi f F T k f (k) ) + R (f) + R (f), (3k ) log + log. c T (f) log + log, c T (f) log + log, ca be performed usig a simple bottom up tree-pruig algorithm i O( ) time. Moreover T )] R mi f F T R(f) c T (f) log + R + log +. If the Bayes decisio boudary is a -dimesioal set, as i Sectio 4., there exists a tree with at most 8Ck leafs such that the boudary is cotaied i at most Ck squares, each of volume /k. To see this, start with a tree yieldig the histogram partitio with k boxes (i.e., the tree partitioig the uit square ito k equal sized squares). Now prue all the odes that do ot itersect the boudary. I Figure 6 we illustrate the procedure. If oe carefully bouds the umber of leafs required at each level, the it ca be show that the total umber of leafs is at most 8Ck. We coclude the that there exists a tree with at most 8Ck leafs that has the same risk as a histogram with k cells. Therefore, usig equatio () we have T )] R (3(8Ck) ) log + CK/k + log +. We ca balace the terms i the right side of the above expressio usig k = /3 (for large) therefore T )] R = O( /3 ), as. C. Scott, Tree pruig with subadditive pealties, IEEE Trasactios o Sigal Processig, vol. 3, o., pp. 48-4, Dec..

8 Lecture : Decisio Trees 8 (a) (b) Figure 6: Illustratio of the tree pruig procedure: (a) Histogram classificatio rule, for a partitio with 6 cells, ad correspodig biary tree represetatio (with 6 leafs). (b) Prued versio of the histogram tree, yieldig exactly the same classificatio rule, but ow requirig oly 6 leafs. (Note: The trees where costructed usig the procedure of Figure 4) Fial Commets Trees geerally work much better tha histogram classifiers. This is essetially because they provide much more efficiet ways of approximatig the Bayes decisio boudary (as we saw i our example, uder reasoable assumptios o the Bayes boudary, a tree ecoded with O(k) bits ca describe the same classifier as a histogram that requires O(k ) bits). The dyadic decisio trees studied here are differet tha classical tree rules, such as CART (Breima et al., 984) or C4. (Quila, 993). Those techiques select a tree accordig to k = arg mi k { R ( (k) f ) + αk for some α > whereas i the aalysis above the pealty was roughly { k = arg mi (k) R ( f ) + α } k, k 3 log for α. The square root pealty is essetial for the risk boud. No such boud exists for CART or C4., except uder very restrictive assumptios. Moreover, recet experimetal work has show that the square root pealty ofte performs better i practice. Fially, recet results show that a slightly tighter boudig procedure for the estimatio error ca be used to show that dyadic decisio trees (with a slightly differet pruig procedure) achieve a rate of }, T )] R = O( / ), as, which turs out to be the miimax optimal rate (i.e., uder the boudary assumptios above, o method ca achieve a faster rate of covergece to the Bayes error). A Box Coutig Dimesio The otio of dimesio of a sets arises i may aspects of mathematics, ad it is particularly relevat to the study of fractals (that besides some importat applicatios make really cool t-shirts). The dimesio somehow idicates how we should measure the complexity of a set (legth, area, volume, etc...). The boxcoutig dimesio is a simple measure of the dimesio of a set. The mai idea is to cover the set with boxes with sidelegth r. Let N(r) deote the smallest umber of such boxes, the the box coutig dimesio is defied as log N(r) lim r log r.

9 Lecture : Decisio Trees 9 Although the boxes cosidered above do ot eed to be aliged o a rectagular grid (ad ca i fact overlap) we ca usually cosider them over a grid ad obtai a upper boud o the box-coutig dimesio. To illustrate the mai ideas let s cosider a simple example, ad coect it to the classificatio sceario cosidered before. Let f : [, ] [, ] be a Lipschitz fuctio, with Lipschitz costat L (i.e., f(a) f(b) L a b, a, b [, ]). Defie the set A = {x = (x, x ) : x = f(x )}, that is, the set A is the graph of fuctio f. Cosider a partitio with k squared boxes (just like the oes we used i the histograms), the poits i set A itersect at most C k boxes, with C = ( + L ) (ad also the umber of itersected boxes is at least k). The sidelegth of the boxes is /k therefore the box-coutig dimesio of A satisfies dim B (A) log C k lim /k log(/k) = log C + log(k) lim k log(k) =. The result above will hold for ay ormal set A [, ] that does ot occupy ay area. For most sets the box-coutig dimesio is always goig to be a iteger, but for some weird sets (called fractal sets) it is ot a iteger. For example, the Koch curve (see for example ItroToFrac/IitGe/IitGeKoch.html) has box-coutig dimesio log(4)/ log(3) = This meas that it is ot quite as small as a -dimesioal curve, but ot as big as a -dimesioal set (hece occupies o area). To coect these cocepts to our classificatio sceario cosider a simple example. Let η(x) = P (Y = X = x) ad assume η(x) has the form η(x) = + x f(x ), x (x, x ) X, () where f : [, ] [, ] is Lipschitz with Lipschitz costat L. The Bayes classifier is the give by f (x) = {η(x) /} {x f(x )}. This is depicted i Figure 7. Note that this is a special, restricted class of problems. That is, we are cosiderig the subset of all classificatio problems such that the joit distributio P XY satisfies P (Y = X = x) = / + x f(x ) for some fuctio f that is Lipschitz. The Bayes decisio boudary is therefore give by A = {x = (x, x ) : x = f(x )}. Has we observed before this set has box-coutig dimesio.

10 Lecture : Decisio Trees f( x) - Bayes Decisio Boudary x x Figure 7: Bayes decisio boudary for the setup described i Appedix A.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality