Bounds on the Moments for an Ensemble of Random Decision Trees

Size: px

Start display at page:

Download "Bounds on the Moments for an Ensemble of Random Decision Trees"

Brent Morrison
5 years ago
Views:

1 Noname manuscript No. (will be inserted by te editor) Bounds on te Moments for an Ensemble of Random Decision Trees Amit Durandar Received: Sep. 17, 2013 / Revised: Mar. 04, 2014 / Accepted: Jun. 30, 2014 Abstract An ensemble of random decision trees is a popular classification tecnique, especially known for its ability to scale to large domains. In tis paper, we provide an efficient strategy to compute bounds on te moments of te generalization error computed over all datasets of a particular size drawn from an underlying distribution, for tis classification tecnique. Being able to estimate tese moments can elp us gain insigts into te performance of tis model. As we will see in te experimental section tese bounds tend to be significantly tigter tan te state-of-te-art Breiman s bounds based on strengt and correlation and, ence, more useful in practice. Keywords bounds, random decision trees, moments 1 Introduction An ensemble of random decision trees is a widely used classification tecnique [16, 13, 22, 14], especially known for its ability to scale to large domains. Tis classification tecnique was introduced in [13] were te autors portrayed te efficacy of te metod on real applications. According to te description in [13], te tecnique can be viewed as a special case of te random forest algoritm given in [7], were an attribute is cosen uniformly at random from te available list of attributes (i.e., from an uniform distribution over te available attributes) to form a node in te tree. In a random forest a set of attributes (say m attributes) are cosen uniformly at random as potential candidates for a node and ten, based on statistical measures suc as information gain (IG), gini gain (GG), etc., an attribute is selected from tis set. Hence, an ensemble of random decision trees is te special case of a random forest, were m = 1. Te primary motivation and te advantage of focusing on tis variant is tat it is igly scalable, since measures suc as IG or GG do not ave to be computed at every node in te tree. In addition, te prediction accuracy of tis model is sown to be quite good in practice [16, 13, 22, 23]. Amit Durandar IBM T.J. Watson aduran@us.ibm.com

2 2 Realizing te widespread utility of an ensemble of random decision trees, it is important tat we carefully analyze tis model. In previous works [11], te autors provided a moment based metodology to accurately caracterize classification models. In particular, tey provided generic expressions to compute te first two moments of te generalization error expected error over te entire input, given an underlying probability distribution. Tese moments were computed over Z(N), wic is te set of all possible classifiers produced by training a given classification algoritm over datasets of size N, drawn independently and identically (i.i.d.) from some distribution. Te callenge was ten to customize tese expressions to specific algoritms in order to meticulously study tem. In [11,10,12], tese expressions were customized for te Naive Bayes Classifier, Nearest Neigbor Classifier and (single) Random decision trees (not an ensemble). It was seen tat suc expressions ave te potential to provide extremely tigt caracterizations of te beavior of individual classification algoritms as well as certain model selection tecniques (viz. cross-validation, old-out-set estimation). In addition to being accurate, tese moments were also very efficient to compute. In [10], one of te future callenges tat was laid out was to develop strategies to efficiently and accurately estimate tese moments for an ensemble of random decision trees. Te strategies and expressions presented in [10] are practical only for single random decision trees, since as we will later see, a straigtforward extension of tose ideas to an ensemble leads to igly inefficient caracterizations. Estimating moments to evaluate quality of classifiers is a standard procedure in macine learning [4, 5], were a dataset is split into multiple training and test sets, and te average error wit a confidence interval or variance is reported as a measure of performance. Our metodology is an idealized version of tis approac were rater tan reporting te error averaged over a limited number of trials, we want to compute te expected error and maybe in some cases even te variance over all possible classifiers tat would be produced by training over datasets of size N sampled from te underlying distribution. Tis as discussed in previous works [11, 10, 12] is a muc more robust evaluation of classification algoritms. Recently, oter suc moment based metodologies [1 3] ave also gained prominence for parameter estimation of models over traditional maximum likeliood and bayesian approaces, wic are inefficient and ave a tendency to get stuck in poor local minima. It is important to stress te fact tat te generic expressions provided in [11] are difficult to customize to specific learning tecniques. Te customization requires analyzing te inner workings of te tecniques, wic are unique to every learning algoritm. Te situation ere is analogous to PAC Bayes bounds literature [17, 15, 18], were many works ave been publised describing a way to compute tem for different tecniques. Moreover, in all te relevant previous works [11, 10, 12], exact and efficient to compute formulations were derived for te respective tecniques. For an ensemble of random decision trees toug, exact formulations based on extensions of te results in [10] lead to igly inefficient caracterizations, wic are exponential in te number of trees T. Tus, we in tis paper, for te first time in tis line of work, describe a novel way to find accurate and efficient to compute upper bounds. Non-trivial and efficient to compute upper bounds can be very useful in gaining confidence in a metod [6]. In fact, te gain in time complexity is exponential in T. Tis is a significant improvement given tat T is usually in te undreds. Te strategies employed to derive tis bound bear little resemblance to previous analysis. Consequently, te contributions in tis paper are by no means attainable by incremental extensions of prior works.

3 3 Table 1 Notation used in te paper. Symbol Meaning X Random vector modeling input. X Domain of random vector (input space) X. Y Random variable modeling output. Y (x) Random variable modeling output for input x. Y Set of class labels (output space). N Sample size ζ Classifier GE(ζ) Generalization error of classifier ζ. Z(N) Te set of classifiers obtained by application of a classification algoritm to i.i.d. samples of size N. E Z(N) [] Expectation w.r.t. te space of classifiers built on a sample of size N. d Dimensionality of te input space. Heigt of trees in te ensemble. pat p p t possible pat amongst te total number of allowed pats given by te tree building algoritm. ξ(pat py) Count of te number of samples in te training set tat correspond to pat p and class y. ηx y Denotes te condition tat class y gets te maximum number of votes in classifying x troug te respective pats in te ensemble. m (S) Number of pats of lengt tat classify x into class y in te sample S. To summarize, we provide in tis paper a metod to efficiently obtain bounds on tese moments for an ensemble of random decision trees grown to a specified eigt. We analyze te fixed eigt variant, since te main strengt of tis tecnique is being able to create an ensemble wit ig diversity [13]. Tis is true for most ensemble tecniques as we want to cover different parts of te ypotesis space. Growing trees too deep limits tis and, ence, restricting te eigt is desirable. Note tat te case were trees are grown to teir full eigt is just a special case in tis setting and tus our analysis is still applicable. Moreover, our analysis does not place any restriction on te number of classes or te number of splits per attribute and is ence applicable to a large class of random decision tree ensembles. Troug experiments on syntetic data and bootstrap distributions built on real data, we sow tat our (upper) bounds are tigter tan te widely used strengt and correlation bounds introduced in [7]. Te rest of te paper is organized as follows. In Section 2, we first describe te ensemble algoritm and ten briefly review te tecnical framework tat our analysis is based on. In Section 3, we discuss te issues wit directly extending te ideas used for caracterizing single random decision trees to an ensemble. We ten provide a solution by adopting a different perspective tan te one in [10]. Moreover, as in te previous study, our analysis applies to categorical as well as continuous attributes wit split points predetermined for eac attribute. In Section 4, we compare te bounds derived from tis analysis wit Breiman s strengt and correlation bounds on syntetic and real datasets. We discuss future lines of researc and summarize te major developments in te paper in section 5. 2 Preliminaries In tis section we first provide a precise description of te algoritm for random decision trees tat we analyze ere. We ten revisit te generic expressions provided in [11] tat need to be customized to specific classification algoritms; in tis case an ensemble

4 4 Algoritm 1 Procedure to build an ensemble of random decision trees. Input: T,, F = {f 1,..., f d } {T is te # of trees in te ensemble, is te eigt and te f i s are te d attributes.} Output: W {Te ensemble.} Set W = φ {No trees in te ensemble initially.} for i = 1; i T ; i + + do Randomly pick f i F {Pick te root.} w = {f i }, w l = {f i } {w denotes te tree built so far and w l te attributes at te leaf of w.} for j = 1; j < ; j + + do w t = φ for f w l do If f as s splits, ten randomly pick s attributes F s F one for eac split suc tat eac attribute is distinct from its ancestors. w t = w t F s end for w l = w t w = w w l end for W = W w end for Return W of random decision trees. Finally, we provide te customized expressions for random decision trees (not ensemble) of prespecified eigt and discuss te intuitions in its derivation. 2.1 Ensemble of Random Decision Trees In tis paper, we consider te following procedure for building eac random decision tree in an ensemble of T trees. Te root of a tree is cosen uniformly at random from te available list of attributes. Nodes at te next level in te tree are cosen uniformly at random from a list of attributes tat excludes te root. Hence, at eac level of te tree a node is picked at random (uniform distribution) from te list of attributes tat excludes its parents and ancestors. Te tree is grown to a prespecified eigt. Tis procedure is clearly elucidated in algoritm 1. It is assumed tat te split points for continuous attributes are predetermined. After T trees ave been built using te aforementioned procedure to form an ensemble, te class label of a datapoint is determined by taking a majority vote. Te procedure for building random decision trees considered ere as been previously mentioned in [13,10] and is sown to be efficient as well as accurate in practice. 2.2 Generic Expressions We first introduce some notation tat is used primarily in tis section. X is a random vector modeling input wose domain is denoted by X. Y is a random variable modeling output wose domain is denoted by Y (set of class labels). Y (x) is a random variable modeling output for input x. ζ represents a particular classifier wit its generalization error (GE) denoted by GE(ζ). Formally,

5 5 GE(ζ) = E[λ(ζ(X), Y )] (1) were λ(.,.) is a zero-one loss function and te expectation is over te distribution on te X Y space. Z(N) denotes a set of classifiers obtained by application of a classification algoritm to different samples of size N. Te fundamental concept beind te generic expressions is to caracterize a class of classifiers induced by a classification algoritm and an i.i.d. sample of a particular size from an underlying distribution. We tus ave a distribution over classifiers, wit te GE of tese classifiers being a random function tat as its own distribution. Computing te entire distribution can be igly inefficient especially for a complicated function suc as te GE of classifiers and ence we resort to efficiently computing te first few moments. Wit tis we now revisit te expressions for te first two moments around zero of te GE of a classifier, E Z(N) [GE(ζ)] = P [X =x] P Z(N) [ζ(x)=y x] P [Y (x) y x] dx, x X y Y (2) [ E Z(N) Z(N) GE(ζ)GE(ζ ] ) = P [X =x] P [ X =x ] x X x X [ P Z(N) Z(N) ζ(x)=y ζ (x )=y x, x ] y Y y Y P [Y (x) y x] P [ Y (x ) y x ] dxdx (3) It becomes clear from te above equations tat to compute tese moments we must be able to caracterize te beavior of a classifier on eac input independently for te first moment, and on pairs of inputs for te second moment. More specifically, we need be able to compute P Z(N) [ζ(x)=y] for te first moment and P Z(N) Z(N) [ ζ(x)=y ζ (x )=y ] for te second moment for te classification algoritm under consideration 1. Te oter terms in tese equations indicate te error of a single classifier or errors of two classifiers derived from te underlying distribution for te first and second moment respectively. Tis can be better understood from te following simple example. If te class prior for class 1 is p and tat for class 2 is 1-p, ten te error of a classifier classifying data into class 1 is 1-p and te error of a classifier classifying data into class 2 is p. 2.3 Customized Expressions for Random Decision Trees As we saw in te previous subsection, in order to compute te moments we need to caracterize P Z(N) [ζ(x)=y] for te first moment and P Z(N) Z(N) [ ζ(x)=y ζ (x )=y ] for te second moment for specific classification algoritms. To caracterize tese probabilities we ave to matematically model te manner in wic a particular classification algoritm classifies a specific input (or pairs of inputs for te second moment). In 1 Tese probabilities and P [Y (x) y] are conditioned on x. We omit explicitly writing te conditional since it improves readability and is obvious from te context.

6 6 te case of decision trees, a specific input is classified based on te relevant pat in te tree from root to leaf wit te majority class being cosen as te class for te input. Tus, te P Z(N) [ζ(x)=y] for a decision tree algoritm is given by, P Z(N) [ζ(x)=y] = P Z(N) [ξ(pat py) > ξ(pat py ), pat pexists, p y y, y, y Y] were p indexes all allowed pats by te decision tree algoritm in classifying input x. After te summation, te term ξ(pat py) is te number of (count of) inputs in te pat pat p tat lie in class y. A similar caracterization exists for P Z(N) Z(N) [ ζ(x)=y ζ (x )=y ] were, rater tan single pats, in eac probability after te sum we ave pairs of pats. Te allowed pats are governed by te attribute selection metod and te stopping criteria in te decision tree algoritm. For example, if we grow random decision trees of eigt were tere are total of d attributes te above probability would be, (4) P Z(N) [ζ(x)=y] = P Z(N) [ξ(pat py) > ξ(pat py ), y y, y, y Y] ( ) (5) p d ( ) d Tis is te case, since for any input tere are possible pats and every pat ( ) is equally likely, i.e., P [pat pexists] = 1 d d p. Te possible pats comes from te fact tat as we build te tree tere are d coices to pick te root, tere are d 1 coices to pick eac of te nodes at te next ( level ) and so on till we reac level, were d we ave d + 1 coices. Eac of tese pats is indexed by p and pat p is te corresponding pat. By similar arguments te probability for te second moment is given by, [ P Z(N) Z(N) ζ(x)=y ζ (x )=y ] = 1 ( ( ) 2 P Z(N) Z(N) [ξ(pat py) > ξ(pat py ), d p,q ξ(pat qy ) > ξ(pat qy ), y y, y y, y, y, y, y Y ] ) An important ting to notice in te above formulas is tat tere are two sources of randomness, te first due to te tree building metod (i.e., pat exists) and te second due to te random samples (i.e., comparing counts) tat are generated from te underlying distribution. As we will see later, tis observation will elp us in deriving bounds for te moments of an ensemble of random decision trees. (6)

7 7 3 Caracterizing an Ensemble of Random Decision Trees In tis section, we first sow tat naively extending te analysis for (single) random decision trees to an ensemble leads to igly inefficient formulations wit no obvious way of deriving reasonable bounds on te moments. We ten suggest a solution using te generic expressions tat is efficient to compute and gives tigter bounds tan directly finding bounds on te moments. 3.1 Straigtforward Extension An ensemble of random decision trees classify an input into a class based on te classes predicted by individual trees in te ensemble. In particular, any input is classified into te class tat receives te maximum number of votes. To caracterize P Z(N) [ζ(x)=y] we ave to caracterize all possible ways in wic class y receives te maximum number of votes in classifying input x. Given tat te ensemble as T trees wit p i indexing all allowed pats for an input x in te i t tree (i T ) we ave, 1 ( P Z(N) [ζ(x)=y] = ( ) T P Z(N) [ηx] y ) (7) d p 1,...,p T were η y x denotes class y gets te maximum number of votes in classifying x troug te respective pats of te T trees. Hence, given a set of pats, P Z(N) [η y x] computes te probability of classifying input x into y over all samples of size N from an underlying distribution. η y x is caracterized by comparing counts of various classes in te leaves of te trees in te ensemble. Tis is a generalization of te caracterization for single trees sown in equation 5, were counts are compared in a single leaf to counts being compared in te corresponding leaves of all te trees in te ensemble. Tis leads to an exponential in T number of massive joint probabilities tat need to be computed. Given tat T is generally in te undreds, exactly computing te above formula is impractical for even extremely small domains. It is also important to notice tat te joint probabilities containing te counts from different trees cannot be factorized as a product of probabilities containing counts of individual trees, altoug te trees are built independently, te individual tree classifiers are correlated troug te sample. Tere also does not seem to be any obvious way of deriving efficient and tigt bounds on tese expressions. Te same issues are faced wen estimating or bounding te probability for pairs of inputs used in te computation of te second moment. 3.2 Main Result In te previous subsection we observed tat a naive extension of te ideas used to caracterize single random decision trees leads to igly inefficient formulations for an ensemble of random decision trees. As indicated before, tere are two sources of randomness: te first due to te random tree building metod and te second due to te random samples from te underlying distribution. To derive te main result in tis paper, we caracterize te second source of randomness, ten fix it and compute probabilities based on te first source of randomness. Tis strategy leads to bounds tat can be computed efficiently. Wit tis we ave te following teorem.

8 8 Teorem 1 Consider a set of samples drawn from te underlying distribution tat ave cumulative measure/probability mass α and let S max be a sample in tis set for wic te number of pats of lengt ( d), tat classify input x into class y is maximum. Let us denote tis maximum number( of ) pats by m (S max). Ten given d tat tere are T trees in te ensemble and q = is te total number of pats of lengt we ave, P Z(N) [ζ(x)=y] βb( T 2, T, m (S max) ) + (1 β) q ) were 0 β α 1 and B( T 2, T, m (S) q ) = ( T T i= T 2 i ( m (S) q ) i (1 m (S) q ) (T i). Te bound on te probability for te second moment is analogous to te above bound wit te only difference being tat we consider pairs of inputs rater tan just single inputs. Before we describe te details in deriving te above bound and tecniques to efficiently estimate te necessary parameters (viz. m (S max) and β), we provide a brief overview of ow tese critical parameters can be estimated. 1. To compute m (S max) for samples of size N wit total measure α, find te (worst) sample for wic te number of pats of lengt tat would classify input x into class y is maximum in tis set. 2. Given tis sample we can easily ceck to see ow many of tese q pats classify input x into class y. Te iger te m (S max) te looser te bound. 3. β can be estimated using Bonferronis inequality [21] for te given set of samples. Te lower te β te looser te bound. Notice tat tere can be many sets of samples of size N wit measure α. Te best set of samples, i.e., te ones tat would give te tigtest bound would be te ones tat ave te lowest m (S max). In tis paper, we provide a strategy tat finds a particular set of samples efficiently tat are not necessarily te best. Te set of samples our procedure finds seems reasonable, owever, given tat our bounds are mostly non-trivial and significantly tigter tan Breiman s bounds, as witnessed in te experimental section. Decipering te best set of samples efficiently is part of future researc. 3.3 Derivation and Efficient Parameter Estimation of te Bound We ave seen tat a straigtforward extension of te ideas tat were used to caracterize single random decision trees cannot be used (due to computational considerations) to caracterize an ensemble of random decision trees. We tus ave to come up wit a different strategy or perspective to tackle tis problem. Tere are 3 key elements in coming up wit tis caracterization, wic are as follows: 1. First, caracterize te second source of randomness, i.e., randomness due to te underlying distribution, and ten conditioning on tis source of randomness caracterize te first source of randomness wic is due to te tree building process. 2. Define a parameter 0 α < 1 tat denotes te measure of te samples of size N drawn from te underlying distribution. Tis parameter affects te tigtness of te bounds.

9 9 Fig. 1 Te figure sows te key steps in deriving te bound for an ensemble of random decision trees. 3. Find a set of samples wose measure is exactly α can be computationally intensive and, ence, find samples wit measure say β were β α wic can be done efficiently and still leads to an useful upper bound on te moments. Tese elements are also mentioned in figure 1, and te details regarding tem are given below. Two Sources of Randomness: As mentioned before, tere are two sources of randomness tat are present: te first due to te tree building metod (i.e., pat exists) and te second due to te random samples (i.e., comparing counts) tat are generated from te underlying distribution. In te caracterization for random decision trees, we observe tat we compute te probability for a certain pat to exist and ten, conditioned on tis pat, we compute te probability of classifying an input into a particular class over all samples of a particular size. In oter words, we caracterize te first source of randomness and ten fix it in te computation of P Z(N) [ζ(x)=y] and P Z(N) Z(N) [ ζ(x)=y ζ (x )=y ]. However, tere is also te oter possibility of caracterizing te second source of randomness and fixing it in order to caracterize tese probabilities. In tis scenario we would ave, P Z(N) [ζ(x)=y] = S D(N) P [S]P t [η y x S] (8) were D(N) denotes all datasets of size N from te underlying distribution, and P t [η y x S] is te probability of classifying input x into class y based on te random tree generation process denoted by t given a particular dataset S D(N). For a given dataset, te number of pats of lengt tat would classify an input x into class y is te same for all trees in te ensemble. In addition, te trees are built independently.

10 10 Hence, if m (S) is te number of pats of lengt tat classify input x into class y given a dataset S and if T is te total number of trees in te ensemble, ten P t [η y x S] is given by a binomial cumulative distribution function (cdf), ( ) d were q = ( ) T T i= T 2 i P t [ηx S] y = B( T 2, T, m (S) ) (9) q is te total number of pats of lengt and B( T 2, T, m (S) q ) = ( m (S) q ) i (1 m (S) q ) (T i). It is tus easy to see tat given a sample, P t [ηx S] y can be computed efficiently. However, te number of samples will be uge for any reasonable problem size. Hence, directly computing te moments using tis formulation also does not seem feasible. In fact, tis formulation is infeasible even for single random decision trees. Noneteless, tis formulation does present us wit an opportunity to compute bounds on te moments wic we will now derive. Te solution we provide is efficient and, as we will see in te experimental section turns out to be tigter tan Breiman s bounds: κ (1 s2 ), were κ is te correlation between te s 2 random decision trees in an ensemble and s is te strengt of te resultant classifier. 2 Defining α: One way of upper bounding P Z(N) [ζ(x)=y], is to find te maximum P t [ηx S] y over all samples and ten to substitute tis value for eac of te conditionals in equation 8. Te maximum P t [ηx S] y would most likely be 1, since for arbitrary distributions tere migt exist samples were m (S) = q. Tis would make te bound trivial (i.e., 1), since S D(N) P [S] = 1. Hence, rater tan coosing all samples, we can coose a fraction of tem wit measure α to upper bound te probabilities. α = 1 corresponds to coosing all samples wile lower values of α (viz or 0.99, etc.) correspond to coosing a subset of te samples wit measure α. For tis subset, we can find te sample wic as te maximum m (S) and, ence, te maximum P t [ηx S] y for tat set. In te experimental section, we will see tat our metod of finding tis sample and consequently te maximum P t [ηx S] y for tat set leads to bounds tat are significantly tigter tan Breimans strengt and correlation bounds. Wit tis we ave te following inequality, P Z(N) [ζ(x)=y] = P [S]P t [ηx S] y S D(N) αp t [ηx S y max] + (1 α) (10) S max is te sample for wic P t [η y x S max] as te maximum value in tat set of samples wit total measure α. Note tat tere can be many sets of samples wit total measure α. Ideally, one would want to coose te set for wic P t [η y x S max] is te least, since tis would lead to te tigtest possible bound among te oter alternatives. At tis point, it is not clear ow tis optimal set can be decipered; owever, we provide a solution were we find one of te feasible sets in an efficient manner and, as we will see later, tis solution tends to give reasonably good bounds. As mentioned before, our analysis applies to discrete attributes as well as continuous attributes wit prespecified split points. Given tis, any distribution over te 2 For furter details refer to [7] and [8].

11 11 Table 2 Example data distribution. X, Y y 1 y 2 y c x 1 p 11 p 12 p 1c x 2 p 21 p 22 p 2c.. x n p n1 p n2 p nc N data can be represented as a multinomial wit parameters N, p 11, p 12,..., p nc as sown in table 2. Here, N is te sample size and p 11, p 12,..., p nc are te probabilities for te corresponding input-output pairs (n distinct inputs and c classes) to occur wit n,c i=1,j=1 p ij = 1. Notice tat bootstrap distributions built by resampling a dataset are subsumed by te given class of distributions. Given suc a multinomial distribution, we want to find a set of samples/datasets of size N were te total measure of tese datasets is equal to α. Computing β: Attempting to directly find tese samples would be computationally intensive, since we would ave to sum te measures of eac sample in te set in order to verify tat te measure is α. However, since we want to find an upper bound on P Z(N) [ζ(x)=y], substituting a lower bound for α in equation 10 would give us an upper bound on P Z(N) [ζ(x)=y], as desired. Finding a lower bound on α can be done efficiently wit te elp of Bonferroni s inequality [21]. Bonferroni s inequality states tat given g random variables Z 1,..., Z g and 2g real numbers a 1,..., a g, b 1,..., b g te following relationsip olds, P [a 1 Z 1 b 1,..., a g Z g b g] g ( 1 1 P [ai Z i b i ] ) (11) i=1 Hence, te joint probability of variables can be lower bounded by a function of te marginals. Te reason tis result is useful is tat one-dimensional confidence regions (i.e., confidence intervals or domain of marginals) of a particular measure can be efficiently computed as opposed to simultaneous (multidimensional) confidence regions wic are muc arder to exactly compute [19,20]. One easy way of obtaining confidence intervals for commonly used measures is by using standard formulas. For example, if we want te confidence interval of 0.95 measure for Z i, we know tat an approximation of it is given by µ i σ i, were µ i and σ i are te mean and standard deviation of Z i. Similar formulas are known for oter measures and, ence, one can compute confidence intervals for tese measures in constant time. If we want te exact confidence interval, we can compute it as a binomial cdf, wic can also be done in essentially constant time using series expansions for te incomplete regularized beta function. Wit tis if β is te measure on te rigt and side of equation 11, ten te measure over te simultaneous confidence region or te relevant set of datasets for our problem is at least β. If α is te true measure over te set of datasets, but β α is wat we efficiently compute based on te above inequality, ten we ave,

12 12 Table 3 Collapsed view of te underlying distribution as required for bounding P Z(N) [ζ(x i )=y]. X, Y y 1 y 2 y c x x i 1 p x i 11 p x i x x i 2 p x i 21 p x i. x x i k R x i p x i k1 12 p x i 1c 22 p x i 2c p x i k2 p x i kc p x i R N P Z(N) [ζ(x)=y] = P [S]P t [ηx S] y S D(N) αp t [ηx S y max] + (1 α) βp t [ηx S y max] + (1 β) (12) In our problem, te Z i represent te number of datapoints corresponding to an input-output pair were te input as at least attributes wit te same value as te input x i for wic te above probability is to be upper bounded. Tis is seen in table 3, were x xi j denotes te j t input wic as at least attributes wit te same value as input x i. If eac of te d attributes can take v values 3 giving a total of n = v d distinct inputs, ten k = ( ) d d i= (v 1) d i < n and R xi denotes te entry in te table for i te probability remaining after accounting for probabilities of eac of te kc pairs i.e. p xi R = 1 k,c j=1,l=1 pxi jl and N as usual is te sample size. Hence, we ave kc degrees of freedom i.e. g = kc, since te probabilities sum to 1 and te probability in te last cell (p xi R ) is determined by te oter probabilities. Wit tis we ave Z 1 is te random variable corresponding to te count in te cell formed by te intersection of input x xi 1 and class y 1, Z 2 is te random variable corresponding to te count in te cell formed by te intersection of input x xi 1 and class y 2 and so on until Z g, wic is te random variable corresponding to te count in te cell formed by te intersection of input x xi k and class y c. Consequently, one way of obtaining a confidence interval of measure 0.95 for te random variable representing te intersection of input x xi j and class y l would be Np xi jl ( Np xi jl (1 pxi jl )). We can owever coose any confidence interval for eac Z i (not necessarily te same) so as to ave te appropriate β, wic as we know, lower bounds α. We also need to ceck tat te individual confidence intervals produce legal datasets, since te total sample size is a fixed number N. Tis can be easily done by adding te upper limit of te intervals for eac Z i and verifying tat te eventual sum is N. If it is not, one can adjust te limits of one or more of te intervals so as to satisfy te condition. Te confidence intervals for eac Z i togeter form a confidence region tat represents te relevant set of datasets. From tis set we now ave to find S max, i.e., te dataset for wic te number of pats tat classify input x into class y is te igest among te set. Tis dataset is te one formed by taking upper limits of te confidence intervals corresponding to class y and te lower limits of te remaining confidence intervals, wit te appropriate value for 3 Tis is after splitting te continuous attributes.

13 13 Algoritm 2 Overview of procedure to find S max and m (S max) for an input-output pair (x i, y b ). Input:, X = { x 1,..., x n}, Y = {y 1,..., y c}, P = {p 11,..., p nc} {P is te estimated or given underlying probability distribution over X Y.} Output: S max, m (S max) Let x x i j denote te j t input wic as at least attributes wit te same value as input x i. Let p x i jl be te (rolled-up) probability of being (x x i j, y l), wic is computed from P. Let n r = N for all j {1,..., k}, l {1,.., c} do if l=b {γ jl below depends on desired β, wic is te rigt side of equation 11, and N.} ten Compute n jl = Np x i jl + γ jl( Np x i jl (1 px i jl )) else Compute n jl = Np x i jl γ jl( Np x i jl (1 px i jl )) end if Compute n r = n r n jl end for S max = {n 11,..., n kc, n r} Compute m (S max) = Count # of pats of lengt in S max, were y b is te majority class. Return S max, m (S max) te cell corresponding to R x, in order to ave a total sample size of N. A dataset formed by tis procedure would be S max, since taking te upper and lower limits as indicated would indeed produce a dataset wit maximum number of pats classifying te particular input into class y in te relevant set. An overview of tis procedure is given in algoritm 2. Wit tis, we ave completely described a process tat can be used to upper bound P Z(N) [ζ(x)=y]. An analogous procedure can be used to upper bound P Z(N) Z(N) [ ζ(x)=y ζ (x )=y ] wit te sligt difference being we ave to consider pairs of inputs rater tan single inputs. Te basic tecnique, owever, remains te same. 3.4 Time Complexity In te previous subsection we provided a strategy to upper bound P Z(N) [ζ(x)=y] [ and P Z(N) Z(N) ζ(x)=y ζ (x )=y ] x, x X, y, y Y and consequently te moments. In tis subsection, we analyze te time complexity of tis strategy and compare it wit te straigtforward extension described in section 3. Consider te following setup, were we ave d input attributes eac taking v values, a class attribute taking c values, a sample size of N and T trees in te ensemble eac of eigt d. Given tis, te straigtforward extension would ave a time complexity of O(ncq T N T +1 ) for te first moment and O(n 2 c 2 q 2T N 2T +2 () for ) te second moment, were n = v d d is te number of distinct inputs and q = is te total number of pats of lengt for a particular input. Te nc term in te time complexity comes from te fact tat we ave to estimate nc probabilities, te q T term indicates we ave to sum over all combinations of q pats for te T trees and N T +1 is te time it takes to compute eac of te uge probabilities in equation 7. Te term N T +1 can be reduced by approximating eac of te probabilities using Monte Carlo sampling,

14 14 owever, te q T term is still very intimidating given tat T is usually in te undreds. Te time complexity of our bound is O(nkc 2 qt ) for te first moment and a quadratic function of tis value for te second moment, were k = ( ) d d i= (v 1) d i < n. i Tis time complexity comes from te fact tat tables, suc as table 3, can be created for eac input in one pass of te dataset and ten for eac input-output pair we can find m (S max) q followed by te binomial probability in O(kcqT ) time. A key component enabling tis exponential gain is te fact tat te simultaneous confidence region can be lower bounded by bounding eac variable independently and tat m (S max) can be found efficiently as sown before. Hence, tere is a uge overall reduction in time complexity for practical applications as tere is an exponential gain in T, wit te sample size N not affecting te analysis. Te increased O(kc) factor is more tan compensated for by tese reductions. We ave tus reduced te complexity from someting tat was practically impossible to compute even for small-scale problems to someting tat is reasonable to compute for medium-scale and even certain large-scale problems. Moreover, te bounds on te individual probabilities can be computed in parallel. 3.5 Practical Considerations In te previous subsection we analyzed te worst-case time complexity of computing our bound. It owever may be possible to furter speed up its computation in practice. For instance, to upper bound te moments, we may not ave to upper bound te probabilities for every input-output pair. For every input, it is sufficient to upper bound every probability starting from te one corresponding to te igest error (i.e., igest P [X =x] P [Y (x) y] y Y ) to te ones wit lower errors in a (descending) sorted fasion until te upper bounds exceed 1 or if we reac te probability wit te lowest error, i.e., for te first moment te P Z(N) [ζ(x)=y] need not be upper bounded if P [X =x] P [Y (x) y] is te lowest error among all of Y and similarly for te second moment. Te weigt multiplied to te lowest error term corresponding to input x is 1 y Y,y y U(x, y ) were U(x, y ) is te upper bound on P Z(N) [ ζ(x)=y ] if te sum of U(x, y ) does not exceed 1. Tis can be done, since te probabilities for classifying an input into different classes form a convex combination. Conversely, if we first upper bound te probability wit te lowest error for a particular input and ten in a (ascending) sorted fasion upper bound probabilities wit successively iger errors until one of te aforementioned conditions are met, we would get lower bounds on te moments. Suc a strategy would give tigter bounds tan blindly upper bounding every probability. It is important to realize tat, because of te generic expressions, we get tigter bounds wen compared wit directly bounding te moments using te procedure mentioned in te previous subsection. From Bonferonni s inequality in equation 11, we see tat as te number of random variables increase, te bound (linearly) loosens. Hence, aving fewer variables will invariably lead to tigter bounds. If we were to derive bounds directly, we would ave to consider te entire dataset wit all te cells witout collapsing or aggregation of cells into fewer cells and ence fewer variables [11], as is te case using te generic expressions. Te reason for tis is tat te generic expressions consider cells corresponding to an individual input or pairs of inputs rater tan te entire input space.

15 15 4 Experiments In te previous section we described a tecnique to derive bounds on te moments. In tis section, we test te usefulness of tese bounds by performing experiments on syntetic and real data. In particular, we compare te tigtness of te upper bound derived by our metod wit te commonly used strengt and correlation bounds given by Breiman, wic are upper bounds on GE. As we will see, our bound tends to be significantly tigter tan te alternative in most circumstances. 4.1 Setup Te experimental setup tat we ave ere is very similar to te one in a previous study [10]. In eac of te experiments on syntetic and real data tat we describe below, we obtain a robust estimate of E Z(N) [GE(ζ)] by generating 1000 old-out or test sets of datapoints and ten averaging te error over tese 1000 old-out sets. Note tat, even in te syntetic case, te E Z(N) [GE(ζ)] cannot be computed exactly, since te expectation is over all possible classifiers built over datasets of size N, wic is computationally infeasible even for N in te undreds. We set te confidence intervals per variable (Z i ) to Te number of trees in te ensemble (T ) is set to a 100 for te real data and varied from 100 to 500 for te syntetic data. We also report te average computation time in seconds for te respective bounds. Te experimental setup for syntetic data is as follows: In our initial experiments we fix N to 100 and ten increase it to Te number of classes is fixed to two. We compute te upper bounds on E Z(N) [GE(ζ)] were te number of attributes is fixed to d = 5 wit eac attribute aving 2 attribute values. We ten increase te number of attribute values to 3 to observe te effect tat increasing te number of split points as on te performance of te estimators. We also increase te number of attributes to d = 8 to study te effect tat increasing te number of attributes as on te performance. Wit tis we ave a d + 1 dimensional contingency table wose d dimensions are te attributes and te (d + 1) t dimension represents te class labels. Wen eac attribute as two values, te total number of cells in te table is c = 2 d+1 and wit tree values te total number of cells is c = 3 d 2. If we fix te probability of observing a datapoint in cell i to be p i suc tat c i=1 p i = 1 and te sample size to N te distribution tat perfectly models tis scenario is a multinomial distribution wit parameters N and te set {p 1, p 2,..., p c}. In fact, irrespective of te value of d and te number of attribute values for eac attribute, te scenario can be modeled by a multinomial distribution. In te studies tat follow, te p i s are varied and te amount of dependence between te attributes and te class labels (ρ) is computed for eac set of p i s using te Ci-square test [9]. More precisely, we sum over all i te squares of te difference of eac p i wit te product of its corresponding marginals, wit eac squared difference being divided by tis product, tat is, correlation = (p i p im) 2 i p im, were p im is te product of te marginals for te i t cell. In te case of real data, we perform experiments by building bootstrap distributions on tree UCI data sets tat ave been used previously in te context of decision trees [10] and six oter industrial datasets obtained from diverse domains. Te six proprietary datasets denoted by Oil, Semiconductor, Store, Spend, Airline and Invoice in table 6, are obtained from te petrocemical industry, te semiconductor industry,

16 16 te consumer products industry, te finance domain, te airline industry and te procurement domain respectively. Oil: Te Oil dataset as information obtained from a major oil corporation. Tere are a total of 9 measures in te dataset. Tese measures are obtained from te sensors of a 3-stage separator tat separates oil, water, and gas. Te 9 measures are composed of 2 measured levels of te oil water interface at eac of te 3 stages and 3 overall measures. In particular, te measures are as follows: 1. 20LT 0017 mean (stage 1) 2. 20LT 0021 mean (stage 1) 3. 20LT 0081 mean (stage 2) 4. 20LT 0085 mean (stage 2) 5. 20LT 0116 mean (stage 3) 6. 20LT 0120 mean (stage 3) 7. Daily water in oil (Daily WIO) 8. Daily oil in water (Daily OIW) 9. Daily production normal Our target measure is te last listed measure: Daily production normal (binary variable), wic indicates if te daily oil production meets te desired level or not. Te dataset size is 992. Semiconductor: In te cip manufacturing industry predicting wafers lying witin certain spec (wic is a collection of cips) accurately aead of time can be crucial in coosing te appropriate set of wafers to send forward for furter processing. Eliminating faulty wafers can save te industry a uge amount of resources in terms of time and money. Te dataset we ave as 175 features including te target. Te target is a binary variable, wic indicates if te wafer conforms to required spec or not. Te oter features are a combination of pysical measurements and electrical measurements made on te wafer. Te dataset size is Store: In te consumer products industry, predicting wen a certain item of a competitor brand may or may not be on promotion can be critical for strategic planning. Te dataset we ave as information about past promotions tat we ave carried out for a type of bread and, for te corresponding time periods, we also ave information about 10 competitors tat ave carried out promotions. Te target ere is te binary time series corresponding to our closest competitor. Te dataset size is 140 corresponding to 140 weeks of data. Spend: Te Spend dataset contains a couple of years wort of (spend) transactions spread across various categories belonging to a large corporation. Tere are transactions wic are indicative of te companies expediture in tis time frame. Te dataset as 13 attributes namely; requestor name, cost center name, description code, material group, vendor name, business unit name, region, purcase order type, addressable, spend type, compliant, invoice spend amount. Our target is te compliance attribute, wic indicates if a transaction was compliant or not. Given tis, te goal is to identify te caracteristics of a transaction tat igly correlate wit it being compliant/non-compliant. Wit tis information te company would be able to put in

17 17 Table 4 Te table sows E Z(N) [GE(ζ)] and te corresponding upper bounds for different levels of correlation (ρ) between te attributes and class labels for T = 100. Te values before te commas in te cells under different levels of correlation correspond to our metod wile te values following te commas correspond to Brieman s strengt and correlation bounds. Parameter Settings Split Metric ρ = 0.9 ρ = 0.36 ρ = 0.11 ρ = 0.02 N = 100, binary Bounds 0.22, , , ,1.34 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.09, , , ,0.11 N = 100, ternary Bounds 0.48, , , ,1.7 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.21, , , ,0.32 N = 100, binary Bounds 0.71, , , ,0.97 d = 8, = 3 E Z(N) [GE(ζ)] Time 0.35, , , ,0.56 N = 10000, binary Bounds 0.22, , , ,0.69 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.10, , , ,0.12 N = 10000, ternary Bounds 0.26, , , ,1.42 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.22, , , ,0.32 N = 10000, binary Bounds 0.68, , , ,0.62 d = 8, = 3 E Z(N) [GE(ζ)] Time 0.34, , , ,0.58 place appropriate policies and practices tat could lead to potentially uge savings. Airline: Tis dataset contains information about fligts of a major airline carrier. In particular, te dataset as information about te date of departure and arrival of a fligt, te airport te fligt daparts from and arrives at, te pilots serial number, te type of aircraft, te dispatcers name, te cost and multiple fuel indicators signifying te amount of fuel put in te fligt, te minimum required by te government and oter safety fuel limits. We ave information of over 100,000 fligts and te goal is to identify factors tat lead to te fligt carrying excess fuel tan wat is required. We ave a binary indicator, wic depicts if te fligt carried excess fuel. Controlling te factors wic lead to tis can elp an airline carrier save money and improve safety. Invoice: Large corporations ave many of business units (BUs) viz. travel, marketing, auditing, information tecnology etc. wit eac BU consisting of various commodity councils (CCs) viz. office supplies, tec services, communication services, etc. Trougout a calendar year, eac of te commodity councils carry out multiple transactions and record te corresponding invoices. It is extremely useful for tese CCs and BUs at large to estimate in advance te invoice amounts tat are likely to be registered in te near future. Tis dataset as 8 BUs wit eac BU consisting of 150 CCs. Te data was collected daily for a year and, ence, te dataset as only 365 datapoints. Our target is te CC communication services 4 under te BU information tecnology, wic as one of te igest invoice amounts and terefore is critical to business. For eac of te above datasets, we split te continuous attributes at te mean of te given data. We tus can form a contingency table representing eac of te data 4 Partitioned into 3 categories ig, medium and low.

18 18 Table 5 Te table sows E Z(N) [GE(ζ)] and te corresponding upper bounds for different levels of correlation (ρ) between te attributes and class labels for T = 500. Te values before te commas in te cells under different levels of correlation correspond to our metod wile te values following te commas correspond to Brieman s strengt and correlation bounds. Parameter Settings Split Metric ρ = 0.9 ρ = 0.36 ρ = 0.11 ρ = 0.02 N = 100, binary Bounds 0.23, , , ,1.29 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.48, , , ,0.56 N = 100, ternary Bounds 0.49, , , ,1.8 d = 5, = 3 E Z(N) [GE(ζ)] Time 1.07, , , ,1.59 N = 100, binary Bounds 0.73, , , ,0.98 d = 8, = 3 E Z(N) [GE(ζ)] Time 1.78, , , ,2.89 N = 10000, binary Bounds 0.23, , , ,0.67 d = 5, = 3 E Z(N) [GE(ζ)] Time 0.49, , , ,0.60 N = 10000, ternary Bounds 0.29, , , ,1.41 d = 5, = 3 E Z(N) [GE(ζ)] Time 1.08, , , ,1.73 N = 10000, binary Bounds 0.71, , , ,0.61 d = 8, = 3 E Z(N) [GE(ζ)] Time 1.73, , , ,2.96 Table 6 Te table sows E Z(N) [GE(ζ)] for te real datasets wit te respective upper bounds and running time. Dataset E Z(N) [GE(ζ)] Our Bound, Time Breiman s Bound, Time Pima Indians , , 0.57 Balloon , , 0.08 Suttle Landing Control , , 0.10 Oil , , 0.63 Semiconductor , , Store , , 0.71 Spend , , Airline , , Invoice , , sets. Te counts in te individual cells divided by te data set size provide us wit empirical estimates for te individual cell probabilities (p i s). Tus, wit te knowledge of N (data set size) and te individual p i s, we ave a multinomial distribution. Using tis distribution we observe te beavior of te bounds. Te results are sown in tables 4, 5 and 6, were we also see te upper bounds computed using Breiman s formula [7]: κ (1 s2 ) described before. Estimating κ and s we s 2 find te upper bound on te GE for a particular classifier. Since we need an estimate of E Z(N) [GE(ζ)], we perform te above procedure multiple times tus building multiple ensembles and computing an upper bound on GE for eac. We ten average te upper bounds tat we ave computed and report te result as an estimate of te upper bound on E Z(N) [GE(ζ)].

19 Observations In te syntetic experiments, our bound is te tigtest wen d = 5, = 3 and wen we ave binary splits wile it is te worst wen d = 8, = 3. Tis is because te number of variables is te least in te first scenario wile it is te most in te second scenario. Tis trend is especially seen at moderate correlation (0.36), since te number of pats classifying inputs into te class wit ig errors for sample S max increases more dramatically wit te number of variables for tis case tan at ig or low correlations. In any case, we do significantly better tan Breiman s bound. Increasing te sample size from 100 to as little effect on te time it takes to compute our bound wile te quality of te bound remains intact, wic is promising. Increasing te number of trees in te ensemble from 100 (table 4) to 500 (table 5) also maintains te quality of our bound toug te time to compute it linearly increases. Tis is consistent wit complexity analysis we provided in te previous section. In te experiments on real data, we see a similar beavior were our bound outperforms Breiman s bound on all te nine datasets. Tese datasets are from diverse domains and exibit varied caracteristics in terms of sample size, dimensionality and type of attributes. Te type of attributes are a mix of continuous and discrete, wit te discrete attributes aving a range of splits. We see in tese as well as te syntetic experiments tat te time to compute our bound increases as te number of attributes and te splitting factor increases, toug it is still faster tan computing Brieman s bound. Te reason for tis is tat we do not ave to explicitly build te trees to compute our bound. Hence, from te experiments we see tat our bound is never worse tan tat computed using Breimans s formula, toug in certain situations it is close to being trivial, i.e., close to 1. 5 Discussion In tis paper we described a procedure to derive bounds for an ensemble of random decision trees. It would be interesting to extend suc an analysis to te more general random forest algoritm. In order to derive bounds for te random forest algoritm, we would ave to caracterize te probability of coosing a particular attribute based on two criteria: first, tat it is part of te subset of attributes tat are randomly cosen and, second, tat it as te maximum IG or GG in tat set. For te ensemble of random decision trees, we just ad to deal wit te first criteria wit a subset size of one. Deriving acceptable bounds for te random forest algoritm efficiently is tus a callenge for te future. In te analysis presented in previous sections, one of te main ideas in computing te bound efficiently was to use Bonferroni s inequality. However, tis inequality can lead to loose bounds wit increasing dimensionality. In te future, it would be interesting to come up wit better caracterizations of te simultaneous confidence region leading to tigter and, ence, more useful bounds. Te callenge, owever, is to deciper tese regions in a scalable fasion. To conclude, we ave presented a novel procedure to derive bounds on te moments of GE for an ensemble of random decision trees. Tis procedure is efficient and leads to tigter bounds tan te well establised Breiman s strengt and correlation bounds. Te derivation of tese bounds required a perspective and a set of analytical tools

20 20 tat are significantly different from te strategies used to caracterize (single) random decision trees [10]. It would be interesting to see if te ideas used in analyzing te ensemble of random decision trees can be used to analyze oter suc ensembles in te future. Acknowledgement I would like to tank te editor and te anonymous reviewers for teir constructive comments. I would also like to tank Katerine Durandar for proofreading te paper. References 1. A. Anandkumar, D. Foster, D. Hsu, S. Kakade and Y. Liu. A Spectral Algoritm for Latent Diriclet Allocation. In NIPS, page , Lake Taoe, USA, B. Boots and G. Gordon. Two Manifold Problems wit Applications to Nonlinear System Identification. In ICML, page 338, Edinburg, Scotland, UK, N. Bsouty and P. Long. Finding Planted Partitions in Nearly Linear Time using Arrested Spectral Clustering. In ICML, page , Haifa, Israel, T. Hastie, R. Tibsirani and J. Friedman. Elements of Statistical Learning. Springer, 2 edition, R. Duda, P. Hart, D. Stork. Pattern Classification. Wiley New York, 2 edition, A. Durandar and A. Dobra. Distribution free bounds for relational classification. Knowledge and Information Systems, L. Breiman. Random forests. Macine learning, 45(1):5 32, S. Buttrey and I. Kobayasi. On strengt and correlation in random forests. In Proceedings of te 2003 Joint Statistical Meetings, Section on Statistical Computing, J. Connor-Linton. Ci square tutorial. ttp:// web ci tut.tml, A. Durandar and A. Dobra. Probabilistic caracterization of random decision trees. Journal of Macine Learning Researc, 9: , A. Durandar and A. Dobra. Semi-analytical metod for analyzing models and model selection measures based on moment analysis. ACM Transactions on Knowledge Discovery and Data Mining, A. Durandar and A. Dobra. Probabilistic caracterization of nearest neigbor classifiers. Intl. Journal on Macine Learning and Cybernetics, W. Fan, H. Wang, P. Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In ICDM 03: Proceedings of te Tird IEEE International Conference on Data Mining, page 51, Wasington, DC, USA, IEEE Computer Society. 14. P. Geurts, D. Ernst, and L. Weenkel. Extremely randomized trees. Macine Learning, 63(1):3 42, Jon Langford. Tutorial on practical prediction teory for classification. J. Mac. Learn. Res., 6: , December F. Liu, K. Ting, and W. Fan. Maximizing tree diversity by building complete-random decision trees. In PAKDD, pages , D. McAllester. Pac-bayesian model averaging. In In Proceedings of te Twelft Annual Conference on Computational Learning Teory, pages ACM Press, David Mcallester. Simplified pac-bayesian margin bounds. In In COLT, pages , S. Roy and R. Bose. Simultaneous confidence interval estimation. Annals of Matematical Statistics, 24(3): , C. Sison and J. Glaz. Simultaneous confidence intervals and sample size determination for multinomial proportions. JASA, 90(429): , Y. Tong. Probabilistic Inequalities for Multivariate Distributions. Academic Press, 1 edition, K. Zang and W. Fan. Forecasting skewed biased stocastic ozone days: analyses, solutions and beyond. Knowl. Inf. Syst., 14(3): , X. Zang, Q. Yuan, S. Zao, W. Fan, W. Zeng, and Z. Wang. Multi-label classification witout te multi-label cost. In SDM 10: Proceedings of te Siam Conference on Data Mining, pages , 2010.

21 Autor Biograpy Fig. 2 Amit Durandar is a researc staff member in te Matematical Sciences Dept. at IBM T.J. Watson. He received is B.E. in computer engineering from Pune University in 2004.

21 21 Autor Biograpy Fig. 2 Amit Durandar is a researc staff member in te Matematical Sciences Dept. at IBM T.J. Watson. He received is B.E. in computer engineering from Pune University in He ten received is Masters and P..d. in computer engineering from University of Florida in 2005 and 2009 respectively. He is te acting Knowledge Discovery and Data Mining Professional Interest Community (KDD PIC) cair at IBM T.J. Watson. Broadly speaking, Amit s researc interests primarily span te areas of macine learning, data mining and computational neuroscience. He as autored several papers and as been a reviewer for many top quality conferences and journals.

Bounds on the Moments for an Ensemble of Random Decision Trees

Bounds on the Moments for an Ensemble of Random Decision Trees Noname manuscript No. (will be inserted by te editor) Bounds on te Moments for an Ensemble of Random Decision Trees Amit Durandar Received: / Accepted: Abstract An ensemble of random decision trees is