Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn

Size: px

Start display at page:

Download "Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn"

Toby Burns
5 years ago
Views:

1 Random Forests Sören Mindermann September 21, 2016 Bachelorproject Begeleiding: dr. B Kleijn Korteweg-de Vries Instituut voor Wiskunde Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam

2 Abstract Tree-based classification methods are widely applied tools in machine learning that create a taxonomy of the space of objects to classify. Nonetheless, their workings are not documented to high degree of formalism. The present paper places them into the mathematical context of decision theory, examines them from the perspective of Bayesian classification where the probability of misclassification is minimized and covers the example of handwritten character recognition. Tree classifiers exhibit low bias but high variance. Pruning of trees reduces variance by decreasing the complexity of the classification model. The random forest algorithm further reduces variance by combining multiple trees and aggregating their votes. To make use of the Law of Large Numbers, a sample from the space of tree classifiers is taken. Both the tree construction and the selection of training data are randomized for this purpose. Given the a measure of the strength of individual trees and the dependence between them, an upper bound for the probability of misclassification of a random forest with an infinite number of trees can be derived. Titel: Random Forests Auteur: Sören Mindermann, soeren.mindermann@gmail.com, Begeleiding: dr. B Kleijn Einddatum: September 21, 2016 Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam Science Park 904, 1098 XH Amsterdam 2

3 Contents 1 Introduction to Decision Theory Decision Theory Frequentist Decision Theory Bayesian Decision Theory Classification Classification with Decision Trees Supervised learning Tree classifiers Construction of tree classifiers Bias and Variance Model overfitting Pruning Random Trees and Forests Bootstrapping Randomized trees for character recognition Introduction Feature extraction Exploring shape space Feature randomization Random Forests The random forest algorithm Convergence of random forests Conclusion 29 5 Popular summary 30 Bibliografie 32 3

4 1 Introduction to Decision Theory Many situations require us to make decisions under uncertainty. The first chapter will build a mathematical framework for such decision theoretic problems, first from a frequentist and then from a Bayesian point of view which builds the foundation for the theory of classification. In chapter 2, we will introduce tree classifiers, an applied algorithm that is used to classify objects. Chapter 3 then goes into an application of tree classifiers and examines random forests, the combination of randomized tree classifiers that improves the accuracy of classification. 1.1 Decision Theory In the example above, decisions can be rated according to certain optimality criteria. For a severe illness such as cancer, a false positive diagnosis can be much more harmful than a false negative, which is different from a less severe illness. Such differences are not taken into account in the classical framework of statistical inference. In that framework, we work with concepts such as Type-I and Type-II errors as well as confidence intervals, but these don t allow us to directly assign a value to how unfavorable certain errors are. Statistical decision theory solves this by introducing a loss-function, which quantifies the impact of an incorrect decision. The mathematical set-up for decision-theoretic problems is similar to that of a regular problem of statistical inference with mostly semantic differences. Instead of a model we speak of a state-space Θ and an unknown state θ Θ. The observation Z takes a value in the sample space Z which is measurable with σ-algebra B. Z is a random variable with distribution P θ : B [0, 1] for a given state θ θ. We take a decision a from the decison-space A based on the realization z of Z. There is an optimal decision for each state θ of the system, but since θ is unknown the decision made can be suboptimal. Statistical decision theory attempts to decide in an optimal way given only the observation Z. The decision procedure can be seen as a way of estimating the optimal decision. The main feature that distinguishes decision theory from statistical inference is the loss-function. Definition 1.1. A loss-function can be any lower-bounded function L : Θ A R. The function L : Θ A R is called the utility-function. Given a particular state θ of the system, a loss L(θ, a) is incurred for an action a. The loss can be positive or negative. In the latter case it is a profit, which usually signifies that the decision was optimal or near-optimal. Since we do not have θ but only the data Z to form the decision, a decison rule that maps observations to decisions is needed. 4

5 Definition 1.2. Let A be a measurable space with σ-algebra H. A measurable function δ : Z A is called a decision rule. The space of decision rules to consider is denoted by. Clearly, the decision theorist s goal is to find a δ that is optimal at minimizing the loss according to some principle. Again, the decision rule can be compared to the estimator in the case of statistical inference, which maps observations to estimates of a parameter in the frequentist case or beliefs about a parameter in the Bayesian case. Having set up the common foundation for decision theoretic problems we now examine both cases separately, starting with frequentist decision theory Frequentist Decision Theory The frequentist assumes that Z is distributed according to a true distribution P θ0 for some state θ 0 Θ. A key concept for analyzing optimality of decison-rules is the expected loss. Definition 1.3. The risk-function R : Θ R is defined as the expected loss under P θ0 with δ as the decision rule. R(θ, δ) = L(θ, δ(z))dp θ0 (1.1) Z Of course, only the expected loss under the true distribution is of interest to the frequentist. Still, we need a way to compare the risks of two decision rules based for any parameter θ. This motivates the following definition. Definition 1.4. Given are the state-space Θ, the states P θ (θ Θ), the decision space A and the loss L. For δ 1, δ 2, δ 1 is called R-better than δ 2 if for all θ Θ R(θ, δ 1 ) < R(θ, δ 2 ) (1.2) A decison rule δ is admissible if there exists no δ which is R-better than δ (and inadmissible otherwise). The ordering induced on by this definition is merely partial. It is possible that two decision rules δ 1, δ 2 exist such that none is R-better than the other. Therefor we define the minimax decision principle which picks the decision rule that is optimal in the worst possible case. Definition 1.5. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. The minimax risk is given by the function R : δ sup R(θ, δ). (1.3) θ Θ Let δ 1, δ 2 be given. Then δ 1 is minimax-preferred to δ 2 if sup θ Θ R(θ, δ 1 ) < sup R(θ, δ 2 ). (1.4) θ Θ If δ M minimizes δ sup θ Θ R(θ, δ) the it is called a minimax decision-rule. 5

6 The existence of such a δ M is guaranteed by the minimax theorem under very general conditions which can be found in [1] Bayesian Decision Theory In Bayesian decision theory we incorporate the prior by integrating over. This approach is more balanced than minimizing the maximum of the risk function over Θ because a prior is given over Θ. Taking it into account does not only take the worst possible state into account, making it less pessimistic than the minimax decision rule. Definition 1.6. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. The Bayesian risk function is given by r(π, δ) = R(θ, δ)dπ(θ). (1.5) Furthermore, let Let δ 1, δ 2 be given. Then δ 1 is Bayes-preferred to δ 2 if Θ r(π, δ 1 ) < r(π, δ 2 ). (1.6) If δ Π minimizes δ r(π, δ) then δ Π is called a Bayes rule. r(π, δ Π ) is called the Bayes risk. The Bayes risk is always upper bounded by the minimax risk because the minimax risk equals the Bayes risk for a pessimistic prior where all probability mass is on a state θ Θ that maximizes the risk function. Definition 1.7. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. We define the decision rule δ : Z A to be such that for all z Z, L(θ, δ (z))dπ(θ Z = z) = inf L(θ, a)dπ(θ Z = z). (1.7) a A Θ In other words, δ (z) minimizes the posterior expected loss point-wise for every z. Given that δ is defined as a point-wise minmizer, the question arises whether it exists and is unique. These cannot be proven in the general case. However, given the existence of δ, the following theorem establishes that it is indeed a Bayes-rule. Theorem 1.8. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. Assume there is a σ-finite measure µ : B R such that P θ is dominated by µ for all θ Θ. If the decision rule δ is well-defined, then δ is a Bayes rule. The proof uses the Radom-Nikodym theorem as well as Fubini s theorem and is given in Kleijn [1]. Θ 6

7 1.2 Classification A problem of classification arises in the context where there is a probability space (Ω, F, P ) containing objects to classify, a set C = {1, 2,..., c} of classes and a random variable Y : Ω C that maps each object to a class. Furthermore, each object has observable features, which gives rise to a random variable V : Ω V, where V = {0, 1} M is the feature space and M the total number of features. In what follows, the feature space can be ignored in most cases. In the language of decision theory, C is the state-space. The problem to solve is that the class Y (x) is not known for each object x Ω. Therefore, a decision rule Ŷ : Ω C, normally called classifier, is required. Ŷ can be seen as an estimate for the true class function Y. Ŷ can also be written as a function V C. A loss function L : A A R is given. When not specified we define it by default as as L(k, l) = 1 k l, (1.8) i.e. there is a loss of 1 for each misclassification. According to the the minimax decision principle we look for a classifier δ M : Z A that minimizes: δ sup L(k, δ(z))dp (z Y = k) = sup P (δ(z) k) Y = k), (1.9) k C Z k C in other words, the principle demands that the probability of misclassification is minimized uniformly over all classes. The Bayesian analog of the prior on the decision space A is a prior on the class space C. The prior is given by the marginal distribution of Y. For practical purposes, it can often be approximated with the relative frequencies of the classes {1, 2,..., c} in a sample. Otherwise, equal probability is assigned to each class. Given the probabilities P (Y = k) the prior is thus defined with respect to the counting measure as: π(k) = P (Y = k). (1.10) For classification problems, the Bayes rule is the minimizer of δ c L(k, δ(z))dπ(k Z = z) = Π(δ(y) Y Z = z) (1.11) k C for every z Z. By theorem 1.8, δ minimizes the Bayes risk which is given by the probability of misclassification in this case: r(π, δ) = k C R(k, δ)π(k) = k C Z k=1 L(k, δ(y))dp (z Y = k)π(k) = k C P (k δ(z) Y = k)p (Y = k) = P (K δ(z)). 7

8 Thus, in Bayesian classification, as opposed to the minimax classification, we minimize the overall probability of misclassification without referring to the class of the object. 8

9 2 Classification with Decision Trees Classification trees are classifiers, i.e. they map an object to a class based on the features of that object. In the context of this chapter, a (rooted) tree is built based on a set of labelled training data l = {x 1, x 2,..., x n } (where x k Ω for k n) with known class labels Y (x 1 ), Y (x 2 ),..., Y (x n ). This allows it to classify data from a general population Ω where the training data stem from. A tree classifies objects by creating a taxonomy, i.e. a partition of the feature space V. At each node, the training data are split into a subset that does and a subset that does not have a certain feature. We work in a Bayesian set-up, i.e. we try to minimize the probability of misclassification. However, no Bayesian updating procedure will be considered. Note that there is usually an unmanageable amount of possible decision rules. Since the Bayes classifier is the minimizer of the misclassification probability and there is no general procedure to find it in a reasonable amount of time, it is usually intractable to find the Bayes classifier. This chapter describes the construction of tree classifiers as an example of supervised learning, introduces the trade-off of bias and variance as well as overfitting and then examines methods of reducing variance. 2.1 Supervised learning In supervised learning, a set of training data l = {x 1,..., x n } where x 1,..., x n are assumed to be an i.i.d. sample from Ω, is given alongside class labels Y (x 1 ),..., Y (x n ). Then a function T analyzes l and maps it to a classifier t : Ω C such that t can generalize to different input, i.e. such that it can classify new examples. Many techniques exist to construct such classifiers. In what follows, we will focus on decision trees. 2.2 Tree classifiers A tree classifier, or decision tree t, built based on training data l, classifies an object x by asking a series of queries about the object s feature vector. In the case at hand, each object is assumed to be fully defined by a vector of binary features. Thus, Ω = {0, 1} M, where M is the total number of features. The classification function Y : Ω C is deterministic, contrary to the default set up in 1.2. Each node in the tree represents a query asked and depending on the answer the object moves to one of the (usually two) children, where another query is posed. The leaves thus represent a partition of the feature space. This recursive partitioning is also called taxonomy. An object x ends up 9

10 at a leaf as it decends down the tree t. Since each leave is assigned a class label, the tree can then be seen as a random variable t : Ω C Construction of tree classifiers As we will see later on, the number of possible trees grows exponentially with the number of features in Ω. Therefore, it is normally infeasible to find the optimal tree given a set of training data l. Nonetheless, algorithms have been developed to find reasonably accurate trees. A general class of algorithms, Hunt s algorithm, works according to the following recursive definition for a given node m and a subset l m of l which arrives at the node: Step 1: If all objects in l m have the same class, m is a leaf. Step 2: If not all objects in l m belong to the same class, a query is selected which further splits l m into child nodes, based on the answer to the query. It is possible that one of the child notes created in step 2 is empty, i.e. none of the training data is associated with the node. In this case, the node will be a leaf. Hunt s algorithm leaves two questions open which are crucial for the design of a tree classifier. Firstly a splitting criterion has to be selected in step 2. Secondly, a stopping criterion other than the one in step 1 should be defined. It is possible that a large number of splits is needed before the remaining training data belong to the same class. N splits would mean that the tree has at least 2 N 1 nodes if the process is not stopped at any branch. Depending on the number of training data, the amount of available computing power and the accuracy of the tree needed, the tree can quickly grow to too large proportions and become computationally unmanageable. Splitting criterion We will therefore set up a framework for analyzing the probability distributions at each node. The joint probability distribution over Ω C at node m is taken to be known, as it can be estimated by the empirical distribution of the examples at m. Then there is a marginal distribution P k ( m). Let A be a binary feature to split on which can take values a 0 = 0 or a 1 = 1. This gives a new distribution P (k m, A = a i ) = P (A = a i m, k)p (k m) P (A = a i m) (2.1) over classes at each leaf (by Bayes Theorem). All of these probabilities are estimated by the empirical distributions of the training data. It should be noted that trees can split on binary, nominal, ordinal (discrete) or continuous features. For simplicity we focus on binary features, although the analysis can be extended to other attribute types. Note that nominal features and (finite-dimensional) ordinal and continuous features can be approximated by multiple binary splits. Let P (k m) denote the probability that x has class k given that it ends up at node m. To classify x with maximal accuracy we would like this probability to be as close as 10

11 possible to 0 or to 1 all else being equal. The least favorable case would be a uniform distribution over both classes. This intuition is formalized via impurity measures of a distribution. A common requirement for a measure of impurity is that it is zero when one class has all probability mass and maximal when the class-distribution is uniform. Three common measures are the Shannon entropy c H(m) = P (k m) log 2 P (k m) (2.2) k=1 where we define 0log 2 0 = 0, the Gini index G(m) = i j and the classification error P (i m)p (j m) = 1 c [P (k m)] 2 (2.3) k=1 P E (m) = 1 max P (k m) (2.4) k The entropy can be understood as the expected amount of information (for some information measure) contained in any realization k from a random object drawn from the population at m. The Gini index can be viewed as the expected error when the class is randomly chosen from C with distribution P ( m). The feature to split on is then selected at each node in a locally optimal way, namely such that it minimizes the remaining impurity. We will cover three common measures below. Rather than merely minimizing the impurity of the data at one child note of m, the splitting criterion should minimize the expected impurity across all child notes. Minimizing the entropy or the Gini index is not the optimal way of splitting just before the leaves. As we have seen in chapter 1, the Bayesian classifier picks the class k with the highest probability P (k m). If the child nodes of the node m will be leaves it makes sense to minimize the classification error itself. However, it only maximizes the maximal probability of one class whereas the other measures offer a more general reduction of impurity across classes. Entropy and Gini index are therefore more useful for also minimizing impurity several splits onward. It is easy to see that all measures are indeed zero when the probability is concentrated on one class. Figure 1 shows that they are maximal for a uniform distribution when there are two classes. Finally, the feature A with possible values a 1,..., a d to split on is selected as the maximizer of the expected loss of impurity in the impurity measure i or the information gain, d i(p (. m)) P (a k m)i(p (. a k, m)). (2.5) k=1 When the classification problem is binary and the training data are split into two children at each split, the impurity measures can also be visualized as a function of p i, the fraction 11

12 Figure 2.1: Comparison of impurity measures for binary classification problems [6]. of objects with class i, as visualized in figure 1. Note that the uniform distribution is indeed the value p 1 = 0.5 which maximizes all three measures and both p 1 = 1 and p 2 = 1 indeed minimize all measures as required. Empirical studies have demonstrated that different measures of impurity yield similar results. This is due to the fact that they are quite consistent with each other [4], as figure 2.1 suggests. Another feature revealed here is that these functions are concave. The following proposition assures that the information gain is always non-negative when the impurity i(p ) is strictly concave. Proposition 2.1. Let P be the space of probability measures on the space C, endowed with the σ-algebra P(C ), the power set of C. Let i : P R be a strictly concave impurity measure. Then a split on any feature A results in a non-negative information gain. The information gain is zero if and only if the distribution over classes P ( A = a i ) is the same in all children. Proof. We denote by P m the distribution P (k m). Now Jensen s inequality assures that for concave i, P (a i m)i(p (k a i )) = E A m i(p (. A)) i=0,1 i(e A m P (. A)) = i( P (a i m)p (. a i )) i=0,1 = i(p (. m)) (2.6) 12

13 with equality if and only if P (k a i ) = P (c) for all i and c. Impurity can also be viewed at the level of the entire tree as I(T ) = P (m)i(p ( m)), (2.7) m M(T ) where M(T ) is the set of leaves of T. If the impurity measure taken is the classification error then this gives the classification error on the training set, or the general probability of misclassification if P is not estimated by the empirical distribution of the training data at each node Bias and Variance In supervised learning two types of error are distinguished. In the context of classification the training error describes the share of misclassifications made on the training data, E T = 1 l 1 T (x) Y (x), (2.8) x l where l = {X 1, X 2,..., X n } are the training data, drawn i.i.d. from Ω. By contrast, the generalization error refers to the expected number of misclassifications on new data. Definition 2.2. The generalization error of a classifier T : Ω C, denoted P E is P E = P (T (X) Y (X)) = E X L(Y (X), T (X)), (2.9) where L is the loss function L(k, l) = 1 k l. Bias and variance can be explained in the context of the space of classifiers when there is a norm on this space. Usually the model T, which includes all classifiers under consideration, does not include the true classification function Y. Assuming that T is closed and convex, there exists a T T which minimizes T Y. The model bias is then defined as T Y. Since the algorithm for finding a classifier does not necessarily find T, the chosen classifier ˆT varies around T. This allows us to define variance as E ˆT T 2. A natural consequence from this perspective is that larger models will exibit a smaller bias as T Y is likely to be smaller. However, they also allow for greater fluctuation of ˆT around T, increasing variance. A large model is given when, e.g. there is a large number of parameters. As this is the case for deep trees, they tend to have high variance. Since no default norm on T is actually given, bias and variance are hard to quantify. Following the usual definition, the bias would be defined as E X d(y (X), T (X)), the expected distance between the classification function and the classifier in some metric d. Neither Amit and Geman nor Breiman specify the metric and instead bias is taken to be the expectation of the training error E X1,X 2,...,X n E T. 13

14 The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all possible training sets. In cases where T (X) takes a numerical value this can be written as E X (T (X) E L T (X)) 2. However, given that the class label is not generally a numerical variable, neither the expected class nor the difference between classes are clearly defined. Amit and Geman [3] as well as Breiman [5] therefore do not quantify variance or simply define it as P E E 2 T. Variance can therefore be seen as the component of generalization error that stems from feeding the classifier previously unknown data Model overfitting When variance increases too much due to a large model, the result is considered overfitting. When grown too large, tree classifiers are prone to overfitting as they have many parameters which contribute to a large model. On the one hand the tree-building procedure can continue until every unique example in the training set ends up in its own leaf. By assigning each leaf the class of its particular example, the tree can classify the whole training set correctly. On the other hand a tree built this way will depend strongly on the training data used so variance will be high. Thus, when a tree is grown larger, the training error (or bias) decreases while variance tends to increase as displayed in figure 2.2. From this observation we can conclude that there is an optimal amount of model complexity which minimizes the generalization error. While low bias is a necessity for sufficiently large trees, the increase in variance cannot be rigorously shown in the general case and there is no general agreement on the precise reasons for overfitting [3] Pruning As variance is a key challenge in decision tree learning, methods to reduce a tree s complexity while preserving accuracy on the training set are highly sought-after. Pruning is one such method. It works by removing the branches of a fully-grown tree which add little to reducing bias but do increase variance. We will discuss pre-pruning and costcomplexity pruning (the most well-known form of pruning). Pre-pruning reduces the size of a tree by specifying a stopping criterion (for an overview, see [12]). By default, a node is not split further if all examples at the node belong to the same class. This criterion is replaced by the stopping criterion (or whichever comes first). Pre-pruning is more computationally parsimonious because it prevents the further induction of a tree and does not have a post-pruning step. Ideally, pre-pruning methods stop the induction when no further accuracy gain is expected. However, in practice they filter both relevant and irrelevant splits. For this reason, they have mostly been abandoned except for large-scale applications where the training set is too large to grow a full tree. Various stopping criteria are possible, e.g. Amit and Geman [3] stop the induction when the number of examples drops below a threshold at each leaf. Regular pruning (also called post-pruning), reduces the a fully-grown treee T to a smaller tree T by removing sub-trees and replacing them with a leaf. Minimal cost- 14

15 Figure 2.2: Relationship of bias, variance, generalization error and model complexity [6]. complexity (MCCP) pruning, proposed by Breiman et al. in 1984 [9] is an early and interesting as well as widely used method we will examine in detail. Let T be the space of possible trees, L Ω be any set of test examples to evaluate the performance of a tree and assume that the class labels of the examples in L are known. Furthermore, let s(t ) be the size of a tree, measured by the number of its leaves. The map s is proportional to the number of nodes which equals 2s 1. Then, let R : T R be a measure of a tree s performance that averages over the contributions from the leaves, R(T ) = R(m) (2.10) m M(T ) where M(T ) is the set of leaves of T. Breiman et al. let R be the number of misclassifications on the training set or a test set. These measures too can be expressed in the form of (2.10) by letting R(m) be the number of examples that are misclassified at node m. When estimating P E a separate test set of labelled data is used. Such a set may be called either pruning set or crossvalidation set depending on whether it is also used to estimate the generalization error of the eventual pruned tree. R(m) is then defined as the number of examples from the test set arriving at node m that are misclassified. Similarly, if R were the entropy then R(m) is the entropy of the distribution P (k m), estimated by P L (k m) as usual. The cost-complexity function is then defined as R α (T ) = R(T ) + αs(t ), (2.11) 15

16 where α > 0 scales the penalty assigned to more complex trees in terms of the number of leaves s(t ). A method to choose α will be discussed. The MCCP chooses a subtree T of the full tree T 0 which minimizes R α. Naturally, α = 0 should be used if R(T ) is the generalization error, estimated with a test set, R α is supposed to approximate P E. The size of a tree need only be penalized because it increases the generalization error. Breiman et al. showed that for the full tree T 0 there is a set of nested trees T k which minimize R α on the interval [α k, α k+1 ) for some partition of [, ] into such intervals (with {α k } [, ] increasing). Moreover, as we will show, a simple algorithm can find such a sequence. There may be multiple subtrees of T 0 which minimize R α. If one of these is a sub tree of all others, call it T (α). We will consider see an algorithm to find T (α). Let T be a tree with more than 1 node and let T m be the sub-tree of T 0 with root m. We write R(m) for the R-measure of the node m when T is pruned at m. Similarly, s(m) is always 1. Define the function g which compares the increase in R with the reduction in size when pruning at m as g(m, T ) = R(T m) R(m) s(m) s(t m ). (2.12) Note that when g(m, T ) > α then R α (m) > R α (T m ) and vice-versa. For a rooted tree T, define the set of children of a node m as all the node which are connected to the root via m and one edge further away from the root than m. Then for a rooted tree T m, the set of branches of T m is defined as B(T m ) = {T b : b is a child of m}. From (2.10) it follows that we can write R α (T ) = T m B(T ) R α(t m ). For the proof of the following proposition it is important to note that any node in the original tree T 0 either has more than one child or none by Hunt s algorithm (see 2.2.1). The same is true for a pruned sub-tree of T 0 because pruning at a node removes all of its branches. This guarantees that any subtree T of a tree T with the same root node as T has as many branches as T if it has been created by pruning T. Proposition 2.3. Let α R and create an enumeration of the nodes of a tree T such that each parent node comes after its child nodes. Consider each node in order and prune at node m if R α (m) R α (T m ) for the remaining tree T. The resulting tree is T (α). Proof. Firstly, a tree T with root node m will be called optimally pruned if any sub-tree T that can be created by pruning T and that is also rooted at m has a strictly larger value of R α. This is what we denote by T (α). Now let T be a tree and let m n, n = 1,..., 2s(T ) 1 be an enumeration of its nodes such that each node precedes its parent. We prove the proposition by induction on n. T m1 is optimally pruned since it is a single node. Let n > 1 and assume that all trees T m k are optimally pruned for k < n in the current tree T. There are two options. If R α (m n ) R α (T m n ) we prune at m n, otherwise T m n remains. In the first case the resulting tree is trivial and therefore optimally pruned. In the second case, suppose a strict sub-tree T m n of T m n could be created by pruning T m n such that R α (T m n ) R α (T m n ). T m n would not be trivial since otherwise we would have pruned at m n. Thus 16

17 has the same number of branches as T m n. Note that R α (T m n ) = b B(T mn ) R α(b) because R α (T ) is a sum over the leaves of T. But then there must be a branch b B(T m n ) such that R α (b ) < R α (b ). By the assumption that T m n can be created by pruning T m n, b can be created by pruning b. This is a contradiction with the assumption that all branches of T mn were optimally pruned. It follows that T mn is optimally pruned and by induction the algorithm yields an optimally pruned tree. T m n Given a value of α this allows us to find T (α). We now show how to find the sequence α k and the tree sequence T k. From now on we assume that s is monotonely increasing in the number of nodes of T. Proposition 2.4. Let T be a tree and let α 1 be the minimum of g(m, T ) for any node m of T that is not a leaf. T is optimally pruned whenever α < α 1. When pruning every node m with g(m, T ) = α 1, the result is T 1 = T (α 1 ). Furthermore, g(m, T ) > α 1 for every non-leaf node of T 1. Proof. T is automatically optimal for α < α 1 from Proposition 2.3. If g(m, T ) > α for every non-leaf m then R α (m) > R α (T m ) so no node is pruned. Therefore, the T is already optimal. Now let α = α 1 and prune according to Proposition 2.3. Each time we prune at a node m, R α (T c ) is not changed for any node c upstream of m Thus R α (m) R α (T m) for the current tree T if and only if R α (m) R α (T m ) for the original tree, which is equivalent to g(m, T ) α 1. So the algorithm in Proposition 2.3 is equivalent to pruning each node with g(m, T ) α 1. Hence, the latter also results in T (α 1 ). Lastly, let m be a node that remains after pruning with α = α 1. g(m, T 1 ) > α 1 follows from, R α1 (m) R α1 ((T 1 ) m ) = R α1 (m) R α1 (T m ) + [R α1 (T m ) R α1 ((T 1 ) m )] = R α1 (m) R α1 (T m ) = g(m, T )[s(t m ) s(m)] > α 1 [s(t m ) s(m)] α 1 [s((t 1 ) m ) s(m)] For the following proposition, recall that given an initial tree T 0, T (α) is a sub tree that minimizes R α and is a sub tree of all other sub trees of T 0 that also minimize R α. Proposition 2.5. Let β > α. Then T (β) is a sub-tree of T (α) and T (β) = T (α)(β). Proof. Enumerate the nodes of T as in the proof of proposition 2.3. We show by induction that T m (β) is (weakly) a subtree of T m (α) and therefore T m (β) is a subtree of T m (α). With proposition 2.3 we conclude that T (α)(β) = T (β). 17

18 For n = 1, T m1 is a leaf so the claim holds. Let n > 1 and assume that for k < n, T mk (β) is a subtree of T mk (α). At node m n we prune T mn (α) if R α (m) R α (T mn (α) and equivalently for T mn (β). We consider two cases. If R α (m) > R α (T mn (α), T mn (β) is automatically a subtree of T mn (α) because either T mn (β) is trivial or all of its branches are subtrees of the corresponding branches in T mn (α) by the induction hypothesis. If R α (m) R α (T mn (α)) we need to show that R α (m) R α (T m (α)) so that both T mn (α) and T mn (β) are trivial. Now T mn (α) minimizes R α over subtrees of T mn so R α (T mn (α)) R α (T mn (β)). Thus we have, R β (m) = R α (m) + (β α)s(m) R α (T mn (α)) + (β α)s(m) R α (T mn (β)) + (β α)s(m) = R β (T mn (β)) + (β α)[s(t mn (β) s(m)] R β (T mn (β)). Finally, T (β) = T (α)(β) because T (β) minimizes R β over all rooted subtrees of T which includes the rooted subtrees of T (α). These propositions suffice to find the sequence a k and the corresponding trees T (α k ). Having found T (α 1 ), the algorithm in proposition 2.3 can be applied to this tree to find T (α 2 ) (because g(m, T (α 1 )) > α 1 for each non-leaf of T (α 1 ) by proposition 2.4) and so forth. Eventually, T (α p ) will be the trivial tree for some p. For α i 1 < α < α i, i = 1,..., p, T (α) = T (α i 1 ) by proposition 2.4 and 2.5. Similarly, proposition 2.5 shows that T (α) = T (α p ) for α > α p. Now that we can find the correct tree for each value of α, a reasonable value has to be found. In general, we seek to minimize the generalization error (although in some cases computation time should be minimized simultaneously). If we use a test set which is drawn i.i.d from Ω, independently from the training set, and let R be the error rate, R(T ) already forms an unbiased estimate of the generalization error. We are only penalizing size because it contributes to variance which in turn contributes to generalization error so in this case we can leave α = 0. If a test set were i.i.d. drawn from Ω, the expectation of R(T (α)) would be the generalization error of T (α). Absent a test set, we therefore need to estimate this expectation with minimal bias and select α to minimize it. Breiman et al. suggest using cross-validation. The training set is split into a partition J and for each part j J a tree T is constructed from the remaining J 1 parts along with the corresponding sequence a k and the trees T (a k ). This gives a piecewise constant function r j (α) = R(j, T (α)). They then compute the average function α j J r j(α)/ J and select α as the minimizer of that function. Since the training set is disjoint from the test set for each j, r j (α) is expected to be a reasonably unbiased estimate for the expectation of R(T (α)) (where the training set and test set are i.i.d. drawn from Ω). Averaging J cross-validation experiments reduces the variance of this estimate. If J is small, the size of the training set is considerably smaller (e.g. to 2/3 of its size for J = 3 as used by Breiman et al.) so the resulting trees 18

19 may have a higher error-rate on the test set. But this does not mean that the relative estimates of R(T (α)) for different α are seriously biased. Therefore, the minimizing value of α does not necessarily change. For reference, a survey on decision tree-pruning methods to avoid overfitting is given by Breslow and Aha [12] and Esposito et al. [13]. Some of the other typical pruning methods include reduced error pruning, pessimistic error pruning, minimum error pruning, critical value pruning, cost-complexity pruning, and error-based pruning. Quinlan and Rivest proposed using the minimum description length principle for decision tree pruning in [7]. 19

20 3 Random Trees and Forests The final chapter will start with section 3.1 on bootstrapping, a commonly used technique in applied statistics that illustrates the principle behind the random selection of training data that will be used in section 3.3. We continue with section 3.2 which introduces an application of trees and goes into randomization of the feature selection. Finally, section 3.3 covers the random forest algorithm. 3.1 Bootstrapping We start this chapter with an introdution to bootstrapping, a technique that will later be used on the training data of random forests. We illustrate with an example. Given i.i.d. random variables X 1, X 2,..., X n D we want to estimate the standard deviation of a statistic ˆθ(X 1, X 2,..., X n ). In a typical case, ˆθ might for example be the the sample standard deviation which has a distribution that is often hard to derive analytically. Writing the standard deviation of ˆθ as σ(d, n, ˆθ) = σ(d) shows that it is merely a function of the underlying distribution D as n and ˆθ are given parameters. The bootstrap procedure, which we will elaborate on shortly, yields an arbitrarily accurate estimate of σ( ) evaluated at ˆD, the empirical sample distribution of the realizations of X 1, X 2,..., X n. Since ˆD is the non-parametric maximum likelihood estimate of D according to Efron [11], σ( ˆD) serves aas an estimate for σ(d). The estimate is not necessarily unbiased here but yields good results in practice [11]. The function σ( ) usually cannot be easily evaluated analytically. However, the bootstrap procedure below can generate arbitrarily close estimates using the Law of Large Numbers. 1. Calculate the empirical distribution ˆD of the given sample with distribution function ˆD(t) = 1 n 1 Xi t. (3.1) n 2. Draw an i.i.d. bootstrap sample i=1 from ˆD and calculate ˆθ = ˆθ(X 1, X 2,..., X n). X 1, X 2,..., X n ˆD (3.2) 20

21 3. Repeat step 2 a large number B of times independently. This yields bootstrap replications ˆθ 1, ˆθ 2,..., ˆθ B. Finally, calculate ˆσ( ˆF ) = ( 1 B 1 B [ˆθ b ˆθ M] ), 2 (3.3) b=1 where ˆθ M refers to the mean value of the bootstrap replications. As 3.3 is the sample standard deviation of the random variable ˆθ from step 2, the Law of Large Numbers guarantees that it converges ˆD-a.s. to the standard deviation σ( ˆD) of ˆθ. This proof is omitted. 3.2 Randomized trees for character recognition This section examines the classification problem of character recognition and discusses results from Amit and Geman [3] who applied feature randomization to classify images of characters with tree classifiers. Their method has two advantages: it makes tree classification possible when the set of features is unmanageably large and reduces the dependence between distinct trees (measured by 3.7), making it more useful to combine multiple trees Introduction Images of handwritten or machine-printed characters commonly have to be recognized for purposes of recognizing text. Examples include the automatic reading of addresses on letters and scanning of books and assistive technology for visually impaired users. A classifier program analyzes the features of an image of a character and classifies it as a particular character so that a digital text is produced. Artificial neural networks including the Adaboost algorithm as well as tree classifiers are well-known methods for character recognition. Amit and Geman introduced a new approach in 1997 [3] by inducing a tree classifier and randomly selecting the features to consider splitting on during the induction. Character recognition has various properties that are relevant for choosing the right classification approach. Including special symbols there may be hundreds of classes and substantial variation of features within a class. Given that the images are binary with M pixels, the space of objects X has 2 M elements and the feature space V is correspondingly large. Due to the large feature space, the challenge is to navigate it efficiently when searching for optima. This chapter starts with a description of the particular features that have to be extracted from binary images in section In section we introduce the particular form of selecting splits for character recognition and explains how and why the splitting features are randomized. 21

22 3.2.2 Feature extraction One of the main challenges for the classification of character images is the selection of the right kind of features that are used to distinguish the images. Amit and Geman extract particular features from pixel images by assigning a label to each pixel and letting the feature to split on be a particular geometric arrangement of tags. Each pixel is labelled with a 4x4 pixel grid which contains the pixel at the top left corner. Since there are 2 16 possible subimages types, they use a decision tree on a sample of images to narrow the space of subimages down to a set S of 62 tag types. The criterion for splitting at node t is dividing the 4x4 pixel subimages at the node as equally as possible into two groups. The resulting tags can losely be described as all black or white at the top, black at the bottom, etc. The eventual classification tree uses splits on geometric arrangements of such tags. These tag arrangements constitute the feature space V which is constructed as follows. Any feature is a tag arrangement which is a labelled directed graph. Each vertex labelled with a tag s S and each edge is labelled with a type of relation that represents the relative position of two tags. The eight types of relations correspond to the compass headings north, northeast, east, etc. More formally, two vertexes are connected by an edge that is labeled k {1, 2,..., 8} if the angle of the vector u v is in [(k 1 2 ) π 4, (k+ 1 2 ) π 4 ] where u and v are the two locations in the image. Let A be the set of all possible arrangements with at most 30 tags, then V = {v A : A A } is the feature space, where v A : Ω {0, 1} indicates whether a given object contains the arrangement A. This still leaves an unmanagably large number of features to consider splitting on at each node. The next section explains how we can navigate the feature space more efficiently Exploring shape space As the feature space consists of graphs, there is a natural partial ordering on it: A graph preceeds any of its extensions where extensions are made by successively adding vertexes or labelled relations. Small graphs produce rough splits of shape space. More complex ones contain more information about the image, but few images contain them. Therefore, Amit and Geman start the tree induction by only considering small graphs, which they call binary arrangements, and search for the best split among its extensions. As we will see later, this makes sure that at each node, only the more complex arrangements will be considered which are likely to be contained in an image at that node. A binary arrangement is a graph with only one edge and two connected vertexes. The set of binary arrangements will be denoted B V. A minimal extension of an arrangement A is any graph created by adding exactly one edge or one vertex and an edge connecting the vertex to A. Now the tree is induced following the recipe from section At the root, the tree is split on the binary arrangement A 0 that most reduces the average impurity I. Amit and Geman use the Shannon entropy as the uncertainty measure as opposed to the classification error or the Gini index. At the child node which answered no to the chosen arrangement, B is searched again and so forth. At the other child node, 22

23 the minimal extension of A 0 which most reduces the uncertainty is chosen as the split. Continuing this pattern, at each node a minimal extension of the arrangement that last answered yes is chosen as the splitting criterion. When a stopping criterion is satisfied, the algorithm stops. Amit and Geman stop the process when the number of examples at a leaf is below a threshold which they do not further specify Feature randomization Both pre-pruning and considering only binary arrangements and minimal extensions at each node restrict the computational resources needed considerably, but not sufficiently. At each node the amount of features to consider is still very large - both the number of new binary arrangements and the number of extensions of arrangements. Amit and Geman suggest to simply select a uniformly drawn random subset of the minimal extensions (or binary arrangements if none has been picked yet) to consider at each node. This randomization also makes it possible to construct multiple different random trees from the same training data l. Let the trees t 1, t 2,..., t N be a random sample from a probability space (T, A, P T ) of trees with some probability distribution P T. If there was a norm on T, the Law of Large Numbers guarantees that the sample mean would converge to the population mean of the trees constructed with L as long as the method of random feature selection generates independent, representative samples from this distribution. Randomization also explains why the variance of the expected loss decreases as N increases. Let G(t) denote the generalization error of a classifier t. This is equivalent to the expected loss as given in 2.2. Assuming that t k are i.i.d. realizations of a random classifier T, representatively sampled from T, 1 N N G(t k ) E T G(T ) (3.4) k=1 as N by the Law of Large Numbers. The same holds for any other function H(T ) as long as its expected value exists and is finite. Although this does not proof that the combination of multiple trees has a lower expected generalization error than a single tree, the reduction in variance is suggestive because much of the error for single trees is due to high variance [3]. 3.3 Random Forests As discussed in section 2.2.2, tree classifiers are prune to overfitting, i.e. lowering bias at the expense of high variance. The larger the tree, the more likely overfitting becomes. Breiman [5] offers a solution to this problem by combining multiple randomized trees and having them vote on the classification. He terms this algorithm Random Forest. Breiman uses two forms of randomization to create multiple trees given a training set. Firstly, the same feature randomization discussed in the previous section and secondly, 23

24 a different random subset of the training data is used for each new tree - a form of bootstrapping. The next subsection will give an explanation of the random forest algorithm followed by its relation to bootstrapping and proofs of its convergence as well as an upper bound for its generalization error The random forest algorithm Random forests work by growing N randomized trees, using a random subset of the training set L. The latter randomization is a form of bootstrapping which Breiman calls bagging. For each tree, his algorithm selects an i.i.d. sample from the training data which is two thirds the size of the training data. The algorithm can be described as follows. Definition 3.1. A random forest is a classifier consisting of a collection of tree-structured classifiers {T (x, l k ) : k = 1,..., N} where the l k are independent identically distributed random vectors drawn uniformly with replacement from the training data l and each of the trees casts a unit vote for the most popular class at input x. Since the random forest algorithm involves multiple forms of randomization, it is difficult to develop a mathematical notation that does justice to all its aspects. Neither Amit and Geman nor Breiman nor any other paper studied gives a full mathematical account of the algorithm. The following is an attempt to give such an account. Let (Ω, F, P) be the probability space of objects to classify and let C = {1, 2,..., c} be the space of classes. Furthermore, let Y : Ω C be the classification function. Let L = (X 1,..., X n ) be the training data, drawn i.i.d. from Ω with known class labels. Let T be the space of functions Ω C and suppose there is a sigma-algebra F T on T. Then there is a probability measure P T that assigns probabilities to measureable subsets of T according to a randomization procedure such as the one outlined in Given the distribution P on Ω and the probabilities over P T conditioned on each possible realization of L, P T is uniquely defined. In other words, we can derive a distribution P T over the space of all trees if we know for each possible set of training data which trees can be created from the set and with which probability. Hence, the set of functions in T which cannot be created via any training set according to Amit and Geman s tree induction process, is a P T -null set. Let l k be a realization of a random vector L k of length z < n drawn uniformly i.i.d. and with replacement from l. The tree T (., l k (l)) T, k = 1,..., N is a T -valued random element drawn i.i.d. from T conditioned on a single realization l of L. Lastly, each realization t k then feeds into a random forest as follows. Let (T N, σ(ft N), P T N ) be the product probability space of vectors of length N with random entries in T. Then the random forest is a random element F : T N T. Each realization f(x) = ( t 1 (, l 1 (l)),..., t N (, l N (l) ) (x) is defined as the maximizer of a N k=1 1(t(x, l k) = a) for each x Ω, (a C ). In other words, each tree casts one vote and the most popular class is selected. The randomization of training data works on the same principle as bootstrapping. Namely, we sample training data from the sample distribution of an i.i.d. sample from 24

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s