Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn

Size: px
Start display at page:

Download "Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn"

Transcription

1 Random Forests Sören Mindermann September 21, 2016 Bachelorproject Begeleiding: dr. B Kleijn Korteweg-de Vries Instituut voor Wiskunde Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit van Amsterdam

2 Abstract Tree-based classification methods are widely applied tools in machine learning that create a taxonomy of the space of objects to classify. Nonetheless, their workings are not documented to high degree of formalism. The present paper places them into the mathematical context of decision theory, examines them from the perspective of Bayesian classification where the probability of misclassification is minimized and covers the example of handwritten character recognition. Tree classifiers exhibit low bias but high variance. Pruning of trees reduces variance by decreasing the complexity of the classification model. The random forest algorithm further reduces variance by combining multiple trees and aggregating their votes. To make use of the Law of Large Numbers, a sample from the space of tree classifiers is taken. Both the tree construction and the selection of training data are randomized for this purpose. Given the a measure of the strength of individual trees and the dependence between them, an upper bound for the probability of misclassification of a random forest with an infinite number of trees can be derived. Titel: Random Forests Auteur: Sören Mindermann, soeren.mindermann@gmail.com, Begeleiding: dr. B Kleijn Einddatum: September 21, 2016 Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam Science Park 904, 1098 XH Amsterdam 2

3 Contents 1 Introduction to Decision Theory Decision Theory Frequentist Decision Theory Bayesian Decision Theory Classification Classification with Decision Trees Supervised learning Tree classifiers Construction of tree classifiers Bias and Variance Model overfitting Pruning Random Trees and Forests Bootstrapping Randomized trees for character recognition Introduction Feature extraction Exploring shape space Feature randomization Random Forests The random forest algorithm Convergence of random forests Conclusion 29 5 Popular summary 30 Bibliografie 32 3

4 1 Introduction to Decision Theory Many situations require us to make decisions under uncertainty. The first chapter will build a mathematical framework for such decision theoretic problems, first from a frequentist and then from a Bayesian point of view which builds the foundation for the theory of classification. In chapter 2, we will introduce tree classifiers, an applied algorithm that is used to classify objects. Chapter 3 then goes into an application of tree classifiers and examines random forests, the combination of randomized tree classifiers that improves the accuracy of classification. 1.1 Decision Theory In the example above, decisions can be rated according to certain optimality criteria. For a severe illness such as cancer, a false positive diagnosis can be much more harmful than a false negative, which is different from a less severe illness. Such differences are not taken into account in the classical framework of statistical inference. In that framework, we work with concepts such as Type-I and Type-II errors as well as confidence intervals, but these don t allow us to directly assign a value to how unfavorable certain errors are. Statistical decision theory solves this by introducing a loss-function, which quantifies the impact of an incorrect decision. The mathematical set-up for decision-theoretic problems is similar to that of a regular problem of statistical inference with mostly semantic differences. Instead of a model we speak of a state-space Θ and an unknown state θ Θ. The observation Z takes a value in the sample space Z which is measurable with σ-algebra B. Z is a random variable with distribution P θ : B [0, 1] for a given state θ θ. We take a decision a from the decison-space A based on the realization z of Z. There is an optimal decision for each state θ of the system, but since θ is unknown the decision made can be suboptimal. Statistical decision theory attempts to decide in an optimal way given only the observation Z. The decision procedure can be seen as a way of estimating the optimal decision. The main feature that distinguishes decision theory from statistical inference is the loss-function. Definition 1.1. A loss-function can be any lower-bounded function L : Θ A R. The function L : Θ A R is called the utility-function. Given a particular state θ of the system, a loss L(θ, a) is incurred for an action a. The loss can be positive or negative. In the latter case it is a profit, which usually signifies that the decision was optimal or near-optimal. Since we do not have θ but only the data Z to form the decision, a decison rule that maps observations to decisions is needed. 4

5 Definition 1.2. Let A be a measurable space with σ-algebra H. A measurable function δ : Z A is called a decision rule. The space of decision rules to consider is denoted by. Clearly, the decision theorist s goal is to find a δ that is optimal at minimizing the loss according to some principle. Again, the decision rule can be compared to the estimator in the case of statistical inference, which maps observations to estimates of a parameter in the frequentist case or beliefs about a parameter in the Bayesian case. Having set up the common foundation for decision theoretic problems we now examine both cases separately, starting with frequentist decision theory Frequentist Decision Theory The frequentist assumes that Z is distributed according to a true distribution P θ0 for some state θ 0 Θ. A key concept for analyzing optimality of decison-rules is the expected loss. Definition 1.3. The risk-function R : Θ R is defined as the expected loss under P θ0 with δ as the decision rule. R(θ, δ) = L(θ, δ(z))dp θ0 (1.1) Z Of course, only the expected loss under the true distribution is of interest to the frequentist. Still, we need a way to compare the risks of two decision rules based for any parameter θ. This motivates the following definition. Definition 1.4. Given are the state-space Θ, the states P θ (θ Θ), the decision space A and the loss L. For δ 1, δ 2, δ 1 is called R-better than δ 2 if for all θ Θ R(θ, δ 1 ) < R(θ, δ 2 ) (1.2) A decison rule δ is admissible if there exists no δ which is R-better than δ (and inadmissible otherwise). The ordering induced on by this definition is merely partial. It is possible that two decision rules δ 1, δ 2 exist such that none is R-better than the other. Therefor we define the minimax decision principle which picks the decision rule that is optimal in the worst possible case. Definition 1.5. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. The minimax risk is given by the function R : δ sup R(θ, δ). (1.3) θ Θ Let δ 1, δ 2 be given. Then δ 1 is minimax-preferred to δ 2 if sup θ Θ R(θ, δ 1 ) < sup R(θ, δ 2 ). (1.4) θ Θ If δ M minimizes δ sup θ Θ R(θ, δ) the it is called a minimax decision-rule. 5

6 The existence of such a δ M is guaranteed by the minimax theorem under very general conditions which can be found in [1] Bayesian Decision Theory In Bayesian decision theory we incorporate the prior by integrating over. This approach is more balanced than minimizing the maximum of the risk function over Θ because a prior is given over Θ. Taking it into account does not only take the worst possible state into account, making it less pessimistic than the minimax decision rule. Definition 1.6. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. The Bayesian risk function is given by r(π, δ) = R(θ, δ)dπ(θ). (1.5) Furthermore, let Let δ 1, δ 2 be given. Then δ 1 is Bayes-preferred to δ 2 if Θ r(π, δ 1 ) < r(π, δ 2 ). (1.6) If δ Π minimizes δ r(π, δ) then δ Π is called a Bayes rule. r(π, δ Π ) is called the Bayes risk. The Bayes risk is always upper bounded by the minimax risk because the minimax risk equals the Bayes risk for a pessimistic prior where all probability mass is on a state θ Θ that maximizes the risk function. Definition 1.7. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. We define the decision rule δ : Z A to be such that for all z Z, L(θ, δ (z))dπ(θ Z = z) = inf L(θ, a)dπ(θ Z = z). (1.7) a A Θ In other words, δ (z) minimizes the posterior expected loss point-wise for every z. Given that δ is defined as a point-wise minmizer, the question arises whether it exists and is unique. These cannot be proven in the general case. However, given the existence of δ, the following theorem establishes that it is indeed a Bayes-rule. Theorem 1.8. Let be the state-space, P θ (θ Θ) the states, A the decision space and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G R. Assume there is a σ-finite measure µ : B R such that P θ is dominated by µ for all θ Θ. If the decision rule δ is well-defined, then δ is a Bayes rule. The proof uses the Radom-Nikodym theorem as well as Fubini s theorem and is given in Kleijn [1]. Θ 6

7 1.2 Classification A problem of classification arises in the context where there is a probability space (Ω, F, P ) containing objects to classify, a set C = {1, 2,..., c} of classes and a random variable Y : Ω C that maps each object to a class. Furthermore, each object has observable features, which gives rise to a random variable V : Ω V, where V = {0, 1} M is the feature space and M the total number of features. In what follows, the feature space can be ignored in most cases. In the language of decision theory, C is the state-space. The problem to solve is that the class Y (x) is not known for each object x Ω. Therefore, a decision rule Ŷ : Ω C, normally called classifier, is required. Ŷ can be seen as an estimate for the true class function Y. Ŷ can also be written as a function V C. A loss function L : A A R is given. When not specified we define it by default as as L(k, l) = 1 k l, (1.8) i.e. there is a loss of 1 for each misclassification. According to the the minimax decision principle we look for a classifier δ M : Z A that minimizes: δ sup L(k, δ(z))dp (z Y = k) = sup P (δ(z) k) Y = k), (1.9) k C Z k C in other words, the principle demands that the probability of misclassification is minimized uniformly over all classes. The Bayesian analog of the prior on the decision space A is a prior on the class space C. The prior is given by the marginal distribution of Y. For practical purposes, it can often be approximated with the relative frequencies of the classes {1, 2,..., c} in a sample. Otherwise, equal probability is assigned to each class. Given the probabilities P (Y = k) the prior is thus defined with respect to the counting measure as: π(k) = P (Y = k). (1.10) For classification problems, the Bayes rule is the minimizer of δ c L(k, δ(z))dπ(k Z = z) = Π(δ(y) Y Z = z) (1.11) k C for every z Z. By theorem 1.8, δ minimizes the Bayes risk which is given by the probability of misclassification in this case: r(π, δ) = k C R(k, δ)π(k) = k C Z k=1 L(k, δ(y))dp (z Y = k)π(k) = k C P (k δ(z) Y = k)p (Y = k) = P (K δ(z)). 7

8 Thus, in Bayesian classification, as opposed to the minimax classification, we minimize the overall probability of misclassification without referring to the class of the object. 8

9 2 Classification with Decision Trees Classification trees are classifiers, i.e. they map an object to a class based on the features of that object. In the context of this chapter, a (rooted) tree is built based on a set of labelled training data l = {x 1, x 2,..., x n } (where x k Ω for k n) with known class labels Y (x 1 ), Y (x 2 ),..., Y (x n ). This allows it to classify data from a general population Ω where the training data stem from. A tree classifies objects by creating a taxonomy, i.e. a partition of the feature space V. At each node, the training data are split into a subset that does and a subset that does not have a certain feature. We work in a Bayesian set-up, i.e. we try to minimize the probability of misclassification. However, no Bayesian updating procedure will be considered. Note that there is usually an unmanageable amount of possible decision rules. Since the Bayes classifier is the minimizer of the misclassification probability and there is no general procedure to find it in a reasonable amount of time, it is usually intractable to find the Bayes classifier. This chapter describes the construction of tree classifiers as an example of supervised learning, introduces the trade-off of bias and variance as well as overfitting and then examines methods of reducing variance. 2.1 Supervised learning In supervised learning, a set of training data l = {x 1,..., x n } where x 1,..., x n are assumed to be an i.i.d. sample from Ω, is given alongside class labels Y (x 1 ),..., Y (x n ). Then a function T analyzes l and maps it to a classifier t : Ω C such that t can generalize to different input, i.e. such that it can classify new examples. Many techniques exist to construct such classifiers. In what follows, we will focus on decision trees. 2.2 Tree classifiers A tree classifier, or decision tree t, built based on training data l, classifies an object x by asking a series of queries about the object s feature vector. In the case at hand, each object is assumed to be fully defined by a vector of binary features. Thus, Ω = {0, 1} M, where M is the total number of features. The classification function Y : Ω C is deterministic, contrary to the default set up in 1.2. Each node in the tree represents a query asked and depending on the answer the object moves to one of the (usually two) children, where another query is posed. The leaves thus represent a partition of the feature space. This recursive partitioning is also called taxonomy. An object x ends up 9

10 at a leaf as it decends down the tree t. Since each leave is assigned a class label, the tree can then be seen as a random variable t : Ω C Construction of tree classifiers As we will see later on, the number of possible trees grows exponentially with the number of features in Ω. Therefore, it is normally infeasible to find the optimal tree given a set of training data l. Nonetheless, algorithms have been developed to find reasonably accurate trees. A general class of algorithms, Hunt s algorithm, works according to the following recursive definition for a given node m and a subset l m of l which arrives at the node: Step 1: If all objects in l m have the same class, m is a leaf. Step 2: If not all objects in l m belong to the same class, a query is selected which further splits l m into child nodes, based on the answer to the query. It is possible that one of the child notes created in step 2 is empty, i.e. none of the training data is associated with the node. In this case, the node will be a leaf. Hunt s algorithm leaves two questions open which are crucial for the design of a tree classifier. Firstly a splitting criterion has to be selected in step 2. Secondly, a stopping criterion other than the one in step 1 should be defined. It is possible that a large number of splits is needed before the remaining training data belong to the same class. N splits would mean that the tree has at least 2 N 1 nodes if the process is not stopped at any branch. Depending on the number of training data, the amount of available computing power and the accuracy of the tree needed, the tree can quickly grow to too large proportions and become computationally unmanageable. Splitting criterion We will therefore set up a framework for analyzing the probability distributions at each node. The joint probability distribution over Ω C at node m is taken to be known, as it can be estimated by the empirical distribution of the examples at m. Then there is a marginal distribution P k ( m). Let A be a binary feature to split on which can take values a 0 = 0 or a 1 = 1. This gives a new distribution P (k m, A = a i ) = P (A = a i m, k)p (k m) P (A = a i m) (2.1) over classes at each leaf (by Bayes Theorem). All of these probabilities are estimated by the empirical distributions of the training data. It should be noted that trees can split on binary, nominal, ordinal (discrete) or continuous features. For simplicity we focus on binary features, although the analysis can be extended to other attribute types. Note that nominal features and (finite-dimensional) ordinal and continuous features can be approximated by multiple binary splits. Let P (k m) denote the probability that x has class k given that it ends up at node m. To classify x with maximal accuracy we would like this probability to be as close as 10

11 possible to 0 or to 1 all else being equal. The least favorable case would be a uniform distribution over both classes. This intuition is formalized via impurity measures of a distribution. A common requirement for a measure of impurity is that it is zero when one class has all probability mass and maximal when the class-distribution is uniform. Three common measures are the Shannon entropy c H(m) = P (k m) log 2 P (k m) (2.2) k=1 where we define 0log 2 0 = 0, the Gini index G(m) = i j and the classification error P (i m)p (j m) = 1 c [P (k m)] 2 (2.3) k=1 P E (m) = 1 max P (k m) (2.4) k The entropy can be understood as the expected amount of information (for some information measure) contained in any realization k from a random object drawn from the population at m. The Gini index can be viewed as the expected error when the class is randomly chosen from C with distribution P ( m). The feature to split on is then selected at each node in a locally optimal way, namely such that it minimizes the remaining impurity. We will cover three common measures below. Rather than merely minimizing the impurity of the data at one child note of m, the splitting criterion should minimize the expected impurity across all child notes. Minimizing the entropy or the Gini index is not the optimal way of splitting just before the leaves. As we have seen in chapter 1, the Bayesian classifier picks the class k with the highest probability P (k m). If the child nodes of the node m will be leaves it makes sense to minimize the classification error itself. However, it only maximizes the maximal probability of one class whereas the other measures offer a more general reduction of impurity across classes. Entropy and Gini index are therefore more useful for also minimizing impurity several splits onward. It is easy to see that all measures are indeed zero when the probability is concentrated on one class. Figure 1 shows that they are maximal for a uniform distribution when there are two classes. Finally, the feature A with possible values a 1,..., a d to split on is selected as the maximizer of the expected loss of impurity in the impurity measure i or the information gain, d i(p (. m)) P (a k m)i(p (. a k, m)). (2.5) k=1 When the classification problem is binary and the training data are split into two children at each split, the impurity measures can also be visualized as a function of p i, the fraction 11

12 Figure 2.1: Comparison of impurity measures for binary classification problems [6]. of objects with class i, as visualized in figure 1. Note that the uniform distribution is indeed the value p 1 = 0.5 which maximizes all three measures and both p 1 = 1 and p 2 = 1 indeed minimize all measures as required. Empirical studies have demonstrated that different measures of impurity yield similar results. This is due to the fact that they are quite consistent with each other [4], as figure 2.1 suggests. Another feature revealed here is that these functions are concave. The following proposition assures that the information gain is always non-negative when the impurity i(p ) is strictly concave. Proposition 2.1. Let P be the space of probability measures on the space C, endowed with the σ-algebra P(C ), the power set of C. Let i : P R be a strictly concave impurity measure. Then a split on any feature A results in a non-negative information gain. The information gain is zero if and only if the distribution over classes P ( A = a i ) is the same in all children. Proof. We denote by P m the distribution P (k m). Now Jensen s inequality assures that for concave i, P (a i m)i(p (k a i )) = E A m i(p (. A)) i=0,1 i(e A m P (. A)) = i( P (a i m)p (. a i )) i=0,1 = i(p (. m)) (2.6) 12

13 with equality if and only if P (k a i ) = P (c) for all i and c. Impurity can also be viewed at the level of the entire tree as I(T ) = P (m)i(p ( m)), (2.7) m M(T ) where M(T ) is the set of leaves of T. If the impurity measure taken is the classification error then this gives the classification error on the training set, or the general probability of misclassification if P is not estimated by the empirical distribution of the training data at each node Bias and Variance In supervised learning two types of error are distinguished. In the context of classification the training error describes the share of misclassifications made on the training data, E T = 1 l 1 T (x) Y (x), (2.8) x l where l = {X 1, X 2,..., X n } are the training data, drawn i.i.d. from Ω. By contrast, the generalization error refers to the expected number of misclassifications on new data. Definition 2.2. The generalization error of a classifier T : Ω C, denoted P E is P E = P (T (X) Y (X)) = E X L(Y (X), T (X)), (2.9) where L is the loss function L(k, l) = 1 k l. Bias and variance can be explained in the context of the space of classifiers when there is a norm on this space. Usually the model T, which includes all classifiers under consideration, does not include the true classification function Y. Assuming that T is closed and convex, there exists a T T which minimizes T Y. The model bias is then defined as T Y. Since the algorithm for finding a classifier does not necessarily find T, the chosen classifier ˆT varies around T. This allows us to define variance as E ˆT T 2. A natural consequence from this perspective is that larger models will exibit a smaller bias as T Y is likely to be smaller. However, they also allow for greater fluctuation of ˆT around T, increasing variance. A large model is given when, e.g. there is a large number of parameters. As this is the case for deep trees, they tend to have high variance. Since no default norm on T is actually given, bias and variance are hard to quantify. Following the usual definition, the bias would be defined as E X d(y (X), T (X)), the expected distance between the classification function and the classifier in some metric d. Neither Amit and Geman nor Breiman specify the metric and instead bias is taken to be the expectation of the training error E X1,X 2,...,X n E T. 13

14 The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all possible training sets. In cases where T (X) takes a numerical value this can be written as E X (T (X) E L T (X)) 2. However, given that the class label is not generally a numerical variable, neither the expected class nor the difference between classes are clearly defined. Amit and Geman [3] as well as Breiman [5] therefore do not quantify variance or simply define it as P E E 2 T. Variance can therefore be seen as the component of generalization error that stems from feeding the classifier previously unknown data Model overfitting When variance increases too much due to a large model, the result is considered overfitting. When grown too large, tree classifiers are prone to overfitting as they have many parameters which contribute to a large model. On the one hand the tree-building procedure can continue until every unique example in the training set ends up in its own leaf. By assigning each leaf the class of its particular example, the tree can classify the whole training set correctly. On the other hand a tree built this way will depend strongly on the training data used so variance will be high. Thus, when a tree is grown larger, the training error (or bias) decreases while variance tends to increase as displayed in figure 2.2. From this observation we can conclude that there is an optimal amount of model complexity which minimizes the generalization error. While low bias is a necessity for sufficiently large trees, the increase in variance cannot be rigorously shown in the general case and there is no general agreement on the precise reasons for overfitting [3] Pruning As variance is a key challenge in decision tree learning, methods to reduce a tree s complexity while preserving accuracy on the training set are highly sought-after. Pruning is one such method. It works by removing the branches of a fully-grown tree which add little to reducing bias but do increase variance. We will discuss pre-pruning and costcomplexity pruning (the most well-known form of pruning). Pre-pruning reduces the size of a tree by specifying a stopping criterion (for an overview, see [12]). By default, a node is not split further if all examples at the node belong to the same class. This criterion is replaced by the stopping criterion (or whichever comes first). Pre-pruning is more computationally parsimonious because it prevents the further induction of a tree and does not have a post-pruning step. Ideally, pre-pruning methods stop the induction when no further accuracy gain is expected. However, in practice they filter both relevant and irrelevant splits. For this reason, they have mostly been abandoned except for large-scale applications where the training set is too large to grow a full tree. Various stopping criteria are possible, e.g. Amit and Geman [3] stop the induction when the number of examples drops below a threshold at each leaf. Regular pruning (also called post-pruning), reduces the a fully-grown treee T to a smaller tree T by removing sub-trees and replacing them with a leaf. Minimal cost- 14

15 Figure 2.2: Relationship of bias, variance, generalization error and model complexity [6]. complexity (MCCP) pruning, proposed by Breiman et al. in 1984 [9] is an early and interesting as well as widely used method we will examine in detail. Let T be the space of possible trees, L Ω be any set of test examples to evaluate the performance of a tree and assume that the class labels of the examples in L are known. Furthermore, let s(t ) be the size of a tree, measured by the number of its leaves. The map s is proportional to the number of nodes which equals 2s 1. Then, let R : T R be a measure of a tree s performance that averages over the contributions from the leaves, R(T ) = R(m) (2.10) m M(T ) where M(T ) is the set of leaves of T. Breiman et al. let R be the number of misclassifications on the training set or a test set. These measures too can be expressed in the form of (2.10) by letting R(m) be the number of examples that are misclassified at node m. When estimating P E a separate test set of labelled data is used. Such a set may be called either pruning set or crossvalidation set depending on whether it is also used to estimate the generalization error of the eventual pruned tree. R(m) is then defined as the number of examples from the test set arriving at node m that are misclassified. Similarly, if R were the entropy then R(m) is the entropy of the distribution P (k m), estimated by P L (k m) as usual. The cost-complexity function is then defined as R α (T ) = R(T ) + αs(t ), (2.11) 15

16 where α > 0 scales the penalty assigned to more complex trees in terms of the number of leaves s(t ). A method to choose α will be discussed. The MCCP chooses a subtree T of the full tree T 0 which minimizes R α. Naturally, α = 0 should be used if R(T ) is the generalization error, estimated with a test set, R α is supposed to approximate P E. The size of a tree need only be penalized because it increases the generalization error. Breiman et al. showed that for the full tree T 0 there is a set of nested trees T k which minimize R α on the interval [α k, α k+1 ) for some partition of [, ] into such intervals (with {α k } [, ] increasing). Moreover, as we will show, a simple algorithm can find such a sequence. There may be multiple subtrees of T 0 which minimize R α. If one of these is a sub tree of all others, call it T (α). We will consider see an algorithm to find T (α). Let T be a tree with more than 1 node and let T m be the sub-tree of T 0 with root m. We write R(m) for the R-measure of the node m when T is pruned at m. Similarly, s(m) is always 1. Define the function g which compares the increase in R with the reduction in size when pruning at m as g(m, T ) = R(T m) R(m) s(m) s(t m ). (2.12) Note that when g(m, T ) > α then R α (m) > R α (T m ) and vice-versa. For a rooted tree T, define the set of children of a node m as all the node which are connected to the root via m and one edge further away from the root than m. Then for a rooted tree T m, the set of branches of T m is defined as B(T m ) = {T b : b is a child of m}. From (2.10) it follows that we can write R α (T ) = T m B(T ) R α(t m ). For the proof of the following proposition it is important to note that any node in the original tree T 0 either has more than one child or none by Hunt s algorithm (see 2.2.1). The same is true for a pruned sub-tree of T 0 because pruning at a node removes all of its branches. This guarantees that any subtree T of a tree T with the same root node as T has as many branches as T if it has been created by pruning T. Proposition 2.3. Let α R and create an enumeration of the nodes of a tree T such that each parent node comes after its child nodes. Consider each node in order and prune at node m if R α (m) R α (T m ) for the remaining tree T. The resulting tree is T (α). Proof. Firstly, a tree T with root node m will be called optimally pruned if any sub-tree T that can be created by pruning T and that is also rooted at m has a strictly larger value of R α. This is what we denote by T (α). Now let T be a tree and let m n, n = 1,..., 2s(T ) 1 be an enumeration of its nodes such that each node precedes its parent. We prove the proposition by induction on n. T m1 is optimally pruned since it is a single node. Let n > 1 and assume that all trees T m k are optimally pruned for k < n in the current tree T. There are two options. If R α (m n ) R α (T m n ) we prune at m n, otherwise T m n remains. In the first case the resulting tree is trivial and therefore optimally pruned. In the second case, suppose a strict sub-tree T m n of T m n could be created by pruning T m n such that R α (T m n ) R α (T m n ). T m n would not be trivial since otherwise we would have pruned at m n. Thus 16

17 has the same number of branches as T m n. Note that R α (T m n ) = b B(T mn ) R α(b) because R α (T ) is a sum over the leaves of T. But then there must be a branch b B(T m n ) such that R α (b ) < R α (b ). By the assumption that T m n can be created by pruning T m n, b can be created by pruning b. This is a contradiction with the assumption that all branches of T mn were optimally pruned. It follows that T mn is optimally pruned and by induction the algorithm yields an optimally pruned tree. T m n Given a value of α this allows us to find T (α). We now show how to find the sequence α k and the tree sequence T k. From now on we assume that s is monotonely increasing in the number of nodes of T. Proposition 2.4. Let T be a tree and let α 1 be the minimum of g(m, T ) for any node m of T that is not a leaf. T is optimally pruned whenever α < α 1. When pruning every node m with g(m, T ) = α 1, the result is T 1 = T (α 1 ). Furthermore, g(m, T ) > α 1 for every non-leaf node of T 1. Proof. T is automatically optimal for α < α 1 from Proposition 2.3. If g(m, T ) > α for every non-leaf m then R α (m) > R α (T m ) so no node is pruned. Therefore, the T is already optimal. Now let α = α 1 and prune according to Proposition 2.3. Each time we prune at a node m, R α (T c ) is not changed for any node c upstream of m Thus R α (m) R α (T m) for the current tree T if and only if R α (m) R α (T m ) for the original tree, which is equivalent to g(m, T ) α 1. So the algorithm in Proposition 2.3 is equivalent to pruning each node with g(m, T ) α 1. Hence, the latter also results in T (α 1 ). Lastly, let m be a node that remains after pruning with α = α 1. g(m, T 1 ) > α 1 follows from, R α1 (m) R α1 ((T 1 ) m ) = R α1 (m) R α1 (T m ) + [R α1 (T m ) R α1 ((T 1 ) m )] = R α1 (m) R α1 (T m ) = g(m, T )[s(t m ) s(m)] > α 1 [s(t m ) s(m)] α 1 [s((t 1 ) m ) s(m)] For the following proposition, recall that given an initial tree T 0, T (α) is a sub tree that minimizes R α and is a sub tree of all other sub trees of T 0 that also minimize R α. Proposition 2.5. Let β > α. Then T (β) is a sub-tree of T (α) and T (β) = T (α)(β). Proof. Enumerate the nodes of T as in the proof of proposition 2.3. We show by induction that T m (β) is (weakly) a subtree of T m (α) and therefore T m (β) is a subtree of T m (α). With proposition 2.3 we conclude that T (α)(β) = T (β). 17

18 For n = 1, T m1 is a leaf so the claim holds. Let n > 1 and assume that for k < n, T mk (β) is a subtree of T mk (α). At node m n we prune T mn (α) if R α (m) R α (T mn (α) and equivalently for T mn (β). We consider two cases. If R α (m) > R α (T mn (α), T mn (β) is automatically a subtree of T mn (α) because either T mn (β) is trivial or all of its branches are subtrees of the corresponding branches in T mn (α) by the induction hypothesis. If R α (m) R α (T mn (α)) we need to show that R α (m) R α (T m (α)) so that both T mn (α) and T mn (β) are trivial. Now T mn (α) minimizes R α over subtrees of T mn so R α (T mn (α)) R α (T mn (β)). Thus we have, R β (m) = R α (m) + (β α)s(m) R α (T mn (α)) + (β α)s(m) R α (T mn (β)) + (β α)s(m) = R β (T mn (β)) + (β α)[s(t mn (β) s(m)] R β (T mn (β)). Finally, T (β) = T (α)(β) because T (β) minimizes R β over all rooted subtrees of T which includes the rooted subtrees of T (α). These propositions suffice to find the sequence a k and the corresponding trees T (α k ). Having found T (α 1 ), the algorithm in proposition 2.3 can be applied to this tree to find T (α 2 ) (because g(m, T (α 1 )) > α 1 for each non-leaf of T (α 1 ) by proposition 2.4) and so forth. Eventually, T (α p ) will be the trivial tree for some p. For α i 1 < α < α i, i = 1,..., p, T (α) = T (α i 1 ) by proposition 2.4 and 2.5. Similarly, proposition 2.5 shows that T (α) = T (α p ) for α > α p. Now that we can find the correct tree for each value of α, a reasonable value has to be found. In general, we seek to minimize the generalization error (although in some cases computation time should be minimized simultaneously). If we use a test set which is drawn i.i.d from Ω, independently from the training set, and let R be the error rate, R(T ) already forms an unbiased estimate of the generalization error. We are only penalizing size because it contributes to variance which in turn contributes to generalization error so in this case we can leave α = 0. If a test set were i.i.d. drawn from Ω, the expectation of R(T (α)) would be the generalization error of T (α). Absent a test set, we therefore need to estimate this expectation with minimal bias and select α to minimize it. Breiman et al. suggest using cross-validation. The training set is split into a partition J and for each part j J a tree T is constructed from the remaining J 1 parts along with the corresponding sequence a k and the trees T (a k ). This gives a piecewise constant function r j (α) = R(j, T (α)). They then compute the average function α j J r j(α)/ J and select α as the minimizer of that function. Since the training set is disjoint from the test set for each j, r j (α) is expected to be a reasonably unbiased estimate for the expectation of R(T (α)) (where the training set and test set are i.i.d. drawn from Ω). Averaging J cross-validation experiments reduces the variance of this estimate. If J is small, the size of the training set is considerably smaller (e.g. to 2/3 of its size for J = 3 as used by Breiman et al.) so the resulting trees 18

19 may have a higher error-rate on the test set. But this does not mean that the relative estimates of R(T (α)) for different α are seriously biased. Therefore, the minimizing value of α does not necessarily change. For reference, a survey on decision tree-pruning methods to avoid overfitting is given by Breslow and Aha [12] and Esposito et al. [13]. Some of the other typical pruning methods include reduced error pruning, pessimistic error pruning, minimum error pruning, critical value pruning, cost-complexity pruning, and error-based pruning. Quinlan and Rivest proposed using the minimum description length principle for decision tree pruning in [7]. 19

20 3 Random Trees and Forests The final chapter will start with section 3.1 on bootstrapping, a commonly used technique in applied statistics that illustrates the principle behind the random selection of training data that will be used in section 3.3. We continue with section 3.2 which introduces an application of trees and goes into randomization of the feature selection. Finally, section 3.3 covers the random forest algorithm. 3.1 Bootstrapping We start this chapter with an introdution to bootstrapping, a technique that will later be used on the training data of random forests. We illustrate with an example. Given i.i.d. random variables X 1, X 2,..., X n D we want to estimate the standard deviation of a statistic ˆθ(X 1, X 2,..., X n ). In a typical case, ˆθ might for example be the the sample standard deviation which has a distribution that is often hard to derive analytically. Writing the standard deviation of ˆθ as σ(d, n, ˆθ) = σ(d) shows that it is merely a function of the underlying distribution D as n and ˆθ are given parameters. The bootstrap procedure, which we will elaborate on shortly, yields an arbitrarily accurate estimate of σ( ) evaluated at ˆD, the empirical sample distribution of the realizations of X 1, X 2,..., X n. Since ˆD is the non-parametric maximum likelihood estimate of D according to Efron [11], σ( ˆD) serves aas an estimate for σ(d). The estimate is not necessarily unbiased here but yields good results in practice [11]. The function σ( ) usually cannot be easily evaluated analytically. However, the bootstrap procedure below can generate arbitrarily close estimates using the Law of Large Numbers. 1. Calculate the empirical distribution ˆD of the given sample with distribution function ˆD(t) = 1 n 1 Xi t. (3.1) n 2. Draw an i.i.d. bootstrap sample i=1 from ˆD and calculate ˆθ = ˆθ(X 1, X 2,..., X n). X 1, X 2,..., X n ˆD (3.2) 20

21 3. Repeat step 2 a large number B of times independently. This yields bootstrap replications ˆθ 1, ˆθ 2,..., ˆθ B. Finally, calculate ˆσ( ˆF ) = ( 1 B 1 B [ˆθ b ˆθ M] ), 2 (3.3) b=1 where ˆθ M refers to the mean value of the bootstrap replications. As 3.3 is the sample standard deviation of the random variable ˆθ from step 2, the Law of Large Numbers guarantees that it converges ˆD-a.s. to the standard deviation σ( ˆD) of ˆθ. This proof is omitted. 3.2 Randomized trees for character recognition This section examines the classification problem of character recognition and discusses results from Amit and Geman [3] who applied feature randomization to classify images of characters with tree classifiers. Their method has two advantages: it makes tree classification possible when the set of features is unmanageably large and reduces the dependence between distinct trees (measured by 3.7), making it more useful to combine multiple trees Introduction Images of handwritten or machine-printed characters commonly have to be recognized for purposes of recognizing text. Examples include the automatic reading of addresses on letters and scanning of books and assistive technology for visually impaired users. A classifier program analyzes the features of an image of a character and classifies it as a particular character so that a digital text is produced. Artificial neural networks including the Adaboost algorithm as well as tree classifiers are well-known methods for character recognition. Amit and Geman introduced a new approach in 1997 [3] by inducing a tree classifier and randomly selecting the features to consider splitting on during the induction. Character recognition has various properties that are relevant for choosing the right classification approach. Including special symbols there may be hundreds of classes and substantial variation of features within a class. Given that the images are binary with M pixels, the space of objects X has 2 M elements and the feature space V is correspondingly large. Due to the large feature space, the challenge is to navigate it efficiently when searching for optima. This chapter starts with a description of the particular features that have to be extracted from binary images in section In section we introduce the particular form of selecting splits for character recognition and explains how and why the splitting features are randomized. 21

22 3.2.2 Feature extraction One of the main challenges for the classification of character images is the selection of the right kind of features that are used to distinguish the images. Amit and Geman extract particular features from pixel images by assigning a label to each pixel and letting the feature to split on be a particular geometric arrangement of tags. Each pixel is labelled with a 4x4 pixel grid which contains the pixel at the top left corner. Since there are 2 16 possible subimages types, they use a decision tree on a sample of images to narrow the space of subimages down to a set S of 62 tag types. The criterion for splitting at node t is dividing the 4x4 pixel subimages at the node as equally as possible into two groups. The resulting tags can losely be described as all black or white at the top, black at the bottom, etc. The eventual classification tree uses splits on geometric arrangements of such tags. These tag arrangements constitute the feature space V which is constructed as follows. Any feature is a tag arrangement which is a labelled directed graph. Each vertex labelled with a tag s S and each edge is labelled with a type of relation that represents the relative position of two tags. The eight types of relations correspond to the compass headings north, northeast, east, etc. More formally, two vertexes are connected by an edge that is labeled k {1, 2,..., 8} if the angle of the vector u v is in [(k 1 2 ) π 4, (k+ 1 2 ) π 4 ] where u and v are the two locations in the image. Let A be the set of all possible arrangements with at most 30 tags, then V = {v A : A A } is the feature space, where v A : Ω {0, 1} indicates whether a given object contains the arrangement A. This still leaves an unmanagably large number of features to consider splitting on at each node. The next section explains how we can navigate the feature space more efficiently Exploring shape space As the feature space consists of graphs, there is a natural partial ordering on it: A graph preceeds any of its extensions where extensions are made by successively adding vertexes or labelled relations. Small graphs produce rough splits of shape space. More complex ones contain more information about the image, but few images contain them. Therefore, Amit and Geman start the tree induction by only considering small graphs, which they call binary arrangements, and search for the best split among its extensions. As we will see later, this makes sure that at each node, only the more complex arrangements will be considered which are likely to be contained in an image at that node. A binary arrangement is a graph with only one edge and two connected vertexes. The set of binary arrangements will be denoted B V. A minimal extension of an arrangement A is any graph created by adding exactly one edge or one vertex and an edge connecting the vertex to A. Now the tree is induced following the recipe from section At the root, the tree is split on the binary arrangement A 0 that most reduces the average impurity I. Amit and Geman use the Shannon entropy as the uncertainty measure as opposed to the classification error or the Gini index. At the child node which answered no to the chosen arrangement, B is searched again and so forth. At the other child node, 22

23 the minimal extension of A 0 which most reduces the uncertainty is chosen as the split. Continuing this pattern, at each node a minimal extension of the arrangement that last answered yes is chosen as the splitting criterion. When a stopping criterion is satisfied, the algorithm stops. Amit and Geman stop the process when the number of examples at a leaf is below a threshold which they do not further specify Feature randomization Both pre-pruning and considering only binary arrangements and minimal extensions at each node restrict the computational resources needed considerably, but not sufficiently. At each node the amount of features to consider is still very large - both the number of new binary arrangements and the number of extensions of arrangements. Amit and Geman suggest to simply select a uniformly drawn random subset of the minimal extensions (or binary arrangements if none has been picked yet) to consider at each node. This randomization also makes it possible to construct multiple different random trees from the same training data l. Let the trees t 1, t 2,..., t N be a random sample from a probability space (T, A, P T ) of trees with some probability distribution P T. If there was a norm on T, the Law of Large Numbers guarantees that the sample mean would converge to the population mean of the trees constructed with L as long as the method of random feature selection generates independent, representative samples from this distribution. Randomization also explains why the variance of the expected loss decreases as N increases. Let G(t) denote the generalization error of a classifier t. This is equivalent to the expected loss as given in 2.2. Assuming that t k are i.i.d. realizations of a random classifier T, representatively sampled from T, 1 N N G(t k ) E T G(T ) (3.4) k=1 as N by the Law of Large Numbers. The same holds for any other function H(T ) as long as its expected value exists and is finite. Although this does not proof that the combination of multiple trees has a lower expected generalization error than a single tree, the reduction in variance is suggestive because much of the error for single trees is due to high variance [3]. 3.3 Random Forests As discussed in section 2.2.2, tree classifiers are prune to overfitting, i.e. lowering bias at the expense of high variance. The larger the tree, the more likely overfitting becomes. Breiman [5] offers a solution to this problem by combining multiple randomized trees and having them vote on the classification. He terms this algorithm Random Forest. Breiman uses two forms of randomization to create multiple trees given a training set. Firstly, the same feature randomization discussed in the previous section and secondly, 23

24 a different random subset of the training data is used for each new tree - a form of bootstrapping. The next subsection will give an explanation of the random forest algorithm followed by its relation to bootstrapping and proofs of its convergence as well as an upper bound for its generalization error The random forest algorithm Random forests work by growing N randomized trees, using a random subset of the training set L. The latter randomization is a form of bootstrapping which Breiman calls bagging. For each tree, his algorithm selects an i.i.d. sample from the training data which is two thirds the size of the training data. The algorithm can be described as follows. Definition 3.1. A random forest is a classifier consisting of a collection of tree-structured classifiers {T (x, l k ) : k = 1,..., N} where the l k are independent identically distributed random vectors drawn uniformly with replacement from the training data l and each of the trees casts a unit vote for the most popular class at input x. Since the random forest algorithm involves multiple forms of randomization, it is difficult to develop a mathematical notation that does justice to all its aspects. Neither Amit and Geman nor Breiman nor any other paper studied gives a full mathematical account of the algorithm. The following is an attempt to give such an account. Let (Ω, F, P) be the probability space of objects to classify and let C = {1, 2,..., c} be the space of classes. Furthermore, let Y : Ω C be the classification function. Let L = (X 1,..., X n ) be the training data, drawn i.i.d. from Ω with known class labels. Let T be the space of functions Ω C and suppose there is a sigma-algebra F T on T. Then there is a probability measure P T that assigns probabilities to measureable subsets of T according to a randomization procedure such as the one outlined in Given the distribution P on Ω and the probabilities over P T conditioned on each possible realization of L, P T is uniquely defined. In other words, we can derive a distribution P T over the space of all trees if we know for each possible set of training data which trees can be created from the set and with which probability. Hence, the set of functions in T which cannot be created via any training set according to Amit and Geman s tree induction process, is a P T -null set. Let l k be a realization of a random vector L k of length z < n drawn uniformly i.i.d. and with replacement from l. The tree T (., l k (l)) T, k = 1,..., N is a T -valued random element drawn i.i.d. from T conditioned on a single realization l of L. Lastly, each realization t k then feeds into a random forest as follows. Let (T N, σ(ft N), P T N ) be the product probability space of vectors of length N with random entries in T. Then the random forest is a random element F : T N T. Each realization f(x) = ( t 1 (, l 1 (l)),..., t N (, l N (l) ) (x) is defined as the maximizer of a N k=1 1(t(x, l k) = a) for each x Ω, (a C ). In other words, each tree casts one vote and the most popular class is selected. The randomization of training data works on the same principle as bootstrapping. Namely, we sample training data from the sample distribution of an i.i.d. sample from 24

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture 2: Basic Concepts of Statistical Decision Theory EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

SF2930 Regression Analysis

SF2930 Regression Analysis SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Supervised Learning via Decision Trees

Supervised Learning via Decision Trees Supervised Learning via Decision Trees Lecture 4 1 Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2 Sequence of decisions

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Data classification (II)

Data classification (II) Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Econ 2148, spring 2019 Statistical decision theory

Econ 2148, spring 2019 Statistical decision theory Econ 2148, spring 2019 Statistical decision theory Maximilian Kasy Department of Economics, Harvard University 1 / 53 Takeaways for this part of class 1. A general framework to think about what makes a

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Classification and regression trees

Classification and regression trees Classification and regression trees Pierre Geurts p.geurts@ulg.ac.be Last update: 23/09/2015 1 Outline Supervised learning Decision tree representation Decision tree learning Extensions Regression trees

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Randomized Decision Trees

Randomized Decision Trees Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14, 2015 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information