Decision trees COMS 4771

Size: px
Start display at page:

Download "Decision trees COMS 4771"

Transcription

1 Decision trees COMS 4771

2 1. Prediction functions (again)

3 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 19

4 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1 / 19

5 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 1 / 19

6 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 1 / 19

7 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 19

8 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). Note: expected zero-one loss is E[1{ ˆf(X) Y }] = P( ˆf(X) Y ), which we also call error rate. 1 / 19

9 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 19

10 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 2 / 19

11 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 19

12 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x, for each x X : P Y X=x is a probability distribution on Y. 2 / 19

13 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 19

14 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk ŷ P(ŷ Y X = x) is ŷ := { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. 3 / 19

15 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 19

16 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. x X, 3 / 19

17 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. For Y = {1,..., K}, f (x) = arg max P(Y = y X = x), x X. y Y x X, 3 / 19

18 Optimal classifiers from discrete probability tables 4 / 19

19 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Y = 1 Y = 2 Y = 3 X = X = / 19

20 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = / 19

21 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

22 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

23 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Error rate: 31.1% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

24 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X C X C X C X C X C X C X C X C X C X C Assume C 1,..., C 10 partition X. Optimal classifier? 4 / 19

25 2. Decision trees

26 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. 5 / 19

27 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. When X = R d, typically only consider splitting rules of the form x 1 > 1.7 h(x) = 1{x i > t} for some i [d] and t R, i.e., axis-aligned splits. (Notation: [d] := {1,..., d}) ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 5 / 19

28 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width sepal length/width 6 / 19

29 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width ŷ = sepal length/width 6 / 19

30 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width 6 / 19

31 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 ŷ = 3 6 / 19

32 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > / 19

33 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 6 / 19

34 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 7 / 19

35 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. 8 / 19

36 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). 8 / 19

37 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, 8 / 19

38 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. 8 / 19

39 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. (Other popular notions of uncertainty exist, such as entropy.) 8 / 19

40 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) 9 / 19

41 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) The reduction in uncertainty from using rule h at leaf l is ( ) u(s) w L u(s L) + w R u(s R). 9 / 19

42 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). 5 petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

43 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index petal length/width ( ) 2 ( ) 2 ( ) u(s) = = sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

44 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 1 > t} to S L, S R: reduction in uncertainty t 10 / 19

45 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > t} to S L, S R: reduction in uncertainty t 10 / 19

46 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > } to S L, S R: ) u(s L ) = 1 ( 0 2 ( ) 1 2 ( ) = , ) u(s R ) = 1 ( 1 2 ( ) 49 2 ( ) = Reduction in uncertainty: ( ) = / 19

47 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. 11 / 19

48 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. Upshot: Zero reduction in uncertainty may not be a good stopping condition. 11 / 19

49 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 12 / 19

50 Stopping condition Many alternatives; two common choices are: 13 / 19

51 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. 13 / 19

52 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 13 / 19

53 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). 13 / 19

54 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. 13 / 19

55 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

56 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

57 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

58 Overfitting risk true risk training risk number of nodes in tree Training risk goes to zero as the number of nodes in the tree increases. True risk decreases initially, but eventually increases due to overfitting. 14 / 19

59 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. 15 / 19

60 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. 15 / 19

61 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. 15 / 19

62 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. Independence of S and S make it unlikely for spurious structures in each to perfectly align. 15 / 19

63 Example: Spam filtering Data messages. Y = {spam, not spam} (39.4% are spam.) s represented by 57 features: 48: percentange of words that is specific word (e.g., free, business ) 6: percentage of characters that is specific character (e.g.,! ). 3: other features (e.g., average length of ALL-CAPS words). Results Final decision tree has just 17 leaves; test risk is 9.3%. ŷ = not spam ŷ = spam y = not spam 57.3% 4.0% y = spam 5.3% 33.4% 16 / 19

64 Example: Spam filtering 600/ / / /861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/ /204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22 spam spam spam spam spam spam spam spam spam spam spam spam ch$< remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045 our<1.2 CAPAVE< free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE< <0.58 ch$> remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045 our>1.2 CAPAVE> free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE> > / 19

65 Final remarks Decision trees are very flexible classifiers. Very easy to overfit training data. NP-hard (i.e., computationally intractable, in general) to find smallest decision tree consistent with data. 18 / 19

66 Key takeaways 1. Structure of decision tree classifiers. 2. Greedy learning algorithm based on notions of uncertainty; limitations of the greedy algorithm. 3. High-level idea of overfitting and ways to deal with it. 19 / 19

SF2930 Regression Analysis

SF2930 Regression Analysis SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler + Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Informal Definition: Telling things apart

Informal Definition: Telling things apart 9. Decision Trees Informal Definition: Telling things apart 2 Nominal data No numeric feature vector Just a list or properties: Banana: longish, yellow Apple: round, medium sized, different colors like

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Linear Methods for Classification

Linear Methods for Classification Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Announcements Midterm is graded! Average: 39, stdev: 6 HW2 is out today HW2 is due Thursday, May 3, by 5pm in my mailbox Decision Tree Classifiers

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Introduction to Data Science Data Mining for Business Analytics

Introduction to Data Science Data Mining for Business Analytics Introduction to Data Science Data Mining for Business Analytics BRIAN D ALESSANDRO VP DATA SCIENCE, DSTILLERY ADJUNCT PROFESSOR, NYU FALL 2014 Fine Print: these slides are, and always will be a work in

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Kernel Density Estimation

Kernel Density Estimation Kernel Density Estimation If Y {1,..., K} and g k denotes the density for the conditional distribution of X given Y = k the Bayes classifier is f (x) = argmax π k g k (x) k If ĝ k for k = 1,..., K are

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Machine Learning 2nd Edi7on

Machine Learning 2nd Edi7on Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Randomized Decision Trees

Randomized Decision Trees Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Decision Trees. Tirgul 5

Decision Trees. Tirgul 5 Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

Decision Trees. Gavin Brown

Decision Trees. Gavin Brown Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN? SVM? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations Review of Lecture 1 This course is about finding novel actionable patterns in data. We can divide data mining algorithms (and the patterns they find) into five groups Across records Classification, Clustering,

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Deconstructing Data Science

Deconstructing Data Science econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 6: ecision trees & random forests Feb 2, 2016 Linear regression eep learning ecision trees Ordinal regression Probabilistic graphical

More information

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification

More information

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of

More information

C4.5 - pruning decision trees

C4.5 - pruning decision trees C4.5 - pruning decision trees Quiz 1 Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Least Squares Classification

Least Squares Classification Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning Chapter ML:III III. Decision Trees Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning ML:III-34 Decision Trees STEIN/LETTMANN 2005-2017 Splitting Let t be a leaf node

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting

More information

Performance Evaluation

Performance Evaluation Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Example:

More information

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi Predictive Modeling: Classification Topic 6 Mun Yi Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2 Introduction Key concept of BI: Predictive modeling

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn

Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn Random Forests Sören Mindermann September 21, 2016 Bachelorproject Begeleiding: dr. B Kleijn Korteweg-de Vries Instituut voor Wiskunde Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

More information

Discriminative v. generative

Discriminative v. generative Discriminative v. generative Naive Bayes 2 Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1

More information

Lecture 7 Decision Tree Classifier

Lecture 7 Decision Tree Classifier Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification

More information