Decision trees COMS 4771

Size: px

Start display at page:

Download "Decision trees COMS 4771"

Oliver Lawrence
5 years ago
Views:

1 Decision trees COMS 4771

2 1. Prediction functions (again)

3 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 19

4 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1 / 19

5 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 1 / 19

6 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 1 / 19

7 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 19

8 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). Note: expected zero-one loss is E[1{ ˆf(X) Y }] = P( ˆf(X) Y ), which we also call error rate. 1 / 19

9 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 19

10 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 2 / 19

11 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 19

12 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x, for each x X : P Y X=x is a probability distribution on Y. 2 / 19

13 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 19

14 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk ŷ P(ŷ Y X = x) is ŷ := { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. 3 / 19

15 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 19

16 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. x X, 3 / 19

17 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. For Y = {1,..., K}, f (x) = arg max P(Y = y X = x), x X. y Y x X, 3 / 19

18 Optimal classifiers from discrete probability tables 4 / 19

19 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Y = 1 Y = 2 Y = 3 X = X = / 19

20 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = / 19

21 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

22 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

23 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Error rate: 31.1% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19

24 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X C X C X C X C X C X C X C X C X C X C Assume C 1,..., C 10 partition X. Optimal classifier? 4 / 19

25 2. Decision trees

26 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. 5 / 19

27 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. When X = R d, typically only consider splitting rules of the form x 1 > 1.7 h(x) = 1{x i > t} for some i [d] and t R, i.e., axis-aligned splits. (Notation: [d] := {1,..., d}) ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 5 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.

28 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width sepal length/width 6 / 19

29 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width ŷ = sepal length/width 6 / 19

30 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width 6 / 19

31 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 ŷ = 3 6 / 19

32 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > / 19

33 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 6 / 19

34 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 7 / 19

35 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. 8 / 19

36 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). 8 / 19

37 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, 8 / 19

38 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. 8 / 19

39 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. (Other popular notions of uncertainty exist, such as entropy.) 8 / 19

40 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) 9 / 19

41 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) The reduction in uncertainty from using rule h at leaf l is ( ) u(s) w L u(s L) + w R u(s R). 9 / 19

42 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). 5 petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

43 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index petal length/width ( ) 2 ( ) 2 ( ) u(s) = = sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

44 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 1 > t} to S L, S R: reduction in uncertainty t 10 / 19

45 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > t} to S L, S R: reduction in uncertainty t 10 / 19

46 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > } to S L, S R: ) u(s L ) = 1 ( 0 2 ( ) 1 2 ( ) = , ) u(s R ) = 1 ( 1 2 ( ) 49 2 ( ) = Reduction in uncertainty: ( ) = / 19

47 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. 11 / 19

48 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. Upshot: Zero reduction in uncertainty may not be a good stopping condition. 11 / 19

49 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 12 / 19

50 Stopping condition Many alternatives; two common choices are: 13 / 19

51 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. 13 / 19

52 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 13 / 19

53 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). 13 / 19

54 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. 13 / 19

55 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

56 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

57 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

58 Overfitting risk true risk training risk number of nodes in tree Training risk goes to zero as the number of nodes in the tree increases. True risk decreases initially, but eventually increases due to overfitting. 14 / 19

59 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. 15 / 19

60 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. 15 / 19

61 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. 15 / 19

62 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. Independence of S and S make it unlikely for spurious structures in each to perfectly align. 15 / 19

63 Example: Spam filtering Data messages. Y = {spam, not spam} (39.4% are spam.) s represented by 57 features: 48: percentange of words that is specific word (e.g., free, business ) 6: percentage of characters that is specific character (e.g.,! ). 3: other features (e.g., average length of ALL-CAPS words). Results Final decision tree has just 17 leaves; test risk is 9.3%. ŷ = not spam ŷ = spam y = not spam 57.3% 4.0% y = spam 5.3% 33.4% 16 / 19

64 Example: Spam filtering 600/ / / /861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/ /204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22 spam spam spam spam spam spam spam spam spam spam spam spam ch$< remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045 our<1.2 CAPAVE< free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE< <0.58 ch$> remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045 our>1.2 CAPAVE> free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE> > / 19

65 Final remarks Decision trees are very flexible classifiers. Very easy to overfit training data. NP-hard (i.e., computationally intractable, in general) to find smallest decision tree consistent with data. 18 / 19

66 Key takeaways 1. Structure of decision tree classifiers. 2. Greedy learning algorithm based on notions of uncertainty; limitations of the greedy algorithm. 3. High-level idea of overfitting and ways to deal with it. 19 / 19

SF2930 Regression Analysis

SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression