Decision trees COMS PDF Free Download

Decision trees COMS 4771

1. Prediction functions (again)

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 19

Optimal classifiers from discrete probability tables 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X = 1 0.0390 0.0170 0.0010 X = 2 0.0060 0.0170 0.0500 X = 3 0.1190 0.0290 0.0230 X = 4 0.0230 0.0630 0.0040 X = 5 0.0300 0.0120 0.0310 X = 6 0.0270 0.0940 0.0060 X = 7 0.0800 0.0070 0.0050 X = 8 0.0110 0.0500 0.0540 X = 9 0.0940 0.0020 0.0130 X = 10 0.0070 0.0210 0.0650 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Error rate: 31.1% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X = 1 0.0390 0.0170 0.0010 X = 2 0.0060 0.0170 0.0500 X = 3 0.1190 0.0290 0.0230 X = 4 0.0230 0.0630 0.0040 X = 5 0.0300 0.0120 0.0310 X = 6 0.0270 0.0940 0.0060 X = 7 0.0800 0.0070 0.0050 X = 8 0.0110 0.0500 0.0540 X = 9 0.0940 0.0020 0.0130 X = 10 0.0070 0.0210 0.0650 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X C 1 0.0390 0.0170 0.0010 X C 2 0.0060 0.0170 0.0500 X C 3 0.1190 0.0290 0.0230 X C 4 0.0230 0.0630 0.0040 X C 5 0.0300 0.0120 0.0310 X C 6 0.0270 0.0940 0.0060 X C 7 0.0800 0.0070 0.0050 X C 8 0.0110 0.0500 0.0540 X C 9 0.0940 0.0020 0.0130 X C 10 0.0070 0.0210 0.0650 Assume C 1,..., C 10 partition X. Optimal classifier? 4 / 19

2. Decision trees

Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. When X = R d, typically only consider splitting rules of the form x 1 > 1.7 h(x) = 1{x i > t} for some i [d] and t R, i.e., axis-aligned splits. (Notation: [d] := {1,..., d}) ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 5 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width 2.5 2 1.5 2 2.5 3 sepal length/width 6 / 19

Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 7 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. 8 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, 8 / 19

Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) 9 / 19

Uncertainty reduction 6 5.5 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). 5 petal length/width 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

Uncertainty reduction 6 5.5 5 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index petal length/width 4.5 4 3.5 3 ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 1 > t} to S L, S R: reduction in uncertainty 0.02 0.015 0.01 0.005 0 1.6 1.8 2 2.2 2.4 2.6 2.8 3 t 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 2 > t} to S L, S R: reduction in uncertainty 0.25 0.2 0.15 0.1 0.05 0 2 2.5 3 3.5 4 4.5 t 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 2 > 2.7222} to S L, S R: ) u(s L ) = 1 ( 0 2 ( ) 1 2 ( ) 2 29 30 30 30 = 0.0605, ) u(s R ) = 1 ( 1 2 ( ) 49 2 ( ) 2 21 71 71 71 = 0.4197. Reduction in uncertainty: ( ) 30 71 0.5098 0.0605 + 101 101 0.4197 = 0.2039. 10 / 19

Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. 11 / 19

Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. Upshot: Zero reduction in uncertainty may not be a good stopping condition. 11 / 19

Stopping condition Many alternatives; two common choices are: 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). 13 / 19

Overfitting risk true risk training risk number of nodes in tree Training risk goes to zero as the number of nodes in the tree increases. True risk decreases initially, but eventually increases due to overfitting. 14 / 19

What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S 2.... until no more such improvements possible. 15 / 19

Example: Spam filtering Data 4601 e-mail messages. Y = {spam, not spam} (39.4% are spam.) E-mails represented by 57 features: 48: percentange of e-mail words that is specific word (e.g., free, business ) 6: percentage of e-mail characters that is specific character (e.g.,! ). 3: other features (e.g., average length of ALL-CAPS words). Results Final decision tree has just 17 leaves; test risk is 9.3%. ŷ = not spam ŷ = spam y = not spam 57.3% 4.0% y = spam 5.3% 33.4% 16 / 19

Example: Spam filtering 600/1536 280/1177 180/1065 80/861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/209 100/204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22 spam spam spam spam spam spam spam spam spam spam spam spam email email email email email email email email email email email email email email email email email email email email email ch$<0.0555 remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045 our<1.2 CAPAVE<2.7505 free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE<2.907 1999<0.58 ch$>0.0555 remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045 our>1.2 CAPAVE>2.7505 free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE>2.907 1999>0.58 17 / 19

Final remarks Decision trees are very flexible classifiers. Very easy to overfit training data. NP-hard (i.e., computationally intractable, in general) to find smallest decision tree consistent with data. 18 / 19

Key takeaways 1. Structure of decision tree classifiers. 2. Greedy learning algorithm based on notions of uncertainty; limitations of the greedy algorithm. 3. High-level idea of overfitting and ways to deal with it. 19 / 19