Decision trees COMS 4771

Similar documents
SF2930 Regression Analysis

Decision Tree Learning Lecture 2

the tree till a class assignment is reached

Logistic regression and linear classifiers COMS 4771

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Informal Definition: Telling things apart

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Jeffrey D. Ullman Stanford University

Linear Methods for Classification

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Empirical Risk Minimization, Model Selection, and Model Assessment

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Classification objectives COMS 4771

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Introduction to Data Science Data Mining for Business Analytics

Learning theory. Ensemble methods. Boosting. Boosting: history

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Holdout and Cross-Validation Methods Overfitting Avoidance

Kernel Density Estimation

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CMU-Q Lecture 24:

Linear Classifiers. Michael Collins. January 18, 2012

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CS 6375 Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Notes on Machine Learning for and

Decision Tree Learning

Machine Learning 2nd Edi7on

Machine Learning 3. week

Statistics and learning: Big Data

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining and Analysis: Fundamental Concepts and Algorithms

Recap from previous lecture

Naïve Bayes classification

CS145: INTRODUCTION TO DATA MINING

Decision Tree Learning

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Generalization and Overfitting

Overfitting, Bias / Variance Analysis

Statistical Machine Learning from Data

Lecture 7: DecisionTrees

Machine Learning & Data Mining

Machine Learning (CS 567) Lecture 2

Randomized Decision Trees

Tufts COMP 135: Introduction to Machine Learning

Decision Trees. Tirgul 5

Decision Trees. Gavin Brown

Induction of Decision Trees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Chapter 6. Ensemble Methods

Generative v. Discriminative classifiers Intuition

Dan Roth 461C, 3401 Walnut

Ensemble Methods and Random Forests

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 3: Introduction to Complexity Regularization

Lecture 2 Machine Learning Review

UVA CS 4501: Machine Learning

Lecture 3: Statistical Decision Theory (Part II)

Classification and Regression Trees

Knowledge Discovery and Data Mining

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Deconstructing Data Science

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

C4.5 - pruning decision trees

CMSC858P Supervised Learning Methods

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Linear regression COMS 4771

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

COMS 4771 Lecture Boosting 1 / 16

Least Squares Classification

Learning Decision Trees

Loss Functions, Decision Theory, and Linear Models

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Chapter 6: Classification

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Machine Learning for NLP

day month year documentname/initials 1

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Generalization, Overfitting, and Model Selection

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Performance Evaluation

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Lecture 3: Decision Trees

Random Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Learning Decision Trees

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Discriminative v. generative

Lecture 7 Decision Tree Classifier

Transcription:

Decision trees COMS 4771

1. Prediction functions (again)

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 19

Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). Note: expected zero-one loss is E[1{ ˆf(X) Y }] = P( ˆf(X) Y ), which we also call error rate. 1 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 2 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 19

Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x, for each x X : P Y X=x is a probability distribution on Y. 2 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk ŷ P(ŷ Y X = x) is ŷ := { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. 3 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. x X, 3 / 19

Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. For Y = {1,..., K}, f (x) = arg max P(Y = y X = x), x X. y Y x X, 3 / 19

Optimal classifiers from discrete probability tables 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X = 1 0.0390 0.0170 0.0010 X = 2 0.0060 0.0170 0.0500 X = 3 0.1190 0.0290 0.0230 X = 4 0.0230 0.0630 0.0040 X = 5 0.0300 0.0120 0.0310 X = 6 0.0270 0.0940 0.0060 X = 7 0.0800 0.0070 0.0050 X = 8 0.0110 0.0500 0.0540 X = 9 0.0940 0.0020 0.0130 X = 10 0.0070 0.0210 0.0650 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X = 1 0.0390 0.0170 0.0010 X = 2 0.0060 0.0170 0.0500 X = 3 0.1190 0.0290 0.0230 X = 4 0.0230 0.0630 0.0040 X = 5 0.0300 0.0120 0.0310 X = 6 0.0270 0.0940 0.0060 X = 7 0.0800 0.0070 0.0050 X = 8 0.0110 0.0500 0.0540 X = 9 0.0940 0.0020 0.0130 X = 10 0.0070 0.0210 0.0650 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Error rate: 31.1% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X = 1 0.0390 0.0170 0.0010 X = 2 0.0060 0.0170 0.0500 X = 3 0.1190 0.0290 0.0230 X = 4 0.0230 0.0630 0.0040 X = 5 0.0300 0.0120 0.0310 X = 6 0.0270 0.0940 0.0060 X = 7 0.0800 0.0070 0.0050 X = 8 0.0110 0.0500 0.0540 X = 9 0.0940 0.0020 0.0130 X = 10 0.0070 0.0210 0.0650 4 / 19

Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = 1 0.1 0.3 0.2 X = 2 0.2 0.1 0.1 Y = 1 Y = 2 Y = 3 X C 1 0.0390 0.0170 0.0010 X C 2 0.0060 0.0170 0.0500 X C 3 0.1190 0.0290 0.0230 X C 4 0.0230 0.0630 0.0040 X C 5 0.0300 0.0120 0.0310 X C 6 0.0270 0.0940 0.0060 X C 7 0.0800 0.0070 0.0050 X C 8 0.0110 0.0500 0.0540 X C 9 0.0940 0.0020 0.0130 X C 10 0.0070 0.0210 0.0650 Assume C 1,..., C 10 partition X. Optimal classifier? 4 / 19

2. Decision trees

Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. 5 / 19

Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. When X = R d, typically only consider splitting rules of the form x 1 > 1.7 h(x) = 1{x i > t} for some i [d] and t R, i.e., axis-aligned splits. (Notation: [d] := {1,..., d}) ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 5 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width 2.5 2 1.5 2 2.5 3 sepal length/width 6 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width ŷ = 2 2.5 2 1.5 2 2.5 3 sepal length/width 6 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > 1.7 2.5 2 1.5 2 2.5 3 sepal length/width 6 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > 1.7 2.5 2 1.5 2 2.5 3 sepal length/width ŷ = 1 ŷ = 3 6 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > 1.7 2.5 2 1.5 2 2.5 3 sepal length/width ŷ = 1 x 2 > 2.8 6 / 19

Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements 5.5 5 4.5 4 3.5 3 X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > 1.7 2.5 2 1.5 2 2.5 3 sepal length/width ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 6 / 19

Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 7 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. 8 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). 8 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, 8 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. 8 / 19

Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. (Other popular notions of uncertainty exist, such as entropy.) 8 / 19

Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) 9 / 19

Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) The reduction in uncertainty from using rule h at leaf l is ( ) u(s) w L u(s L) + w R u(s R). 9 / 19

Uncertainty reduction 6 5.5 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). 5 petal length/width 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

Uncertainty reduction 6 5.5 5 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index petal length/width 4.5 4 3.5 3 ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 1 > t} to S L, S R: reduction in uncertainty 0.02 0.015 0.01 0.005 0 1.6 1.8 2 2.2 2.4 2.6 2.8 3 t 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 2 > t} to S L, S R: reduction in uncertainty 0.25 0.2 0.15 0.1 0.05 0 2 2.5 3 3.5 4 4.5 t 10 / 19

Uncertainty reduction petal length/width 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 2 2.5 3 sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) 2 1 50 50 u(s) = 1 101 101 101 = 0.5098. Split S with 1{x 2 > 2.7222} to S L, S R: ) u(s L ) = 1 ( 0 2 ( ) 1 2 ( ) 2 29 30 30 30 = 0.0605, ) u(s R ) = 1 ( 1 2 ( ) 49 2 ( ) 2 21 71 71 71 = 0.4197. Reduction in uncertainty: ( ) 30 71 0.5098 0.0605 + 101 101 0.4197 = 0.2039. 10 / 19

Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. 11 / 19

Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. Upshot: Zero reduction in uncertainty may not be a good stopping condition. 11 / 19

Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 12 / 19

Stopping condition Many alternatives; two common choices are: 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19

Overfitting risk true risk training risk number of nodes in tree Training risk goes to zero as the number of nodes in the tree increases. True risk decreases initially, but eventually increases due to overfitting. 14 / 19

What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. 15 / 19

What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S 2.... until no more such improvements possible. 15 / 19

What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S 2.... until no more such improvements possible. This can be done in bottom-up traversal of the tree. 15 / 19

What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S 2.... until no more such improvements possible. This can be done in bottom-up traversal of the tree. Independence of S and S make it unlikely for spurious structures in each to perfectly align. 15 / 19

Example: Spam filtering Data 4601 e-mail messages. Y = {spam, not spam} (39.4% are spam.) E-mails represented by 57 features: 48: percentange of e-mail words that is specific word (e.g., free, business ) 6: percentage of e-mail characters that is specific character (e.g.,! ). 3: other features (e.g., average length of ALL-CAPS words). Results Final decision tree has just 17 leaves; test risk is 9.3%. ŷ = not spam ŷ = spam y = not spam 57.3% 4.0% y = spam 5.3% 33.4% 16 / 19

Example: Spam filtering 600/1536 280/1177 180/1065 80/861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/209 100/204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22 spam spam spam spam spam spam spam spam spam spam spam spam email email email email email email email email email email email email email email email email email email email email email ch$<0.0555 remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045 our<1.2 CAPAVE<2.7505 free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE<2.907 1999<0.58 ch$>0.0555 remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045 our>1.2 CAPAVE>2.7505 free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE>2.907 1999>0.58 17 / 19

Final remarks Decision trees are very flexible classifiers. Very easy to overfit training data. NP-hard (i.e., computationally intractable, in general) to find smallest decision tree consistent with data. 18 / 19

Key takeaways 1. Structure of decision tree classifiers. 2. Greedy learning algorithm based on notions of uncertainty; limitations of the greedy algorithm. 3. High-level idea of overfitting and ways to deal with it. 19 / 19