Decision Trees: Overfitting

Similar documents
Stochastic Gradient Descent

ECE 5424: Introduction to Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Voting (Ensemble Methods)

Announcements Kevin Jamieson

Learning theory. Ensemble methods. Boosting. Boosting: history

CS7267 MACHINE LEARNING

Linear classifiers: Overfitting and regularization

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Ensemble Methods for Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Introduction to Machine Learning CMU-10701

Learning with multiple models. Boosting.

Data Mining und Maschinelles Lernen

Boosting: Foundations and Algorithms. Rob Schapire

CS145: INTRODUCTION TO DATA MINING

Machine Learning Recitation 8 Oct 21, Oznur Tastan

COMS 4771 Lecture Boosting 1 / 16

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Neural Networks and Ensemble Methods for Classification

Logistic Regression Logistic

Statistics and learning: Big Data

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Chapter 14 Combining Models

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Ensemble Methods: Jay Hyer

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Generative v. Discriminative classifiers Intuition

Lecture 3: Decision Trees

A Brief Introduction to Adaboost

CSC 411 Lecture 3: Decision Trees

Boosting & Deep Learning

Linear classifiers: Logistic regression

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Gradient Boosting (Continued)

Statistical Machine Learning from Data

Decision Tree Learning Lecture 2

Learning Decision Trees

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Machine Learning. Ensemble Methods. Manfred Huber

Ensembles. Léon Bottou COS 424 4/8/2010

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Boos$ng Can we make dumb learners smart?

Learning Ensembles. 293S T. Yang. UCSB, 2017.

FINAL: CS 6375 (Machine Learning) Fall 2014

10701/15781 Machine Learning, Spring 2007: Homework 2

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Lecture 8. Instructor: Haipeng Luo

Learning Decision Trees

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

UVA CS 4501: Machine Learning

Hierarchical Boosting and Filter Generation

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Midterm, Fall 2003

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Decision Trees Lecture 12

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Name (NetID): (1 Point)

ECE521 Lecture7. Logistic Regression

Learning from Examples

Variance Reduction and Ensemble Methods

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Cross Validation & Ensembling

Warm up: risk prediction with logistic regression

Lecture 7: DecisionTrees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

CS229 Supplemental Lecture notes

CSCI-567: Machine Learning (Spring 2019)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Application of machine learning in manufacturing industry

the tree till a class assignment is reached

Supervised Learning via Decision Trees

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Decision Tree And Random Forest

EECS 349:Machine Learning Bryan Pardo

Large-Margin Thresholded Ensembles for Ordinal Regression

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

CS534 Machine Learning - Spring Final Exam

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Ensemble Methods and Random Forests

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Transcription:

Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9 0 3 years 0 2 high 4 5 Term? 5 years 4 3 low 0 9 For each leaf node, set = majority value 2 1

Greedy decision tree learning Step 1: Start with an empty tree Step 2: Select a feature to split data For each split of the tree: Step 3: If nothing more to, make predictions Step 4: Otherwise, go to Step 2 & continue (recurse) on this split Pick feature split leading to lowest classification error Stopping conditions 1 & 2 Recursion 3 Scoring a loan application Start x i = (Credit = poor, Income = high, Term = 5 years) excellent Credit? fair poor 3 years Term? 5 years high Income? Low Term? 3 years 5 years i = 4 2

Decision trees vs logistic regression: Example Logistic regression Feature Value Weight Learne d h 0 (x) 1 0.22 h 1 (x) x[1] 1.12 h 2 (x) x[2] -1.07 6 3

Depth 1: Split on x[1] y values - + Root 18 13 x[1] x[1] < -0.07 13 3 x[1] >= -0.07 4 11 7 Depth 2 y values - + Root 18 13 x[1] x[1] < -0.07 13 3 x[1] >= -0.07 4 11 x[1] x[2] x[1] < -1.66 7 0 x[1] >= -1.66 6 3 x[2] < 1.55 1 11 x[2] >= 1.55 3 0 8 4

Threshold split caveat y values - + Root 18 13 x[1] For threshold splits, same feature can be used multiple times x[1] < -0.07 13 3 x[1] x[1] >= -0.07 4 11 x[2] x[1] < -1.66 7 0 x[1] >= -1.66 6 3 x[2] < 1.55 1 11 x[2] >= 1.55 3 0 9 Decision boundaries Depth 1 Depth 2 Depth 10 10 5

Comparing decision boundaries Decision Tree Depth 1 Depth 3 Depth 10 Logistic Regression Degree 1 features Degree 2 features Degree 6 features 11 Predicting probabilities with decision trees Loan status: Root 18 12 Credit? excellent 9 2 fair 6 9 poor 3 1 P(y = x) = 3 = 0.75 3 + 1 12 6

Depth 1 probabilities Y values - + root 18 13 X1 X1 < -0.07 13 3 X1 >= -0.07 4 11 13 Depth 2 probabilities Y values - + root 18 13 X1 X1 < -0.07 13 3 X1 >= -0.07 4 11 X1 X2 X1 < -1.66 7 0 X1 >= -1.66 6 3 X2 < 1.55 1 11 X2 >= 1.55 3 0 14 7

Comparison with logistic regression Logistic Regression (Degree 2) Decision Trees (Depth 2) Class Probability 15 Overfitting in decision trees 8

What happens when we increase depth? Training error reduces with depth Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10 Training error 0.22 0.13 0.10 0.03 0.00 Decision boundary 17 Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates 18 9

Technique 1: Early stopping Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points 19 Challenge with early stopping condition 1 Hard to know exactly when to stop Classification Error Training error Simple trees Complex trees True error Also, might want some branches of tree to go deeper while others remain shallow 20 max_depth Tree depth 10

Early stopping condition 2: Pros and Cons Pros: - A reasonable heuristic for early stopping to avoid useless splits Cons: - Too short sighted: We may miss out on good splits may occur right after useless splits - Saw this with xor example 21 Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates Complements early stopping 22 11

Pruning: Intuition Train a complex tree, simplify later Complex Tree Simpler Tree Simplify 23 Pruning motivation Classification Error True Error Training Error Simple tree Complex tree Simplify after tree is built Don t stop too early 24 Tree depth 12

Scoring trees: Desired total quality format Want to balance: i. How well tree fits data ii. Complexity of tree Total cost = want to balance measure of fit + measure of complexity 25 (classification error) Large # = bad fit to training data Large # = likely to overfit Simple measure of complexity of tree Start L(T) = # of leaf nodes excellent Credit? poor good fair bad 26 13

Balance simplicity & predictive power Too complex, risk of overfitting Start excellent Credit? fair Term? 3 years 5 years poor high Income? low Too simple, high classification error Start Term? 3 years 5 years 27 2015-2016 Emily Fox & Carlos Guestrin Balancing fit and complexity If λ=0: Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ= : If λ in between: 28 14

Tree pruning algorithm Step 1: Consider a split Start Tree T excellent Credit? fair poor 3 years Term? 5 years high Income? low Term? 3 years 5 years Candidate for pruning 30 15

Step 2: Compute total cost C(T) of split Start Tree T excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 C(T) = Error(T) + λ L(T) Term? 3 years 5 years Candidate for pruning 31 Step 2: Undo the splits on Tsmaller Start Tree T smaller excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 T smaller 0.26 5 0.41 C(T) = Error(T) + λ L(T) 32 Replace split by leaf node? 16

Prune if total cost is lower: C(T smaller ) C(T) Start Tree T smaller Worse training error but lower overall cost excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 T smaller 0.26 5 0.41 C(T) = Error(T) + λ L(T) 33 Replace split by leaf node? YES! Step 5: Repeat Steps 1-4 for every split excellent Start Credit? poor Decide if each split can be pruned fair 3 years Term? 5 years high Income? low 34 17

Summary of overfitting in decision trees What you can do now Identify when overfitting in decision trees Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones 36 18

Boosting Emily Fox University of Washington January 30, 2017 Simple (weak) classifiers are good! Income>$100K? Yes No Logistic regression w. simple features Shallow decision trees Decision stumps Low variance. Learning is fast! 38 But high bias 19

Finding a classifier that s just right Classification error Model complexity true error Weak learner Need stronger learner 39 Option 1: add more features or depth Option 2:????? Boosting question Can a set of weak learners be combined to create a stronger learner? Kearns and Valiant (1988) Yes! Schapire (1990) Boosting Amazing impact: simple approach widely used in industry wins most Kaggle competitions 40 20

Ensemble classifier A single classifier Input: x Income>$100K? Yes No Output: = f(x) - Either +1 or -1 42 Classifier 21

Income>$100K? Yes No 43 Ensemble methods: Each classifier votes on prediction x i = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good) Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good f 1 (x i ) = +1 f 2 (x i ) = -1 f 3 (x i ) = -1 f 4 (x i ) = +1 Ensemble model Combine? Learn coefficients F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i )) 44 22

Prediction with ensemble Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good f 1 (x i ) = +1 f 2 (x i ) = -1 f 3 (x i ) = -1 f 4 (x i ) = +1 Ensemble model Combine? Learn coefficients F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i )) w 1 2 w 2 1.5 w 3 1.5 w 4 0.5 45 Ensemble classifier in general Goal: - Predict output y Either +1 or -1 - From input x Learn ensemble model: - Classifiers: f 1 (x),f 2 (x),,f T (x) - Coefficients: 1, 2,, T Prediction: 46 23

Boosting Training a classifier Training data Learn classifier f(x) Predict = sign(f(x)) 48 24

Learning decision stump Credit Income y A $130K B $80K C $110K A $110K A $90K B $120K C $30K C $60K B $95K A $60K A $98K Income? > $100K 3 1 4 3 = = 49 Boosting = Focus learning on hard points Training data Learn classifier f(x) Predict = sign(f(x)) Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes 50 25

Learning on weighted data: More weight on hard or more important points Weighted dataset: - Each x i,y i weighted by α i More important point = higher weight α i Learning: - Data point i counts as α i data points E.g., α i = 2 count point twice 51 Learning a decision stump on weighted data Increase weight α of harder/misclassified points 52 Credit Income y y Weight t A $130K α A B $130K $80K 0.5 B C $80K $110K 1.5 C A $110K $110K 1.2 A A $110K $90K 0.8 A B $90K $120K 0.6 B C $120K $30K 0.7 C C $30K $60K 3 C B $60K $95K 2 B A $95K $60K 0.8 A A $60K $98K 0.7 A $98K 0.9 Income? > $100K 2 1.2 3 6.5 = = 26

Learning from weighted data in general Usually, learning from weighted data - Data point i counts as α i data points E.g., gradient ascent for logistic regression: Sum over data points α i Weigh each point by α i 53 Boosting = Greedy learning ensembles from data Training data Higher weight for points where f 1 (x) is wrong Learn classifier f 1 (x) Predict = sign(f 1 (x)) Weighted data Learn classifier & coefficient,f 2 (x) Predict = sign( 1 f 1 (x) + 2 f 2 (x)) 54 27

AdaBoost AdaBoost: learning ensemble [Freund & Schapire 1999] Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: 56 28

Computing coefficient t AdaBoost: Computing coefficient t of classifier f t (x) Is f t (x) good? Yes No t large t small f t (x) is good f t has low training error Measuring error in weighted data? - Just weighted # of misclassified points 58 29

Weighted classification error Learned classifier = Data point (Food (Sushi was OK, great,,α=0.5),α=1.2) Correct! Mistake! Weight of correct Weight of mistakes 1.2 0 0.5 0 Hide label 59 Weighted classification error Total weight of mistakes: Total weight of all points: Weighted error measures fraction of weight of mistakes: weighted_error =. 60 - Best possible value is 0.0 30

AdaBoost: Formula for computing coefficient t of classifier f t (x) Is f t (x) good? Yes No weighted_error(f t ) on training data t 61 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: 62 31

Recompute weights α i AdaBoost: Updating weights α i based on where classifier f t (x) makes mistakes Did f t get x i right? Yes No Decrease α i Increase α i 64 32

AdaBoost: Formula for updating weights α i α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i f t (x i )=y i? t Multiply α i by Implication Did f t get x i right? Yes No 65 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i 66 33

AdaBoost: Normalizing weights α i If x i often mistake, weight α i gets very large If x i often correct, weight α i gets very small Can cause numerical instability after many iterations Normalize weights to add up to 1 after every iteration 67 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i - Normalize weights α i Final model predicts by: 68 34

AdaBoost example t=1: Just learn a classifier on original data Original data Learned decision stump f 1 (x) 70 35

Updating weights α i Learned decision stump f 1 (x) Increase weight α i of misclassified points New data weights α i Boundary 71 t=2: Learn classifier on weighted data f 1 (x) Weighted data: using α i chosen in previous iteration Learned decision stump f 2 (x) on weighted data 72 36

Ensemble becomes weighted sum of learned classifiers 1 0.61 f 1 (x) 2 + 0.53 f 2 (x) = 73 Decision boundary of ensemble classifier after 30 iterations 74 training_error = 0 37

Boosted decision stumps Boosted decision stumps Start same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x): pick decision stump with lowest weighted training error according to α i - Compute coefficient t - Recompute weights α i - Normalize weights α i Final model predicts by: 76 38

Finding best next decision stump f t (x) Consider splitting on each feature: Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good weighted_error = 0.2 weighted_error = 0.35 weighted_error = 0.3 weighted_error = 0.4 f t = Income>$100K? Yes No t = 0.69 77 Boosted decision stumps Start same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x): pick decision stump with lowest weighted training error according to α i - Compute coefficient t - Recompute weights α i - Normalize weights α i Final model predicts by: 78 39

Updating weights α i Credit Income y Income>$100K? Yes α i Previous weight α α i e - t α i e t New weight α A $130K $130K 0.5 0.5/2 = 0.25 B $80K $80K 1.5 0.75 C $110K $110K 1.5 2 * 1.5 = 3 A $110K $110K 2 1 A $90K $90K 1 2 B $120K $120K 2.5 1.25 C $30K $30K 3 1.5 C $60K $60K 2 1 B $95K $95K 0.5 1 A $60K $60K 1 2 A $98K $98K 0.5 1 79 No = α i e -0.69 = α i /2 = α i e 0.69 = 2α i, if ft(xi)=yi, if f t (x i ) y i Summary of boosting 40

Variants of boosting and related algorithms There are hundreds of variants of boosting, most important: Gradient boosting Like AdaBoost, but useful beyond basic classification Many other approaches to learn ensembles, most important: Random forests Bagging: Pick random subsets of the data - Learn a tree in each subset - Average predictions Simpler than boosting & easier to parallelize Typically higher error than boosting for same # of trees (# iterations T) 81 Impact of boosting (spoiler alert... HUGE IMPACT) Amongst most useful ML methods ever created Extremely useful in computer vision Standard approach for face detection, for example Used by most winners of ML competitions (Kaggle, KDD Cup, ) Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, Higgs boson detection, Most deployed ML systems use model ensembles Coefficients chosen manually, with boosting, with bagging, or others 82 41

What you can do now Identify notion ensemble classifiers Formalize ensembles as the weighted combination of simpler classifiers Outline the boosting framework sequentially learn classifiers on weighted data Describe the AdaBoost algorithm - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights Implement AdaBoost to create an ensemble of decision stumps 83 42