Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9 0 3 years 0 2 high 4 5 Term? 5 years 4 3 low 0 9 For each leaf node, set = majority value 2 1
Greedy decision tree learning Step 1: Start with an empty tree Step 2: Select a feature to split data For each split of the tree: Step 3: If nothing more to, make predictions Step 4: Otherwise, go to Step 2 & continue (recurse) on this split Pick feature split leading to lowest classification error Stopping conditions 1 & 2 Recursion 3 Scoring a loan application Start x i = (Credit = poor, Income = high, Term = 5 years) excellent Credit? fair poor 3 years Term? 5 years high Income? Low Term? 3 years 5 years i = 4 2
Decision trees vs logistic regression: Example Logistic regression Feature Value Weight Learne d h 0 (x) 1 0.22 h 1 (x) x[1] 1.12 h 2 (x) x[2] -1.07 6 3
Depth 1: Split on x[1] y values - + Root 18 13 x[1] x[1] < -0.07 13 3 x[1] >= -0.07 4 11 7 Depth 2 y values - + Root 18 13 x[1] x[1] < -0.07 13 3 x[1] >= -0.07 4 11 x[1] x[2] x[1] < -1.66 7 0 x[1] >= -1.66 6 3 x[2] < 1.55 1 11 x[2] >= 1.55 3 0 8 4
Threshold split caveat y values - + Root 18 13 x[1] For threshold splits, same feature can be used multiple times x[1] < -0.07 13 3 x[1] x[1] >= -0.07 4 11 x[2] x[1] < -1.66 7 0 x[1] >= -1.66 6 3 x[2] < 1.55 1 11 x[2] >= 1.55 3 0 9 Decision boundaries Depth 1 Depth 2 Depth 10 10 5
Comparing decision boundaries Decision Tree Depth 1 Depth 3 Depth 10 Logistic Regression Degree 1 features Degree 2 features Degree 6 features 11 Predicting probabilities with decision trees Loan status: Root 18 12 Credit? excellent 9 2 fair 6 9 poor 3 1 P(y = x) = 3 = 0.75 3 + 1 12 6
Depth 1 probabilities Y values - + root 18 13 X1 X1 < -0.07 13 3 X1 >= -0.07 4 11 13 Depth 2 probabilities Y values - + root 18 13 X1 X1 < -0.07 13 3 X1 >= -0.07 4 11 X1 X2 X1 < -1.66 7 0 X1 >= -1.66 6 3 X2 < 1.55 1 11 X2 >= 1.55 3 0 14 7
Comparison with logistic regression Logistic Regression (Degree 2) Decision Trees (Depth 2) Class Probability 15 Overfitting in decision trees 8
What happens when we increase depth? Training error reduces with depth Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10 Training error 0.22 0.13 0.10 0.03 0.00 Decision boundary 17 Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates 18 9
Technique 1: Early stopping Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points 19 Challenge with early stopping condition 1 Hard to know exactly when to stop Classification Error Training error Simple trees Complex trees True error Also, might want some branches of tree to go deeper while others remain shallow 20 max_depth Tree depth 10
Early stopping condition 2: Pros and Cons Pros: - A reasonable heuristic for early stopping to avoid useless splits Cons: - Too short sighted: We may miss out on good splits may occur right after useless splits - Saw this with xor example 21 Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates Complements early stopping 22 11
Pruning: Intuition Train a complex tree, simplify later Complex Tree Simpler Tree Simplify 23 Pruning motivation Classification Error True Error Training Error Simple tree Complex tree Simplify after tree is built Don t stop too early 24 Tree depth 12
Scoring trees: Desired total quality format Want to balance: i. How well tree fits data ii. Complexity of tree Total cost = want to balance measure of fit + measure of complexity 25 (classification error) Large # = bad fit to training data Large # = likely to overfit Simple measure of complexity of tree Start L(T) = # of leaf nodes excellent Credit? poor good fair bad 26 13
Balance simplicity & predictive power Too complex, risk of overfitting Start excellent Credit? fair Term? 3 years 5 years poor high Income? low Too simple, high classification error Start Term? 3 years 5 years 27 2015-2016 Emily Fox & Carlos Guestrin Balancing fit and complexity If λ=0: Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ= : If λ in between: 28 14
Tree pruning algorithm Step 1: Consider a split Start Tree T excellent Credit? fair poor 3 years Term? 5 years high Income? low Term? 3 years 5 years Candidate for pruning 30 15
Step 2: Compute total cost C(T) of split Start Tree T excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 C(T) = Error(T) + λ L(T) Term? 3 years 5 years Candidate for pruning 31 Step 2: Undo the splits on Tsmaller Start Tree T smaller excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 T smaller 0.26 5 0.41 C(T) = Error(T) + λ L(T) 32 Replace split by leaf node? 16
Prune if total cost is lower: C(T smaller ) C(T) Start Tree T smaller Worse training error but lower overall cost excellent Credit? fair Term? 3 years 5 years poor high Income? low λ = 0.3 Tree Error #Leaves Total T 0.25 6 0.43 T smaller 0.26 5 0.41 C(T) = Error(T) + λ L(T) 33 Replace split by leaf node? YES! Step 5: Repeat Steps 1-4 for every split excellent Start Credit? poor Decide if each split can be pruned fair 3 years Term? 5 years high Income? low 34 17
Summary of overfitting in decision trees What you can do now Identify when overfitting in decision trees Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones 36 18
Boosting Emily Fox University of Washington January 30, 2017 Simple (weak) classifiers are good! Income>$100K? Yes No Logistic regression w. simple features Shallow decision trees Decision stumps Low variance. Learning is fast! 38 But high bias 19
Finding a classifier that s just right Classification error Model complexity true error Weak learner Need stronger learner 39 Option 1: add more features or depth Option 2:????? Boosting question Can a set of weak learners be combined to create a stronger learner? Kearns and Valiant (1988) Yes! Schapire (1990) Boosting Amazing impact: simple approach widely used in industry wins most Kaggle competitions 40 20
Ensemble classifier A single classifier Input: x Income>$100K? Yes No Output: = f(x) - Either +1 or -1 42 Classifier 21
Income>$100K? Yes No 43 Ensemble methods: Each classifier votes on prediction x i = (Income=$120K, Credit=Bad, Savings=$50K, Market=Good) Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good f 1 (x i ) = +1 f 2 (x i ) = -1 f 3 (x i ) = -1 f 4 (x i ) = +1 Ensemble model Combine? Learn coefficients F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i )) 44 22
Prediction with ensemble Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good f 1 (x i ) = +1 f 2 (x i ) = -1 f 3 (x i ) = -1 f 4 (x i ) = +1 Ensemble model Combine? Learn coefficients F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i )) w 1 2 w 2 1.5 w 3 1.5 w 4 0.5 45 Ensemble classifier in general Goal: - Predict output y Either +1 or -1 - From input x Learn ensemble model: - Classifiers: f 1 (x),f 2 (x),,f T (x) - Coefficients: 1, 2,, T Prediction: 46 23
Boosting Training a classifier Training data Learn classifier f(x) Predict = sign(f(x)) 48 24
Learning decision stump Credit Income y A $130K B $80K C $110K A $110K A $90K B $120K C $30K C $60K B $95K A $60K A $98K Income? > $100K 3 1 4 3 = = 49 Boosting = Focus learning on hard points Training data Learn classifier f(x) Predict = sign(f(x)) Evaluate Boosting: focus next classifier on places where f(x) does less well Learn where f(x) makes mistakes 50 25
Learning on weighted data: More weight on hard or more important points Weighted dataset: - Each x i,y i weighted by α i More important point = higher weight α i Learning: - Data point i counts as α i data points E.g., α i = 2 count point twice 51 Learning a decision stump on weighted data Increase weight α of harder/misclassified points 52 Credit Income y y Weight t A $130K α A B $130K $80K 0.5 B C $80K $110K 1.5 C A $110K $110K 1.2 A A $110K $90K 0.8 A B $90K $120K 0.6 B C $120K $30K 0.7 C C $30K $60K 3 C B $60K $95K 2 B A $95K $60K 0.8 A A $60K $98K 0.7 A $98K 0.9 Income? > $100K 2 1.2 3 6.5 = = 26
Learning from weighted data in general Usually, learning from weighted data - Data point i counts as α i data points E.g., gradient ascent for logistic regression: Sum over data points α i Weigh each point by α i 53 Boosting = Greedy learning ensembles from data Training data Higher weight for points where f 1 (x) is wrong Learn classifier f 1 (x) Predict = sign(f 1 (x)) Weighted data Learn classifier & coefficient,f 2 (x) Predict = sign( 1 f 1 (x) + 2 f 2 (x)) 54 27
AdaBoost AdaBoost: learning ensemble [Freund & Schapire 1999] Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: 56 28
Computing coefficient t AdaBoost: Computing coefficient t of classifier f t (x) Is f t (x) good? Yes No t large t small f t (x) is good f t has low training error Measuring error in weighted data? - Just weighted # of misclassified points 58 29
Weighted classification error Learned classifier = Data point (Food (Sushi was OK, great,,α=0.5),α=1.2) Correct! Mistake! Weight of correct Weight of mistakes 1.2 0 0.5 0 Hide label 59 Weighted classification error Total weight of mistakes: Total weight of all points: Weighted error measures fraction of weight of mistakes: weighted_error =. 60 - Best possible value is 0.0 30
AdaBoost: Formula for computing coefficient t of classifier f t (x) Is f t (x) good? Yes No weighted_error(f t ) on training data t 61 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: 62 31
Recompute weights α i AdaBoost: Updating weights α i based on where classifier f t (x) makes mistakes Did f t get x i right? Yes No Decrease α i Increase α i 64 32
AdaBoost: Formula for updating weights α i α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i f t (x i )=y i? t Multiply α i by Implication Did f t get x i right? Yes No 65 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i Final model predicts by: α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i 66 33
AdaBoost: Normalizing weights α i If x i often mistake, weight α i gets very large If x i often correct, weight α i gets very small Can cause numerical instability after many iterations Normalize weights to add up to 1 after every iteration 67 AdaBoost: learning ensemble Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient t - Recompute weights α i α i - t α i e, if f t (x i )=y i t α i e, if f t (x i ) y i - Normalize weights α i Final model predicts by: 68 34
AdaBoost example t=1: Just learn a classifier on original data Original data Learned decision stump f 1 (x) 70 35
Updating weights α i Learned decision stump f 1 (x) Increase weight α i of misclassified points New data weights α i Boundary 71 t=2: Learn classifier on weighted data f 1 (x) Weighted data: using α i chosen in previous iteration Learned decision stump f 2 (x) on weighted data 72 36
Ensemble becomes weighted sum of learned classifiers 1 0.61 f 1 (x) 2 + 0.53 f 2 (x) = 73 Decision boundary of ensemble classifier after 30 iterations 74 training_error = 0 37
Boosted decision stumps Boosted decision stumps Start same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x): pick decision stump with lowest weighted training error according to α i - Compute coefficient t - Recompute weights α i - Normalize weights α i Final model predicts by: 76 38
Finding best next decision stump f t (x) Consider splitting on each feature: Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good weighted_error = 0.2 weighted_error = 0.35 weighted_error = 0.3 weighted_error = 0.4 f t = Income>$100K? Yes No t = 0.69 77 Boosted decision stumps Start same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x): pick decision stump with lowest weighted training error according to α i - Compute coefficient t - Recompute weights α i - Normalize weights α i Final model predicts by: 78 39
Updating weights α i Credit Income y Income>$100K? Yes α i Previous weight α α i e - t α i e t New weight α A $130K $130K 0.5 0.5/2 = 0.25 B $80K $80K 1.5 0.75 C $110K $110K 1.5 2 * 1.5 = 3 A $110K $110K 2 1 A $90K $90K 1 2 B $120K $120K 2.5 1.25 C $30K $30K 3 1.5 C $60K $60K 2 1 B $95K $95K 0.5 1 A $60K $60K 1 2 A $98K $98K 0.5 1 79 No = α i e -0.69 = α i /2 = α i e 0.69 = 2α i, if ft(xi)=yi, if f t (x i ) y i Summary of boosting 40
Variants of boosting and related algorithms There are hundreds of variants of boosting, most important: Gradient boosting Like AdaBoost, but useful beyond basic classification Many other approaches to learn ensembles, most important: Random forests Bagging: Pick random subsets of the data - Learn a tree in each subset - Average predictions Simpler than boosting & easier to parallelize Typically higher error than boosting for same # of trees (# iterations T) 81 Impact of boosting (spoiler alert... HUGE IMPACT) Amongst most useful ML methods ever created Extremely useful in computer vision Standard approach for face detection, for example Used by most winners of ML competitions (Kaggle, KDD Cup, ) Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, Higgs boson detection, Most deployed ML systems use model ensembles Coefficients chosen manually, with boosting, with bagging, or others 82 41
What you can do now Identify notion ensemble classifiers Formalize ensembles as the weighted combination of simpler classifiers Outline the boosting framework sequentially learn classifiers on weighted data Describe the AdaBoost algorithm - Learn each classifier on weighted data - Compute coefficient of classifier - Recompute data weights - Normalize weights Implement AdaBoost to create an ensemble of decision stumps 83 42