Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos
Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations Regression Trees Overfitting Pruning Regularization Many of these slides are taken from Aarti Singh, Eric Xing, Carlos Guestrin Russ Greiner Andrew Moore 2
Decision Trees 3
Decision Tree: Motivation Learn decision rules from a dataset: Do we want to play tennis? 4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind) Play tennis?: Yes/No classification problem 4
Decision Tree: Motivation We want to learn a good decision tree from the data. For example, this tree: 5
Function Approximation Formal Problem Setting: Set of possible instances X (set of all possible feature vectors) Unknown target function f : X! Y Set of function hypotheses H={ h h : X! Y } (H= possible decision trees) Input: Training examples {<x (i), y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees 6
Decision Tree: The Hypothesis Space Each internal node is labeled with some feature x j Arc (from x j ) labeled with results of test x j Leaf nodes specify class h(x) One Instance: Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as No (Temperature, Wind: irrelevant) Easy to use in Classification Interpretable rules 7
Generalizations Features can be continuous Output can be continuous too (regression trees) Instead of single features in the nodes, we can use set of features too in the nodes Later we will discuss them in more detail. 8
Continuous Features If a feature is continuous: internal nodes may test value against threshold 9
Example: Mixed Discrete and Continuous Features Tax Fraud Detection: Goal is to predict who is cheating on tax using the refund, marital status, and income features Refund Marital status Taxable income Cheat yes Married 50K no no Married 90K no no Single 60K no no Divorced 100K yes yes Married 110K no Build a tree that matches the data 10
Decision Tree for Tax Fraud Detection Data Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K NO YES Married NO Each internal node: test one feature X i Continuous features test value against threshold Each branch from a node: selects one value (or set of values) for X i. Each leaf node: predict Y 11
Given a decision tree, how do we assign label to a test point? 12
10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 13
10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 14
10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 15
10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 16
10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 17
10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced Married Assign Cheat to No TaxInc NO < 80K > 80K NO YES 18
What do decision trees do in the feature space? 19
Decision Tree Decision Boundaries Decision trees divide feature space into axis-parallel rectangles, labeling each rectangle with one class Two features only: x 1 and x 2 20
Some functions cannot be represented with binary splits Some functions cannot be represented with binary splits: If we want to learn this function too, we need more complex functions in the nodes than binary splits We need to break this function to smaller parts that can be represented with binary splits. 5 4 3 1 2 - + 21
How do we learn a decision tree from training data? 22
What Boolean functions can be represented with decision trees? How would you represent Y = X 2 and X 5? Y = X 2 or X 5? How would you represent X 2 X 5 X 3 X 4 ( X 1 )? 23
Decision trees can represent any boolean/discrete functions n boolean features (x 1,,x n ) ) 2 n possible different instances 2 n+1 possible different functions if class label Y is boolean too. X 1 X 2 X 2 + - - + 24
Option 1: Just store training data Trees can represent any boolean (and discrete) functions, e.g. (A v B) & (C v not D v E) Just produce path for each example (store the training data)... may require exponentially many nodes... Any generalization capability? (Other instances that are not in the training data?) NP-hard to find smallest tree that fits data Intuition: Want SMALL trees... to capture regularities in data...... easier to understand, faster to execute 25
Expressiveness of General Decision Trees Example: Learn A xor B (Boolean features and labels) There is a decision tree which perfectly classifies a training set with one path to leaf for each example. 26
Example of Overfitting 1000 patients 25% have butterfly-itis (250) 75% are healthy (750) Use 10 silly features, not related to the class label ½ of patients have F1 = 1 ( odd birthday ) ½ of patients have F2 = 1 even SSN etc 27
Standard decision tree learner: Typical Results Error Rate: Train data: 0% New data: 37% Optimal decision tree: Error Rate: Train data: 25% New data: 25% Regularization is important 28
How to learn a decision tree Top-down induction [many algorithms ID3, C4.5, CART, ] (Grow the tree from the root to the leafs) We will focus on ID3 algorithm Repeat: 1. Select best feature (X 1, X 2 or X 3 ) to split 2. For each value that feature takes, sort training examples to leaf nodes 3. Stop if leaf contains all training examples with same label or if all features are used up 4. Assign leaf with majority vote of labels of training examples 29
First Split? 30
Which feature is best to split? Good split: we are less uncertain about classification after split 80 training people (50 Genuine, 30 Cheats) Refund Marital Status Yes Refund No Yes No Single, Divorced Married NO Single, Divorced MarSt Married TaxInc NO 40 Genuine 0 Cheats 10 Genuine 30 Cheats 30 Genuine 10 Cheats 20 Genuine 20 Cheats < 80K > 80K NO YES Absolutely sure Kind of sure Kind of sure Absolutely unsure Refund gives more information about the labels than Marital Status 31
Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) entropy of Y H(Y X i ) conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 32
Entropy of a random variable Y Entropy Larger uncertainty, larger entropy! Y ~ Bernoulli(p) Entropy, H(Y) Uniform Max entropy Deterministic Zero entropy p Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 33
Information Gain Advantage of attribute = decrease in uncertainty Entropy of Y before split Entropy of Y after splitting based on X i We want this to be small Weight by probability of following each branch Information gain is the difference: Max Information gain = min conditional entropy 34
First Split? Which feature splits the data the best to + and instances? 35
First Split? Outlook feature looks great, because the Overcast branch is perfectly separated. 36
Statistics If split on x i, produce 2 children: (1) #(x i = t) follow TRUE branch data: [ #(x i = t, Y = +),#(x i = t, Y= ) ] (2) #(x i = f) follow FALSE branch data: [ #(x i = f, Y = +),#(x i = f, Y= ) ] Calculate the mutual information between x i and Y! 37
Information gain of the Outlook feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Outlook 14: (9+,5-) Sunny Rain Overcast [2+,3-] [4+,0-] [3+,2-] H3=-(3/5*log2(3/5)+2/5*log2(2/5))=0.9710 H1=-(2/5*log2(2/5)+3/5*log2(3/5)) =0.9710 H2=-(4/4*log2(4/4)+0/4*log2(0/4))=0 I(Y,Outlook) = 0.940 (5/14*H1+4/14*H2+5/14*H3)= 0.2465 38
Information gain of the Humidity feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Humidity 14: (9+,5-) High Normal [3+,4-] [6+,1-] H=-(3/7*log2(3/7)+4/7*log2(4/7))=0.9852 H=-(6/7*log2(6/7)+1/7*log2(1/7))=0.5917 I(Y, Humidity) = 0.940-7/14*0.985-7/14*0.591 = 0.151 39
Information gain of the Wind feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Wind 14: (9+,5-) Weak Strong [6+,2-] [3+,3-] H=-(6/8*log2(6/8)+2/8*log2(2/8))=0.811 H=-(3/6*log2(3/6)+3/6*log2(3/6))=1 I(Y,Wind) = 0.940-8/14*0.811-6/14*1 = 0.048 40
Repeat and build the tree Similar calculations for the temperature feature. Outlook feature is the best root node among all features. Humidity is the best 41
Tree Learning App http://www.cs.ualberta.ca/%7eaixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html 42
More general trees 43
Decision/Classification Tree more generally Features can be discrete or continuous 1 1 Each internal node: test some set of features {X i } 0 0 1 1 1 Each branch from a node: selects a set of value for the set {X i } 0 1 1 0 1 1 1 Class labels Each leaf node: predict Y Majority vote (classification) Average or Polynomial fit (regression) 44
Regression trees X 1 X p Num Children? 2 < 2 Average (fit a constant ) using training data at the leaves 45
Regression (Constant) trees 46
Overfitting 47
When to Stop? Many strategies for picking simpler trees: Pre-pruning Fixed depth Fixed number of leaves Post-pruning Chi-square test Yes Refund No MarSt Single, Divorced Married NO Model Selection by complexity penalization 48
Model Selection Penalize complex models by introducing cost (j) (j) j log likelihood cost (j) (j) (j) (j) Regression (j) (j) Classification penalize trees with more leaves 49
Pre-Pruning 50
PAC bound and Bias-Variance tradeoff Equivalently, with probability Fixed m sample size H hypothesis space complex simple small large large small
What about the size of the Sample complexity hypothesis space? ) How large is the hypothesis space of decision trees?
Number of decision trees of depth k Recursive solution: Given n attributes H k = Number of decision trees of depth k H 0 =2 (Yes, and no tree) H k = (#choices of root attribute) *(# possible left subtrees) *(# possible right subtrees) = n * H k-1 * H k-1 Write L k = log 2 H k L 0 = 1 L k = log 2 n + 2L k-1 = log 2 n + 2(log 2 n + 2L k-2 ) = log 2 n + 2log 2 n + 2 2 log 2 n + +2 k-1 (log 2 n + 2L 0 ) So L k = (2 k -1)log 2 n+2 k (sum of the first k terms of a geometric series) 53
PAC bound for decision trees of depth k L k = (2 k -1)log 2 n+2 k ) Bad!!! Number of points is exponential in depth k! In contrary, the number of leaves is never more than the number of data points, so let us regularize with the number of leaves instead of depth! 54
Number of decision trees with k leaves H k = Number of decision trees with k leaves H 1 =2 (Yes graph or No graph) H k = (#choices of root attribute) * [(# left subtrees wth 1 leaf)*(# right subtrees wth k-1 leaves) + (# left subtrees wth 2 leaves)*(# right subtrees wth k-2 leaves) + + (# left subtrees wth k-1 leaves)*(# right subtrees wth 1 leaf)] = n k-1 C k-1 (C k-1 : Catalan Number) Loose bound (using Sterling s approximation): 55
Number of decision trees With k leaves number of points m is linear in #leaves k linear in k With depth k log 2 H k = (2 k -1)log 2 n) +2 k exponential in k number of points m is exponential in depth k (n is the number of features) 56
PAC bound for decision trees with k leaves Bias-Variance revisited With prob 1-δ With, we get: With prob 1-δ, m: number of training points k: number of leaves k = m 0 large (~ > ½) k < m >0 small (~ <½) 57
What did we learn from decision trees? Bias-Variance tradeoff formalized Complexity k» m no bias, lots of variance k <m some bias, less variance 58
Post-Pruning (Bottom-Up pruning) 59
OBSERVED DATA Chi-Squared independence test Voting Preferences Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Row total Column total 450 450 100 1000 H 0 : Gender and voting preferences are independent. H a : Gender and voting preferences are not independent. Expected numbers under H0 (independence:) E r,c = (n r * n c ) / n 60
OBSERVED DATA Chi-Squared independence test Voting Preferences Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Row total Column total 450 450 100 1000 Expected numbers under H0 (independence:) E r,c = (n r * n c ) / n E 1,1 = (400 * 450) / 1000 = 180000/1000 = 180 E 1,2 = (400 * 450) / 1000 = 180000/1000 = 180 E 1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E 2,1 = (600 * 450) / 1000 = 270000/1000 = 270 E 2,2 = (600 * 450) / 1000 = 270000/1000 = 270 E 2,3 = (600 * 100) / 1000 = 60000/1000 = 60 Χ 2 = Σ [ (O r,c - E r,c ) 2 / E r,c ] Χ 2 = (200-180) 2 /180 + (150-180) 2 /180 + (50-40) 2 /40 + (250-270) 2 /270 + (300-270) 2 /270 + (50-60) 2 /40 = 16.2 61
Chi-Squared independence test Degrees of freedom DF = (r - 1) * (c - 1) = (2-1) * (3-1) = 2 where r=#rows, c=#columns P(Χ 2 > 16.2) = 0.0003<0.05 (p value) ) we cannot accept the null hypothesis. Evidence shows that there is a relationship between gender and voting preference. 62
Chi-Square Pruning 1. Build a Complete Tree 2. Consider each leaf, and perform a chi-square independence test #of instances entering this node =s # of + instances entering this node = p # of - instances entering this node = n X: s=p+n, (p+,n-) false true #of instances here =s f =p f +n f # of + instances here = p f # of - instances here = n f Expected numbers s f *p/s s t *p/s s f *n/s s t *n/s #of instances here =s t= p t +n t # of + instances here = p t # of - instances here = n t If after splitting the expected numbers are the same as the measured ones, then there is no point of splitting the node! Delete the leafs! 63
X1 X2 Y Count T T T 2 F X1 T T F T 2 F T F 5 F F T 1 S=6, p=1,n=5 F X2 T Y=T S f =1, p f =1,n f =0 Y=T Y=F S t =5, p t =0,n t =5 Variable Assignment Real Counts of Y=T Expected Counts of Y=T X2=F 1 1/6 (s f *p/s) X2=T 0 5/6 (s t *p/s) Variable Assignment Real Counts of Y=F Expected Counts of Y=F X2=F 0 5/6 (s f *n/s) X2=T 5 25/6 (s t *n/s) 64
Variable Assignment Real Counts of Y=T Expected Counts of Y=T X2=F 1 1/6 (s f *p/s) X2=T 0 5/6 (s t *p/s) Variable Assignment Real Counts of Y=F Expected Counts of Y=F X2=F 0 5/6 (s f *n/s) X2=T 5 25/6 (s t *n/s) If label Y and feature X2 are independent, then the expected counts should be close to the real counts. Degrees of freedom DF = (# Y labels- 1) * (#X2 labels - 1) =(2-1) * (2-1) = 1 Z = Σ [ (O r,c - E r,c ) 2 / E r,c ] =(1-1/6)^2/(1/6)+(0-5/6)^2/(5/6)+ (0-5/6)^2/(5/6) 6+(5-25/6)^2/(25/6) =25/6+5/6+5/6+1/6 = 6 65
Chi-Squared independence test P(Z> c) is the probability that we see this large deviation by chance under the H0 independence assumption. P(Z> 3.8415) =0.05, P(Z 3.8415) =0.95 The smaller the Z is, the more likely that the feature is independent from the label. (There is no evidence showing their dependence) In our case Z = 6 ) we reject the independence hypothesis and keep the node X2. 66
What you should know Decision trees are one of the most popular data mining tools Simplicity of design Interpretability Ease of implementation Good performance in practice (for small dimensions) Information gain to select attributes (ID3, C4.5, ) Can be used for classification, regression, and density estimation too Decision trees will overfit!!! Must use tricks to find simple trees, e.g., o Pre-Pruning: Fixed depth/fixed number of leaves o Post-Pruning: Chi-square test of independence o Complexity Penalized model selection 67
Thanks for the Attention! 68