Introduction to Machine Learning CMU-10701

Similar documents
the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Decision Trees

Machine Learning 2nd Edi7on

Decision Trees Lecture 12

Learning Decision Trees

CS 6375 Machine Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Lecture 3: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Dan Roth 461C, 3401 Walnut

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

EECS 349:Machine Learning Bryan Pardo

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

PAC-learning, VC Dimension and Margin-based Bounds

Decision Tree Learning

Classification and Regression Trees

Einführung in Web- und Data-Science

Machine Learning

Decision Trees. Gavin Brown

Decision Trees. Tirgul 5

Notes on Machine Learning for and

Machine Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Generative v. Discriminative classifiers Intuition

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Tree Learning Lecture 2

Learning Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture 3: Decision Trees

Decision Tree Learning

UVA CS 4501: Machine Learning

Decision Tree Learning

Machine Learning

Classification: Decision Trees

Decision Trees: Overfitting

Chapter 3: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees.

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

C4.5 - pruning decision trees

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Learning Theory Continued

Decision Tree Learning - ID3

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Classification and Prediction

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Machine Learning 3. week

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Learning from Observations

Classification II: Decision Trees and SVMs

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning & Data Mining

Decision Trees. Danushka Bollegala

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Lecture 7: DecisionTrees

Midterm, Fall 2003

CS145: INTRODUCTION TO DATA MINING

Information Theory & Decision Trees

Decision Tree Learning and Inductive Inference

Classification: Decision Trees

Artificial Intelligence Roman Barták

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSC 411 Lecture 3: Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Lecture 28 Chi-Square Analysis

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Chapter 14 Combining Models

Learning Classification Trees. Sargur Srihari

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees (Cont.)

Decision Trees / NLP Introduction

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Empirical Risk Minimization, Model Selection, and Model Assessment

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

CSCE 478/878 Lecture 6: Bayesian Learning

Lecture VII: Classification I. Dr. Ouiem Bchir

10-701/ Machine Learning: Assignment 1

Decision Tree Learning

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Transcription:

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos

Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information gain Generalizations Regression Trees Overfitting Pruning Regularization Many of these slides are taken from Aarti Singh, Eric Xing, Carlos Guestrin Russ Greiner Andrew Moore 2

Decision Trees 3

Decision Tree: Motivation Learn decision rules from a dataset: Do we want to play tennis? 4 discrete-valued attributes (Outlook, Temperature, Humidity, Wind) Play tennis?: Yes/No classification problem 4

Decision Tree: Motivation We want to learn a good decision tree from the data. For example, this tree: 5

Function Approximation Formal Problem Setting: Set of possible instances X (set of all possible feature vectors) Unknown target function f : X! Y Set of function hypotheses H={ h h : X! Y } (H= possible decision trees) Input: Training examples {<x (i), y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f In decision tree learning, we are doing function approximation, where the set of hypotheses H = set of decision trees 6

Decision Tree: The Hypothesis Space Each internal node is labeled with some feature x j Arc (from x j ) labeled with results of test x j Leaf nodes specify class h(x) One Instance: Outlook = Sunny Temperature = Hot Humidity = High Wind = Strong classified as No (Temperature, Wind: irrelevant) Easy to use in Classification Interpretable rules 7

Generalizations Features can be continuous Output can be continuous too (regression trees) Instead of single features in the nodes, we can use set of features too in the nodes Later we will discuss them in more detail. 8

Continuous Features If a feature is continuous: internal nodes may test value against threshold 9

Example: Mixed Discrete and Continuous Features Tax Fraud Detection: Goal is to predict who is cheating on tax using the refund, marital status, and income features Refund Marital status Taxable income Cheat yes Married 50K no no Married 90K no no Single 60K no no Divorced 100K yes yes Married 110K no Build a tree that matches the data 10

Decision Tree for Tax Fraud Detection Data Refund Yes No NO MarSt Single, Divorced TaxInc < 80K > 80K NO YES Married NO Each internal node: test one feature X i Continuous features test value against threshold Each branch from a node: selects one value (or set of values) for X i. Each leaf node: predict Y 11

Given a decision tree, how do we assign label to a test point? 12

10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 13

10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 14

10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 15

10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 16

10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO NO YES 17

10 10 Decision Tree for Tax Fraud Detection Query Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO MarSt Single, Divorced Married Assign Cheat to No TaxInc NO < 80K > 80K NO YES 18

What do decision trees do in the feature space? 19

Decision Tree Decision Boundaries Decision trees divide feature space into axis-parallel rectangles, labeling each rectangle with one class Two features only: x 1 and x 2 20

Some functions cannot be represented with binary splits Some functions cannot be represented with binary splits: If we want to learn this function too, we need more complex functions in the nodes than binary splits We need to break this function to smaller parts that can be represented with binary splits. 5 4 3 1 2 - + 21

How do we learn a decision tree from training data? 22

What Boolean functions can be represented with decision trees? How would you represent Y = X 2 and X 5? Y = X 2 or X 5? How would you represent X 2 X 5 X 3 X 4 ( X 1 )? 23

Decision trees can represent any boolean/discrete functions n boolean features (x 1,,x n ) ) 2 n possible different instances 2 n+1 possible different functions if class label Y is boolean too. X 1 X 2 X 2 + - - + 24

Option 1: Just store training data Trees can represent any boolean (and discrete) functions, e.g. (A v B) & (C v not D v E) Just produce path for each example (store the training data)... may require exponentially many nodes... Any generalization capability? (Other instances that are not in the training data?) NP-hard to find smallest tree that fits data Intuition: Want SMALL trees... to capture regularities in data...... easier to understand, faster to execute 25

Expressiveness of General Decision Trees Example: Learn A xor B (Boolean features and labels) There is a decision tree which perfectly classifies a training set with one path to leaf for each example. 26

Example of Overfitting 1000 patients 25% have butterfly-itis (250) 75% are healthy (750) Use 10 silly features, not related to the class label ½ of patients have F1 = 1 ( odd birthday ) ½ of patients have F2 = 1 even SSN etc 27

Standard decision tree learner: Typical Results Error Rate: Train data: 0% New data: 37% Optimal decision tree: Error Rate: Train data: 25% New data: 25% Regularization is important 28

How to learn a decision tree Top-down induction [many algorithms ID3, C4.5, CART, ] (Grow the tree from the root to the leafs) We will focus on ID3 algorithm Repeat: 1. Select best feature (X 1, X 2 or X 3 ) to split 2. For each value that feature takes, sort training examples to leaf nodes 3. Stop if leaf contains all training examples with same label or if all features are used up 4. Assign leaf with majority vote of labels of training examples 29

First Split? 30

Which feature is best to split? Good split: we are less uncertain about classification after split 80 training people (50 Genuine, 30 Cheats) Refund Marital Status Yes Refund No Yes No Single, Divorced Married NO Single, Divorced MarSt Married TaxInc NO 40 Genuine 0 Cheats 10 Genuine 30 Cheats 30 Genuine 10 Cheats 20 Genuine 20 Cheats < 80K > 80K NO YES Absolutely sure Kind of sure Kind of sure Absolutely unsure Refund gives more information about the labels than Marital Status 31

Which feature is best to split? Pick the attribute/feature which yields maximum information gain: H(Y) entropy of Y H(Y X i ) conditional entropy of Y Feature which yields maximum reduction in entropy provides maximum information about Y 32

Entropy of a random variable Y Entropy Larger uncertainty, larger entropy! Y ~ Bernoulli(p) Entropy, H(Y) Uniform Max entropy Deterministic Zero entropy p Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) 33

Information Gain Advantage of attribute = decrease in uncertainty Entropy of Y before split Entropy of Y after splitting based on X i We want this to be small Weight by probability of following each branch Information gain is the difference: Max Information gain = min conditional entropy 34

First Split? Which feature splits the data the best to + and instances? 35

First Split? Outlook feature looks great, because the Overcast branch is perfectly separated. 36

Statistics If split on x i, produce 2 children: (1) #(x i = t) follow TRUE branch data: [ #(x i = t, Y = +),#(x i = t, Y= ) ] (2) #(x i = f) follow FALSE branch data: [ #(x i = f, Y = +),#(x i = f, Y= ) ] Calculate the mutual information between x i and Y! 37

Information gain of the Outlook feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Outlook 14: (9+,5-) Sunny Rain Overcast [2+,3-] [4+,0-] [3+,2-] H3=-(3/5*log2(3/5)+2/5*log2(2/5))=0.9710 H1=-(2/5*log2(2/5)+3/5*log2(3/5)) =0.9710 H2=-(4/4*log2(4/4)+0/4*log2(0/4))=0 I(Y,Outlook) = 0.940 (5/14*H1+4/14*H2+5/14*H3)= 0.2465 38

Information gain of the Humidity feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Humidity 14: (9+,5-) High Normal [3+,4-] [6+,1-] H=-(3/7*log2(3/7)+4/7*log2(4/7))=0.9852 H=-(6/7*log2(6/7)+1/7*log2(1/7))=0.5917 I(Y, Humidity) = 0.940-7/14*0.985-7/14*0.591 = 0.151 39

Information gain of the Wind feature H=-(9/14*log2(9/14)+5/14*log2(5/14))=0.9403 Wind 14: (9+,5-) Weak Strong [6+,2-] [3+,3-] H=-(6/8*log2(6/8)+2/8*log2(2/8))=0.811 H=-(3/6*log2(3/6)+3/6*log2(3/6))=1 I(Y,Wind) = 0.940-8/14*0.811-6/14*1 = 0.048 40

Repeat and build the tree Similar calculations for the temperature feature. Outlook feature is the best root node among all features. Humidity is the best 41

Tree Learning App http://www.cs.ualberta.ca/%7eaixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html 42

More general trees 43

Decision/Classification Tree more generally Features can be discrete or continuous 1 1 Each internal node: test some set of features {X i } 0 0 1 1 1 Each branch from a node: selects a set of value for the set {X i } 0 1 1 0 1 1 1 Class labels Each leaf node: predict Y Majority vote (classification) Average or Polynomial fit (regression) 44

Regression trees X 1 X p Num Children? 2 < 2 Average (fit a constant ) using training data at the leaves 45

Regression (Constant) trees 46

Overfitting 47

When to Stop? Many strategies for picking simpler trees: Pre-pruning Fixed depth Fixed number of leaves Post-pruning Chi-square test Yes Refund No MarSt Single, Divorced Married NO Model Selection by complexity penalization 48

Model Selection Penalize complex models by introducing cost (j) (j) j log likelihood cost (j) (j) (j) (j) Regression (j) (j) Classification penalize trees with more leaves 49

Pre-Pruning 50

PAC bound and Bias-Variance tradeoff Equivalently, with probability Fixed m sample size H hypothesis space complex simple small large large small

What about the size of the Sample complexity hypothesis space? ) How large is the hypothesis space of decision trees?

Number of decision trees of depth k Recursive solution: Given n attributes H k = Number of decision trees of depth k H 0 =2 (Yes, and no tree) H k = (#choices of root attribute) *(# possible left subtrees) *(# possible right subtrees) = n * H k-1 * H k-1 Write L k = log 2 H k L 0 = 1 L k = log 2 n + 2L k-1 = log 2 n + 2(log 2 n + 2L k-2 ) = log 2 n + 2log 2 n + 2 2 log 2 n + +2 k-1 (log 2 n + 2L 0 ) So L k = (2 k -1)log 2 n+2 k (sum of the first k terms of a geometric series) 53

PAC bound for decision trees of depth k L k = (2 k -1)log 2 n+2 k ) Bad!!! Number of points is exponential in depth k! In contrary, the number of leaves is never more than the number of data points, so let us regularize with the number of leaves instead of depth! 54

Number of decision trees with k leaves H k = Number of decision trees with k leaves H 1 =2 (Yes graph or No graph) H k = (#choices of root attribute) * [(# left subtrees wth 1 leaf)*(# right subtrees wth k-1 leaves) + (# left subtrees wth 2 leaves)*(# right subtrees wth k-2 leaves) + + (# left subtrees wth k-1 leaves)*(# right subtrees wth 1 leaf)] = n k-1 C k-1 (C k-1 : Catalan Number) Loose bound (using Sterling s approximation): 55

Number of decision trees With k leaves number of points m is linear in #leaves k linear in k With depth k log 2 H k = (2 k -1)log 2 n) +2 k exponential in k number of points m is exponential in depth k (n is the number of features) 56

PAC bound for decision trees with k leaves Bias-Variance revisited With prob 1-δ With, we get: With prob 1-δ, m: number of training points k: number of leaves k = m 0 large (~ > ½) k < m >0 small (~ <½) 57

What did we learn from decision trees? Bias-Variance tradeoff formalized Complexity k» m no bias, lots of variance k <m some bias, less variance 58

Post-Pruning (Bottom-Up pruning) 59

OBSERVED DATA Chi-Squared independence test Voting Preferences Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Row total Column total 450 450 100 1000 H 0 : Gender and voting preferences are independent. H a : Gender and voting preferences are not independent. Expected numbers under H0 (independence:) E r,c = (n r * n c ) / n 60

OBSERVED DATA Chi-Squared independence test Voting Preferences Republican Democrat Independent Male 200 150 50 400 Female 250 300 50 600 Row total Column total 450 450 100 1000 Expected numbers under H0 (independence:) E r,c = (n r * n c ) / n E 1,1 = (400 * 450) / 1000 = 180000/1000 = 180 E 1,2 = (400 * 450) / 1000 = 180000/1000 = 180 E 1,3 = (400 * 100) / 1000 = 40000/1000 = 40 E 2,1 = (600 * 450) / 1000 = 270000/1000 = 270 E 2,2 = (600 * 450) / 1000 = 270000/1000 = 270 E 2,3 = (600 * 100) / 1000 = 60000/1000 = 60 Χ 2 = Σ [ (O r,c - E r,c ) 2 / E r,c ] Χ 2 = (200-180) 2 /180 + (150-180) 2 /180 + (50-40) 2 /40 + (250-270) 2 /270 + (300-270) 2 /270 + (50-60) 2 /40 = 16.2 61

Chi-Squared independence test Degrees of freedom DF = (r - 1) * (c - 1) = (2-1) * (3-1) = 2 where r=#rows, c=#columns P(Χ 2 > 16.2) = 0.0003<0.05 (p value) ) we cannot accept the null hypothesis. Evidence shows that there is a relationship between gender and voting preference. 62

Chi-Square Pruning 1. Build a Complete Tree 2. Consider each leaf, and perform a chi-square independence test #of instances entering this node =s # of + instances entering this node = p # of - instances entering this node = n X: s=p+n, (p+,n-) false true #of instances here =s f =p f +n f # of + instances here = p f # of - instances here = n f Expected numbers s f *p/s s t *p/s s f *n/s s t *n/s #of instances here =s t= p t +n t # of + instances here = p t # of - instances here = n t If after splitting the expected numbers are the same as the measured ones, then there is no point of splitting the node! Delete the leafs! 63

X1 X2 Y Count T T T 2 F X1 T T F T 2 F T F 5 F F T 1 S=6, p=1,n=5 F X2 T Y=T S f =1, p f =1,n f =0 Y=T Y=F S t =5, p t =0,n t =5 Variable Assignment Real Counts of Y=T Expected Counts of Y=T X2=F 1 1/6 (s f *p/s) X2=T 0 5/6 (s t *p/s) Variable Assignment Real Counts of Y=F Expected Counts of Y=F X2=F 0 5/6 (s f *n/s) X2=T 5 25/6 (s t *n/s) 64

Variable Assignment Real Counts of Y=T Expected Counts of Y=T X2=F 1 1/6 (s f *p/s) X2=T 0 5/6 (s t *p/s) Variable Assignment Real Counts of Y=F Expected Counts of Y=F X2=F 0 5/6 (s f *n/s) X2=T 5 25/6 (s t *n/s) If label Y and feature X2 are independent, then the expected counts should be close to the real counts. Degrees of freedom DF = (# Y labels- 1) * (#X2 labels - 1) =(2-1) * (2-1) = 1 Z = Σ [ (O r,c - E r,c ) 2 / E r,c ] =(1-1/6)^2/(1/6)+(0-5/6)^2/(5/6)+ (0-5/6)^2/(5/6) 6+(5-25/6)^2/(25/6) =25/6+5/6+5/6+1/6 = 6 65

Chi-Squared independence test P(Z> c) is the probability that we see this large deviation by chance under the H0 independence assumption. P(Z> 3.8415) =0.05, P(Z 3.8415) =0.95 The smaller the Z is, the more likely that the feature is independent from the label. (There is no evidence showing their dependence) In our case Z = 6 ) we reject the independence hypothesis and keep the node X2. 66

What you should know Decision trees are one of the most popular data mining tools Simplicity of design Interpretability Ease of implementation Good performance in practice (for small dimensions) Information gain to select attributes (ID3, C4.5, ) Can be used for classification, regression, and density estimation too Decision trees will overfit!!! Must use tricks to find simple trees, e.g., o Pre-Pruning: Fixed depth/fixed number of leaves o Post-Pruning: Chi-square test of independence o Complexity Penalized model selection 67

Thanks for the Attention! 68