Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Similar documents
Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

the tree till a class assignment is reached

CSC 411 Lecture 3: Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning CMU-10701

Learning Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees Lecture 12

Learning Decision Trees

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

CS 6375 Machine Learning

Lecture 7: DecisionTrees

Classification and Regression Trees

Decision Trees. Gavin Brown

Machine Learning 2nd Edi7on

Decision Tree Learning. Dr. Xiaowei Huang

Decision Tree Learning

Decision Tree Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Classification: Decision Trees

Generative v. Discriminative classifiers Intuition

Machine Learning 3. week

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Tree Learning

Lecture 3: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Tree Learning Lecture 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Lecture 3: Decision Trees

Decision Tree Learning and Inductive Inference

Notes on Machine Learning for and

Learning Classification Trees. Sargur Srihari

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Learning Decision Trees

Decision Trees.

Decision Trees: Overfitting

CS145: INTRODUCTION TO DATA MINING

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Machine Learning

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Dan Roth 461C, 3401 Walnut

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Chapter 3: Decision Tree Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Decision Trees. Danushka Bollegala

Decision Trees.

Machine Learning & Data Mining

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Artificial Intelligence Roman Barták

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Artificial Intelligence Decision Trees

Decision Tree Learning - ID3

Information Theory, Statistics, and Decision Trees

Decision Tree Learning

Decision Tree Learning

Classification: Decision Trees

Decision trees COMS 4771

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning

UVA CS 4501: Machine Learning

Decision Trees Part 1. Rao Vemuri University of California, Davis

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Chapter 3: Decision Tree Learning (part 2)

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

C4.5 - pruning decision trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Information Theory & Decision Trees

Nonlinear Classification

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning

Machine Learning, Midterm Exam

Classification and regression trees

10701/15781 Machine Learning, Spring 2007: Homework 2


Decision Trees / NLP Introduction

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Introduction to Data Science Data Mining for Business Analytics

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Administrative notes. Computational Thinking ct.cs.ubc.ca

Classification II: Decision Trees and SVMs

Randomized Decision Trees

Data classification (II)

Chapter 6: Classification

Lecture 7 Decision Tree Classifier

Decision Tree. Decision Tree Learning. c4.5. Example

Transcription:

Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy and Lisa Zhang 1

Orange vs Lemon? What classifiers can we use to classify fruit as oranges or lemons? knn Logistic Regression Neural Networks 2

Decision Trees 3

Decision Trees Yes: orange No: lemon 4

Decision Trees Internal nodes test the value of particular features x j branch according to the results of the test Leaf nodes specify the class h x Simpler Example: Predicting whether we ll play tennis (outputs: Yes, No) Features: Outlook (x 1 ), Temperature (x 2 ), Humidity (x 3 ), and Wind (x 4 ). x = (Sunny, Hot, High, Strong) will be classified as No The Temperature feature is irrelevant 5

Decision Trees As close as it gets to an off-the-shelf classifier Random forests averages of multiple decision trees classifiers often perform the best on Kaggle Kaggle.com: a website that hosts machine learning bake-offs Though carefully-engineered neural networks and other methods win as well 6

Decision Trees: Continuous Features If the features are continuous, internal nodes may test the value of a feature against a threshold 7

Decision Trees: Decision Boundaries Decision trees divide the feature space into axis-parallel rectangles Each rectangle is labelled as one of the K classes 8

Decision Trees: Model Capacity Any Boolean function can be represented Might need exponentially many nodes in order to represent the function 9

Decision Trees: Model Capacity As the number of nodes in the tree/the depth of the tree increases, the hypothesis space grows Depth 1 ( decision stump ): can represent any Boolean function of one feature Depth 2: Any Boolean function of two features + some Boolean functions of three features (e.g. x 1 x 2 ( x 1 x 3 )) Etc. Hypothesis space: the set of all possible functions h θ (x) 10

Learning the Parity Function Suppose we want to learn to distinguish strings of parity 0 and strings of parity 1 # of 1 s in the string mod 2 All splits will look equally good! Need 2 n examples to learn the function correctly If there are extra random features, cannot do anything 11

Learning Decision Trees Learning the simplest (smallest) decision tree is an NPcomplete problem See Hyafil and Rivest, Constructing Optimal Binary Decision Trees is NP-complete, Information Processing Letters Vol 5(1), 1976 So use a greedy heuristic to construct the tree Start from an empty decision tree Select the best attribute/feature to split on Recurse But what does best mean? We ll come back to this. 12

Learning Decision Trees (Binary Features) GrowTree(S) if (y=0 for all < x, y > S) return new leaf(0) else if (y=1 for all < x, y > S) return new leaf(1) else choose the best attribute x j S 0 = < x, y > S s. t. x j = 0 S 1 = {< x, y > S s. t. x j = 1} return new node(x j, GrowTree(S 0 ), GrowTree(S 1 )) 13

Threshold splits For continuous features, need to decide on a threshold t Branches: x j < t, x j t Want to allow repeated splits along a path Why? 14

Set of all possible thresholds Branches: x j < t, (x j t) Can t try all real t But only a finite number of t s are important Sort the values of x j into z 1,, z m, consider split points of the form z i + (z i+1 z i )/2 Only splits between different examples of different classes matter 15

Choosing the Best Attribute Most straightforward idea: do a 1-step lookahead, and choose the attribute such that if we split on it, we get the lowest error rate on the training data Do a majority vote if not all y s agree at a leaf ChooseBestAttribute(S) Choose j s.t. J j is minimized S o =< x, y > S s. t. x j = 0 S 1 =< x, y > S s. t. x j = 1 y 0 : the most common value of y in S 0 y 1 : the most common value of y in S 1 J 0 =#{< x. y > S 0, y y 0 }, J 1 =#{< x. y > S 1, y y 1 } J j = J 0 + J 1 #total number of errors if we split on x j 16

Choosing the Best Attribute (example) Four 0 s, four 1 s One 0, three 1 s Splitting on x 1 produces just two errors, splitting on other attributes produces four errors 17

Choosing the Best Attribute The number of errors won t always tell us that we re making progress Same number of errors as before the split 18

Choosing the Best Attribute The number of errors won t always tell us that we re making progress Same number of errors as before the split 19

A digression on Information Theory Suppose V is a random variable with the probability distribution P(v = 0) P(v = 1) P(v = 2) P(v = 3) P(v = 4) P(v = 5) P(v = 6) 0.1 0.002 0.52 The surprise S(V = v) for each value of v is defined as S V = v = log 2 P(V = v) The smaller the probability of the event, the larger the surprise if we observe the event 0 surprise for events with probability 1 Infinite surprise for events with probability 0 20

Surprise and Message Length Suppose we want to communicate the value of v to a receiver. It makes sense to use longer binary codes for rarer values of V Can use log 2 P(V = v) bits to communicate v Check that this makes sense if P V = 0 = 1 (no need to transmit any information) and P V = 0 = P V = 1 = 1 (need one bit to transmit v) 2 Fractional bits only make sense for longer messages Example: UTF-8 uses more bytes for rare symbols Amount of information We won t go into this in further detail. We need to make sure that the receiver can decode the message even though different symbols take up different amounts of bits, for example 21

Entropy: Average/Expected Surprise The entropy of V, H(V) is defined as H V = v P V = v log 2 P(V = v) The average surprise for one trial of V The average message length when communicating the outcome v The average amount of information we get by seeing one value of V (in bits) 22

Entropy: How Spread Out the distribution is High entropy of V means we cannot predict what the value of V might be Low entropy means we are pretty sure we know what the value of V is every time The entropy of a Bernoulli variable is maximized when p = 0.5 23

Entropy of Coin Flips 24

Entropy of Coin Flips H V = P V = v log 2 P(V = v) v Higher Entropy; more uncertainty about the outcome 25

Three views of Entropy We are considering a random variable V, and a sample v from it The Entropy is 1. Average Surprise at v 2. Average message length when transmitting v in an efficient way 3. Measure of the spread-out -ness of the distribution V 26

Conditional Entropy The amount of information needed to communicate the outcome of B given that we know A H B A = a P A = a H(B A = a) = P A = a [ P B = b A = a log 2 P(B = b A = a)] a b H(B) if A and B are indep. 0 if A = B always. 27

Mutual Information The amount of information we learn about the value of B by knowing the value of A I A; B = H B H B A = H A H(A B) # of bits of information we know about B if we know A = # of bits need to communicate the value of B - # of extra bits need to communicate the value of B if we know A Also called Information Gain 28

Mutual Information Suppose the class Y of each training example and the value of feature x 1 are random variables. The mutual information quantifies how much x 1 tells us about the value of Y Lower is better for higher I(Y;X) I Y; X = H Y H Y X = H Y P X = x H(Y X = x) x 29

Mutual Information What is the mutual information of this split? (Exercise) 30

Mutual Information Heuristic Pick the attribute x j such that I(x j ; Y) is as high as possible 31

Mutual Information Heuristic If we had a correct rate of 0.7, and split the data into two groups where the correct rates were 0.6 and 0.8, we will not make progress on the number of errors, but we will make progress on the average H(Y X) H(Y X=x) Error rate P(Y=1 X=x) We could use any concave function of p instead of computing the conditional entropy in I Y; X = H Y H Y X = H Y σ x P X = x H(Y X = x) P(Y=1 X=x) 32

Learning Decisions Trees: Summary 33

Learning Decisions Trees: Summary 34

Learning Decisions Trees: Summary 35

Learning Decisions Trees: Summary 36

Aside: Cross Entropy The cost function we used when training classifiers was called the Cross Entropy H P, Q = σ v P V = v log Q V = v With this probability Transmit this many bits The amount of information we need to transmit if we are using a coding scheme optimized for distribution Q, when the actual distribution over V is P 37

Aside: Cross Entropy H P, Q = σ y P X = y log Q X = y When used as a cost function: P: the observed distribution (we know the answer) Change of notation: the random variable is y Y = 0 Y=1 Y=2 Y=3 P Y 0 = 0 x 0 = 0 P Y 0 = 1 x 0 = 1 P Y 0 = 2 x 0 = 0 P Y 0 = 3 x 0 = 0 Q: what the classifier actually outputs Y = 0 Y=1 Y=2 Y=3 P Y 0 = 0 x 0 = 0.1 P Y 0 = 1 x 0 =.7 P Y 0 = 2 x 0 =.15 P Y 0 = 3 x 0 =.15 Smaller H(P, Q) means the distributions P and Q are more similar Different conditional distribution for every value of x (i)! Note: we previously explained the function as the negative loglikelihood of the datapoint. That also works 38

Missing Attribute Values Can use examples with missing attribute values If node n tests missing attribute A, assign the most common value of attribute A among the other examples in node n Assign the most common value of A among examples with the same target value Assign probability p i to each possible value v i of A. Assign fraction p i of example to each descendent in the tree 39

Avoiding Overfitting Stop growing the tree early Or grow full tree, then prune The best tree: Measure performance on the validation set Measure performance on the training data, but add a penalty term that grows with the size of the tree 40

Reduced-Error Pruning Repeat Evaluate the impact on the validation set of pruning each possible node (and all those below it) Greedily remove the node such that removing the node improves validation set accuracy the most 41

Effect of Reduced-Error Pruning 42

Converting a Tree to Rules 43

Rule Post-Pruning Convert the tree into an equivalent set of rules If sunny and warm, there will be a tennis match If rainy, there will not be a tennis match Prune each rule independently of the others Is removing the rule improving validation performance Sort the rules into a good sequence for use 44

Scaling Up Decision trees algorithms like ID3 and C4.5 assume random access to memory is fast Good for up to hundreds of thousands of examples SPRINT, SLIQ: multiple sequential scans of data OK for millions of examples VDFT: at most one sequential scan stream mode 45