CSC 411 Lecture 3: Decision Trees

Similar documents
Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees Lecture 12

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Learning Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

the tree till a class assignment is reached

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Learning Decision Trees

Learning Decision Trees

From inductive inference to machine learning

Learning and Neural Networks

Generative v. Discriminative classifiers Intuition

Introduction to Artificial Intelligence. Learning from Oberservations

Learning from Observations. Chapter 18, Sections 1 3 1

Statistical Learning. Philipp Koehn. 10 November 2015

Lecture 7: DecisionTrees

CSC 411 Lecture 6: Linear Regression

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

CS 380: ARTIFICIAL INTELLIGENCE

Decision Tree Learning. Dr. Xiaowei Huang

Introduction to Data Science Data Mining for Business Analytics

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Classification and Regression Trees

Classification: Decision Trees

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

CSC 411 Lecture 10: Neural Networks

CS 6375 Machine Learning

Introduction to Machine Learning CMU-10701

CS145: INTRODUCTION TO DATA MINING

Decision Trees: Overfitting

Machine Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Dan Roth 461C, 3401 Walnut

Decision Trees. Gavin Brown

Decision Tree Learning

Support vector machines Lecture 4

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CSC 411 Lecture 7: Linear Classification

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Machine Learning

Machine Learning 2nd Edi7on

10-701/ Machine Learning: Assignment 1

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Decision Tree Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

UVA CS 4501: Machine Learning

Bayesian learning Probably Approximately Correct Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Lecture 3: Decision Trees

16.4 Multiattribute Utility Functions

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Linear classifiers Lecture 3

Decision Tree Learning

PAC-learning, VC Dimension and Margin-based Bounds

Information Theory, Statistics, and Decision Trees

Machine Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Lecture 3: Decision Trees

Supervised Learning via Decision Trees

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Lecture 23: More PSPACE-Complete, Randomized Complexity

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CSC321 Lecture 18: Learning Probabilistic Models

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Artificial Intelligence Roman Barták

Decision Tree Learning - ID3

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

CS 188: Artificial Intelligence Spring Announcements

Entropy-based data organization tricks for browsing logs and packet captures

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Lecture 1: Introduction, Entropy and ML estimation

Learning Bayesian Networks

Decision trees COMS 4771

Nonlinear Classification

CS 188: Artificial Intelligence. Bayes Nets

Administrative notes. Computational Thinking ct.cs.ubc.ca

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Information & Correlation

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning from Examples

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Information Theory & Decision Trees

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Transcription:

CSC 411 Lecture 3: Decision Trees Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 03-Decision Trees 1 / 33

Today Decision Trees Simple but powerful learning algorithm One of the most widely used learning algorithms in Kaggle competitions Lets us introduce ensembles (Lectures 4 5), a key idea in ML more broadly Useful information theoretic concepts (entropy, mutual information, etc.) UofT CSC 411: 03-Decision Trees 2 / 33

Decision Trees Yes No Yes No Yes No UofT CSC 411: 03-Decision Trees 3 / 33

Decision Trees UofT CSC 411: 03-Decision Trees 4 / 33

Decision Trees Decision trees make predictions by recursively splitting on different attributes according to a tree structure. UofT CSC 411: 03-Decision Trees 5 / 33

Example with Discrete Inputs What if the attributes are discrete? Attributes: UofT CSC 411: 03-Decision Trees 6 / 33

Decision Tree: Example with Discrete Inputs The tree to decide whether to wait (T) or not (F) UofT CSC 411: 03-Decision Trees 7 / 33

Decision Trees Yes No Yes No Yes No Internal nodes test attributes Branching is determined by attribute value Leaf nodes are outputs (predictions) UofT CSC 411: 03-Decision Trees 8 / 33

Decision Tree: Classification and Regression Each path from root to a leaf defines a region R m of input space Let {(x (m1), t (m1) ),..., (x (m k ), t (m k ) )} be the training examples that fall into R m Classification tree: discrete output leaf value y m typically set to the most common value in {t (m1),..., t (m k ) } Regression tree: continuous output leaf value y m typically set to the mean value in {t (m1),..., t (m k ) } Note: We will focus on classification [Slide credit: S. Russell] UofT CSC 411: 03-Decision Trees 9 / 33

Expressiveness Discrete-input, discrete-output case: Decision trees can express any function of the input attributes E.g., for Boolean functions, truth table row path to leaf: Continuous-input, continuous-output case: Can approximate any function arbitrarily closely Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples [Slide credit: S. Russell] UofT CSC 411: 03-Decision Trees 10 / 33

How do we Learn a DecisionTree? How do we construct a useful decision tree? UofT CSC 411: 03-Decision Trees 11 / 33

Learning Decision Trees Learning the simplest (smallest) decision tree is an NP complete problem [if you are interested, check: Hyafil & Rivest 76] Resort to a greedy heuristic: Start from an empty decision tree Split on the best attribute Recurse Which attribute is the best? Choose based on accuracy? UofT CSC 411: 03-Decision Trees 12 / 33

Choosing a Good Split Why isn t accuracy a good measure? Is this split good? Zero accuracy gain. Instead, we will use techniques from information theory Idea: Use counts at leaves to define probability distributions, so we can measure uncertainty UofT CSC 411: 03-Decision Trees 13 / 33

Choosing a Good Split Which attribute is better to split on, X 1 or X 2? Deterministic: good (all are true or false; just one class in the leaf) Uniform distribution: bad (all classes in leaf equally probable) What about distributons in between? Note: Let s take a slight detour and remember concepts from information theory [Slide credit: D. Sontag] UofT CSC 411: 03-Decision Trees 14 / 33

We Flip Two Different Coins Sequence 1: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0...? Sequence 2: 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1...? 16 2 versus 8 10 0 1 0 1 UofT CSC 411: 03-Decision Trees 15 / 33

Quantifying Uncertainty Entropy is a measure of expected surprise : H(X ) = x X p(x) log 2 p(x) 8/9 4/9 5/9 1/9 0 1 0 1 8 9 log 2 8 9 1 9 log 1 2 9 1 2 4 9 log 2 Measures the information content of each observation Unit = bits A fair coin flip has 1 bit of entropy 4 9 5 9 log 5 2 9 0.99 UofT CSC 411: 03-Decision Trees 16 / 33

Quantifying Uncertainty H(X ) = x X p(x) log 2 p(x) entropy 1.0 0.8 0.6 0.4 0.2 probability p of heads 0.2 0.4 0.6 0.8 1.0 UofT CSC 411: 03-Decision Trees 17 / 33

Entropy High Entropy : Variable has a uniform like distribution Flat histogram Values sampled from it are less predictable Low Entropy Distribution of variable has many peaks and valleys Histogram has many lows and highs Values sampled from it are more predictable [Slide credit: Vibhav Gogate] UofT CSC 411: 03-Decision Trees 18 / 33

Entropy of a Joint Distribution Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' H(X, Y ) = x X p(x, y) log 2 p(x, y) y Y = 24 100 log 2 1.56bits 24 100 1 100 log 2 1 100 25 100 log 2 25 100 50 100 log 2 50 100 UofT CSC 411: 03-Decision Trees 19 / 33

Specific Conditional Entropy Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' What is the entropy of cloudiness Y, given that it is raining? H(Y X = x) = p(y x) log 2 p(y x) y Y = 24 25 log 24 2 25 1 25 log 1 2 25 0.24bits We used: p(y x) = p(x,y) p(x), and p(x) = y p(x, y) (sum in a row) UofT CSC 411: 03-Decision Trees 20 / 33

Conditional Entropy Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' The expected conditional entropy: H(Y X ) = p(x)h(y X = x) x X = p(x, y) log 2 p(y x) x X y Y UofT CSC 411: 03-Decision Trees 21 / 33

Conditional Entropy Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy} Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' What is the entropy of cloudiness, given the knowledge of whether or not it is raining? H(Y X ) = x X p(x)h(y X = x) = 1 4 H(cloudy is raining) + 3 H(cloudy not raining) 4 0.75 bits UofT CSC 411: 03-Decision Trees 22 / 33

Conditional Entropy Some useful properties: H is always non-negative Chain rule: H(X, Y ) = H(X Y ) + H(Y ) = H(Y X ) + H(X ) If X and Y independent, then X doesn t tell us anything about Y : H(Y X ) = H(Y ) But Y tells us everything about Y : H(Y Y ) = 0 By knowing X, we can only decrease uncertainty about Y : H(Y X ) H(Y ) UofT CSC 411: 03-Decision Trees 23 / 33

Information Gain Cloudy' Not'Cloudy' Raining' 24/100' 1/100' Not'Raining' 25/100' 50/100' How much information about cloudiness do we get by discovering whether it is raining? IG(Y X ) = H(Y ) H(Y X ) 0.25 bits This is called the information gain in Y due to X, or the mutual information of Y and X If X is completely uninformative about Y : IG(Y X ) = 0 If X is completely informative about Y : IG(Y X ) = H(Y ) UofT CSC 411: 03-Decision Trees 24 / 33

Revisiting Our Original Example Information gain measures the informativeness of a variable, which is exactly what we desire in a decision tree attribute! What is the information gain of this split? Root entropy: H(Y ) = 49 149 log 2( 49 149 ) 100 149 log 2( 100 149 ) 0.91 Leafs entropy: H(Y left) = 0, H(Y right) 1. IG(split) 0.91 ( 1 3 0 + 2 3 1) 0.24 > 0 UofT CSC 411: 03-Decision Trees 25 / 33

Constructing Decision Trees Yes No Yes No Yes No At each level, one must choose: 1. Which variable to split. 2. Possibly where to split it. Choose them based on how much information we would gain from the decision! (choose attribute that gives the highest gain) UofT CSC 411: 03-Decision Trees 26 / 33

Decision Tree Construction Algorithm Simple, greedy, recursive approach, builds up tree node-by-node 1. pick an attribute to split at a non-terminal node 2. split examples into groups based on attribute value 3. for each group: if no examples return majority from parent else if all examples in same class return class else loop to step 1 UofT CSC 411: 03-Decision Trees 27 / 33

Back to Our Example Attributes: [from: Russell & Norvig] UofT CSC 411: 03-Decision Trees 28 / 33

Attribute Selection [ 2 IG(type) = 1 IG(Patrons) = 1 12 H(Y Fr.) + 2 12 IG(Y ) = H(Y ) H(Y X ) H(Y It.) + 4 12 ] 4 H(Y Thai) + H(Y Bur.) 12 ] 0.541 [ 2 4 6 H(0, 1) + H(1, 0) + 12 12 12 H(2 6, 4 6 ) = 0 UofT CSC 411: 03-Decision Trees 29 / 33

Which Tree is Better? UofT CSC 411: 03-Decision Trees 30 / 33

What Makes a Good Tree? Not too small: need to handle important but possibly subtle distinctions in data Not too big: Computational efficiency (avoid redundant, spurious attributes) Avoid over-fitting training examples Human interpretability Occam s Razor : find the simplest hypothesis that fits the observations Useful principle, but hard to formalize (how to define simplicity?) See Domingos, 1999, The role of Occam s razor in knowledge discovery We desire small trees with informative nodes near the root UofT CSC 411: 03-Decision Trees 31 / 33

Decision Tree Miscellany Problems: You have exponentially less data at lower levels Too big of a tree can overfit the data Greedy algorithms don t necessarily yield the global optimum Handling continuous attributes Split based on a threshold, chosen to maximize information gain Decision trees can also be used for regression on real-valued outputs. Choose splits to minimize squared error, rather than maximize information gain. UofT CSC 411: 03-Decision Trees 32 / 33

Comparison to k-nn Advantages of decision trees over KNN Good when there are lots of attributes, but only a few are important Good with discrete attributes Easily deals with missing values (just treat as another value) Robust to scale of inputs Fast at test time More interpretable Advantages of KNN over decision trees Few hyperparameters Able to handle attributes/features that interact in complex ways (e.g. pixels) Can incorporate interesting distance measures (e.g. shape contexts) Typically make better predictions in practice As we ll see next lecture, ensembles of decision trees are much stronger. But they lose many of the advantages listed above. UofT CSC 411: 03-Decision Trees 33 / 33