Learning Decision Trees

Similar documents
Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE

Learning and Neural Networks

Introduction to Artificial Intelligence. Learning from Oberservations

From inductive inference to machine learning

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Statistical Learning. Philipp Koehn. 10 November 2015

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Bayesian learning Probably Approximately Correct Learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Learning from Examples

Decision Trees. Ruy Luiz Milidiú

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CSC 411 Lecture 3: Decision Trees

CS 6375 Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

EECS 349:Machine Learning Bryan Pardo

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Learning Decision Trees

16.4 Multiattribute Utility Functions

Lecture 3: Decision Trees

Decision Trees (Cont.)

Introduction to Machine Learning CMU-10701

Classification Algorithms

Introduction to Machine Learning

the tree till a class assignment is reached

CS145: INTRODUCTION TO DATA MINING

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Tree Learning

Learning Decision Trees

Artificial Intelligence Roman Barták

Classification and Regression Trees

Decision Tree Learning Lecture 2

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

Statistics and learning: Big Data

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Decision Tree Learning

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Tree Learning

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Lecture 3: Decision Trees

CSC242: Intro to AI. Lecture 23

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Classification Algorithms

Decision Trees Lecture 12

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Incremental Stochastic Gradient Descent

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Classification: Decision Trees

Chapter 3: Decision Tree Learning

Machine Learning 2nd Edi7on

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Generative v. Discriminative classifiers Intuition

Machine Learning Recitation 8 Oct 21, Oznur Tastan

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Tree Learning

Induction on Failing Derivations

Decision Trees. Tirgul 5

Introduction To Artificial Neural Networks

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Dan Roth 461C, 3401 Walnut

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Machine Learning & Data Mining

Lecture 7: DecisionTrees

Decision Trees Entropy, Information Gain, Gain Ratio

C4.5 - pruning decision trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Nonlinear Classification

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Holdout and Cross-Validation Methods Overfitting Avoidance

Decision Trees.

Decision Trees.

Lecture 7 Decision Tree Classifier

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Randomized Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

Learning Classification Trees. Sargur Srihari

Decision Tree Learning and Inductive Inference

Linear Classifiers: Expressiveness

Notes on Machine Learning for and

18.9 SUPPORT VECTOR MACHINES

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision trees COMS 4771

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Trees. Danushka Bollegala

Tutorial 6. By:Aashmeet Kalra

Transcription:

Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1

Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

Restaurant example Discrete inputs (some have static discretizations), Boolean output Example Input Attributes Goal Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait x 1 Yes No No Yes Some $$$ No Yes French 0 10 y 1 = Yes x 2 Yes No No Yes Full $ No No Thai 30 60 y 2 = No x 3 No Yes No No Some $ No No Burger 0 10 y 3 = Yes x 4 Yes No Yes Yes Full $ Yes No Thai 10 30 y 4 = Yes x 5 Yes No Yes No Full $$$ No Yes French >60 y 5 = No x 6 No Yes No Yes Some $$ Yes Yes Italian 0 10 y 6 = Yes x 7 No Yes No No None $ Yes No Burger 0 10 y 7 = No x 8 No No No Yes Some $$ Yes Yes Thai 0 10 y 8 = Yes x 9 No Yes Yes No Full $ Yes No Burger >60 y 9 = No x 10 Yes Yes Yes Yes Full $$$ No Yes Italian 10 30 y 10 = No x 11 No No No No None $ No No Thai 0 10 y 11 = No x 12 Yes Yes Yes Yes Full $ No No Burger 30 60 y 12 = Yes CS194-10 Fall 2011 Lecture 8 3

Decision trees Popular representation for hypotheses, even among humans! E.g., here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? No Yes Hungry? No Yes T Reservation? No Yes Fri/Sat? No Yes T Alternate? No Yes Bar? No Yes T F T T Raining? No Yes F T F T CS194-10 Fall 2011 Lecture 8 4

Classification and regression Each path from root to a leaf defines a region R m of input space Let X m be the training examples that fall into R m Classification tree: discrete output leaf value ŷ m typically set to the most common value in X m Regression tree: continuous output leaf value ŷ m typically set to the mean value in X m (L 2 ) CS194-10 Fall 2011 Lecture 8 5

Discrete and continuous inputs Simplest case: discrete inputs with small ranges (e.g., Boolean) one branch for each value; attribute is used up ( complete split ) For continuous attribute, test is X j > c for some split point c two branches, attribute may be split further in each subtree Also split large discrete ranges into two or more subsets CS194-10 Fall 2011 Lecture 8 6

Expressiveness Discrete-input, discrete-output case: Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row path to leaf: A B A xor B F F F F T T T F T T T F F F B A F T B T F T T T F Continuous-input, continuous-output case: Can approximate any function arbitrarily closely Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Need some kind of regularization to ensure more compact decision trees CS194-10 Fall 2011 Lecture 8 7

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? CS194-10 Fall 2011 Lecture 8 8

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions CS194-10 Fall 2011 Lecture 8 9

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows CS194-10 Fall 2011 Lecture 8 10

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n CS194-10 Fall 2011 Lecture 8 11

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees CS194-10 Fall 2011 Lecture 8 12

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? CS194-10 Fall 2011 Lecture 8 13

Digression: Hypothesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses CS194-10 Fall 2011 Lecture 8 14

Digression: Hypothesis spaces More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set many consistent hypotheses have large test error may get worse predictions 2 2n hypotheses all but an exponentially small fraction will require O(2 n ) bits to describe in any representation (decision trees, SVMs, English, brains) Conversely, any given hypothesis representation can provide compact descriptions for only an exponentially small fraction of functions E.g., decision trees are bad for additive threshold -type functions such as at least 6 of the 10 inputs must be true, but linear separators are good CS194-10 Fall 2011 Lecture 8 15

Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree function returns Decision-Tree-Learning(examples, attributes, parent examples) a tree if examples is empty then return Plurality-Value(parent examples) else if all examples have the same classification then return the classification else if attributes is empty then return Plurality-Value(examples) else A argmax a attributes Importance(a, examples) tree a new decision tree with root test A for each value v k of A do exs {e : e examples and e.a = v k } subtree Decision-Tree-Learning(exs, attributes A, examples) add a branch to tree with label (A = v k ) and subtree subtree return tree CS194-10 Fall 2011 Lecture 8 16

Choosing an attribute: Minimizing loss Idea: if we can test only one more attribute, which one results in the smallest empirical loss on examples that come through this branch? Patrons? Type? None Some Full French Italian Thai Burger Assuming we pick the majority value in each new leaf, 0/1 loss is 0 + 0 + 2 = 2 for P atrons and 1 + 1 + 2 + 2 = 6 for T ype For continuous output, assume least-squares fit (mean value) in each leaf, pick the attribute that minimizes total squared error across new leaves CS194-10 Fall 2011 Lecture 8 17

Choosing an attribute: Information gain Idea: measure the contribution the attribute makes to increasing the purity of the Y -values in each subset of examples Patrons? Type? None Some Full French Italian Thai Burger P atrons is a better choice gives information about the classification (also known as reducing entropy of distribution of output values) CS194-10 Fall 2011 Lecture 8 18

Information answers questions Information The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5 Information in an answer when prior is P 1,..., P n is H( P 1,..., P n ) = Σ n i = 1 P i log 2 P i (also called entropy of the prior) Convenient notation: B(p) = H( p, 1 p ) CS194-10 Fall 2011 Lecture 8 19

Information contd. Suppose we have p positive and n negative examples at the root B(p/(p + n) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets E k, each of which (we hope) needs less information to complete the classification Let E k have p k positive and n k negative examples B(p k /(p k + n k ) bits needed to classify a new example expected number of bits per example over all branches is Σ i p i + n i p + n B(p k/(p k + n k )) For P atrons, this is 0.459 bits, for T ype this is (still) 1 bit choose the attribute that minimizes the remaining information needed CS194-10 Fall 2011 Lecture 8 20

Empirical loss vs. information gain Information gain tends to be less greedy: Suppose X j splits a 10/6 parent into 2/0 and 8/6 leaves 0/1 loss is 6 for parent, 6 for children: no progress 0/1 loss is still 6 with 4/0 and 6/6 leaves! Information gain is 0.2044 bits: we have peeled off an easy category Most experimental investigations indicate that information gain leads to better results (smaller trees, better predictive accuracy) CS194-10 Fall 2011 Lecture 8 21

Results on restaurant data Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? No Yes T F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data CS194-10 Fall 2011 Lecture 8 22

Learning curve for restaurant data 1 % correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size CS194-10 Fall 2011 Lecture 8 23

Pruning Tree grows until it fits the data perfectly or runs out of attributes In the non-realizable case, will severely overfit Pruning methods trim tree back, removing weak tests A test is weak if its ability to separate +ve and -ve is not significantly different from that of a completely irrelevant attribute E.g., X j splits a 4/8 parent into 2/2 and 2/6 leaves; gain is 0.04411 bits. How likely is it that randomly assigning the examples to leaves of the same size leads to a gain of at least this amount? Test the deviation from 0-gain split using χ 2 -squared statistic, prune leaves of nodes that fail the test, repeat until done. Regularization parameter is the χ 2 threshold (10%, 5%, 1%, etc.) CS194-10 Fall 2011 Lecture 8 24

Optimal splits for continuous attributes Infinitely many possible split points c to define node test X j > c? CS194-10 Fall 2011 Lecture 8 25

Optimal splits for continuous attributes Infinitely many possible split points c to define node test X j > c? No! Moving split point along the empty space between two observed values has no effect on information gain or empirical loss; so just use midpoint X j c 1 c 2 Moreover, only splits between examples from different classes can be optimal for information gain or empirical loss reduction X j c 1 c 2 CS194-10 Fall 2011 Lecture 8 26

Optimal splits contd. Lemma: Information gain is minimized (zero) when each child has the same output label distribution as the parent X j >c No Yes CS194-10 Fall 2011 Lecture 8 27

Optimal splits contd. Lemma: if c splits X j between two +ve examples, and examples to the left of c are more +ve than the parent distribution, then examples to the right are are more -ve than the parent distribution, moving c right by one takes children further from parent distribution, and information gain necessarily increases X j c 1 c 2 X j >c 1 X j >c 2 No Yes No Yes Gain for c 1 : B( 4 12 ) ( 4 12 B(2 4 ) + 8 12 B(2 8 = 0.04411 bits )) Gain for c 2 : B( 4 12 ) ( 5 12 B(3 5 ) + 7 12 B(1 7 = 0.16859 bits )) CS194-10 Fall 2011 Lecture 8 28

Optimal splits for discrete attributes A discrete attribute may have many possible values e.g., ResidenceState has 50 values Intuitively, this is likely to result in overfitting; extreme case is an identifier such as SS#. Solutions: don t use such attributes use them and hope for the best group values by hand (e.g., NorthEast, South, MidW est, etc.) find best split into two subsets but there are 2 K subsets! CS194-10 Fall 2011 Lecture 8 29

Optimal splits for discrete attributes contd. The following algorithm gives an optimal binary split: Let p jk be proportion of +ve examples in the set of examples that have X j = x jk Order the values x jk by p jk Choose the best split point just as for continuous attributes E.g., p j,ca = 7/10, p j,ny = 4/10, p j,t X = 5/20, p j,il = 3/20; ascending order is IL, T X, NY, CA; 3 possible split points. CS194-10 Fall 2011 Lecture 8 30

Additional tricks Blended selection criterion: info gain near root, empirical loss near leaves Oblique splits : node test is a linear inequality w T x + b > 0 Full regressions at leaves: ŷ(x) = (X T l X l ) 1 X T l Y l Cost-sensitive classification Lopsided outputs: one class very rare (e.g., direct mail targeting) Scaling up: disk-resident data CS194-10 Fall 2011 Lecture 8 31

Summary Of all the well-known learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. [Hastie et al.] Efficient learning algorithm Handle both discrete and continuous inputs and outputs Robust against any monotonic input transformation, also against outliers Automatically ignore irrelevant features: no need for feature selection Decision trees are usually interpretable CS194-10 Fall 2011 Lecture 8 32