Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Similar documents
Learning Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Learning from Observations. Chapter 18, Sections 1 3 1

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

CS 380: ARTIFICIAL INTELLIGENCE

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

From inductive inference to machine learning

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Learning and Neural Networks

Introduction to Artificial Intelligence. Learning from Oberservations

EECS 349:Machine Learning Bryan Pardo

Statistical Learning. Philipp Koehn. 10 November 2015

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Learning from Examples

Decision Trees. Tirgul 5

CSC 411 Lecture 3: Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

the tree till a class assignment is reached

Learning Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Classification Algorithms

Decision Tree Learning

Lecture 3: Decision Trees

Learning Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Bayesian learning Probably Approximately Correct Learning

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Classification: Decision Trees

Generative v. Discriminative classifiers Intuition

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Machine Learning 2nd Edi7on

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

C4.5 - pruning decision trees

CSC242: Intro to AI. Lecture 23

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Introduction to Machine Learning

Decision Trees. Gavin Brown

Lecture 7: DecisionTrees

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Artificial Intelligence Roman Barták

Classification Algorithms

CS 6375 Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Classification and Regression Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS145: INTRODUCTION TO DATA MINING

Decision Trees. Ruy Luiz Milidiú

Decision Tree Learning Lecture 2

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Incremental Stochastic Gradient Descent

Supervised Learning via Decision Trees

Machine Learning & Data Mining

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Notes on Machine Learning for and

Dan Roth 461C, 3401 Walnut

Chapter 3: Decision Tree Learning

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Learning Classification Trees. Sargur Srihari

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees Lecture 12

Introduction to Machine Learning CMU-10701

Tutorial 6. By:Aashmeet Kalra

Decision Trees Entropy, Information Gain, Gain Ratio

Lecture 3: Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Tree Learning

Decision Trees (Cont.)

Artificial Intelligence Decision Trees

Decision Trees.

Lecture 7 Decision Tree Classifier

Decision Trees.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision trees COMS 4771

Decision Tree Learning and Inductive Inference

Decision Tree Learning

Empirical Risk Minimization, Model Selection, and Model Assessment

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees: Overfitting

16.4 Multiattribute Utility Functions

Decision Tree Learning

Decision Tree Learning - ID3

UVA CS 4501: Machine Learning

Decision Tree Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Transcription:

Decision rees CS 341 Lectures 8/9 Dan Sheldon

Review: Linear Methods Y! So far, we ve looked at linear methods! Linear regression! Fit a line/plane/hyperplane X 2 X 1! Logistic regression! Decision boundary is a line/ plane/hyperplane x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Review: Feature Expansions! Non-linearity by trickery 4 7!! Drawback: not automatic! hard work for us! y 3 2 1 1.5 0 1 1 x 2 0.5 0-0.5-1 2 1 0.5 0 0.5 1 x x 7! (1,x,x 2,x 3,...) -1.5-1.5-1 -0.5 0 0.5 1 1.5 x 1 (x 1,x 2 ) 7! (1,x 1,x 2,x 2 1,x 2 2,x 1 x 2 ),

Next Up: Non-Linear Methods! Hypothesis class is intrinsically non-linear! Next few topics! Decision trees! Neural networks! Kernels! oday: decision trees

Decision ree opics! What is a decision tree?! Learn a tree from data! Good vs. bad trees! Greedy top-down learning! Entropy! Pruning! Geometric interpretation

What Is a Decision ree?! Example from R&N: wait for table at restaurant? Patrons? None Some Full F WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? Reservation? Fri/Sat? Alternate? Bar? F Raining? F F [Figure: Russell & Norvig]

What Is a Decision ree?! Internal nodes: test an attribute (feature)! Assume discrete attributes for now! Branches: different values of attribute! Leaf node: predict a class label Patrons? None Some Full F WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? Reservation? Fri/Sat? Alternate? Bar? F Raining? F F [Figure: Russell & Norvig]

What Is a Decision ree?! Predict by following path down tree Attributes arget Example Alt Bar Fri Hun Pat Price Rain Res ype Est Wait? X 4 F Full $ F F hai 10 30 Patrons? None Some Full F WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? Reservation? Fri/Sat? Alternate? Bar? F Raining? F F

Decision rees in the Wild! Decision trees are common decisionmaking tools! You have all seen them!! R&N: Car service manuals! Viburnum ID! http://www.hort.cornell.edu/vlb/key/ index.html [Image: Clemson cooperative extension]

Decision rees in the Wild! C-section risk (learned from data) [Example: X. Fern, Oregon State]

Decision rees in the Wild

Why Decision rees?! Easy to understand! Match human decision-making processes! Excellent performance! One of the best out-of-the-box methods when used carefully! Very flexible hypothesis space! Can learn complex decision boundaries! Handle messy data well! Missing features! Continuous discrete! Etc.

How to Learn a ree From Data! raining data for R&N example Attributes arget Example Alt Bar Fri Hun Pat Price Rain Res ype Est Wait? X 1 Yes No Some $$$ French 0 10 X 2 Yes No Full $ No No hai 30 60 F X 3 No No Some $ No No Burger 0 10 X 4 Yes Yes Full $ Yes No hai 10 30 X 5 Yes No Full $$$ French >60 F X 6 Some $$ Yes Yes Italian 0 10 X 7 No No None $ Yes No Burger 0 10 F X 8 No No Some $$ Yes Yes hai 0 10 X 9 Yes No Full $ Yes No Burger >60 F X 10 Yes Yes Yes Yes Full $$$ Italian 10 30 F X 11 No No No No None $ No No hai 0 10 F X 12 Yes Yes Yes Yes Full $ No No Burger 30 60

Exercise! Design an algorithm! Input: single training example! Output: decision tree that matches that example Attributes arget Example Alt Bar Fri Hun Pat Price Rain Res ype Est Wait? X 4 Yes Yes Full $ Yes No hai 10 30

Exercise (cont ) Attributes arget Example Alt Bar Fri Hun Pat Price Rain Res ype Est Wait? X 4 Yes Yes Full $ Yes No hai 10 30

Exercise! Design a tree to match all examples! Hint: use previous exercise Attributes arget Example Alt Bar Fri Hun Pat Price Rain Res ype Est Wait? X 1 Yes No Some $$$ French 0 10 X 2 Yes No Full $ No No hai 30 60 F X 3 No No Some $ No No Burger 0 10 X 4 Yes Yes Full $ Yes No hai 10 30 X 5 Yes No Full $$$ French >60 F X 6 Some $$ Yes Yes Italian 0 10 X 7 No No None $ Yes No Burger 0 10 F X 8 No No Some $$ Yes Yes hai 0 10 X 9 Yes No Full $ Yes No Burger >60 F X 10 Yes Yes Yes Yes Full $$$ Italian 10 30 F X 11 No No No No None $ No No hai 0 10 F X 12 Yes Yes Yes Yes Full $ No No Burger 30 60

A Correct Answer! For each training example, create a path from root to leaf! What is wrong with this?

What Is a Good ree?! Find smallest tree that fits training data! A very hard problem! Number of trees on d binary attributes at least! E.g., with 6 attributes! 2 2d 2 26 = 18,446,744,073,709,551,616! Effectively impossible to find best tree

op-down Greedy Heuristic! Idea:! Find best attribute to test at root of tree! Recursively apply algorithm to each branch Patrons? ype? None Some Full French Italian hai Burger! Best attribute?! Ideally, splits the examples into subsets that are "all positive" or "all negative"

Base Cases! Keep splitting until.! All examples have same label! Run out of examples! Run out of attributes

Base Case I Patrons? None Some Full! All examples have same classification (e.g. true)! Return classification (e.g., true)

Base Case II Patrons? None Some Full Price? $ $$ $$$! No more examples! Return majority class of parent (e.g., false)

Base Case III Patrons? None Some Full Price? $ $$ $$$ ype? hai Hungry?! No more attributes! Return majority class (e.g., false) yes

he Algorithm! Put it (almost) all together function DL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best Choose-Attribute(attributes, examples) tree a new decision tree with root test best for each value v i of best do examples i {elements of examples with best = v i } subtree DL(examples i, attributes best, Mode(examples)) add a branch to tree with label v i and subtree subtree return tree

Splitting Criteria! An obvious idea is error rate! otal # of errors we would make if we stopped training now 15 5 Errors = 5 15 5 13 1 2 4 7 3 8 2 Attribute 1 Attribute 2

Splitting Criteria! Problem: error rate fails to differentiate these two splits 15 5 Errors = 5 15 5 10 0 5 5 7 3 8 2 Attribute 3 Attribute 2! HW problem: show that error rate does not decrease unless children have different majority classes

Entropy! Measure of uncertainty from field of Information heory! Claude Shannon (1948)! Consider a biased random coin! Pr(X = heads) = q!! Pr(X = tails) = 1 q!! Define surprise of a random variable as! S(X = x) = -log 2 Pr(X = x)! q = 0 q=1/4 q=1/2 S(X = heads) infinite 2 1 S(X = tails) 0.4150 1

Entropy q = 0 q=1/4 q=1/2 S(X = heads) infinite 2 1 S(X = tails) 0.4150 1! Entropy = average surprise! For the biased coin H (X) = Pr(X = x)logpr(x = x) values x B(q) := H (X) = qlogq (1 q)log(1 q)

Entropy! Entropy of a biased coin 1 0.8 Entropy 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 q

Entropy as Attribute est! Look for split with biggest reduction in entropy! Information gain = entropy (average entropy of children) 15 5 10 0 5 5

Entropy as Attribute est! Back to our previous example 15 5 15 5 10 0 5 5 7 3 8 2

A More General Example B( 3 4 ) = 3 4 log 3 4 1 4 log 1 4 = 0.88 10 20 = 1 2 15 5 1 4 1 4 10 0 4 1 2 3 B(1) = 0 B( 2 5 ) = 2 5 log 2 5 3 5 log 3 5 = 0.97 B( 4 5 ) = 4 5 log 4 5 1 5 log 1 5 = 0.72 Average entropy of children 1 2 0 + 1 4 0.72 + 1 4 0.97 = 0.42 Reduction 0.88 0.42 = 0.46

Pruning! It is still possible to grow very large trees! overfitting! Pruning! First grow a full tree! hen remove some nodes! (Why not just stop growing the tree earlier?)! Different pruning criteria! R&N: Chi-squared. How different are the children from the parents?! An easier possibility: use a test set to see if removing the node hurts generalization

Continuous Attributes 15 5 Wait estimate x Wait estimate > x 13 1 2 4! Split on any threshold value x! Select attribute/threshold combination with best reduction in entropy

Miscellaneous! Issues to think/read about on your own! Entropy criteria tends to favor multi-valued attributes over binary attributes! Why?! Various fixes proposed in literature. Can you think of one?! Missing attributes! E.g., medical testing! How to handle these during prediction?! During training?

Geometric Interpretation! Axis-parallel splits

Geometric Interpretation! Can learn simple and complex boundaries x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 1.5 1 0.5 0-0.5-1 x 1-1.5-1.5-1 -0.5 0 0.5 1 1.5 x 1

What You Should Know! op-down greedy learning for decision trees! Entropy splitting criteria! Continuous attributes