Supervised Learning via Decision Trees

Similar documents
Learning Decision Trees

Learning Decision Trees

Decision Trees. Tirgul 5

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS145: INTRODUCTION TO DATA MINING

Machine Learning 3. week

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Lecture 7: DecisionTrees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

the tree till a class assignment is reached

Decision Tree Learning Lecture 2

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning 2nd Edi7on

Decision Tree Learning

Lecture 3: Decision Trees

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CS 6375 Machine Learning

Lecture 7 Decision Tree Classifier

Lecture 3: Decision Trees

Decision Tree And Random Forest

Decision Trees. Danushka Bollegala

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

day month year documentname/initials 1

Empirical Risk Minimization, Model Selection, and Model Assessment

Classification: Decision Trees

Randomized Decision Trees

Introduction to Machine Learning (67577) Lecture 5

Classification Using Decision Trees

UVA CS 4501: Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Trees: Overfitting

Classification: Decision Trees

CSC 411 Lecture 3: Decision Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

SF2930 Regression Analysis

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Introduction to Machine Learning CMU-10701

Statistical Machine Learning from Data

Informal Definition: Telling things apart

Statistics and learning: Big Data

Decision Trees Entropy, Information Gain, Gain Ratio

Generative v. Discriminative classifiers Intuition

Decision Tree Learning

Decision Trees Lecture 12

Machine Learning & Data Mining

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Classification and Regression Trees

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Decision trees COMS 4771

CSCI 5622 Machine Learning

Induction of Decision Trees

Information Theory & Decision Trees

Notes on Machine Learning for and

Dan Roth 461C, 3401 Walnut

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

ECE521 week 3: 23/26 January 2017

Decision Tree. Decision Tree Learning. c4.5. Example

Introduction to Data Science Data Mining for Business Analytics

Decision Tree Learning

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Learning with multiple models. Boosting.

Decision Trees. Gavin Brown

ECE 5424: Introduction to Machine Learning

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Artificial Intelligence. Topic

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees.

EECS 349:Machine Learning Bryan Pardo

Decision Trees Part 1. Rao Vemuri University of California, Davis

Chapter 3: Decision Tree Learning

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

Artificial Intelligence Roman Barták

Learning Decision Trees

Chapter 6: Classification

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Learning - ID3

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Classification Algorithms

Voting (Ensemble Methods)

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees.

Deconstructing Data Science

CS7267 MACHINE LEARNING

Knowledge Discovery and Data Mining

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Transcription:

Supervised Learning via Decision Trees Lecture 4 1

Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2

Sequence of decisions at choice nodes from root to a leaf node Each choice node splits on a single feature Can be used for classification or regression Explicit, easy for humans to understand Typically very fast at testing/ prediction time Decision Trees h"ps://en.wikipedia.org/wiki/decision_tree_learning 3

Weather Example 4

IRIS Example 5

Training Issues Approximation Optimal tree-building is NP-complete Typically greedy, top-down Bias vs. Variance Occam s Razor vs. CC/SSN Pruning, ensemble methods Splitting metric Information gain, gain ratio, Gini impurity 6

Iterative Dichotomiser 3 Invented by Ross Quinlan in 1986 Precursor to C4.5/5 Categorical data only Greedily consumes features Subtrees cannot consider previous feature(s) for further splits Typically produces shallow trees 7

ID3: Algorithm Sketch If all examples same, return f(examples) If no more features, return f(examples) A = best feature For each distinct value of A branch = ID3( attributes - {A} ) 8

Details Classification same = same class f(examples) = majority Regression same = std. dev. < ε f(examples) = average 9

Recursion A method of programming in which a function refers to itself in order to solve a problem Example: ID3 calls itself for subtrees Never necessary In some situations, results in simpler and/or easier-to-write code Can often be more expensive in terms of memory + time 10

Example Consider the factorial function n! = ny k=1 k =1 2 3... n 11

Iterative Implementation def factorial(n): result = 1 for i in range(n): result *= (i+1) return result 12

Consider a Recursive Definition 0! = 1 Base Case n! =n(n 1)! when n 1 Recursive Step 13

Recursive Implementation def factorial_r(n): if n == 0: return 1 else: return (n * factorial_r(n-1)) 14

How the Code Executes factorial_r return 1 factorial_r return 1 * factorial_r( 0 ) factorial_r return 2 * factorial_r( 1 ) factorial_r return 3 * factorial_r( 2 ) Func%on Stack factorial_r return 4 * factorial_r( 3 ) main print factorial_r( 4 ) Stack Frame 15

How the Code Executes factorial_r return 1 * 1 factorial_r return 2 * factorial_r( 1 ) factorial_r return 3 * factorial_r( 2 ) Func%on Stack factorial_r return 4 * factorial_r( 3 ) main print factorial_r( 4 ) Stack Frame 16

How the Code Executes factorial_r return 2 * 1 factorial_r return 3 * factorial_r( 2 ) Func%on Stack factorial_r return 4 * factorial_r( 3 ) main print factorial_r( 4 ) Stack Frame 17

How the Code Executes factorial_r return 3 * 2 Func%on Stack factorial_r return 4 * factorial_r( 3 ) main print factorial_r( 4 ) Stack Frame 18

How the Code Executes Func%on Stack factorial_r return 4 * 6 main print factorial_r( 4 ) Stack Frame 19

How the Code Executes Func%on Stack main print 24 Stack Frame 20

ID3: Algorithm Sketch If all examples same, return f(examples) If no more features, return f(examples) A = best feature For each distinct value of A branch = ID3( attributes - {A} ) Base Cases Recursive Step 21

Splitting Metric: The best Feature Classification Information gain Goal: choose splits that proceed from much->little uncertainty Regression Standard Deviation Reduction h"p://www.saedsayad.com/ decision_tree_reg.htm 22

Measure of impurity or uncertainty Shannon Entropy Intuition: the less likely the event, the more information is transmitted 23

Entropy Range Small Large 24

Quantifying Entropy H(X) =E[I(X)] Expected value of informacon X i P (x i )I(x i ) Z P (x)i(x)dx 25

I(X) =... Intuition for Information I(X) 0 Shouldn t be negative I(1) = 0 Events that always occur communicate no information I(X 1,X 2 )= I(X 1 )+I(X 2 ) Information from independent events are additive 26

Quantifying Information I(X) = log b 1 P (X) = log b P (X) Log Base = Units: 2=bit (binary digit), 3=trit, e=nat H(X) = X i P (x i ) log b P (x i ) Log Base = Units: 2=shannon/bit 27

Example: Fair Coin Toss I(heads) = log 2 ( 1 0.5 ) = log 2 2=1bit I(tails) = log 2 ( 1 0.5 ) = log 2 2=1bit H(fair toss) = (0.5)(1) + (0.5)(1) = = 1 shannon 28

Example: Double Headed Coin H(double head) = (1) I(head) = (1) log 2 ( 1 1 ) = (1) (0) = 0 shannons 29

Exercise: Weighted Coin Compute the entropy of a coin that will land on heads about 25% of the time, and tails the remaining 75%. 30

Answer H(weighted toss) = (0.25) I(head) + (0.75) I(tails) 1 =(0.25) log 2 0.25 +(0.75) log 1 2 0.75 = 0.81 shannons 31

Entropy vs. P 32

Exercise Calculate the entropy of the following data 33

Answer H(data) = 16 30 = 16 30 log 2 14 I(green circle) + 30 30 16 + 14 30 log 2 = 0.99679 shannons 30 14 I(purple cross) 34

H(X) 0 Bounds on Entropy H(X) =0 () 9x 2 X(P (x) = 1) H b (X) apple log b ( X ) X denotes the number of elements in the range of X H b (X) = log b ( X ) () X has a uniform distribution over X 35

Information Gain To use entropy for a splitting metric, we consider the information gain of an action as the resulting change in entropy IG(T,a) = H(T ) = H(T ) H(T a) X i T i T H(T i) Average Entropy of the children 36

Example Split { 4 17, 13 17 } { 16 30, 14 30 } { 12 13, 1 13 } 37

H 1 = 4 H 2 = 12 Example Information Gain 17 log 2 13 log 2 17 4 + 13 17 log 2 13 12 + 1 13 log 2 17 13 0.79 13 IG = H(T ) ( 17 30 H 1 + 13 30 H 2) =0.99679 0.62 =0.38 shannons 1 0.39 38

Exercise Consider the following dataset. Compute the information gain for each of the non-target attributes. Decide which attribute is the best to split on. X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 39

H(C) H(C) = (0.5) log 2 0.5 (0.5) log 2 0.5 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 40

IG(C,X) H(C X) = 3 4 [2 3 log 3 2 2 + 1 3 log 3 2 1 ]+1 4 [0] =0.689 shannons IG(C, X) = 1 0.689 = 0.311 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 41

IG(C,Y) H(C Y )= 1 2 [0] + 1 2 [0] = 0 shannons IG(C, Y )=1 0 = 1 shannon X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 42

IG(C,Z) H(C Y )= 1 2 [1] + 1 2 [1] = 1 shannons IG(C, Z) =1 1 = 0 shannons X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 43

Feature Split Choice Y 1 0 A B 0.311 1.0 0.0 X Y Z Class 1 1 1 A 1 1 0 A 0 0 1 B 1 0 0 B 44

ID3: Algorithm Sketch If all examples same, return f(examples) If no more features, return f(examples) A = best feature For each distinct value of A branch = ID3( attributes - {A} ) 45

Example (MLiA) No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No 46

0. Preliminaries No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No Examples not the same class Features remain H(Fish?) = 0.971 47

1a: No Surfacing No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No H(Fish? No Surfacing) = 0.55 IG(Fish?, No Surfacing) = 0.42 48

1b: Flippers? No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No H(Fish? Flippers?) = 0.8 IG(Fish?, Flippers) = 0.17 49

2: Split on No Surfacing No Surfacing No Yes Flippers? Fish? Yes No Yes No Flippers? Fish? Yes Yes Yes Yes No No Recurse(left) 50

2. Left Flippers? Fish? Yes Yes No No Examples the same class! Return class leaf node 51

2: Split on No Surfacing No Surfacing No Yes No Flippers? Fish? Yes Yes Yes Yes No No Recurse(right) 52

2. Right Flippers? Fish? Yes Yes No Yes Yes No Examples not the same class One feature remaining Split! 53

3. Split on Flippers Flippers No Yes Fish? No Fish? Yes Yes Recurse(left) 54

3. Left Fish? No Examples the same class! Return class leaf node 55

3. Split on Flippers Flippers No Yes No Fish? Yes Yes Recurse(right) 56

3. Right Fish? Yes Yes Examples the same class! Return class leaf node 57

3. Split on Flippers Flippers No Yes No Yes Return! 58

2: Split on No Surfacing No Surfacing No Yes No Flippers No Yes Done! No No Surfacing Flippers? Fish? Yes Yes Yes Yes Yes Yes Yes Yes No No No Yes No No Yes No 59

Additional Base Case What to do given the following example input to ID3? No additional features upon which to split Fish? Yes Yes No No No No For categorical, majority vote 60

Generalization Continuous features Ensemble learning Extensions 61

Generalization Information gain biases towards features with many distinct values Consider the value of CC/SSN Approaches to mediate Gain ratio is a metric that divides each IG term by SplitInfo, which is large for features with many partitions (used in C4.5) There are several pruning techniques that replace subtrees 62

Continuous Features You can always discretize/bin yourself Run the risk of suboptimal depending on tree location Simple approach: binary splits, whereby left is threshold Consider each distinct value a threshold, calculate gain Computationally expensive for large numbers of values C4.5 penalizes similar to large distinct sets 63

Ensemble Learning The Random Forest algorithm is an exemplar of using multiple trees Each tree is trained via bootstrapped data (i.e. sampled with replacement) Each choice node feature is selected from a random subset of overall Decisions are bagged (i.e. aggregated over many trees) Can use a validation set to weight via expected accuracy of each tree 64

ML task(s)? Checkup Classification: binary/multi-class? Feature type(s)? Implicit/explicit? Parametric? Online? 65

Summary: ID3/Decision Trees Practicality Easy, generally applicable Need know nothing about the underlying process Very popular, easy to understand Efficiency Training: relatively fast, batch Testing: typically very fast Performance Possible to get stuck in suboptimal trees Methods to help, hard in general 66