the tree till a class assignment is reached

Similar documents
CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

EECS 349:Machine Learning Bryan Pardo

Learning Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Learning Decision Trees

Decision Tree Learning

CS 6375 Machine Learning

Classification and Regression Trees

Decision Trees. Tirgul 5

Machine Learning & Data Mining

Machine Learning 2nd Edi7on

Lecture 3: Decision Trees

Decision Trees.

Decision Tree Learning Lecture 2

Introduction to Machine Learning CMU-10701

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees.

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees. Gavin Brown

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Classification: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision trees COMS 4771

CS145: INTRODUCTION TO DATA MINING

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning

Decision Tree Learning

Decision Tree Learning and Inductive Inference

Dan Roth 461C, 3401 Walnut

Lecture 3: Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Information Theory & Decision Trees

Machine Learning 3. week

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees Part 1. Rao Vemuri University of California, Davis

Induction on Decision Trees

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees / NLP Introduction

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 7: DecisionTrees

Chapter 3: Decision Tree Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Classification and Prediction

Decision Trees Entropy, Information Gain, Gain Ratio


Decision Tree And Random Forest

Notes on Machine Learning for and

Classification and regression trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Jeffrey D. Ullman Stanford University

Learning Classification Trees. Sargur Srihari

Classification: Decision Trees

Chapter 3: Decision Tree Learning (part 2)

Induction of Decision Trees

Data classification (II)

Lecture 7 Decision Tree Classifier

CSC 411 Lecture 3: Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees Lecture 12

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Tree Learning

Classification Using Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Rule Generation using Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Apprentissage automatique et fouille de données (part 2)

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Learning Decision Trees

CHAPTER-17. Decision Tree Induction

CSCI 5622 Machine Learning

Decision Trees. Danushka Bollegala

Machine Learning

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Tree. Decision Tree Learning. c4.5. Example

UVA CS 4501: Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Artificial Intelligence. Topic

Chapter 6: Classification

Machine Learning

C4.5 - pruning decision trees

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Transcription:

Decision Trees

Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached

Definitions Internal nodes Each test a feature Branch according to feature values Discrete features branching is naturally defined Continuous features branching by comparing to a threshold Leaf nodes Each assign a classification

Decision Tree Decision Boundaries Decision Trees divide the feature space into axis-parallel rectangles and label each rectangle with one of the K classes

Hypothesis Space of Decision Trees Decision trees provide a very popular and efficient i hypothesis space Deal with both Discrete and Continuous features Variable size: as the # of nodes or depth of tree increases, the hypothesis space grows Depth 1 decision stump can represent any Boolean function of one feature Depth 2: Any Boolean function of two features and some Boolean functions involving three features: In general, can represent any Boolean functions

Decision Trees Can Represent Any Boolean Function OR If a target Boolean function has n inputs, there always exists a decision tree representing that target function. However, in the worst case, exponentially many nodes will be needed why? 2 n possible inputs to the function In the worst case, we need to use one leaf node to represent each possible input

Learning Decision Trees Goal: Find a decision tree h that achieves minimum misclassification errors on the training data A trivial solution: just create a decision tree with one path from root to leaf for each training example Bug: Such a tree would just memorize the training data. It would not generalize to new data points Solution 2: Find the smallest tree h that minimizes error Bug: This is NP-Hard

Top-down Induction of Decision Trees There are different ways to construct trees from data. We will focus on the top-down, greedy search approach: Basic idea: 1. Choose the best feature a* for the root of the tree. 2. Separate training set S into subsets {S 1, S 2,.., S k } where each subset S i contains examples having the same value for a*. 3. Recursively apply ppy the algorithm on each new subset until all examples have the same class label.

Choosing Feature Based on Classification Error Perform 1-step look-ahead search and choose the attribute that gives the lowest error rate on the training data 0 1 0 1 0 1

Unfortunately, t this measure does not always work well, because it does not detect cases where we are making progress toward a good tree J=10 J=10

Unfortunately, this measure does not always work well, because it does not detect cases where we are making progress toward a good tree 21 10 Error = 10 11 9 10 1 Error = 10

Unfortunately, this measure does not always work well, because it does not detect cases where we are making progress toward a good tree 21 10 Error = 10 11 9 10 1 Error = 10 11 1 0 8

Entropy Let be a random variable with the following probability distribution P = 0 P = 1 0.2 0.8 The entropy of, denoted H, is defined as H P x log P x 2 x Entropy measures the uncertainty of a random variable The larger the entropy, the more uncertain we are about the value of If P=0=0 0 or 1, there is no uncertainty t about the value of, entropy = 0 If P=0=P=1=0.5, the uncertainty is maximized, entropy = 1

Entropy Entropy is a concave function downward

More About Entropy More About Entropy Joint Entropy Joint Entropy x y y x P y x P H, log,, Conditional Entropy is defined as x H x P H x The average surprise of when we know the value of log x y P x y P x P x y x g p Entropy is additive H H H, H H H

Mutual Information The Mutual information between two random variables and is defined as:, The amount of information that we learn about by knowing Also, the amount of information that we learn about by knowing Consider the class of each example and the outcome of a test to be random variables. The mutual information quantifies how much information about can be gained by testing on

Example 21 10 1 0 0.907 1 0.645 0 0.355 11 9 10 1 1 0.993 0 0.439 0.645 0.993 0.355 0.4390.796, 0.907 0.796 0.111

Choosing the Best Feature Choosing the Best Feature Choose the feature j that has the highest mutual information with often referred to as the information information with - often referred to as the information gain criterion arg min arg max ; arg max j j j j j j H H H I Define to be the expected remaining uncertainty j ~ j J about y after testing x j ~ x H x P H j J x H x P H j J j x j j

Choosing the Best Feature: A General View US Benefit of split = US [P L *US L +P R *US R ] P L P R US L US R Expected Remaining Uncertainty Impurity Measures of Uncertainty Error Entropy Gini Index

Example

Selecting the root test using information gain 9 5 + - Humidity 9 5 + - Outlook High Normal sunny Overcast Rain 3 + 4-6 + 1-2 + 3-4 + 0-3 + 2 -

Continue building the tree 9 5 + - Outlook sunny Overcast Rain 2 + 3 - es 3 + 2 - Which test should be placed here? 2 3 + - Humidity High Normal 0 + 3-2 + 0 -

Multinomial Features Multiple Discrete features Method 1: construct multiway split Information gain will tend to prefer multiway split To avoid this bias, we can rescale the information gain:, argmin Method 2: test one value versus the others Each value leads to one possible test Compare all different tests to pick the one with highest information gain

Continuous Features Test against a threshold How to compute the best threshold j for j? Sort the examples according to j. Move the threshold h from the smallest to the largest value Select that gives the best information gain Only need to compute information gain when class label changes

Continuous Features Information gain for = 1.2 is 0.2294 n ;L n ;R n ;L n ;R Information gain only needs to be computer when class label changes

Top-down Induction of Decision Trees There are different ways to construct trees from data. We will focus on the top-down, greedy search approach: Basic idea: 1. Choose the best feature a* for the root of the tree. 2. Separate training set S into subsets {S 1, S 2,.., S k } where each subset S i contains examples having the same value for a*. 3. Recursively apply ppy the algorithm on each new subset until all examples have the same class label.

Over-fitting Decision tree has a very flexible hypothesis space As the nodes increase, we can represent arbitrarily complex decision boundaries training set error is always zero This can lead to over-fitting t2 t3 Possibly just noise, but the tree is grown larger to capture these examples size

Over-fitting

Early stop Avoid Overfitting Stop growing the tree when data split does not differ significantly different from random split Post pruning Separate training data into training set and validating set Evaluate impact on validation set when pruning each possible node Greedily remove the one that most improve the validation set performance

Effect of Pruning

Decision Tree Summary DT - one of the most popular machine learning tools Easy to understand Easy to implement Easy to use Computationally cheap Information gain to select features ID3, C4.5 DT over-fits Training error can always reach zero To avoid overfitting: Early stopping Pruning