Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Similar documents
Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Learning and Neural Networks

Learning Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Introduction to Artificial Intelligence. Learning from Oberservations

Statistical Learning. Philipp Koehn. 10 November 2015

From inductive inference to machine learning

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Bayesian learning Probably Approximately Correct Learning

16.4 Multiattribute Utility Functions

Introduction to Machine Learning

Learning from Examples

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees. Ruy Luiz Milidiú

Learning Decision Trees

Learning Decision Trees

Artificial Intelligence Roman Barták

EECS 349:Machine Learning Bryan Pardo

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Decision Trees. Tirgul 5

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CSC 411 Lecture 3: Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 3: Decision Trees

CS 6375 Machine Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree. Decision Tree Learning. c4.5. Example

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning

Classification: Decision Trees

Decision Tree Learning

Machine Learning 2nd Edi7on

Classification Algorithms

Decision Tree Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

the tree till a class assignment is reached

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Introduction to Machine Learning CMU-10701

Incremental Stochastic Gradient Descent

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Introduction To Artificial Neural Networks

Supervised Learning via Decision Trees

Classification Algorithms

Learning Classification Trees. Sargur Srihari

CS145: INTRODUCTION TO DATA MINING

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Tree Learning Lecture 2

Decision Tree Learning

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Decision Tree Learning and Inductive Inference

Decision Tree Learning. Dr. Xiaowei Huang

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Lecture 3: Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Machine Learning & Data Mining

Decision Trees.

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Classification and Prediction

Symbolic methods in TC: Decision Trees

Lecture 7: DecisionTrees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Chapter 3: Decision Tree Learning

Artificial Intelligence Decision Trees

Dan Roth 461C, 3401 Walnut

Jeffrey D. Ullman Stanford University

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Statistics and learning: Big Data

Decision Trees.

Classification Using Decision Trees

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees / NLP Introduction

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

ML (cont.): DECISION TREES

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Classification and Regression Trees

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Decision Trees (Cont.)

Generative v. Discriminative classifiers Intuition

Decision Tree Learning

Decision Trees. Gavin Brown

Randomized Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Learning with multiple models. Boosting.

CHAPTER-17. Decision Tree Induction

Transcription:

Decision rees Decision trees is one of the simplest methods for supervised learning. It can be applied to both regression & classification. Example: A decision tree for deciding whether to wait for a place at restaurant. arget W illw ait can be rue or alse. Patrons? None Some ull WaitEstimate? >60 30 60 10 30 0 10 Alternate? Hungry? No Yes No Yes Reservation? ri/sat? Alternate? No Yes No Yes No Yes Bar? Raining? No Yes No Yes

Decision rees (cont.) At each node of a tree, a test is applied which sends the query sample down one of the branches of the node. his continues until the query sample arrives at a terminal or leaf node. Each leaf node is associated with a value: a class label in classification, or a numeric value in regression. he value of the leaf node reached by the query sample is returned as the output of the tree. Example: P atrons = ull, Price = Expensive, Rain = Yes, Reservation = Yes, Hungry = Yes, ri = No, Bar = Yes, Alternate = Yes, ype = hai, W aitestimate =0 10 What is W illw ait?

he raining Set raining set for a decision tree (each row is a sample): An attribute or feature is a characteristic of the situation that we can measure. he samples could be observations of previous decisions taken.

he raining Set (cont.) We can re-arrange the samples as follows: x 1 = Some $$$ rench 0 10, y 1 = } {{ } sample 1 x 2 = ull $ hai 30 60, y 2 = } {{ } sample 2 i.e., each sample consists of a 10-dimensional vector x i of discrete feature values, and a target label y i which is Boolean.

Learning Decision rees from Observations rivial solution for building decision trees construct a path to a leaf for each sample. Merely memorising the observations (not extracting any patterns from the data). Produces a consistent but excessively complex hypothesis (recall Occam s Razor). Unlikely to generalise well to new/unseen observations. We should aim to build the smallest tree that is consistent with the observations. One way to do this is to recursively test the most important attributes first.

Learning Decision rees from Observations (cont.) Example: est P atrons or ype first? Patrons? ype? None Some ull rench Italian hai Burger

Learning Decision rees from Observations (cont.) Example: est P atrons or ype first? Patrons? ype? None Some ull rench Italian hai Burger An important (or good) attribute splits samples into groups that are (ideally) all positive or negative. herefore P atrons is more important than ype. esting good attributes first allows us to minimise the tree depth.

Learning Decision rees from Observations (cont.) After the first attribute splits the samples, the remaining samples become decision tree problems themselves (or subtrees) but with less samples and one less attribute, e.g., his suggests a recursive approach to build decision trees.

Decision ree Learning (DL) Algorithm Aim: ind the smallest tree consistent with the training samples. Idea: Recursively choose most significant attribute as root of (sub)tree.

Decision ree Learning (DL) Algorithm (cont.) our basic underlying ideas of the algorithm: 1. If there are some positive and negative samples, then choose the best attribute to split them, e.g., test P atrons at the root.

Decision ree Learning (DL) Algorithm (cont.) 2. If all the remaining samples are all positive or all negative, we have reached a leaf node. Assign label as positive (or negative), e.g.,

Decision ree Learning (DL) Algorithm (cont.) 3. If there are no samples left, it means that no such sample has been observed. Return a default value calculated from the majority classification at the node s parent, e.g.,

Decision ree Learning (DL) Algorithm (cont.) 4. If there are no attributes left, but both positive and negative samples, it means that these samples have exactly the same feature values but different classifications. his may happen because some of the data could be incorrect, or the attributes do not give enough information to describe the situation fully (i.e. we lack other useful attributes), or the problem is truly non-deterministic, i.e., given two samples describing exactly the same conditions, we may make different decisions. Solution: Call it a leaf node and assign the majority vote as the label.

Decision ree Learning (DL) Algorithm (cont.) Result of applying DL on the observations: Patrons? None Some ull Hungry? Yes No ype? rench Italian hai Burger ri/sat? No Yes Simpler than the tree shown in Page 2.

Choosing the Best Attribute We need a measure of good and bad for attributes. One way to do us is to compute the information content at a node, i.e. at node R I(R) = L P (c i )log 2 P (c i ) i=1 where {c 1,...,c L } are the L class labels present at the node, and P (c i ) is the probability of getting class c i at the node. A unit of information is called a bit.

Choosing the Best Attribute (cont.) Example: At the root node of the restaurant problem, c 1 = rue, c 2 = alse and there are 6 rue samples and 6 alse samples. herefore P (c 1 )= no. of samples = c 1 total no. of samples = 6 6+6 =0.5 P (c 2 )= no. of samples = c 2 total no. of samples = 6 6+6 =0.5 I(Root) = 0.5 log 2 0.5 0.5 log 2 0.5 =1bit In general, the amount of information will be maximum when all classes are equally likely, and be minimum when the node is homogeneous (all samples have the same labels). What are the maximum and minimum attainable values for information content?

Choosing the Best Attribute (cont.) An attribute A will divide the samples at a node into different subsets (or child nodes) E A=v1,...,E A=vM, where A has M distinct values {v 1,...,v M }. Generally each subset E A=vi will have samples of different labels, so if we go along that branch we will need an additional of I(E A=vi ) bits of information. Example: In the restaurant problem, at the branch P atrons = ull of the root node, we have c 1 = rue with 2 samples and c 2 = alse with 4 samples, therefore I(E Patrons=ull )= 1 3 log 2 1 3 2 3 log 2 2 3 =0.9183 bits

Choosing the Best Attribute (cont.) Denote by P (A = v i ) as the probability of a sample to follow the branch A = v i. Example: or the restaurant problem, at the root node No. of samples where P atrons = ull P (P atrons = ull) = No. of samples at the root node = 6 12 =0.5

Choosing the Best Attribute (cont.) After testing attribute A, we need a remainder of Remainder(A) = M P (A = v i )I(E A=vi ) i=1 bits to classify the samples. he information gain in testing attribute A is the difference between the original information content and the new information content, i.e. Gain(A) = I(R) Remainder(A) = M I(R) P (A = v i )I(E A=vi ) where {E A=v1,...,E A=vM } are the child nodes of R after testing attribute A. i=1

Choosing the Best Attribute (cont.) he key idea behind the CHOOSE-ARIBUE function in the DL algorithm is to choose the attribute that gives the maximum information gain. Example: In the restaurant problem, at the root node Gain(Patrons) = 1 2 12 ( log 2 1) 4 12 ( log 2 1) 6 12 ( 2 6 log 2 2 6 4 6 log 2 4 6 ) 0.5409 bits. With similar calculations, Gain(ype)=0 (ry this yourself!!!) confirming that P atrons is a better attribute than ype. In fact at the root P atrons gives the highest information gain.

Concluding Remarks Adifferent point of view is to regard I(node) as a measure of impurity. he more mixed the samples are at a node (i.e. has equal proportions of all class labels), the higher the impurity value. On the other hand, a homogeneous node (i.e. has samples of one class only) will have zero impurity. he Gain(A) value can thus be viewed as the amount of reduction in impurity if we split according to A. his affords the intuitive idea that we grow trees by recursively trying to obtain leaf nodes which are as pure as possible.