Decision Tree Learning. Dr. Xiaowei Huang

Similar documents
Decision Tree Learning

Decision Tree Learning

Classification: Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Trees. Tirgul 5

CSC 411 Lecture 3: Decision Trees

Decision Tree Learning

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Expectations and Entropy

the tree till a class assignment is reached

Learning Decision Trees

Decision Tree Learning

Machine Learning

Decision Tree Learning

Learning Decision Trees

From inductive inference to machine learning

Lecture 3: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning - ID3

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Lecture 3: Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Classification and Regression Trees

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning

Decision Tree Learning

Machine Learning 2nd Edi7on

CS 6375 Machine Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Trees. Danushka Bollegala

Machine Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Learning from Observations. Chapter 18, Sections 1 3 1

Decision Trees.

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees.

Generative v. Discriminative classifiers Intuition

Statistical Learning. Philipp Koehn. 10 November 2015

Chapter 3: Decision Tree Learning

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Trees Lecture 12

Machine Learning & Data Mining

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Introduction to Artificial Intelligence. Learning from Oberservations

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Bayesian learning Probably Approximately Correct Learning

Chapter 3: Decision Tree Learning (part 2)

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Math 1313 Experiments, Events and Sample Spaces

CS 380: ARTIFICIAL INTELLIGENCE

Dan Roth 461C, 3401 Walnut

SRNDNA Model Fitting in RL Workshop

Artificial Intelligence. Topic

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan


Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Machine Learning Alternatives to Manual Knowledge Acquisition

Lecture 7: DecisionTrees

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Trees. Gavin Brown

day month year documentname/initials 1

Decision Tree Learning Lecture 2

Lecture #11: Classification & Logistic Regression

Classification: The rest of the story

16.4 Multiattribute Utility Functions

Algebra 1B notes and problems March 12, 2009 Factoring page 1

Decision Trees (Cont.)

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Classification and Prediction

Rule Generation using Decision Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Bayesian Learning Extension

Induction on Decision Trees

CS 6375 Machine Learning

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

CS145: INTRODUCTION TO DATA MINING

CSCE 478/878 Lecture 6: Bayesian Learning

Lecture 9: Bayesian Learning

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Decision Trees Part 1. Rao Vemuri University of California, Davis

The binary entropy function

Decision Trees Entropy, Information Gain, Gain Ratio

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision T ree Tree Algorithm Week 4 1

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Learning Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Typical Supervised Learning Problem Setting

Learning Classification Trees. Sargur Srihari

Classification: Decision Trees

Transcription:

Decision Tree Learning Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/

After two weeks, Remainders Should work harder to enjoy the learning procedure slides Read slides before coming, think them through after class Slides will be adapted according to the prior classes, and will be updated to the newest asap after class Lab First assignment to be disclosed this week Learning experience: type in the code, and try to adapt the code to see the different results

Decision Tree up to now, Decision tree representation A general top-down algorithm How to do splitting on numeric features

Top-down decision tree learning The focus of this lecture In the next lecture Explained in the last lecture

Topics Occam s razor entropy and information gain types of decision-tree splits

Finding the best split How should we select the best feature to split on at each step? Key hypothesis: the simplest tree that classifies the training instances accurately will work well on previously unseen instances

Occam s razor attributed to 14th century William of Ockham Nunquam ponenda est pluralitis sin necesitate Entities should not be multiplied beyond necessity when you have two competing theories that make exactly the same predictions, the simpler one is the better

But a thousand years earlier, I said, We consider it a good principle to explain the phenomena by the simplest hypothesis possible.

Occam s razor and decision trees Why is Occam s razor a reasonable heuristic for decision tree learning? there are fewer short models (i.e. small trees) than long ones a short model is unlikely to fit the training data well by chance a long model is more likely to fit the training data well coincidentally

Finding the best splits Can we find and return the smallest possible decision tree that accurately classifies the training set? This is an NP-hard problem [Hyafil & Rivest, Information Processing Letters, 1976] Instead, we ll use an information-theoretic heuristics to greedily choose splits

Expected Value (Finite Case) Let X be a random variable with a finite number of finite outcomes x 1, x 2,, x k occurring with probability p 1, p 2,, p k, respectively. The expectation of X is defined as E[X] = p 1 x 1 + p 2 x 2 + + p k x k Expectation is a weighted average

Expected Value Example Let X represent the outcome of a roll of a fair six-sided die Possible values for X include {1,2,3,4,5,6} Probability of them are {1/6, 1/6, 1/6, 1/6, 1/6, 1/6} The expected value is

Information theory background consider a problem in which you are using a code to communicate information to a receiver example: as bikes go past, you are communicating the manufacturer of each bike

Information theory background suppose there are only four types of bikes we could use the following code expected number of bits we have to communicate: 2 bits/bike

Information theory background we can do better if the bike types aren t equiprobable

Information theory background expected number of bits we have to communicate

Information theory background =

Information theory background optimal code uses -log 2 P(y) bits for event with probability P(y)

Entropy entropy is a measure of uncertainty associated with a random variable defined as the expected number of bits required to communicate the value of the variable

Conditional entropy conditional entropy (or equivocation) quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. What s the entropy of Y if we condition on some other variable X? Where similar as the expected value? Similar as entropy

Information gain (a.k.a. mutual information) choosing splits in ID3: select the split S that most reduces the conditional entropy of Y for training set D

Relations between the concepts

Information gain example

Information gain example What s the information gain of splitting on Humidity?

Information gain example

Information gain example = P(Humidity=high)H D (Y Humidity=high) + P(Humidity=normal)H D (Y Humidity=normal)

Information gain example

Information gain example Is it better to split on Humidity or Wind?

One limitation of information gain information gain is biased towards tests with many outcomes e.g. consider a feature that uniquely identifies each training instance splitting on this feature would result in many branches, each of which is pure (has instances of only one class) maximal information gain!

Gain ratio to address this limitation, C4.5 uses a splitting criterion called gain ratio gain ratio normalizes the information gain by the entropy of the split being considered