Decision Trees. Tirgul 5

Similar documents
Learning Decision Trees

the tree till a class assignment is reached

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Learning Decision Trees

Lecture 3: Decision Trees

CS 6375 Machine Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning

Decision Trees. Gavin Brown

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning 2nd Edi7on

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Artificial Intelligence. Topic

Lecture 3: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Tree Learning

Machine Learning 3. week

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification and Prediction

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Chapter 3: Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees. Danushka Bollegala

Decision Tree Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

The Solution to Assignment 6

Classification: Decision Trees

Decision Trees.

Decision Tree Learning - ID3

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees Part 1. Rao Vemuri University of California, Davis

Dan Roth 461C, 3401 Walnut

Classification and Regression Trees

Induction on Decision Trees

Decision Tree Learning and Inductive Inference

Decision Trees.

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

EECS 349:Machine Learning Bryan Pardo

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Typical Supervised Learning Problem Setting

Chapter 3: Decision Tree Learning (part 2)

Machine Learning & Data Mining

Learning Classification Trees. Sargur Srihari

Symbolic methods in TC: Decision Trees

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Machine Learning

Bayesian Classification. Bayesian Classification: Why?

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

CS145: INTRODUCTION TO DATA MINING

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Tree Learning

Notes on Machine Learning for and

Decision Support. Dr. Johan Hagelbäck.

Supervised Learning via Decision Trees

Introduction to Machine Learning CMU-10701

Decision Trees / NLP Introduction

Decision Tree Learning

Classification II: Decision Trees and SVMs

Administrative notes. Computational Thinking ct.cs.ubc.ca

UVA CS 4501: Machine Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro


Data classification (II)

Decision Tree Learning

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Tree Learning. Dr. Xiaowei Huang

Rule Generation using Decision Trees

Introduction to Artificial Intelligence. Learning from Oberservations

Decision Tree And Random Forest

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Einführung in Web- und Data-Science

Apprentissage automatique et fouille de données (part 2)

Empirical Risk Minimization, Model Selection, and Model Assessment

CSC 411 Lecture 3: Decision Trees

Classification and regression trees

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Classification: Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

Lecture 7 Decision Tree Classifier

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Chapter 6: Classification

From inductive inference to machine learning

10-701/ Machine Learning: Assignment 1

Decision T ree Tree Algorithm Week 4 1

COMP 328: Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Tutorial 6. By:Aashmeet Kalra

Symbolic methods in TC: Decision Trees

Transcription:

Decision Trees Tirgul 5

Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

Using Decision Trees Another example: The tree helps a student decide what to do in the evening. <Party? (Yes, No), Deadline (Urgent, Near, None), Lazy? (Yes, No)> Party/tudy/TV/Pub 3

Definition A Decision tree is a flowchart-like structure which provides a useful way to describe a hypothesis h from a domain X to a label set Y = 0, 1, k. Given x X, a prediction h(x) of a decision tree h corresponds to a path from the root of the tree to a leaf. Most of the time we do not distinguish between the representation (the tree) and the hypothesis. An internal node corresponds to a question. A branch corresponds to an answer. A leaf corresponds to a label. 4

hould I study? Equivalent to: Party == No Deadline == Urgent Party == No Deadline == Near (Lazy == 5

Constructing the Tree Features: Party, Deadline, Lazy. Based on these features, how do we construct the tree? The DT algorithm use the following principle: Build the tree in a greedy manner; tarting at the root, choose the most informative feature at each step. 6

Constructing the Tree Informative features? Choosing which feature to use next in the decision tree can be thought of as playing the game 20 Questions. At each stage, you choose a question that gives you the most information given what you know already. Thus, you would ask Is it an animal? before you ask Is it a cat?. 7

Constructing the Tree 20 Questions example. Akinator s questions My character 8

Constructing the Tree The idea: quantify how much information is provided. Mathematically: Information Theory 9

Pivot example: 10

Quick Aside: Information Theory

Entropy is a sample of training examples. p + is the proportion of positive examples in p - is the proportion of negative examples in Entropy measures the impurity of : Entropy() = - p + log 2 p + - p - log 2 p - The smaller the better (we define: 0log0 = 0) Entropy p = i p i log 2 p i 12

Entropy Generally, entropy refers to disorder or uncertainty. Entropy = 0 if outcome is certain. E.g.: Consider a coin toss: Probability of heads == probability of tails The entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2. 13

Entropy: Example The dataset Entropy = p "yes" log 2 p "yes" p "no" log 2 p "no" Entropy p = i p i log 2 p i = 9 14 log 2 9 14 5 14 log 2 5 14 = 0.409 + 0.530 = 0.939 } 14 examples: 9 positive 5 negative 14

Information Gain Important Idea: find how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. Called: Information Gain. defined as the entropy of the whole set minus the entropy when a particular feature is chosen. 15

Information Gain The information gain is the expected reduction in entropy caused by partitioning the examples with respect to an attribute. Given is the set of examples (at the current node), A the attribute, and v the subset of for which attribute A has value v: IG(, a) = Entropy Entropy v Values A v That is, current entropy minus new entropy v 16

Information Gain: Example Attribute A: Outlook Values A = sunny, overcast, rain Outlook unny Overcast Rain [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 17

18

Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny Entropy overcast Entropy rain [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 19

Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny Entropy overcast Entropy rain = 5 14 Entropy sunny = 4 14 Entropy overcast = 5 14 Entropy rain [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 20

Information Gain: Example Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny = 5 2 log 14 5 2 3 log 5 5 2 Entropy overcast = 4 4 log 14 4 2 0 log 4 4 2 Entropy rain = 5 3 log 14 5 2 2 log 5 5 2 3 2 4 3 5 = 5 14 0.970 0 4 = 4 14 0 2 5 = 5 14 0.970 [9 +, 5 -] Outlook Entropy p = p i log 2 p i unny Overcast Rain IG(, a) = Entropy v Values A i v Entropy v [2 +, 3-] [4 +, 0 -] [3 +, 2 -] 21

Information Gain: Example unny [9 +, 5 -] Outlook Overcast Rain Entropy = 0.939 Values A = Outlook = sunny, overcast, rain sunny overcast rain Entropy sunny = 5 0.970 = 0.346 14 Entropy overcast = 4 14 0 = 0 Entropy rain = 5 0.970 = 0.346 14 [2 +, 3-] [4 +, 0 -] [3 +, 2 -] IG(, a = outlook) = 0.939 0.346 + 0 + 0.346 = 0. 247 IG(, a) = Entropy v Values A v Entropy v 22

Information Gain: Example Attribute A: Wind Values A = weak, strong [9 +, 5-] Wind weak strong [6 +, 2-] [3 +, 3 -] 23

Information Gain: Example Entropy = 0.939 Values A = Wind = weak, strong weak strong Entropy weak = 8 6 log 14 8 2 2 log 8 8 2 Entropy strong = 6 3 log 14 6 2 3 log 6 6 2 6 3 2 8 = 8 14 0.811 3 6 = 6 14 1 weak [9 +, 5-] Wind strong Entropy p = p i log 2 p i IG(, a) = Entropy v Values A i v Entropy v [6 +, 2-] [3 +, 3 -] 24

Information Gain: Example Entropy = 0.939 Values A = Wind = weak, strong weak Entropy weak = 8 14 0.811 = 0.463 strong Entropy strong = 6 14 1 = 0.428 IG(, a = wind) = 0.939 0.463 + 0.428 = 0. 048 IG(, a) = Entropy v Values A v Entropy v 25

Information Gain The smaller the value of : v Entropy v is, the larger IG becomes. Probability of getting to the node IG(, a) = Entropy v Values A before v after Entropy v The ID3 algorithm computes this IG for each attribute and chooses the one that produces the highest value. Greedy 26

ID3 Algorithm Artificial Intelligence: A Modern Approach 27

ID3 Algorithm Majority Value function: Returns the label of the majority of the training examples in the current subtree. Choose attribute function: Choose the attribute that maximizes the Information Gain. (Could use other measures other than IG.) 28

Back to the example IG(, a = outlook) = 0. 247 IG(, a = wind) = 0.048 IG(, a = temperature) = 0.028 IG(, a = humidity) = 0.151 Maximal value The tree (for now): 29

Decision Tree after first step: hot hot mild cool mild hot cool mild hot mild cool cool mild mild 30

Decision Tree after first step: hot hot mild cool mild hot cool mild hot yes mild cool cool mild mild 31

The second step: hot [2 +, 3 -] temperature mild cool [0 +, 2-] [1 +, 1 -] [1 +, 0-] IG is maximal high [2 +, 3-] humidity normal weak [2 +, 3-] wind strong [0 +, 3-] [2 +, 0 -] [1 +, 2-] [1 +, 1 -] Entropy = 0 Entropy = 0 32

Decision Tree after second step: 33

Next 34

Final DT: 35

Minimal Description Length The attribute we choose is the one with the highest information gain We minimize the amount of information that is left Thus, the algorithm is biased towards smaller trees Consistent with the well-known principle that short solutions are usually better than longer ones. 36

Minimal Description Length MDL: The shortest description of something, i.e., the most compressed one, is the best description. a.k.a: Occam s Razor: Among competing hypotheses, the one with the fewest assumptions should be selected. 37

New Data: <Rain, Mild, High, Weak> - False Outlook unny Overcast Rain Humidity True Wind Wait What!? High Normal Weak trong False True True False Prediction: True 38

Overfitting Learning was performed for too long or the learner may adjust to very specific random features of the training data, that have no causal relation to the target function. 39

Overfitting 40

Overfitting The performance on the training examples still increases while the performance on unseen data becomes worse. 41

42

Pruning 43

44

ID3 for real-valued features Until now, we assumed that the splitting rules are of the form: I [xi =1] For real-valued features, use threshold-based splitting rules: I [xi <θ] *I (boolean expression) is the indicator function (equals 1 if expression is true and 0 otherwise. 45

Random forest 46

Random forest Collection of decision trees. Prediction: a majority vote over the predictions of the individual trees. Constructing the trees: Take a random subsample (of size m ) from, with replacements. Construct a sequence I 1, I 2,, where I t is a subset of the attributes (of size k) The algorithm grows a DT (using ID3) based on the sample At each splitting stage, the algorithm chooses a feature that maximized IG from I t 47

Weka http://www.cs.waikato.ac.nz/ml/weka/ 48

ummary Intro: Decision Trees Constructing the tree: Information Theory: Entropy, IG ID3 Algorithm MDL Overfitting: Pruning ID3 for real-valued features Random Forests Weka 49