Learning Decision Trees

Similar documents
Learning Decision Trees

Dan Roth 461C, 3401 Walnut

Learning Classification Trees. Sargur Srihari

Decision Tree Learning and Inductive Inference

Machine Learning 2nd Edi7on

Lecture 3: Decision Trees

Artificial Intelligence. Topic

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS 6375 Machine Learning

Decision Trees. Gavin Brown

Lecture 3: Decision Trees

the tree till a class assignment is reached

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees. Tirgul 5

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees.

Decision Trees.

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification: Decision Trees

Decision Trees / NLP Introduction

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree Learning

Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Classification Using Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning - ID3

Rule Generation using Decision Trees

Symbolic methods in TC: Decision Trees

Classification and Regression Trees


Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Chapter 3: Decision Tree Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

CS145: INTRODUCTION TO DATA MINING

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Tree Learning

Typical Supervised Learning Problem Setting

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Trees. Danushka Bollegala

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Machine Learning Alternatives to Manual Knowledge Acquisition

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Tree Learning

Classification and Prediction

Administrative notes. Computational Thinking ct.cs.ubc.ca

Machine Learning 3. week

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Classification and regression trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Chapter 6: Classification

Apprentissage automatique et fouille de données (part 2)

Decision Tree Learning Lecture 2

Decision Support. Dr. Johan Hagelbäck.

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Introduction to Machine Learning CMU-10701

The Solution to Assignment 6

Induction on Decision Trees

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Machine Learning & Data Mining

Symbolic methods in TC: Decision Trees

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Algorithms for Classification: The Basic Methods

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

UVA CS 4501: Machine Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

CSCE 478/878 Lecture 6: Bayesian Learning

Decision Tree Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Tree And Random Forest

Decision Tree Learning

Supervised Learning via Decision Trees

CSC 411 Lecture 3: Decision Trees

CSCI 5622 Machine Learning

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Einführung in Web- und Data-Science

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Induction of Decision Trees

Data classification (II)

Artificial Intelligence Decision Trees

Inductive Learning. Chapter 18. Why Learn?

Decision Tree. Decision Tree Learning. c4.5. Example

Bayesian Classification. Bayesian Classification: Why?

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Transcription:

Learning Decision Trees Machine Learning Spring 2018 1

This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy heuristic 3. Some extensions 2

This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy heuristic 3. Some extensions 3

History of Decision Tree Research Full search decision tree methods to model human concept learning: Hunt et al 60s, psychology Quinlan developed the ID3 algorithm, with the information gain heuristic to learn expert systems from examples (late 70s) Breiman, Freidman and colleagues in statistics developed CART (Classification And Regression Trees) A variety of improvements in the 80s: coping with noise, continuous attributes, missing data, etc. Quinlan s updated algorithms, C4.5 (1993) and C5 are more commonly used Boosting (or Bagging) over DTs is a very good general purpose algorithm 4

Will I play tennis today? Features Outlook: {Sun, Overcast, Rain} Temperature: {Hot, Mild, Cool} Humidity: {High, Normal, Low} Wind: {Strong, Weak} Labels Binary classification task: Y = {+, -} 5

Will I play tennis today? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Outlook: S(unny), O(vercast), R(ainy) Temperature: H(ot), M(edium), C(ool) Humidity: Wind: H(igh), N(ormal), L(ow) S(trong), W(eak) 6

Basic Decision Tree Learning Algorithm Data is processed in Batch (i.e. all the data available) O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 7

Basic Decision Tree Learning Algorithm Data is processed in Batch (i.e. all the data available) Recursively build a decision tree top down. O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - High No Humidity? Outlook? Sunny Overcast Rain Normal Yes Yes Strong No Wind? Weak Yes 8

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? For generalization at test time 9

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best classifies S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node why? For generalization at test time 10

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best classifies S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 11

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best classifies S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 12

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 13

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 14

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 15

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes why? below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 16

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node For generalization at test time 17

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the most common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node why? 18

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the most common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node why? For generalization at test time 19

Basic Decision Tree Algorithm: ID3 ID3(S, Attributes, Label): 1. If all examples have same label: Return a leaf node with the label ; if Attributes empty, return a leaf node with the most common label 2. Otherwise 1. Create a Root node for tree 2. A = attribute in Attributes that best splits S 3. for each possible value v of that A can take: 1. Add a new tree branch corresponding to A=v 2. Let S v be the subset of examples in S with A=v 3. if S v is empty: add leaf node with the most common value of Label in S Else: Input: S the set of Examples Label is the target attribute (the prediction) Attributes is the set of measured attributes below this branch add the subtree ID3(S v, Attributes - {A}, Label) 4. Return Root node why? For generalization at test time 20

How to Pick the Root Attribute Goal: Have the resulting decision tree as small as possible (Occam s Razor) But, finding the minimal decision tree consistent with data is NP-hard The recursive algorithm is a greedy heuristic search for a simple tree, but cannot guarantee optimality The main decision in the algorithm is the selection of the next attribute to split on 21

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples < (A=1,B=1), + >: 100 examples 22

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples < (A=1,B=1), + >: 100 examples What should be the first attribute we select? 23

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples < (A=1,B=1), + >: 100 examples What should be the first attribute we select? Splitting on A: we get purely labeled nodes. A 1 0 + - 24

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples < (A=1,B=1), + >: 100 examples A 1 1 0 + - What should be the first attribute we select? B 0 - Splitting on A: we get purely labeled nodes. Splitting on B: we don t get purely labeled nodes. A 1 0 + - 25

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples < (A=1,B=1), + >: 100 examples A 1 1 0 + - What should be the first attribute we select? B 0 - Splitting on A: we get purely labeled nodes. Splitting on B: we don t get purely labeled nodes. What if we have: <(A=1,B=0), - >: 3 examples A 1 0 + - 26

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples 3 examples < (A=1,B=1), + >: 100 examples Which attribute should we choose? 27

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples 3 examples < (A=1,B=1), + >: 100 examples Which attribute should we choose? + - 100 1 B 0 1 - + - 50 53 100 3 100 A 0-28

How to Pick the Root Attribute Consider data with two Boolean attributes (A,B). < (A=0,B=0), - >: 50 examples < (A=0,B=1), - >: 50 examples < (A=1,B=0), - >: 0 examples 3 examples < (A=1,B=1), + >: 100 examples Which attribute should we choose? + - 100 1 B 0 - Still A. But Need a way to quantify things + - 50 53 100 3 100 1 A 0-29

How to Pick the Root Attribute Goal: Have the resulting decision tree as small as possible (Occam s Razor) The main decision in the algorithm is the selection of the next attribute for splitting the data We want attributes that split the instances to sets that are relatively pure in one label This way we are closer to a leaf node. The most popular heuristic is information gain, originated with the ID3 system of Quinlan 30

Review: Entropy Entropy (purity) of a set of examples S with respect to binary labels is The proportion of positive examples is p + The proportion of negative examples is p - In general, for a discrete probability distribution with K possible values, with probabilities {p 1, p 2,!, p K } the entropy is given by 31

Review: Entropy Entropy (purity) of a set of examples S with respect to binary labels is The proportion of positive examples is p + The proportion of negative examples is p - In general, for a discrete probability distribution with K possible labels, with probabilities {p 1, p 2,!, p K } the entropy is given by 32

Review: Entropy Entropy (purity) of a set of examples S with respect to binary labels is The proportion of positive examples is p + The proportion of negative examples is p - If all examples belong to the same label, entropy = 0 If p + = p - = 0.5, entropy = 1 Entropy can be viewed as the number of bits required, on average, to encode the distinctive labels. If the probability for + is 0.5, a single bit is required for each label; if it is 0.8: can use less then 1 bit. 33

Review: Entropy Entropy (purity) of a set of examples S with respect to binary labels is The proportion of positive examples is p + The proportion of negative examples is p - If all examples belong to the same label, entropy = 0 If p + = p - = 0.5, entropy = 1 1 1 1 - + - + - + 34

Review: Entropy The uniform distribution has the highest entropy 1 1 1 35

Review: Entropy The uniform distribution has the highest entropy 1 1 1 36

Review: Entropy High Entropy High level of Uncertainty Low Entropy Little/No Uncertainty. The uniform distribution has the highest entropy 1 1 1 37

How to Pick the Root Attribute Goal: Have the resulting decision tree as small as possible (Occam s Razor) The main decision in the algorithm is the selection of the next attribute for splitting the data We want attributes that split the instances to sets that are relatively pure in one label This way we are closer to a leaf node. The most popular heuristic is information gain, originated with the ID3 system of Quinlan 38

Information Gain The information gain of an attribute A is the expected reduction in entropy caused by partitioning on this attribute S v : the subset of examples where the value of attribute A is set to value v Entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set Partitions of low entropy (imbalanced splits) lead to high gain Go back to check which of the A, B splits is better 39

Information Gain The information gain of an attribute A is the expected reduction in entropy caused by partitioning on this attribute S v : the subset of examples where the value of attribute A is set to value v Entropy of partitioning the data is calculated by weighing the entropy of each partition by its size relative to the original set Go back to check which of the A, B splits is better 40

Will I play tennis today? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Outlook: S(unny), O(vercast), R(ainy) Temperature: H(ot), M(edium), C(ool) Humidity: Wind: H(igh), N(ormal), L(ow) S(trong), W(eak) 41

Will I play tennis today? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Current entropy: p = 9/14 n = 5/14 H(Play?) = (9/14) log 2 (9/14) (5/14) log 2 (5/14)» 0.94 42

Information Gain: Outlook O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 43

Information Gain: Outlook O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Outlook = sunny: 5 of 14 examples p = 2/5 n = 3/5 H S = 0.971 44

Information Gain: Outlook O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Outlook = sunny: 5 of 14 examples p = 2/5 n = 3/5 H S = 0.971 Outlook = overcast: 4 of 14 examples p = 4/4 n = 0 H o = 0 Outlook = rainy: 5 of 14 examples p = 3/5 n = 2/5 H R = 0.971 Expected entropy: (5/14) 0.971 + (4/14) 0 + (5/14) 0.971 = 0.694 Information gain: 0.940 0.694 = 0.246 45

Information Gain: Outlook O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Outlook = sunny: 5 of 14 examples p = 2/5 n = 3/5 H S = 0.971 Outlook = overcast: 4 of 14 examples p = 4/4 n = 0 H o = 0 Outlook = rainy: 5 of 14 examples p = 3/5 n = 2/5 H R = 0.971 Expected entropy: (5/14) 0.971 + (4/14) 0 + (5/14) 0.971 = 0.694 Information gain: 0.940 0.694 = 0.246 46

Information Gain: Humidity O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Humidity = high: p = 3/7 n = 4/7 H h = 0.985 Humidity = Normal: p = 6/7 n = 1/7 H o = 0.592 Expected entropy: (7/14) 0.985 + (7/14) 0.592= 0.7885 Information gain: 0.940 0.7885 = 0.1515 47

Information Gain: Humidity O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Humidity = High: p = 3/7 n = 4/7 H h = 0.985 Humidity = Normal: p = 6/7 n = 1/7 H o = 0.592 Expected entropy: (7/14) 0.985 + (7/14) 0.592= 0.7885 Information gain: 0.940 0.7885 = 0.1515 48

Information Gain: Humidity O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Humidity = High: p = 3/7 n = 4/7 H h = 0.985 Humidity = Normal: p = 6/7 n = 1/7 H o = 0.592 Expected entropy: (7/14) 0.985 + (7/14) 0.592= 0.7885 Information gain: 0.940 0.7885 = 0.1515 49

Information Gain: Humidity O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Humidity = High: p = 3/7 n = 4/7 H h = 0.985 Humidity = Normal: p = 6/7 n = 1/7 H o = 0.592 Expected entropy: (7/14) 0.985 + (7/14) 0.592= 0.7885 Information gain: 0.940 0.7885 = 0.1515 50

Which feature to split on? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Information gain: Outlook: 0.246 Humidity: 0.151 Wind: 0.048 Temperature: 0.029 51

Which feature to split on? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - Information gain: Outlook: 0.246 Humidity: 0.151 Wind: 0.048 Temperature: 0.029 Split on Outlook 52

An Illustrative Example Outlook Gain(S,Humidity)=0.151 Gain(S,Wind) = 0.048 Gain(S,Temperature) = 0.029 Gain(S,Outlook) = 0.246 53

An Illustrative Example Sunny Outlook Overcast Rain 1,2,8,9,11 3,7,12,13 4,5,6,10,14 2+,3-? 4+,0-? 3+,2-? O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 54

An Illustrative Example Sunny 1,2,8,9,11 Outlook Overcast Rain 3,7,12,13 4,5,6,10,14 2+,3-4+,0-3+,2-??? Continue to split until: All examples in the leaf have same label O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 55

An Illustrative Example Sunny 1,2,8,9,11 4+,0-2+,3- Outlook Overcast Rain 3,7,12,13 4,5,6,10,14 3+,2-? Yes? Continue split until: All examples in the leaf have same label O T H W Play? 1 S H H W - 2 S H H S - 3 O H H W + 4 R M H W + 5 R C N W + 6 R C N S - 7 O C N S + 8 S M H W - 9 S C N W + 10 R M N W + 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 56

An Illustrative Example Outlook Gain(S sunny, Humidity) =.97-(3/5) 0-(2/5) 0 =.97 Gain(S sunny,temp) =.97-0-(2/5) 1 =.57 Sunny 1,2,8,9,11 4+,0-2+,3- Overcast Rain 3,7,12,13 4,5,6,10,14 3+,2-? Yes? Gain(S sunny, wind) =.97-(2/5) 1 - (3/5).92=.02 Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 11 Sunny Mild Normal Strong Yes 57

An Illustrative Example Outlook Overcast Rain 3,7,12,13 4+,0-4,5,6,10,14 3+,2- Yes? Sunny 1,2,8,9,11 2+,3-? 58

An Illustrative Example Outlook Sunny 1,2,8,9,11 4+,0-2+,3- Overcast Rain 3,7,12,13 4,5,6,10,14 3+,2- Humidity Yes? High No Normal Yes Low No 59

An Illustrative Example Outlook Overcast Rain 3,7,12,13 4+,0-4,5,6,10,14 3+,2- Yes Wind Sunny 1,2,8,9,11 2+,3- Humidity High No Normal Yes Low No Strong No Weak Yes 60

Summary: Learning Decision Trees 1. Representation: What are decision trees? A hierarchical data structure that represents data 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy heuristic If all the instances have the same label, create a leaf with that label Otherwise, find the best attribute and split the data for different values of that attributes Recurs on the splitted datasets 61