Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Similar documents
Decision Tree Learning and Inductive Inference

Learning Classification Trees. Sargur Srihari

Lecture 3: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Lecture 3: Decision Trees

Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Machine Learning 2nd Edi7on

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree Learning - ID3

Decision Tree Learning

Learning Decision Trees

Decision Tree Learning

Learning Decision Trees

Chapter 3: Decision Tree Learning

Decision Trees / NLP Introduction

Typical Supervised Learning Problem Setting

Decision Trees.

Decision Trees. Tirgul 5

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Artificial Intelligence. Topic

Classification and regression trees

Classification and Prediction

Decision Trees.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CS 6375 Machine Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

EECS 349:Machine Learning Bryan Pardo

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

the tree till a class assignment is reached

Machine Learning & Data Mining

Classification: Decision Trees

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Dan Roth 461C, 3401 Walnut

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Chapter 3: Decision Tree Learning (part 2)

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Decision Trees. Gavin Brown

Data classification (II)

Classification Using Decision Trees

Rule Generation using Decision Trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Classification: Decision Trees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Machine Learning in Bioinformatics

Notes on Machine Learning for and

10-701/ Machine Learning: Assignment 1

Decision Trees. Danushka Bollegala

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Symbolic methods in TC: Decision Trees

Bayesian Learning Features of Bayesian learning methods:

Apprentissage automatique et fouille de données (part 2)

Algorithms for Classification: The Basic Methods

Introduction to Machine Learning CMU-10701

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Tutorial 6. By:Aashmeet Kalra

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Tree Learning

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

CSCI 5622 Machine Learning

Classification and Regression Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Classification II: Decision Trees and SVMs

Bayesian Classification. Bayesian Classification: Why?

Data Mining in Bioinformatics

Decision Tree. Decision Tree Learning. c4.5. Example

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Learning Decision Trees

Lecture 9: Bayesian Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree And Random Forest


Chapter 6: Classification

UVA CS 4501: Machine Learning

The Solution to Assignment 6

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Induction on Decision Trees

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Decision Tree Learning

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Decision Tree Learning Lecture 2

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Artificial Intelligence Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Einführung in Web- und Data-Science

CS145: INTRODUCTION TO DATA MINING

Transcription:

Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely expressive hypothesis space Inductive bias is a preference for small trees over large trees Outline Decision Tree Definition Appropriate Problems for Decision Tree Learning Decision Tree Learning Algorithm Overfitting Problem Decision Tree Decision Tree Definition Decision tree classify instances by sorting them down the tree from root to some leaf node Each node in the tree specifies a test of some attribute of the instance Each branch corresponds to one of the possible values for this attribute Each leaf associated with classification 1

Decision Tree Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. (Outlook=unny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) Appropriate Problems For Decision Tree Learning Appropriate Problems for Decision Tree Learning Instances are represented by attribute-value pairs The target function has discrete output values Disjunctive descriptions may be required The training data may contain errors The training data may contain missing attribute values Examples (Classification Problems) Classify medical patients by their disease Classify loan applicants by their likelihood of defaulting on payments. Decision Tree Algorithms Variation on a core algorithm that employs a top-down Greedy search through the space of possible decision trees ID3 is a typical example of decision tree algorithm Decision Tree Algorithms ID3 Algorithm Top down Determine which attribute should be tested first. The best attribute will be used as root. Greedy and never backtracks For each branch, repeat the ID3 process 2

ID3 Algorithm ID3 (Examples, Target_attribute, Attribute) Create a Root node for the tree If all examples are positive, Return the single-node tree, with label=+ If all examples are negative, Return the single-node tree, with label=- If Attributes is empty, Return the single-node tree Root, with label=most common value of Target_attribute in Examples Otherwise A the attribute from Attributes that best classifies Examples The decision attribute for Root A For each possible value, vi, of A, Add a new tree branch below Root, Corresponding to the test A=vi Let examples vi be the subset of Examples that have value vi for A If examples vi is empty Then below this new branch and a leaf node with label=most common value if Target_attribute in Examples Else below this new branch add the subtree» ID(Examples vi, Target_attribute, Attributres-{A})) End Return Root Which Attribute is the Best Classifier? Information Gain How well a given attribute separates the training examples. elect candidate attributes at each step while growing the tree. Entropy Purity of an arbitrary collection of examples Given a collection containing positive and negative examples of some target concept, the entropy of relative to this Boolean classification is Entropy()=-p 1 log 2 p 1 - p 0 log 2 p 0 More general version of Entropy c Entropy()= p i log 2 p i i 1 Entropy uppose is a collection of 14 examples of some Boolean concept, including 9 positive and 5 negative examples. Then the entropy of relative to this Boolean classification is Entropy([9+,5-])=-(9/14)log2(9/14)- (5/14)log2(5/14)=0.940 Information Gain The expected reduction in entropy caused by partitioning the example according to this attribute A) Entropy ( ) v Values( A) v Entropy ( v) 3

Example of Information Gain Values( Wind) Weak, trong [9,5 ] [6,2 ] Weak trong [3,3 ] High : [9+, 5-] E=0.940 Humidity Normal [3+, 4-] [6+, 1-] E=0.985 E=0.592 Humidity) =.940-(7/14).985-(7/14).592 =.151 Wind) Entropy( ) Entropy( ) (8 /14) Entropy( 0.940 (8 /14)0.811 (6 /14)1.00 0.048 Weak v { Weak, trong} Weak ) (6 /14) Entropy( : [9+, 5-] E=0.940 Wind v Entropy( v ) trong trong [6+, 2-] [3+, 3-] E=0.811 E=1.00 Wind) =.940-(8/14).811-(6/14)1.0 =.048 ) ID3 Example Day Outlook Temperature Humidity Wind Play Tennis D1 unny Hot High Weak No D2 unny Hot High trong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal trong No D7 Overcast Cool Normal trong Yes D8 unny Mild High Weak No D9 unny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 unny Mild Normal trong Yes D12 Overcast Mild High trong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High trong No {D1,D2,D8,D9,D11} [2+,3-] Humidity sunny { D1, D2, D8, D9, D11} ID3 Example {D1,D2,,D14} [9+,5-] Outlook {D3,D7,D12,D13} [4+,0-] sunny, Humidity ) 0.970 (3/ 5)0.0 (2 / 5) 0.970 Yes sunny, Wind ) 0.970 (2 / 5)0.0 (1/ 5)0.0 (2 / 5)1.0 0.570 sunny, Temperature) 0.970 (2 / 5)1.0 (3/ 5)0.918 0.019 Outlook) 0.246 Humidity ) 0.151 Wind ) 0.048 Temperature) 0.029 {D4,D5,D6,D10,D14} [3+,2-]??? Hypothesis pace earch in Decision Tree Learning earch a space of hypotheses for one that fits the training examples Hypothesis space of decision tree is a complete space of finite discrete-valued functions ID3 maintains only a single current hypothesis ID3 in it pure form performs no backtracking in its search ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis 4

Inductive Bias in Decision Tree Learning Inductive bias of ID3 horter trees are preferred over larger trees Trees that place high information gain attributes close to the root are preferred over those that do not It only can achieve local optimal, but not global optimal Why Prefer hort Hypotheses? Occam s Razor Prefer the simplest hypothesis that fits the data 500 nodes decision tree or 5 nodes decision tree? Less likely that one will find a short hypothesis that coincidentally fits the training data. Issue in Decision Tree Learning Determine how deeply to grow the decision tree Handle training data with missing attribute values Hand attribute with differing cost Improve computational efficiency Avoiding Overfitting the Data Overfitting Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h H, such that h has smaller error than h over the training examples, but h has a smaller error than h over the entire distribution of instance 5

Avoiding Overfitting the Data Avoiding Overfitting the Data Training examples contain random errors or noise <Outlook=unny, Temperature=Hot, Humidity=Normal, Wind=trong, PlayTennis=No>? ome attribute happens to partition the examples perfectly, whereas the simpler h will not. Approaches to avoiding overfitting in decision tree learning : top growing the tree earlier P: Allow the tree to overfit the data and then postprune tree is more straightforward, but it is hard to estimate precisely when to stop growing the tree P method has been found to be more successful in practice Reduced Error Pruning Pruning the decision node consists of removing the subtree rooted at that node, making it a leaf node Node are removed only if the resulting pruned tree performs no worse than the original over the validation set. Rule Post-Pruning Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node Prune each rule by removing any precondition that result in improving its estimated accuracy ort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. 6

Rule Post-Pruning if (outlook=sunny) (Humidity=High) then PlayTennis=No ee whether remove this rule will improve the overall performance Incorporating Continuous-Valued Attributed ID3 is restricted to attributes that take on a discrete set of values Target attribute must be discrete valued Attribute must be discrete valued Remove second restriction. Dynamically defining new discrete valued attributes that partition the continuous attribute value into a discrete set of intervals. Incorporating Continuous-Valued Attributed Temperature 40 48 60 72 80 90 Playtennis No No Yes Yes Yes No Two candidate (48+60)/2 and (80+90)/2. The information gain can then be computed for each of candidate attributes and the best can be selected. Handing Training Examples with Missing Attribute Values Assign value that is most common among training examples Missing value estimation method 7

Handing Attributes with Differing Costs Lower-cost attributes would be preferred Assign Weight Day Outlook Temperature Humidity Wind Play Tennis D1 unny Hot High Weak No D2 unny Hot High trong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal trong No D7 Overcast Cool Normal trong Yes D8 unny Mild High Weak No D9 unny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 unny Mild Normal trong Yes D12 Overcast Mild High trong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High trong No D15 unny Hot Normal trong No 8