Decision Tree Learning

Similar documents
Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Chapter 3: Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees.

Decision Trees.

Chapter 3: Decision Tree Learning (part 2)

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning

Notes on Machine Learning for and

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Lecture 3: Decision Trees

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

Typical Supervised Learning Problem Setting

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Machine Learning 2nd Edi7on

Machine Learning & Data Mining

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Classification and Prediction

Learning Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning - ID3

Decision Tree Learning and Inductive Inference

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Learning Classification Trees. Sargur Srihari

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees / NLP Introduction

Learning Decision Trees

CS 6375 Machine Learning

the tree till a class assignment is reached

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Apprentissage automatique et fouille de données (part 2)

Decision Trees. Tirgul 5

Classification II: Decision Trees and SVMs

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Machine Learning in Bioinformatics

Dan Roth 461C, 3401 Walnut

Decision Tree Learning

EECS 349:Machine Learning Bryan Pardo

Classification: Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Tree Learning Lecture 2

Decision Trees. Gavin Brown

Lecture 9: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning


Information Theory & Decision Trees

Classification: Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Introduction to Machine Learning CMU-10701

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

The Solution to Assignment 6

Artificial Intelligence Decision Trees

Tutorial 6. By:Aashmeet Kalra

Data classification (II)

Artificial Intelligence. Topic

Machine Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Classification and regression trees

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Tree Learning

CSCI 5622 Machine Learning

Decision Support. Dr. Johan Hagelbäck.

10-701/ Machine Learning: Assignment 1

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Tree Learning

Decision Trees. Danushka Bollegala

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Algorithms for Classification: The Basic Methods

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Learning from Observations

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Machine Learning

Symbolic methods in TC: Decision Trees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees Entropy, Information Gain, Gain Ratio

Data Mining in Bioinformatics

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Induction of Decision Trees

Classification and Regression Trees

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Lecture 7: DecisionTrees

Machine Learning 3. week

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Concept Learning through General-to-Specific Ordering

Einführung in Web- und Data-Science

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Tree Learning

Chapter 6: Classification

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Transcription:

0. Decision Tree Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell

PLAN 1. Concept learning: an example Decision tree representation ID3 learning algorithm Statistical measures in decision tree learning: Entropy, Information gain Issues in DT Learning: 1. Inductive bias in ID3 2. Avoiding overfitting of data 3. Incorporating continuous-valued attributes 4. Alternative measures for selecting attributes 5. Handling training examples with missing attributes values 6. Handling attributes with different costs

1. Concept learning: an example Given the data: 2. Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No predict the value of PlayTennis for Outlook = sunny, T emp = cool, Humidity = high, W ind = strong

2. Decision tree representation 3. Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification Outlook Sunny Overcast Rain Example: Decision Tree for P layt ennis Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Another example: A Tree to Predict C-Section Risk 4. Learned from medical records of 1000 women Negative examples are C-sections [833+,167-].83+.17- Fetal_Presentation = 1: [822+,116-].88+.12- Previous_Csection = 0: [767+,81-].90+.10- Primiparous = 0: [399+,13-].97+.03- Primiparous = 1: [368+,68-].84+.16- Fetal_Distress = 0: [334+,47-].88+.12- Birth_Weight < 3349: [201+,10.6-].95+.05- Birth_Weight >= 3349: [133+,36.4-].78+.22- Fetal_Distress = 1: [34+,21-].62+.38- Previous_Csection = 1: [55+,35-].61+.39- Fetal_Presentation = 2: [3+,29-].11+.89- Fetal_Presentation = 3: [8+,22-].27+.73-

When to Consider Decision Trees 5. Instances describable by attribute value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences

3. ID3 Algorithm: Top-Down Induction of Decision Trees 6. START create the root node; assign all examples to root; Main loop: 1. A the best decision attribute for next node; 2. for each value of A, create a new descendant of node; 3. sort training examples to leaf nodes; 4. if training examples perfectly classified, then STOP; else iterate over new leaf nodes

7. 4. Statistical measures in DT leraning: Entropy, Information Gain Which attribute is best? A1=? [29+,35-] [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-]

8. Entropy Let S be a sample of training examples p is the proportion of positive examples in S p is the proportion of negative examples in S Entropy measures the impurity of S Information theory: Entropy(S) = expected number of bits needed to encode or for a randomly drawn member of S (under the optimal, shortestlength code) The optimal length code for a message having the probability p is log 2 p bits. So: Entropy(S) p ( log 2 p ) + p ( log 2 p ) = p log 2 p p log 2 p

9. Entropy 1.0 Entropy(S) 0.5 Entropy(S) p log 2 p p log 2 p 0.0 0.5 1.0 p +

Information Gain: 10. expected reduction in entropy due to sorting on A Gain(S, A) Entropy(S) v V alues(a) S v S Entropy(S v) A1=? [29+,35-] [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-]

Selecting the Next Attribute 11. Which attribute is the best classifier? S: [9+,5-] S: [9+,5-] E =0.940 E =0.940 Humidity Wind High Normal Weak Strong [3+,4-] [6+,1-] [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00 Gain (S, Humidity ) Gain (S, Wind) =.940 - (7/14).985 - (7/14).592 =.151 =.940 - (8/14).811 - (6/14)1.0 =.048

Partially learned tree {D1, D2,..., D14} [9+,5 ] Outlook 12. Sunny Overcast Rain {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3 ] [4+,0 ] [3+,2 ]? Yes? Which attribute should be tested here? S sunny = {D1,D2,D8,D9,D11} Gain (S sunny, Humidity) =.970 (3/5) 0.0 (2/5) 0.0 =.970 Gain (S sunny, Temperature) =.970 (2/5) 0.0 (2/5) 1.0 (1/5) 0.0 =.570 Gain (S sunny, Wind) =.970 (2/5) 1.0 (3/5).918 =.019

13. Hypothesis Space Search by ID3 + + A1 + + + A2 + +...... + + A2 A3 + + + A2 A4......

Hypothesis Space Search by ID3 14. Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis Which one? Inductive bias: approximate prefer shortest tree No back tracking Local minima... Statisically-based search choices Robust to noisy data...

5. Issues in DT Learning 5.1 Inductive Bias in ID3 15. Note: H is the power set of instances X Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam s razor: prefer the shortest hypothesis that fits the data

Occam s Razor 16. Why prefer short hypotheses? Argument in favor: Fewer short hypotheses than long hypsotheses a short hypothesis that fits data unlikely to be coincidence a long hypothesis that fits data might be coincidence Argument opposed: There are many ways to define small sets of hypotheses (E.g., all trees with a prime number of nodes that use attributes beginning with Z.) What s so special about small sets based on size of hypothesis??

5.2 Overfitting in Decision Trees 17. Consider adding noisy training example #15: (Sunny, Hot, N ormal, Strong, P layt ennis = N o) What effect does it produce on the earlier tree? Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Overfitting: Definition 18. Consider error of hypothesis h over training data: error train (h) entire distribution D of data: error D (h) Hypothesis h H overfits training data if there is an alternative hypothesis h H such that error train (h) < error train (h ) and error D (h) > error D (h )

19. Overfitting in Decision Tree Learning 0.9 0.85 0.8 Accuracy 0.75 0.7 0.65 0.6 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) On training data On test data

20. How can we avoid overfitting? Avoiding Overfitting stop growing when the data split is not anymore statistically significant grow full tree, then post-prune How to select best tree: Measure performance over training data Measure performance over a separate validation data set MDL: minimize size(tree) + size(misclassif ications(tree))

21. Reduced-Error Pruning Split data into training set and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves validation set accuracy Efect: Produces the smallest version of most accurate subtree Question: What if data is limited?

Effect of Reduced-Error Pruning 22. 0.9 0.85 0.8 0.75 Accuracy 0.7 0.65 0.6 0.55 0.5 On training data On test data On test data (during pruning) 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes)

23. Rule Post-Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use It is perhaps most frequently used method (e.g., C4.5)

Converting A Tree to Rules 24. Outlook IF THEN IF THEN (Outlook = Sunny) (Humidity = High) P layt ennis = No (Outlook = Sunny) (Humidity = Normal) P layt ennis = Y es Sunny Overcast Rain... Humidity Yes Wind High Normal Strong Weak No Yes No Yes

25. 5.3 Continuous Valued Attributes Create a discrete attribute to test continuous T emperature = 82.5 (T emperature > 72.3) = t, f Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No

5.4 Attributes with Many Values 26. Problem: If attribute has many values, Gain will select it Imagine using Date = Jun 3 1996 as attribute One approach: use GainRatio instead GainRatio(S, A) SplitInformation(S, A) c Gain(S, A) SplitInf ormation(s, A) i=1 S i S log 2 S i S where S i is the subset of S for which A has the value v i

5.5 Attributes with Costs 27. Consider medical diagnosis, BloodT est has cost $150 robotics, W idth from 1ft has cost 23 sec. Question: How to learn a consistent tree with low expected cost? One approach: replace gain by Gain2 (S,A) Cost(A) (Tan and Schlimmer, 1990) 2Gain(S,A) 1 (Cost(A)+1) w (Nunez, 1988) where w [0, 1] determines importance of cost

5.6 Unknown Attribute Values 28. Question: What if an example is missing the value of an attribute A? Use the training example anyway, sort through tree If node n tests A, assign the most common value of A among the other examples sorted to node n assign the most common value of A among the other examples with same target value assign probability p i to each possible value v i of A assign the fraction p i of the example to each descendant in the tree Classify new examples in same fashion.