Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Similar documents
Classification II: Decision Trees and SVMs

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Chapter 3: Decision Tree Learning

Decision Tree Learning

Machine Learning 2nd Edi7on

Lecture 3: Decision Trees

Decision Trees.

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees.

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Learning Decision Trees

Decision Trees / NLP Introduction

Learning Classification Trees. Sargur Srihari

CS 6375 Machine Learning

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Chapter 3: Decision Tree Learning (part 2)

Decision Trees. Tirgul 5

Decision Tree Learning and Inductive Inference

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Learning Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Machine Learning & Data Mining

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

EECS 349:Machine Learning Bryan Pardo

Typical Supervised Learning Problem Setting

Decision Tree Learning - ID3

Dan Roth 461C, 3401 Walnut

Classification and Prediction

the tree till a class assignment is reached

Introduction to Machine Learning CMU-10701

Algorithms for Classification: The Basic Methods

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Artificial Intelligence. Topic

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Classification: Decision Trees

Decision Trees. Gavin Brown

Decision Tree Learning

CSCE 478/878 Lecture 6: Bayesian Learning

Apprentissage automatique et fouille de données (part 2)

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees Part 1. Rao Vemuri University of California, Davis

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Notes on Machine Learning for and

Classification and Regression Trees

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision Trees. Danushka Bollegala

Introduction to Machine Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Symbolic methods in TC: Decision Trees

Decision Support. Dr. Johan Hagelbäck.

Data classification (II)

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Numerical Learning Algorithms

Chapter 6: Classification

Support Vector Machines

Bayesian Classification. Bayesian Classification: Why?

Administrative notes. Computational Thinking ct.cs.ubc.ca


Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Machine Learning 3. week

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

CS145: INTRODUCTION TO DATA MINING

Induction on Decision Trees

The Solution to Assignment 6

Tutorial 6. By:Aashmeet Kalra

Machine Learning in Bioinformatics

Decision Tree Learning

Decision Tree Learning Lecture 2

Expectations and Entropy

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Classification: Decision Trees

Classification and regression trees

Lecture 9: Bayesian Learning

Machine Learning Alternatives to Manual Knowledge Acquisition

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Artificial Intelligence Decision Trees

Machine Learning

Mining Classification Knowledge

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Stephen Scott.

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Rule Generation using Decision Trees

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Transcription:

Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Roadmap Classification: machines labeling data for us Last time: naïve Bayes This time: Decision Trees Simple, nonlinear, interpretable Discussion: Which classifier should I use for my problem? Data Science: Jordan Boyd-Graber UMD Decision Trees 2 / 1

Decision Trees Trees Suppose that we want to construct a set of rules to represent the data can represent data as a series of if-then statements here, if splits inputs into two categories then assigns value when if statements are nested, structure is called a tree X 1 > 8 True X 1 > 5 False X 2 < - 2 True True False FALSE TRUE X 3 > 0 True False TRUE TRUE FALSE Data Science: Jordan Boyd-Graber UMD Decision Trees 3 / 1

Decision Trees Trees Ex: data (X 1,X 2,X 3,Y) with X 1,X 2,X 3 are real, Y Boolean First, see if X 1 > 5: if TRUE, see if X1 > 8 if TRUE, return FALSE if FALSE, return TRUE if FALSE, see if X2 < 2 if TRUE, see if X 3 > 0 if TRUE, return TRUE if FALSE, return FALSE if FALSE, return TRUE True X 1 > 8 True X 1 > 5 False X 2 < - 2 True False FALSE TRUE X 3 > 0 TRUE True False TRUE FALSE Data Science: Jordan Boyd-Graber UMD Decision Trees 4 / 1

Decision Trees Trees X 1 > 8 True X 1 > 5 False X 2 < - 2 True True False FALSE TRUE X 3 > 0 True False TRUE TRUE FALSE Example 1: (X 1,X 2,X 3 ) = (1,1,1) Example 2: (X 1,X 2,X 3 ) = (10, 3,0) Data Science: Jordan Boyd-Graber UMD Decision Trees 5 / 1

Decision Trees Trees X 1 > 8 True X 1 > 5 False X 2 < - 2 True True False FALSE TRUE X 3 > 0 True False TRUE TRUE FALSE Example 1: (X 1,X 2,X 3 ) = (1,1,1) TRUE Example 2: (X 1,X 2,X 3 ) = (10, 3,0) Data Science: Jordan Boyd-Graber UMD Decision Trees 5 / 1

Decision Trees Trees X 1 > 8 True X 1 > 5 False X 2 < - 2 True True False FALSE TRUE X 3 > 0 True False TRUE TRUE FALSE Example 1: (X 1,X 2,X 3 ) = (1,1,1) TRUE Example 2: (X 1,X 2,X 3 ) = (10, 3,0) FALSE Data Science: Jordan Boyd-Graber UMD Decision Trees 5 / 1

Decision Trees Trees Terminology: branches: one side of a split leaves: terminal nodes that return values Why trees? trees can be used for regression or classification regression: returned value is a real number classification: returned value is a class unlike linear regression, SVMs, naive Bayes, etc, trees fit local models in large spaces, global models may be hard to fit results may be hard to interpret fast, interpretable predictions Data Science: Jordan Boyd-Graber UMD Decision Trees 6 / 1

Decision Trees Example: Predicting Electoral Results 2008 Democratic primary: Hillary Clinton Barack Obama Given historical data, how will a count vote? can extrapolate to state level data might give regions to focus on increasing voter turnout would like to know how variables interact Data Science: Jordan Boyd-Graber UMD Decision Trees 7 / 1

Decision Trees Example: Predicting Electoral Results Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party primary (as of April 16), by Amanada Cox for the New York Times. 3 Data Science: Jordan Boyd-Graber UMD Decision Trees 8 / 1

Decision Trees Example: Predicting Electoral Results Data Science: Jordan Boyd-Graber UMD Decision Trees 9 / 1

Decision Trees Example: Predicting Electoral Results Data Science: Jordan Boyd-Graber UMD Decision Trees 10 / 1

Decision Trees Example: Predicting Electoral Results Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Par primary (as of April 16), by Amanada Cox for the New York Times. 3 Data Science: Jordan Boyd-Graber UMD Decision Trees 11 / 1

Decision Trees Decision Trees Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent as a function of X,Y : X AND Y (both must be true) X OR Y (either can be true) X XOR Y (one and only one is true) Data Science: Jordan Boyd-Graber UMD Decision Trees 12 / 1

Decision Trees When to Consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences Data Science: Jordan Boyd-Graber UMD Decision Trees 13 / 1

Learning Decision Trees Top-Down Induction of Decision Trees Main loop: 1. A the best decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best? A1=? [29+,35-] [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-] Data Science: Jordan Boyd-Graber UMD Decision Trees 14 / 1

Learning Decision Trees Entropy: Reminder 1.0 Entropy(S) 0.5 0.0 0.5 1.0 p + S is a sample of training examples p is the proportion of positive examples in S p is the proportion of negative examples in S Entropy measures the impurity of S Entropy(S) p log 2 p p log 2 p Data Science: Jordan Boyd-Graber UMD Decision Trees 15 / 1

Learning Decision Trees Entropy How spread out is the distribution of S: p ( log 2 p ) + p ( log 2 p ) Entropy(S) p log 2 p p log 2 p Data Science: Jordan Boyd-Graber UMD Decision Trees 16 / 1

Learning Decision Trees Information Gain Which feature A would be a more useful rule in our decision tree? Gain(S, A) = expected reduction in entropy due to sorting on A Gain(S, A) Entropy(S) v Values(A) S v S Entropy(S v) A1=? [29+,35-] [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-] Data Science: Jordan Boyd-Graber UMD Decision Trees 17 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = 0.96 Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = 0.96 Gain(S,A 1 ) = 0.96 26 5 5 64 26 lg 26 38 8 8 64 38 lg 30 38 38 lg = 21 21 26 lg 26 30 38 Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = 0.96 Gain(S,A 1 ) = 0.96 26 5 5 64 26 lg 26 38 8 8 64 38 lg 30 38 38 lg = 0.96 0.28 0.44 = 0.24 21 21 26 lg 26 30 38 Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = 0.96 Gain(S,A 1 ) = 0.96 26 5 5 64 26 lg 26 38 8 8 64 38 lg 30 38 38 lg = 0.96 0.28 0.44 = 0.24 21 21 26 lg 26 30 38 Gain(S,A 2 ) = 0.96 51 18 18 64 51 lg 51 13 11 11 64 13 lg 13 = 33 2 13 lg 2 13 33 51 lg 51 Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees H(S) = 29 29 54 lg 35 35 54 64 lg 64 = 0.96 Gain(S,A 1 ) = 0.96 26 5 5 64 26 lg 26 38 8 8 64 38 lg 30 38 38 lg = 0.96 0.28 0.44 = 0.24 21 21 26 lg 26 30 38 Gain(S,A 2 ) = 0.96 51 18 18 64 51 lg 51 13 11 11 64 13 lg 13 = 0.96 0.75 0.13 = 0.08 33 2 13 lg 2 13 33 51 lg 51 Data Science: Jordan Boyd-Graber UMD Decision Trees 18 / 1

Learning Decision Trees Training Examples Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Data Science: Jordan Boyd-Graber UMD Decision Trees 19 / 1

Learning Decision Trees Selecting the Next Attribute Which attribute is the best classifier? S: [9+,5-] S: [9+,5-] E =0.940 E =0.940 Humidity Wind High Normal Weak Strong [3+,4-] [6+,1-] [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00 Gain (S, Humidity ) Gain (S, Wind) =.940 - (7/14).985 - (7/14).592 =.151 =.940 - (8/14).811 - (6/14)1.0 =.048 Data Science: Jordan Boyd-Graber UMD Decision Trees 20 / 1

Learning Decision Trees ID3 Algorithm Start at root, look for best attribute Repeat for subtrees at each attribute outcome Stop when information gain is below a threshold Bias: prefers shorter trees (Occam s Razor) a short hyp that fits data unlikely to be coincidence a long hyp that fits data might be coincidence Prevents overfitting (more later) Data Science: Jordan Boyd-Graber UMD Decision Trees 21 / 1

Recap Text classification Many commercial applications There are many applications of text classification for corporate Intranets, government departments, and Internet publishers. Often greater performance gains from exploiting domain-specific text features than from changing from one machine learning method to another. Representing features is often a big challenge (e.g., zero mean, standard variance) Data Science: Jordan Boyd-Graber UMD Decision Trees 22 / 1

Recap Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Very little? A fair amount? A huge amount Data Science: Jordan Boyd-Graber UMD Decision Trees 23 / 1

Recap Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? A fair amount? A huge amount Data Science: Jordan Boyd-Graber UMD Decision Trees 23 / 1

Recap Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? A huge amount Data Science: Jordan Boyd-Graber UMD Decision Trees 23 / 1

Recap Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM (later) A huge amount Data Science: Jordan Boyd-Graber UMD Decision Trees 23 / 1

Recap Choosing what kind of classifier to use When building a text classifier, first question: how much training data is there currently available? None? Hand write rules or use active learning Very little? Naïve Bayes A fair amount? SVM (later) A huge amount Doesn t matter, use whatever works Data Science: Jordan Boyd-Graber UMD Decision Trees 23 / 1

Recap Recap Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? For an unstable problem, it s better to use a simple and robust classifier. Data Science: Jordan Boyd-Graber UMD Decision Trees 24 / 1