Classification: Decision Trees

Similar documents
Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

CS 6375 Machine Learning

Notes on Machine Learning for and

Machine Learning 2nd Edi7on

Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees.

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

the tree till a class assignment is reached

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees.

Decision Tree Learning. Dr. Xiaowei Huang

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning Decision Trees

Decision Tree Learning Lecture 2

Dan Roth 461C, 3401 Walnut

Decision Tree Learning

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Tree Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Lecture 3: Decision Trees

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

EECS 349:Machine Learning Bryan Pardo

Chapter 3: Decision Tree Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Inductive Learning. Chapter 18. Why Learn?

Classification Based on Probability

Machine Learning

Decision Tree Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification and Prediction

Machine Learning

Learning Decision Trees

Decision Tree Learning - ID3

Introduction to Machine Learning CMU-10701

Decision Trees (Cont.)

Lecture 3: Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Tree Learning

Decision Trees. Tirgul 5

Decision Tree Learning and Inductive Inference

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Chapter 3: Decision Tree Learning (part 2)

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. Danushka Bollegala

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

CSC 411 Lecture 3: Decision Trees

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Classification and regression trees

Decision Trees. Common applications: Health diagnosis systems Bank credit analysis

Machine Learning 2010

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Typical Supervised Learning Problem Setting

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning & Data Mining

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision Tree Learning

Learning Classification Trees. Sargur Srihari

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Learning Decision Trees

Machine Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

CS 380: ARTIFICIAL INTELLIGENCE

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

From inductive inference to machine learning

Rule Generation using Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Classification Using Decision Trees

Learning from Observations. Chapter 18, Sections 1 3 1

CS145: INTRODUCTION TO DATA MINING

Introduction to Machine Learning

Statistical Learning. Philipp Koehn. 10 November 2015

UVA CS 4501: Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Artificial Intelligence Roman Barták

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Lecture 7: DecisionTrees

Tutorial 6. By:Aashmeet Kalra

Classification and Regression Trees

Lecture 7 Decision Tree Classifier

Introduction to Artificial Intelligence. Learning from Oberservations

Decision Tree And Random Forest

CSCI 5622 Machine Learning

Decision Trees / NLP Introduction

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Transcription:

Classification: Decision Trees These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Last Time Common decomposition of machine learning, based on differences in inputs and outputs Supervised Learning: Learn a function mapping inputs to outputs using labeled training data (you get instances/examples with both inputs and ground truth output) Unsupervised Learning: Learn something about just data without any labels (harder!), for example clustering instances that are similar Reinforcement Learning: Learn how to make decisions given a sparse reward ML is an interaction between: the input data input to the function (features/attributes of data) the the function (model) you choose, and the optimization algorithm you use to explore space of functions We are new at this, so let s start by learning if/then rules for classification!

Function Approximation Problem Setting Set of possible instances X Set of possible labels Y Unknown target function f : X! Y Set of function hypotheses H = {h h : X! Y} Input: Training examples of unknown target function f {hx i,y i i} n i=1 = {hx 1,y 1 i,...,hx n,y n i} Output: Hypothesis h 2 H that best approximates f Based on slide by Tom Mitchell

Sample Dataset (was Tennis Played?) Columns denote features X i Rows denote labeled instances hx i,y i i Class label denotes whether a tennis game was played hx i,y i i

Decision Tree A possible decision tree for the data: Each internal node: test one attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y Based on slide by Tom Mitchell

Decision Tree A possible decision tree for the data: What prediction would we make for <outlook=sunny, temperature=hot, humidity=high, wind=weak>? Based on slide by Tom Mitchell

Decision Tree Decision Boundary Decision trees divide the feature space into axisparallel (hyper-)rectangles Each rectangular region is labeled with one label or a probability distribution over labels Decision boundary 7

Expressiveness Given a particular space of functions, you may not be able to represent everything What functions can decision trees represent? Decision trees can represent any function of the input attributes! Boolean operations (and, or, xor, etc.)? Yes! All boolean functions? Yes!

Stages of (Batch) Machine Learning Given: labeled training data X, Y = {hx i,y i i} n i=1 Assumes each x i D(X ) with y i = f target (x i ) Train the model: model ß classifier.train(x, Y ) x X, Y learner model y prediction Apply the model to new data: Given: new unlabeled instance y prediction ß model.predict(x) x D(X )

Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node = root of decision tree Main loop: 1. A ß the best decision attribute for the next node. 2. Assign A as decision attribute for node. 3. For each value of A, create a new descendant of node. 4. Sort training examples to leaf nodes. 5. If training examples are perfectly classified, stop. Else, recurse over new leaf nodes. How do we choose which attribute is best?

Choosing the Best Attribute Key problem: choosing which attribute to split a given set of examples Some possibilities are: Random: Select any attribute at random Least-Values: Choose the attribute with the smallest number of possible values Most-Values: Choose the attribute with the largest number of possible values Max-Gain: Choose the attribute that has the largest expected information gain i.e., attribute that results in smallest expected size of subtrees rooted at its children The ID3 algorithm uses the Max-Gain method of selecting the best attribute

Information Gain Which test is more informative? Split over whether Balance exceeds 50K Split over whether applicant is employed Less or equal 50K Over 50K Unemployed Employed Based on slide by Pedro Domingos 12

Information Gain Impurity/Entropy (informal) Measures the level of impurity in a group of examples Based on slide by Pedro Domingos 13

Impurity Very impure group Less impure Minimum impurity Based on slide by Pedro Domingos 14

Entropy: a common way to measure impurity Entropy = å- i pi log2 p i p i is the probability of class i Compute it as the proportion of class i in the set. Entropy comes from information theory. The higher the entropy the more the information content. What does that mean for learning from examples? Based on slide by Pedro Domingos 15

Entropy 2-Class Cases: H(x) = What is the entropy of a group in which all examples belong to the same class? entropy = - 1 log 2 1 = 0 not a good training set for learning nx i=1 P (x = i) log 2 P (x = i) Minimum impurity What is the entropy of a group with 50% in either class? entropy = -0.5 log 2 0.5 0.5 log 2 0.5 =1 Maximum impurity good training set for learning Based on slide by Pedro Domingos 16

Slide by Tom Mitchell Sample Entropy

Information Gain We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. Information gain tells us how important a given attribute of the feature vectors is. We will use it to decide the ordering of attributes in the nodes of a decision tree. Based on slide by Pedro Domingos 18

From Entropy to Information Gain Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Mututal information (aka Information Gain) of X and Y : Slide by Tom Mitchell

Information Gain Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A Slide by Tom Mitchell

Calculating Information Gain Information Gain = entropy(parent) [average entropy(children)] child entropy æ13 13 ö æ 4 4 ö -ç log2 - ç log2 = è17 17 ø è17 17 ø 0.787 Entire population (30 instances) 17 instances child entropy æ -ç è 1 13 1 ö æ12 12 ö log2 - ç log2 = 13 ø è13 13 ø 0.391 parent æ14 14 ö æ16 16 ö -ç log2 - ç log2 = 0.996 entropy è 30 30 ø è 30 30 ø 13 instances æ17 ö æ 13 ö (Weighted) Average Entropy of Children = ç 0.787 + ç 0.391 = 0. 615 è 30 ø è 30 ø Information Gain= 0.996-0.615 = 0.38 Based on slide by Pedro Domingos 21

Slide by Tom Mitchell

Slide by Tom Mitchell

Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis that fits the data Slide by Tom Mitchell

Preference bias: Ockham s Razor Principle stated by William of Ockham (1285-1347) non sunt multiplicanda entia praeter necessitatem entities are not to be multiplied beyond necessity AKA Occam s Razor, Law of Economy, or Law of Parsimony Idea: The simplest consistent explanation is the best Therefore, the smallest decision tree that correctly classifies all of the training examples is best Finding the provably smallest decision tree is NP-hard...So instead of constructing the absolute smallest tree consistent with the training examples, construct one that is pretty small

Overfitting in Decision Trees Many kinds of noise can occur in the examples: Two examples have same attribute/value pairs, but different classifications Some values of attributes are incorrect because of errors in the data acquisition process or the preprocessing phase The instance was labeled incorrectly (+ instead of -) Also, some attributes are irrelevant to the decisionmaking process e.g., color of a die is irrelevant to its outcome Based on Slide from M. desjardins & T. Finin 27

Overfitting in Decision Trees Irrelevant attributes can result in overfitting the training example data If hypothesis space has many dimensions (large number of attributes), we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features If we have too little training data, even a reasonable hypothesis space will overfit Based on Slide from M. desjardins & T. Finin 28

Overfitting in Decision Trees Consider adding a noisy training example to the following tree: What would be the effect of adding: <outlook=sunny, temperature=hot, humidity=normal, wind=strong, playtennis=no>? 29 Based on Slide by Pedro Domingos

Overfitting in Decision Trees Slide by Pedro Domingos 30

Overfitting in Decision Trees Slide by Pedro Domingos 31

Avoiding Overfitting in Decision Trees How can we avoid overfitting? Stop growing when data split is not statistically significant Acquire more training data Remove irrelevant attributes (manual process not always possible) Grow full tree, then post-prune How to select best tree: Measure performance over training data Measure performance over separate validation data set Add complexity penalty to performance measure (heuristic: simpler is better) Based on Slide by Pedro Domingos 32

Reduced-Error Pruning Split training data further into training and validation sets Grow tree based on training set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the node that most improves validation set accuracy Slide by Pedro Domingos 33

Pruning Decision Trees Pruning of the decision tree is done by replacing a whole subtree by a leaf node. The replacement takes place if a decision rule establishes that the expected error rate in the subtree is greater than in the single leaf. For example, Training red Color 1 positive 0 negative blue 0 positive 2 negative Validation red Color 1 positive 3 negative blue 1 positive 1 negative 2 correct 4 incorrect If we had simply predicted the majority class (negative), we make 2 errors instead of 4. Based on Example from M. desjardins & T. Finin 34

Effect of Reduced-Error Pruning On training data it looks great But that s not the case for the test (validation) data Based on Slide by Pedro Domingos 35

Effect of Reduced-Error Pruning The tree is pruned back to the red line where it gives more accurate results on the test data 36 Based on Slide by Pedro Domingos

Summary: Decision Tree Learning Widely used in practice Strengths include Fast and simple to implement Can convert to rules Handles noisy data Weaknesses include Univariate splits/partitioning using only one attribute at a time --- limits types of possible trees Large decision trees may be hard to understand Requires fixed-length feature vectors Non-incremental (i.e., batch method) 37

Summary: Decision Tree Learning Representation: decision trees Bias: prefer small decision trees Search algorithm: greedy Heuristic function: information gain or information content or others Overfitting / pruning Slide by Pedro Domingos 38