Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Similar documents
Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Bayesian learning Probably Approximately Correct Learning

Learning and Neural Networks

Learning Decision Trees

Learning from Observations. Chapter 18, Sections 1 3 1

Introduction To Artificial Neural Networks

Classification Algorithms

Decision Trees. Ruy Luiz Milidiú

CS 380: ARTIFICIAL INTELLIGENCE

Classification Algorithms

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Learning from Examples

From inductive inference to machine learning

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Introduction to Machine Learning

Introduction to Artificial Intelligence. Learning from Oberservations

Learning with multiple models. Boosting.

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Incremental Stochastic Gradient Descent

Statistical Learning. Philipp Koehn. 10 November 2015

EECS 349:Machine Learning Bryan Pardo

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS7267 MACHINE LEARNING

Learning Decision Trees

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Learning Decision Trees

Decision Trees. Tirgul 5

Data Mining und Maschinelles Lernen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Lecture 8. Instructor: Haipeng Luo

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Boosting: Foundations and Algorithms. Rob Schapire

CS229 Supplemental Lecture notes

Statistics and learning: Big Data

Decision Trees: Overfitting

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensembles of Classifiers.

Ensembles. Léon Bottou COS 424 4/8/2010

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Artificial Intelligence Roman Barták

Data Warehousing & Data Mining

Learning theory. Ensemble methods. Boosting. Boosting: history

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Machine Learning. Ensemble Methods. Manfred Huber

Decision Tree Learning

VBM683 Machine Learning

ECE 5424: Introduction to Machine Learning

i=1 = H t 1 (x) + α t h t (x)

TDT4173 Machine Learning

Ensemble Methods for Machine Learning

Voting (Ensemble Methods)

Boosting & Deep Learning

Classification and Regression Trees

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Neural Networks and Ensemble Methods for Classification

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Lecture 3: Decision Trees

Announcements Kevin Jamieson

Algorithm-Independent Learning Issues

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

FINAL: CS 6375 (Machine Learning) Fall 2014

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

1 Handling of Continuous Attributes in C4.5. Algorithm

Representing Images Detecting faces in images

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Computational and Statistical Learning Theory

Name (NetID): (1 Point)

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Introduction to Machine Learning CMU-10701

Decision T ree Tree Algorithm Week 4 1

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Brief Introduction to Adaboost

2 Upper-bound of Generalization Error of AdaBoost

COMS 4771 Lecture Boosting 1 / 16

Holdout and Cross-Validation Methods Overfitting Avoidance

Variance Reduction and Ensemble Methods

the tree till a class assignment is reached

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Notes on Machine Learning for and

Transcription:

CSE 473 Chapter 18 Decision Trees and Ensemble Learning Recall: Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) CSE AI Faculty 2 1

Example Decision tree A decision tree for Wait? based on personal rules of thumb (this was used to generate input data): CSE AI Faculty 3 Input Data for Learning Past examples where I did/did not wait for a table: Classification of examples is positive (T) or negative (F) CSE AI Faculty 4 2

Decision Tree Learning Aim: find a small tree consistent with training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree CSE AI Faculty 5 Choosing an attribute to split on Idea: a good attribute should reduce uncertainty E.g., splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice To wait or not to wait is still at 50%. CSE AI Faculty 6 3

How do we quantify uncertainty? Using information theory to quantify uncertainty Entropy measures the amount of uncertainty in a probability distribution Entropy (or Information Content) of an answer to a question with possible answers v 1,, v n : I(P(v 1 ),, P(v n )) = i=1 -P(v i ) log 2 P(v i ) CSE AI Faculty 8 4

Using information theory Imagine we have p examples with Wait = True (positive) and n examples with Wait = false (negative). Our best estimate of the probabilities of Wait = true or false is given by: P ( true) p / p + n p( false) n / p + n Hence the entropy of Wait is given by: p n p p n I(, ) = log 2 log 2 p + n p + n p + n p + n p + n n p + n CSE AI Faculty 9 Entropy I 1.0 0.5 Entropy is highest when uncertainty is greatest Wait = F Wait = T P(Wait = T).00.50 1.00 CSE AI Faculty 10 5

Choosing an attribute to split on Idea: a good attribute should reduce uncertainty and result in gain in information How much information do we gain if we disclose the value of some attribute? Answer: uncertainty before uncertainty after CSE AI Faculty 11 Back at the Restaurant Before choosing an attribute: Entropy = - 6/12 log(6/12) 6/12 log(6/12) = - log(1/2) = log(2) = 1 bit There is 1 bit of information to be discovered CSE AI Faculty 12 6

Back at the Restaurant If we choose Type: Go along branch French : we have entropy = 1 bit; similarly for the others. Information gain = 1-1 = 0 along any branch If we choose Patrons: In branch None and Some, entropy = 0 For Full, entropy = -2/6 log(2/6)-4/6 log(4/6) = 0.92 Info gain = (1-0) or (1-0.64) bits > 0 in both cases So choosing Patrons gains more information! 13 CSE AI Faculty Entropy across branches How do we combine entropy of different branches? Answer: Computed average entropy Weight entropies according to probabilities of branches 2/12 times we enter None, so weight for None = 1/6 Some has weight: 4/12 = 1/3 Full has weight 6/12 = ½ n Entropy AvgEntropy (A) = i =1 p i +ni p n Entropy ( i, i ) p +n pi + ni pi + ni entropy for each branch weight for each branch CSE AI Faculty 14 7

Information gain Information Gain (IG) or reduction in entropy from using attribute A: IG(A) = Entropy before AvgEntropy after choosing A Choose the attribute with the largest IG CSE AI Faculty 15 Information gain in our example 2 4 6 2 4 IG( Patrons) = 1 [ I(0,1) + I(1,0) + I(, )] =.541bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 IG( Type) = 1 [ I(, ) + I(, ) + I(, ) + I(, )] = 0 bits 12 2 2 12 2 2 12 4 4 12 4 4 Patrons has the highest IG of all attributes Chosen by the DTL algorithm as the root CSE AI Faculty 16 8

Should I stay or should I go? Learned Decision Tree Decision tree learned from the 12 examples: Substantially simpler than other tree more complex hypothesis not justified by small amount of data CSE AI Faculty 17 Performance Measurement How do we know that the learned tree h f? Answer: Try h on a new test set of examples Learning curve = % correct on test set as a function of training set size CSE AI Faculty 18 9

Ensemble Learning Sometimes each learning technique yields a different hypothesis (or function) But no perfect hypothesis Could we combine several imperfect hypotheses to get a better hypothesis? CSE AI Faculty 19 Example 1: Othello Project Many brains better than one? CSE AI Faculty 20 10

Example 2 Combining 3 linear classifiers More complex classifier this line is one simple classifier saying that everything to the left is + and everything to the right is - CSE AI Faculty 21 Ensemble Learning: Motivation Analogies: Elections combine voters choices to pick a good candidate (hopefully) Committees combine experts opinions to make better decisions Students working together on Othello project Intuitions: Individuals make mistakes but the majority may be less likely to (true for Othello? We shall see ) Individuals often have partial knowledge; a committee can pool expertise to make better decisions CSE AI Faculty 22 11

Technique 1: Bagging Combine hypotheses via majority voting CSE AI Faculty 23 Bagging: Analysis Error probability went down from 0.1 to 0.01! CSE AI Faculty 24 12

Weighted Majority Voting In practice, hypotheses rarely independent Some hypotheses have less errors than others all votes are not equal! Idea: Let s take a weighted majority CSE AI Faculty 25 Technique 2: Boosting Most popular ensemble learning technique Computes a weighted majority of hypotheses Can boost performance of a weak learner Operates on a weighted training set Each training example (instance) has a weight Learning algorithm takes weight of input into account Idea: when an input is misclassified by a hypothesis, increase its weight so that the next hypothesis is more likely to classify it correctly CSE AI Faculty 26 13

Boosting Example with Decision Trees (DTs) training case correctly classified training case has large weight in this round this DT has a strong vote. Output of h final is weighted majority of outputs of h 1,,h 4 CSE AI Faculty 27 AdaBoost Algorithm CSE AI Faculty 28 14

AdaBoost Example Original training set D 1 : Equal weights to all training inputs Goal: In round t, learn classifier h t that minimizes error with respect to weighted training set h t maps input to True (+1) or False (-1) Taken from A Tutorial on Boosting by Yoav Freund and Rob Schapire CSE AI Faculty 29 ROUND 1 AdaBoost Example Misclassified Increase weights z 1 = 0.42 CSE AI Faculty 30 15

AdaBoost Example ROUND 2 z 2 = 0.65 CSE AI Faculty 31 AdaBoost Example ROUND 3 z 3 = 0.92 CSE AI Faculty 32 16

AdaBoost Example h final sign(x) = +1 if x > 0 and -1 otherwise CSE AI Faculty 33 Classification using: Nearest Neighbors Neural Networks Next Time CSE AI Faculty 34 17