Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Similar documents
Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Learning Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Learning Decision Trees

Lecture 7 Decision Tree Classifier

Lecture 7: DecisionTrees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Voting (Ensemble Methods)

Decision Tree And Random Forest

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Dan Roth 461C, 3401 Walnut

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lecture 3: Decision Trees

CS 6375 Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

the tree till a class assignment is reached

Introduction to Machine Learning CMU-10701

CSC 411 Lecture 3: Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees Lecture 12

Notes on Machine Learning for and

Decision Trees Entropy, Information Gain, Gain Ratio

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Chapter 3: Decision Tree Learning

Decision Trees. Gavin Brown

Decision Tree Learning and Inductive Inference

Artificial Intelligence. Topic

Machine Learning 2nd Edi7on

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

CSCI-567: Machine Learning (Spring 2019)

Learning Classification Trees. Sargur Srihari

Generative v. Discriminative classifiers Intuition

Stochastic Gradient Descent

Data Mining und Maschinelles Lernen

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Classification and Regression Trees

Decision Tree Learning - ID3

Lecture 3: Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees. Tirgul 5

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Decision Trees: Overfitting

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

Decision Trees.

UVA CS 4501: Machine Learning


Decision Tree Learning Lecture 2

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

VBM683 Machine Learning

Midterm, Fall 2003

Algorithms for Classification: The Basic Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

ECE 5424: Introduction to Machine Learning

Decision Trees. Danushka Bollegala

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Tree Learning

Machine Learning. Ensemble Methods. Manfred Huber

Decision Trees.

Decision Tree Learning

Machine Learning & Data Mining

Chapter 3: Decision Tree Learning (part 2)

Decision Trees Part 1. Rao Vemuri University of California, Davis

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees / NLP Introduction

Lecture 8. Instructor: Haipeng Luo

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

TDT4173 Machine Learning

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Name (NetID): (1 Point)

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Jeffrey D. Ullman Stanford University

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Artificial Intelligence Decision Trees

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

CS7267 MACHINE LEARNING

10701/15781 Machine Learning, Spring 2007: Homework 2

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Learning with multiple models. Boosting.

10-701/ Machine Learning: Assignment 1

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Machine Learning 3. week

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Classification: Decision Trees

Machine Learning, Fall 2009: Midterm

FINAL: CS 6375 (Machine Learning) Fall 2014

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Transcription:

Information Gain, Decision Trees and Boosting 10-701 ML recitation 9 Feb 2006 by Jure

Entropy and Information Grain

Entropy & Bits You are watching a set of independent random sample of X X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 You get a string of symbols ACBABBCDADDC To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) You need 2 bits per symbol

Fewer Bits example 1 Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 Now, it is possible to find coding that uses only 1.75 bits on the average. How?

Fewer bits example 2 Suppose there are three equally likely values P(X=A)=1/3, P(X=B)=1/3, P(X=C)=1/3 Naïve coding: A = 00, B = 01, C=10 Uses 2 bits per symbol Can you find coding that uses 1.6 bits per symbol? In theory it can be done with 1.58496 bits

Entropy General Case Suppose X takes n values, V 1, V 2, V n, and P(X=V 1 )=p 1, P(X=V 2 )=p 2, P(X=V n )=p n What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X? It s H(X) = -p 1 log 2 p 1 p 2 log 2 p 2 p n log 2 p n n p i log ( p i 1 2 i H(X) = the entropy of X )

High, Low Entropy High Entropy X is from a uniform like distribution Flat histogram Values sampled from it are less predictable Low Entropy X is from a varied (peaks and valleys) distribution Histogram has many lows and highs Values sampled from it are more predictable

Specific Conditional Entropy, H(Y X=v) X = College Major Y = Likes Gladiator X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes I have input X and want to predict Y From data we estimate probabilities P(LikeG = Yes) = 0.5 P(Major=Math & LikeG=No) = 0.25 P(Major=Math) = 0.5 P(Major=History & LikeG=Yes) = 0 Note H(X) = 1.5 H(Y) = 1

Specific Conditional Entropy, H(Y X=v) X = College Major Y = Likes Gladiator X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes Definition of Specific Conditional Entropy H(Y X=v) = entropy of Y among only those records in which X has value v Example: H(Y X=Math) = 1 H(Y X=History) = 0 H(Y X=CS) = 0

Conditional Entropy, H(Y X) X = College Major Y = Likes Gladiator X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes Definition of Conditional Entropy H(Y X) = the average conditional entropy of Y = Σ i P(X=v i ) H(Y X=v i ) Example: v i P(X=v i ) H(Y X=v i ) Math 0.5 1 History 0.25 0 CS 0.25 0 H(Y X) = 0.5*1+0.25*0+0.25*0 = 0.5

Information Gain X = College Major Y = Likes Gladiator X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes Definition of Information Gain IG(Y X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y X) = H(Y) H(Y X) Example: H(Y) = 1 H(Y X) = 0.5 Thus: IG(Y X) = 1 0.5 = 0.5

Decision Trees

When do I play tennis?

Decision Tree

Is the decision tree correct? Let s check whether the split on Wind attribute is correct. We need to show that Wind attribute has the highest information gain.

When do I play tennis?

Wind attribute 5 records match Note: calculate the entropy only on examples that got routed in our branch of the tree (Outlook=Rain)

Calculation Let S = {D4, D5, D6, D10, D14} Entropy: H(S) = 3/5log(3/5) 2/5log(2/5) = 0.971 Information Gain IG(S,Temp) = H(S) H(S Temp) = 0.01997 IG(S, Humidity) = H(S) H(S Humidity) = 0.01997 IG(S,Wind) = H(S) H(S Wind) = 0.971

More about Decision Trees How I determine classification in the leaf? If Outlook=Rain is a leaf, what is classification rule? Classify Example: We have N boolean attributes, all are needed for classification: How many IG calculations do we need? Strength of Decision Trees (boolean attributes) All boolean functions Handling continuous attributes

Boosting

Booosting Is a way of combining weak learners (also called base learners) into a more accurate classifier Learn in iterations Each iteration focuses on hard to learn parts of the attribute space, i.e. examples that were misclassified by previous weak learners. Note: There is nothing inherently weak about the weak learners we just think of them this way. In fact, any learning algorithm can be used as a weak learner in boosting

Boooosting, AdaBoost

miss-classifications with respect to weights D Influence (importance) of weak learner

Booooosting Decision Stumps

Boooooosting Weights D t are uniform First weak learner is stump that splits on Outlook (since weights are uniform) 4 misclassifications out of 14 examples: α 1 = ½ ln((1-ε)/ε) = ½ ln((1-0.28)/0.28) = 0.45 Determines miss-classifications Update D t :

Booooooosting Decision Stumps miss-classifications by 1 st weak learner

Boooooooosting, round 1 1 st weak learner misclassifies 4 examples (D6, D9, D11, D14): Now update weights D t : Weights of examples D6, D9, D11, D14 increase Weights of other (correctly classified) examples decrease How do we calculate IGs for 2 nd round of boosting?

Booooooooosting, round 2 Now use D t instead of counts (D t is a distribution): So when calculating information gain we calculate the probability by using weights D t (not counts) e.g. P(Temp=mild) = D t (d4) + D t (d8)+ D t (d10)+ D t (d11)+ D t (d12)+ D t (d14) which is more than 6/14 (Temp=mild occurs 6 times) similarly: P(Tennis=Yes Temp=mild) = (D t (d4) + D t (d10)+ D t (d11)+ D t (d12)) / P(Temp=mild) and no magic for IG

Boooooooooosting, even more Boosting does not easily overfit Have to determine stopping criteria Not obvious, but not that important Boosting is greedy: always chooses currently best weak learner once it chooses weak learner and its Alpha, it remains fixed no changes possible in later rounds of boosting

Acknowledgement Part of the slides on Information Gain borrowed from Andrew Moore