Machine Learning Recitation 8 Oct 21, Oznur Tastan

Similar documents
UVA CS 4501: Machine Learning

Decision Tree And Random Forest

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Classification and Regression Trees

Decision Trees. Tirgul 5

Decision Trees. Gavin Brown

Decision Trees. Danushka Bollegala

Introduction to Machine Learning CMU-10701

the tree till a class assignment is reached

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$17:$Decision(Tree(/(Random( Forest(/(Ensemble$

Decision Tree Learning - ID3

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning

Learning Decision Trees

Lecture 3: Decision Trees

Decision Tree Learning and Inductive Inference

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Classification: Decision Trees

Lecture 3: Decision Trees

UVA$CS$4501$+$001$/$6501$ $007$ Introduc8on$to$Machine$Learning$ and$data$mining$$ $ Lecture$19:!Decision!Tree!/! Random!Forest!/!

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Artificial Intelligence. Topic

Machine Learning 2nd Edi7on

CS145: INTRODUCTION TO DATA MINING

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture 7: DecisionTrees

The Solution to Assignment 6

Learning Classification Trees. Sargur Srihari

Data classification (II)

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

CS 6375 Machine Learning

Learning Decision Trees

Decision Tree Learning

Deconstructing Data Science

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees.

Classification and Prediction


Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Trees.

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees Lecture 12

Typical Supervised Learning Problem Setting

Dan Roth 461C, 3401 Walnut

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Decision Tree Learning

Classification and regression trees

10-701/ Machine Learning: Assignment 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Classification Using Decision Trees

Rule Generation using Decision Trees

Learning with multiple models. Boosting.

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Machine Learning 3. week

CSC 411 Lecture 3: Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees: Overfitting

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision trees COMS 4771

Decision Trees Entropy, Information Gain, Gain Ratio

Lecture 7 Decision Tree Classifier

Notes on Machine Learning for and

Decision Tree. Decision Tree Learning. c4.5. Example

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Decision Trees Part 1. Rao Vemuri University of California, Davis

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Introduction to Data Science Data Mining for Business Analytics

Chapter 3: Decision Tree Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Tree Learning

Decision Support. Dr. Johan Hagelbäck.

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Induction on Decision Trees

Decision Tree Learning Lecture 2

Empirical Risk Minimization, Model Selection, and Model Assessment

Tutorial 6. By:Aashmeet Kalra

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Information Theory & Decision Trees

Machine Learning Alternatives to Manual Knowledge Acquisition

Bias Correction in Classification Tree Construction ICML 2001

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Algorithms for Classification: The Basic Methods

Bayesian Classification. Bayesian Classification: Why?

Variance Reduction and Ensemble Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Supervised Learning via Decision Trees

ECE 5424: Introduction to Machine Learning

Transcription:

Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan

Outline Tree representation Brief information theory Learning decision trees Bagging Random forests

Decision trees Non linear classifier Easy to use Easy to interpret Succeptible to overfitting but can be avoided.

Anatomy of a decision tree Outlook Each node is a test on one attribute sunny overcast rain Possible attribute values of the node Humidity Yes Windy high normal false true No Yes No Yes Leafs are the decisions

Anatomy of a decision tree Outlook Each node is a test on one attribute sunny overcast rain Possible attribute values of the node Humidity Yes Windy high normal false true No Yes No Yes Leafs are the decisions

Anatomy of a decision tree Sample size Outlook Each node is a test on one attribute Your data gets smaller sunny overcast rain Possible attribute values of the node Humidity Yes Windy high normal false true No Yes No Yes Leafs are the decisions

To play tennis or not. Outlook A new test example: (Outlook==rain) and (not Windy==false) sunny overcast rain Pass it on the tree -> Decision is yes. Humidity Yes Windy high normal false true No Yes No Yes

To play tennis or not. sunny Outlook overcast (Outlook ==overcast) -> yes (Outlook==rain) and (not Windy==false) ->yes (Outlook==sunny) and (Humidity=normal) ->yes rain Humidity Yes Windy high normal false true No Yes No Yes

Decision trees Decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. (Outlook ==overcast) OR ((Outlook==rain) and (not Windy==false)) OR ((Outlook==sunny) and (Humidity=normal)) => yes play tennis

Representation 0 A 1 Y=((A and B) or ((not A) and C)) C B 0 1 0 1 false true false true

true false Same concept different representation 0 A 1 Y=((A and B) or ((not A) and C)) C B 0 1 0 1 false true false true 0 C 1 B A 0 1 1 0 false A 1 0 false true

Which attribute to select for splitting? 16 + 16 - the distribution of each class (not attribute) 8 + 8-8 + 8-4 + 4-4 + 4-4 + 4-4 + 4-2 + 2-2 + 2 - This is bad splitting

How do we choose the test? Which attribute should be used as the test? Intuitively, you would prefer the one that separates the training examples as much as possible.

Information Gain Information gain is one criteria to decide on the attribute.

Information Imagine: 1. Someone is about to tell you your own name 2. You are about to observe the outcome of a dice roll 2. You are about to observe the outcome of a coin flip 3. You are about to observe the outcome of a biased coin flip Each situation have a different amount of uncertainty as to what outcome you will observe.

Information Information: reduction in uncertainty (amount of surprise in the outcome) 1 I( E) = log2 log 2 p( x) px ( ) = If the probability of this event happening is small and it happens the information is large. Observing the outcome of a coin flip is head 2. Observe the outcome of a dice is 6 I = log2 1/ 2 = 1 I = log2 1/ 6 = 2.58

Entropy The expected amount of information when observing the output of a random variable X H( X) = E( I( X)) = p( x ) I( x ) = p( x )log p( x ) i i i i 2 i i If there X can have 8 outcomes and all are equally likely H( X ) == 1/8log 1/8 = 3 i 2 bits

Entropy If there are k possible outcomes H( X) log 2 k Equality holds when all outcomes are equally likely The more the probability distribution the deviates from uniformity the lower the entropy

Entropy, purity Entropy measures the purity 4 + 4-8 + 0 - The distribution is less uniform Entropy is lower The node is purer

Conditional entropy H ( X) = p( xi)log 2 p( xi) i H ( X Y) = p( yj ) H( X Y = yj ) j = j py ( ) px ( y)log px ( y) j i j 2 i j i

Information gain IG(X,Y)=H(X) H(X Y) Reduction in uncertainty by knowing Y Information gain: (information before split) (information after split)

Information Gain Information gain: (information before split) (information after split)

Example Attributes Labels X1 X2 Y Count T T + 2 T F + 2 F T 5 F F + 1 Which one do we choose X1 or X2? IG(X1,Y) = H(Y) H(Y X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y X1) = P(X1=T)H(Y X1=T) + P(X1=F) H(Y X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6) = 0.39 Information gain (X1,Y)= 1 0.39=0.61

Which one do we choose? X1 X2 Y Count T T + 2 T F + 2 F T 5 F F + 1 Information gain (X1,Y)= 0.61 Information gain (X2,Y)= 0.12 Pick the variable which provides the most information gain about Y Pick X1

Recurse on branches X1 X2 Y Count T T + 2 T F + 2 F T 5 F F + 1 One branch The other branch

Caveats The number of possible values influences the information gain. The more possible values, the higher the gain (the more likely it is to form small, but pure partitions)

Purity (diversity) measures Purity (Diversity) Measures: Gini(population diversity) Information Gain Chi square Test

Overfitting You can perfectly fit to any training data Zero bias, high variance Two approaches: 1. Stop growing the tree when further splitting the data does not yield an improvement 2. Grow a full tree, then prune the tree, by eliminating nodes.

Bagging Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. For classification, a committee of trees each cast a vote for the predicted class.

Bootstrap The basic idea: randomly draw datasets with replacement from the training data, each sample the same size as the original training set

N examples Bagging Create bootstrap samples from the training data M features...

N examples Random Forest Classifier Construct a decision tree M features...

Random Forest Classifier M features N examples...... Take the majority vote

Bagging Z = {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} Z *b where = 1,.., B.. The prediction at input x when bootstrap sample b is used for training http://www-stat.stanford.edu/~hastie/papers/eslii.pdf (Chapter 8.7)

Bagging : an simulated example Generated a sample of size N = 30, with two classes and p = 5 features, each having a standard Gaussian distribution with pairwise Correlation 0.95. The response Y was generated according to Pr(Y = 1 x1 0.5) = 0.2, Pr(Y = 0 x1 > 0.5) = 0.8.

Bagging Notice the bootstrap trees are different than the original tree

Bagging Treat the voting Proportions as probabilities Hastie http://www-stat.stanford.edu/~hastie/papers/eslii.pdf Example 8.7.1

Random forest classifier Random forest classifier, an extension to bagging which uses de correlated trees.

N examples Random Forest Classifier Training Data M features

N examples Random Forest Classifier Create bootstrap samples from the training data M features...

N examples Random Forest Classifier Construct a decision tree M features...

N examples Random Forest Classifier At each node in choosing the split feature choose only among m<m features M features...

N examples Random Forest Classifier Create decision tree from each bootstrap sample M features......

Random Forest Classifier M features N examples...... Take he majority vote

Random forest Available package: http://www.stat.berkeley.edu/~breiman/randomforests/cc_home.htm To read more: http://www stat.stanford.edu/~hastie/papers/eslii.pdf