Decision Trees Entropy, Information Gain, Gain Ratio

Similar documents

Classification: Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Slides for Data Mining by I. H. Witten and E. Frank

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Learning and Inductive Inference

Machine Learning Chapter 4. Algorithms

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Learning Classification Trees. Sargur Srihari

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Inductive Learning. Chapter 18. Why Learn?

the tree till a class assignment is reached

CS 6375 Machine Learning

Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Learning Decision Trees

Machine Learning 2nd Edi7on

Decision Trees.

Lecture 3: Decision Trees

Machine Learning: Symbolische Ansätze. Decision-Tree Learning. Introduction C4.5 ID3. Regression and Model Trees

Decision Tree Learning

Notes on Machine Learning for and

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

EECS 349:Machine Learning Bryan Pardo

Learning Decision Trees

Decision Trees.

Machine Learning & Data Mining

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Naïve Bayes Lecture 6: Self-Study -----

CS145: INTRODUCTION TO DATA MINING

Decision Tree Learning Lecture 2

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Chapter 3: Decision Tree Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Lecture 3: Decision Trees

C4.5 - pruning decision trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

Dan Roth 461C, 3401 Walnut

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Support. Dr. Johan Hagelbäck.

Lecture 7 Decision Tree Classifier

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Classification and Regression Trees

Classification and Prediction

Decision Tree And Random Forest

Decision Tree Learning

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Data classification (II)

Decision Trees. Danushka Bollegala

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees. Gavin Brown

Decision Tree Learning - ID3

Tutorial 6. By:Aashmeet Kalra

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees / NLP Introduction

Lecture 7: DecisionTrees

Knowledge Discovery and Data Mining

Decision Tree Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Artificial Intelligence Decision Trees

Reminders. HW1 out, due 10/19/2017 (Thursday) Group formations for course project due today (1 pt) Join Piazza (

Machine Learning Alternatives to Manual Knowledge Acquisition

Introduction to Machine Learning CMU-10701

Einführung in Web- und Data-Science

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Learning Decision Trees

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Machine Learning 2010

Learning from Observations

Induction of Decision Trees

Chapter 3: Decision Tree Learning (part 2)

Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Machine Learning 3. week

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Classification and regression trees

Decision Trees. Tirgul 5

Algorithms for Classification: The Basic Methods

Apprentissage automatique et fouille de données (part 2)

Rule Generation using Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Classification: Decision Trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CHAPTER-17. Decision Tree Induction

Classification Using Decision Trees

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Symbolic methods in TC: Decision Trees

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Supervised Learning via Decision Trees

Decision Tree. Decision Tree Learning. c4.5. Example

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Transcription:

Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 2 Trees minal and numeric attributes Divide-and-conquer approach produces tree des involve testing a particular attribute Usually, attribute value is compared to constant Other possibilities: Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances Unknown instance is routed down the tree minal: number of children usually equal to number values attribute won t get tested more than once Other possibility: division into two subsets Numeric: test whether value is greater or less than constant attribute may get tested several times Other possibility: three-way split (or multi-way split) Integer: less than, equal to, greater than Real: below, within, above 3 4 Missing values Constructing decision trees Does absence of value have some significance? missing is a separate value missing must be treated in a special way Solution A: assign instance to most popular branch Solution B: split instance into pieces Pieces receive weight according to fraction of training instances that go down each branch Classifications from leave nodes are combined using the weights that have percolated to them Strategy: top down Recursive divide-and-conquer fashion First: select attribute for root node Create branch for each possible attribute value Then: split instances into subsets One for each branch extending from the node Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class 5 6 1

Which attribute to select? Which attribute to select? 7 8 Criterion for attribute selection Computing information Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that produces the purest nodes Popular impurity criterion: information gain Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain Measure information in bits Given a probability distribution, the info required to predict an event is the distribution s entropy Entropy gives the information required in bits (can involve fractions of bits!) Formula for computing the entropy: 9 10 Example: attribute Outlook Computing information gain Information gain: information before splitting information after splitting gain(outlook ) = info([9,5]) info([2,3],[4,0],[3,2]) = 0.940 0.693 = 0.247 bits Information gain for attributes from weather data: gain(outlook ) = 0.247 bits gain(temperature ) = 0.029 bits gain(humidity ) = 0.152 bits gain(windy ) = 0.048 bits 11 12 2

Continuing to split Final decision tree gain(temperature ) = 0.571 bits gain(humidity ) = 0.971 bits gain(windy ) = 0.020 bits te: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can t be split any further 13 14 Wishlist for a purity measure Properties of the entropy Properties we require from a purity measure: When node is pure, measure should be zero When impurity is maximal (i.e. all classes equally likely), measure should be maximal Measure should obey multistage property (i.e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties! 15 16 ly-branching attributes Weather data with ID code ID code Outlook Temp. Humidity Windy Play Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) Another problem: fragmentation A B C D E F G H I J K L M N 17 18 3

Tree stump for ID code attribute Gain ratio Entropy of split: Information gain is maximal for ID code (namely 0.940 bits) Gain ratio: a modification of the information gain that reduces its bias Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account Intrinsic information: entropy of distribution of instances into branches (i.e. how much info do we need to tell which branch an instance belongs to) 19 20 Computing the gain ratio Gain ratios for weather data Outlook Gain: 0.940-0.693 Split info: info([5,4,5]) Gain ratio: 0.247/1.577 Humidity Gain: 0.940-0.788 Split info: info([7,7]) Gain ratio: 0.152/1 0.693 0.247 1.577 0.157 0.788 0.152 1.000 0.152 Temperature Gain: 0.940-0.911 Split info: info([4,6,4]) Gain ratio: 0.029/1.557 Windy Gain: 0.940-0.892 Split info: info([8,6]) Gain ratio: 0.048/0.985 0.911 0.029 1.557 0.019 0.892 0.048 0.985 0.049 21 22 More on the gain ratio Discussion Outlook still comes out top However: ID code has greater gain ratio Standard fix: ad hoc test to prevent splitting on that type of attribute Problem with gain ratio: it may overcompensate May choose an attribute just because its intrinsic information is very low Standard fix: only consider attributes with greater than average information gain Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan Gain ratio just one modification of this basic algorithm C4.5: deals with numeric attributes, missing values, noisy data Similar approach: CART There are many other attribute selection criteria! (But little difference in accuracy of result) 23 24 4

Covering algorithms Example: generating a rule Convert decision tree into a rule set Straightforward, but rule set overly complex More effective conversions are not trivial Instead, can generate rule set directly for each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: at each stage a rule is identified that covers some of the instances If true then class = a If x > 1.2 then class = a Possible rule set for class b : If x > 1.2 and y > 2.6 then class = a If x 1.2 then class = b If x > 1.2 and y 2.6 then class = b Could add more rules, get perfect rule set 25 26 Classification rules From trees to rules Popular alternative to decision trees Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree) Tests are usually logically ANDed together (but may also be general logical expressions) Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule Individual rules are often logically ORed together Conflicts arise if different conclusions apply Easy: converting a tree into a set of rules One rule for each leaf: Antecedent contains a condition for every node on the path from the root to the leaf Consequent is class assigned by the leaf Produces rules that are unambiguous Doesn t matter in which order they are executed But: resulting rules are unnecessarily complex Pruning to remove redundant tests/rules 27 28 From rules to trees Rules vs. trees More difficult: transforming a rule set into a tree Tree cannot easily express disjunction between rules Example: rules which test different attributes Corresponding decision tree: (produces exactly the same predictions) If a and b then x If c and d then x Symmetry needs to be broken Corresponding tree contains identical subtrees ( replicated subtree problem ) But: rule sets can be more perspicuous when decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account 29 30 5

The end 31 6