Information Theory & Decision Trees

Similar documents
DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

the tree till a class assignment is reached

Machine Learning & Data Mining

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees.

Decision Tree Learning

Chapter 3: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Lecture 3: Decision Trees

Decision Trees.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Lecture 3: Decision Trees

CS 6375 Machine Learning

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning Lecture 2

Lecture 7: DecisionTrees

Induction of Decision Trees

Chapter 3: Decision Tree Learning (part 2)

Machine Learning 3. week

Classification: Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Notes on Machine Learning for and

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Learning Decision Trees

Classification and Regression Trees

Decision Tree Learning

Learning Decision Trees

Machine Learning

Decision Trees. Tirgul 5

Introduction to Machine Learning CMU-10701

CHAPTER-17. Decision Tree Induction

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

CS145: INTRODUCTION TO DATA MINING

Machine Learning 2nd Edi7on

Decision Tree Learning - ID3

Typical Supervised Learning Problem Setting

Decision Tree Learning

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Data classification (II)

EECS 349:Machine Learning Bryan Pardo

Decision Trees. Gavin Brown

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Tree Learning

Dan Roth 461C, 3401 Walnut

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Classification and Prediction

Decision Support. Dr. Johan Hagelbäck.

Introduction to Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Introduction to Machine Learning

Artificial Intelligence Roman Barták

Artificial Intelligence. Topic

Algorithms for Classification: The Basic Methods

Lecture 9: Bayesian Learning

Apprentissage automatique et fouille de données (part 2)

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

UVA CS 4501: Machine Learning

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees / NLP Introduction

Classification and regression trees

CSC 411 Lecture 3: Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

MODULE -4 BAYEIAN LEARNING

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Decision Tree And Random Forest

Supervised Learning via Decision Trees

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning

Information Theory, Statistics, and Decision Trees

Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision Trees Lecture 12

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Jeffrey D. Ullman Stanford University

Decision Tree Learning

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Empirical Risk Minimization, Model Selection, and Model Assessment

Statistical Learning. Philipp Koehn. 10 November 2015

Classification: Decision Trees

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Decision Trees. Danushka Bollegala

Einführung in Web- und Data-Science

Tutorial 6. By:Aashmeet Kalra

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Transcription:

Information Theory & Decision Trees Jihoon ang Sogang University Email: yangjh@sogang.ac.kr

Decision tree classifiers Decision tree representation for modeling dependencies among input variables using elements of information theory ow to learn decision tree from data Over-fitting and how to minimize it ow to deal with missing values in the data Learning decision trees from distributed data Learning decision trees at multiple levels of abstraction

Decision tree representation In the simplest case Each internal node tests on an attribute Each branch corresponds to an attribute value Each leaf node corresponds to a class label In general Each internal node corresponds to a test on input instances with mutually exclusive and exhaustive outcomes tests may be univariate or multivariate Each branch corresponds to an outcome of a test Each leaf node corresponds to a class label 3

Decision tree representation,0,0,00 Attributes Class x y c,0,0,00 x y 0 E x a m p l e s A 0 B 3 0 A 4 0 0 B 0 0 c=a 0 00 c=b 0 0 00 x 0 x 0 0 c=a c=b 0 00 c=a c=b Data set Tree Tree Should we choose Tree or Tree? Why? 4

Decision tree representation Any boolean function can be represented by a decision tree Any function f : A A A n where each Ai is the domain of the ith attribute and C is a discrete set of values class labels can be represented by a decision tree C In general, the inputs need not be discrete valued 5

Learning decision tree classifiers Decision trees are especially well suited for representing simple rules for classifying instances that are described by discrete attribute values Decision tree learning algorithms Implement Ockham s razor as a preference bias simpler decision trees are preferred over more complex trees Are relatively efficient linear in the size of the decision tree and the size of the data set Produce comprehensible results Are often among the first to be tried on a new data set 6

Learning decision tree classifiers Ockham s razor recommends that we pick the simplest decision tree that is consistent with the training set Simplest tree is one that takes the fewest bits to encode why? information theory There are far too many trees that are consistent with a training set Searching for the simplest tree that is consistent with the training set is not typically computationally feasible Solution Use a greedy algorithm not guaranteed to find the simplest tree but works well in practice Or restrict the space of hypothesis to a subset of simple trees 7

Information some intuitions Information reduces uncertainty Information is relative to what you already know Information content of a message is related to how surprising the message is Information depends on context 8

Information and Uncertainty Message Sender Receiver ou are stuck inside. ou send me out to report back to you on what the weather is like. I do not lie, so you trust me. ou and I are both generally familiar with the weather in Korea. On a July afternoon in Korea, I walk into the room and tell you it is hot outside On a January afternoon in Korea, I walk into the room and tell you it is hot outside 9

Information and Uncertainty Message Sender Receiver ow much information does a message contain? If my message to you describes a scenario that you expect with certainty, the information content of the message for you is zero The more surprising the message to the receiver, the greater the amount of information conveyed by the message What does it mean for a message to be surprising? 0

Information and Uncertainty Suppose I have a coin with heads on both sides and you know that I have a coin with heads on both sides; I toss the coin, and without showing you the outcome, tell you that it came up heads how much information did I give you? Suppose I have a fair coin and you know that I have a fair coin; I toss the coin, and without showing you the outcome, tell you that it came up heads how much information did I give you?

Information & Coding Theory Without loss of generality, assume that messages are binary made of 0s and s Conveying the outcome of a fair coin toss requires bit of information need to identify one out of two equally likely outcomes Conveying the outcome of an experiment with 8 equally likely outcomes requires 3 bits.. Conveying an outcome of that is certain takes 0 bits In general, if an outcome has a probability p, the information content of the corresponding message is I p log p, I0 0

Information is Subjective Suppose there are 3 agents Kim, Lee, Park, in a world where a dice has been tossed; Kim observes the outcome is a 6 and whispers to Lee that the outcome is even but Park knows nothing about the outcome Probability assigned by Lee to the event 6 is a subjective measure of Lee s belief about the state of the world Information gained by Kim by looking at the outcome of the dice = log 6 bits Information conveyed by Kim to Lee = log 6 - log 3 bits Information conveyed by Kim to Park = 0 bits 3

Information and Shannon Entropy Suppose we have a message that conveys the result of a random experiment with m possible discrete outcomes, with probabilities p p,...,, p m The expected information content of such a message is called the entropy of the probability distribution p, p,..., I p i I p i p log m 0 m i p I p i p otherwise i i provided p i 0 4

Entropy and Coding Theory Entropy is a measure of uncertainty associated with a random variable Noiseless coding theorem: the Shannon entropy is the minimum average message length, in bits, that must be sent to communicate the true value of a random variable to a recipient 5

Shannon s entropy as a measure of information Let P p n p p P p log p log p i... i n be a discrete probability distribution The entropy of the distribution P is given by i n i i i, p p 0, p log p I 0I0 0 bit i i i i log i i log log bit 6

The entropy for a binary variable 7

Properties of Shannon s entropy a. P P 0 If there are N possibleoutcomes, P log N b. If i p i N, P log N c. If i such that p i, P 0 d. P is a continuous function of P 8

Shannon s entropy as a measure of information P, P For any distribution is the optimal number of binary questions required on average to determine an outcome drawn from P We can extend these ideas to talk about how much information is conveyed by the observation of the outcome of one experiment about the possible outcomes of another mutual information We can also quantify to the difference between two probability distribution Kullback-Leibler divergence or relative entropy 9

Coding theory perspective Suppose you and I both know the distribution I choose an outcome according to P P Suppose I want to send you a message about the outcome ou and I could agree in advance on the questions I can simply send you the answers Optimal message length on average is P This generalizes to noisy communication 0

Entropy and Coding Theory x a b c d P e f g h Px / /4 /8 /6 /64 /64 /64 /64 code 0 0 0 0 00 0 0 x log bits 4 log 4 8 log 8 6 log 6 4 64 log 64 Average code length = 4 8 3 6 4 4 64 6 bits

Entropy of random variables and sets of random variables For a random variable n i P log P taking values a... a P a log P a i i n, If is a set of random variables, P log P

3 Joint entropy and conditional entropy log, log given of entropy Conditional, log,, joint entropy the, and random variable For a,, P P a P a P P a P P P

4 Joint entropy and conditional entropy, Chain rule for entropy When do we have equality?, results: useful Some The entropy never increases after conditioning observing value of reduces our uncertainty about, unless independent

Example of entropy calculations P = ; = = 0.; P = ; = T = 0.4 P = T; = = 0.3; P = T; = T = 0.,=-0.log 0.+.85 P = = 0.6, so = 0.97 P = = 0.5, so =.0 P = = = 0./0.6 = 0.333 P = T = = -0.333=0.667 P = = T= 0.3/0.4=0.75 P = T = T=0./0.4=0.5 0.88 5

6 Mutual information, log,, distributions, probability In terms of,,,, using chain rule Or by,, and information between mutual the average, and For random variables, b P a P b a P b a P I I I I Mutual information measures the extent to which observing one variable will reduce the uncertainty in another Mutual information is nonnegative, and equal to zero iff and are independent

Relative entropy Let The relative entropy Kullback - Leibler distance is a P and Q measure of be two distributions over random variable "distance" from D P Q P log P toq. P Q. Note D P Q D Q P D P Q 0 D P P 0 KL-divergence is not a true distance measure in that it is not symmetric 7

Learning decision tree classifiers On average, the information needed to convey the class membership of a random instance drawn from nature is P Nature Training Data S S S Instance S m Classifier P m i p i log p i Class label where P is an estimate of P and is a random variable with distribution P S i is the multi-set of examples belonging to class C i 8

Learning decision tree classifiers The task of the learner then is to extract the needed information from the training set and store it in the form of a decision tree for classification Information gain based decision tree learner Start with the entire training set at the root Recursively add nodes to the tree corresponding to tests that yield the greatest expected reduction in entropy or the largest expected information gain until some termination criterion is met e.g. the training data at every leaf node has zero entropy 9

Learning decision tree classifiers 30

Which attribute to choose? 3

Which attribute to choose? 3

Learning decision tree classifiers Example Instances ordered 3-tuples of attribute values corresponding to eight tall, short air dark, blonde, red Eye blue, brown Training Data Instance Class label I t, d, l + I s, d, l + I 3 t, b, l I 4 t, r, l I 5 s, b, l I 6 t, b, w + I 7 t, d, w + I 8 s, b, w + 33

Learning decision tree classifiers Example t eight I I 8 s 3 3 5 5 log log 0. 954bits 8 8 8 8 3 3 eight t log log 0. 97bits 5 5 5 5 eight s log log 0. 98bits 3 3 3 3 S t I I 3 I 4 I 6 I 7 I I 5 I 8 5 3 5 3 eight eight t eight s 0.97 0.98 0. 95bits 8 8 8 8 Similarly, air 3 8 Eye 0.607bits S s air d 4 8 air b 8 air r 0.5bits and air is the most informative because it yields the largest reduction in entropy; Test on the value of air is chosen to correspond to the root of the decision tree 34

Learning decision tree classifiers Example air d b + - Eye r Compare the result with Naïve Bayes l w - + In practice, we need some way to prune the tree to avoid overfitting the training data 35

Learning, generalization, overfitting Consider the error of a hypothesis h over Training data: Error Train h Error D h Entire distribution D of data: ypothesis hypothesis h h' over fits training data if there is an alternative such that and Error Error Train D h Error h Error D Train h' h' 36

Overfitting in decision tree learning e.g. diabetes dataset 37

Causes of overfitting As we move further away from the root, the data set used to choose the best test becomes smaller poor estimates of entropy Noisy examples can further exacerbate overfitting 38

Minimizing overfitting Use roughly the same size sample at every node to estimate entropy when there is a large data set from which we can sample Stop when further split fails to yield statistically significant information gain estimated from validation set Grow full tree, then prune Minimize size tree + size exceptions tree 39

Reduced error pruning Each decision node in the tree is considered as a candidate for pruning Pruning a decision node consists of Removing the sub tree rooted at that node, Making it a leaf node, and Assigning it the most common label at that node or storing the class distribution at that node for probabilistic classification 40

Reduced error pruning Do until further pruning is harmful:. Evaluate impact on validation set of pruning each candidate node. Greedily select a node which most improves the performance on the validation set when the sub tree rooted at that node is pruned Drawback holding back the validation set limits the amount of training data available Not desirable when data set is small 4

Reduced error pruning Example Node Accuracy gain by Pruning A -0% B +0% A 00 A=a A=a + B 40 60 A A=a A=a 55 5 - + Before Pruning + - After Pruning 4

Reduced error pruning Pruned Tree 43

pruning Evaluate candidate split to decide if the resulting information gain is significantly greater than zero as determined using a suitable hypothesis testing at a desired significance level Example: statistics -class, binary [Lleft, Rright] split case n of class, n of class ; N = n + n Assume a candidate split s sends pn to L and -pn to R A random split having this probability i.e. the null hypothesis would send pn of class to L and pn of class to L Deviation of the results due to s from the random split: i nil: number of patterns in class i sent to the left under decision s nie = p*ni : number expected by the random rule n il n n ie ie 44

pruning The critical value of depends on the degrees of freedom which is in this case for a given p, nl fully specifies nl, nr, and nr In general, the number of degrees of freedom can be > 45

Pruning based on whether information gain is significantly greater than zero 50 50 L R 50 0 0 50 n n n L e 50, n n n n n L n 50, N n e e e L n e e 00 nl 50, nl 0, nl 50, p 0. 5 N pn 5, n pn 5 n 5 5 50 This split is significantly better than random with confidence > 99% because > 6.64 46

Value Table 47

Pruning based on whether information gain is significantly greater than zero N p n ie [ p j n Branches Classes p j j p n n... p i i... n n Classes Branches ij ]; n n ie j ie Branches j j p j Degrees of freedom = Classes Branches The greater the value of, the less likely it is that the split is random; For a sufficiently high value of, the difference between the expected random and the observed split is statistically significant and we reject the null hypothesis that the split is random 48

Rule post-pruning Convert tree to equivalent set of rules IF TEN IF TEN... Outlook = Sunny umidity = igh PlayTennis = No Outlook = Sunny umidity = Normal PlayTennis = es 49

Rule post-pruning. Convert tree to equivalent set of rules. Prune each rule independently of others by dropping a condition at a time if doing so does not reduce estimated accuracy at the desired confidence level 3. Sort final rules in order of lowest to highest error for classifying new instances Advantage can potentially correct bad choices made close to the root Post pruning based on validation set is the most commonly used method in practice 50

Classification of instances Unique classification possible when each leaf has zero entropy and there are no missing attribute values Most likely or probabilistic classification based on distribution of classes at a node when there are no missing attributes 5

andling different types of attribute values Types of attributes Nominal values are names Ordinal values are ordered Cardinal numeric values are numbers hence ordered 5

andling numeric attributes Attribute T 40 48 50 54 60 70 Class N N N Candidate splits E S T T 48 50? T 60 70? 4 3 3 49? 0 log log 6 6 4 4 4 4 Sort instances by value of numeric attribute under consideration For each attribute, find the test which yields the lowest entropy Greedily choose the best test across all attributes 53

andling numeric attributes Axis-parallel split Oblique split C C C C Oblique splits cannot be realized by univariate tests 54

Two-way versus multi-way splits Entropy criterion favors many-valued attributes Pathological behavior what if in a medical diagnosis data set, social security number is one of the candidate attributes? Solutions Only two-way splits CART: A = value versus A = ~value Gain ratio C4.5 Gain S, A GainRatio S, A SplitInformation S, A SplitInformation S, A Values A Si log S Si i S 55

56 Alternative split criteria Consider split of set S based on attribute A expected rate of error if class label is picked randomly according to distribution of instances in a set j A Values j S Impurity A S Impurity log Entropy Z Z Z Z Z Impurity i Classes i i Classes i i j j i i Z Z Z Z Z Z Z Impurity Gini

Incorporating attribute costs Not all attribute measurements are equally costly or risky Example: In medical diagnosis Blood-test has cost $50 Exploratory-surgery may have a cost of $3000 Goal: Learn a decision tree classifier which minimizes cost of classification Tan and Schlimmer 990 Gain S, A Cost A Nunez 988 Gain S, A Cost A where w [0, ] determines importance of cost w 57

Incorporating different misclassification costs for different classes Not all misclassifications are equally costly: An occasional false alarm about a nuclear power plant meltdown is less costly than the failure to alert when there is a chance of a meltdown Weighted Gini Impurity Impurity S ij ij S i S S S j ij is the cost of wrongly assigning an instance belonging to class i to class j 58

Dealing with missing attribute values Solution Sometimes, the fact that an attribute value is missing might itself be informative Missing blood sugar level might imply that the physician had reason not to measure it Introduce a new value one per attribute to denote a missing value Decision tree construction and use of tree for classification proceed as before 59

Dealing with missing attribute values Solution During decision tree construction Replace a missing attribute value in a training example with the most frequent value found among the instances at the node During use of tree for classification Replace a missing attribute value in an instance to be classified with the most frequent value found among the training instances at the node 60

Dealing with missing attribute values Solution 3 During decision tree construction Replace a missing attribute value in a training example with the most frequent value found among the instances at the node that have the same class label as the training example During use of tree for classification Same as solution 6

Dealing with missing attribute values Solution 4 During decision tree construction Generate several fractionally weighted training examples based on the distribution of values for the corresponding attribute at the node During use of tree for classification Generate multiple instances by assigning candidate values for the missing attribute based on the distribution of instances at the node Sort each such instance through the tree to generate candidate labels and assign the most probable class label or probabilistically assign class label Used in C4.5 6

Dealing with missing attribute values A 0 n + =60, n - =40 n + A==50 + B n + A=0 = 0; n - A=0 = 40 0 n - A=0, B==40 - + n + A=0, B=0 = 0 Suppose B is missing Replacement with most frequent value at the node B = Replacement with most frequent value if class is + B = 0 63

Dealing with missing attribute values A 0 n + =60, n - =40 n + A==50 + B n + A=0 = 0; n - A=0 = 40 0 n - A=0, B==40 - + n + A=0, B=0 = 0 Suppose B is missing Fractional instance based on distribution at the node: 4/5 for B =, /5 for B = 0 Fractional instance based on distribution at the node for class +: /5 for B = 0, 0 for B = 64

See5/C5.0 [997] Boosting New data types e.g. dates, N/A values, variable misclassification costs, attribute pre-filtering Unordered rulesets: all applicable rules are found and voted Improved scalability: multi-threading, multi-core/cpus 65

Recent developments in decision trees Learning decision trees from distributed data Learning decision trees from attribute value taxonomies and partially specified data Learning decision trees from relational data Decision trees for multi-label classification tasks Learning decision trees from vary large data sets Learning decision trees from data streams Other Research Issues: Stable trees e.g. against leave-one-out CV Decomposing complex trees for comprehensibility Gradient Boosted Decision Trees GBDT 66

Summary of decision trees Simple Fast linear in size of the tree, linear in the size of the training set, linear in the number of attributes Produce easy to interpret rules Good for generating simple predictive rules from data with lots of attributes 67