CS 6375 Machine Learning - PDF Free Download

CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1

Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {G, P} Decision Tree Example (I) 3 The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf. For example: (B,T) is classified as (D,S) is classified as 4 2

Decision Trees Decision Trees are classifiers. Instances represented as feature vectors. Nodes are (equality and inequality) tests for feature values. There is one branch for each value of the feature. Leaves specify the categories (labels). Can categorize instances into multiple disjoint categories. 5 General Case (Discrete Attributes) We have R observations from training data Each observation has M attributes X 1,..,X M Each X i can take N i distinct discrete values Each observation has a class attribute Y with C distinct (discrete) values Problem: Construct a sequence of tests on the attributes such that, given a new input (x 1,..,x M ), the class attribute y is correctly predicted X=attributes of training data (RxM) Y=Class of training data (R) Note: There are other notations used for data representation 6 3

General Decision Tree (Discrete Attributes) 7 Decision Tree Example (II) 8 4

The class of a new input can be classified by following the tree all the way down to a leaf and by reporting the output of the leaf. For example: (0.2,0.8) is classified as (0.8,0.2) is classified as 9 General Case (Continuous Attributes) We have R observations from training data Each observation has M attributes X 1,..,X M Each X i can take continuous values Each observation has a class attribute Y with C distinct (discrete) values Problem: Construct a sequence of tests of the form X i < t i? on the attributes such that, given a new input (x 1,..,x M ), the class attribute y is correctly predicted X=attributes of training data (RxM) Y=Class of training data (R) 10 5

General Decision Tree (Continuous Attributes) 11 Side: Decision Tree Decision Boundaries Decision trees divide the feature space into axis-parallel rectangles, and label each rectangle with one of the class labels. 12 6

Decision Boundaries Also called decision surfaces (high dimensional space) Take binary classification as an example: it is a hypersurface that partitions the feature space into two sets, one for each class. If the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable. Instances on one side of the hyperplane belong to one class, those on the other side belong to the other class. - + 13 Basic Questions How to choose the attribute/value to split at each level of the tree? When to stop splitting? When should a node be declared a leaf? If a leaf node is impure, how should the class label be assigned? 14 7

How to choose the attribute/value to split on at each level of the tree? Two classes (red circles/green crosses) Two attributes: X 1 and X 2 11 points in training data Idea: Construct a decision tree such that the leaf nodes predict correctly the class for all the training examples 15 How to choose the attribute/value to split on at each level of the tree? 16 8

This node is pure because there is only one class left No ambiguity in the class label This node is almost pure Little ambiguity in the class label These nodes contain a mixture of classes Do not disambiguate between the classes 17 We want to find the most compact, smallest size tree (Occam s razor), that classifies the training data correctly. We want to find the split choices that will get us the fastest to pure nodes This node is pure because there is only one class left No ambiguity in the class label This node is almost pure Little ambiguity in the class label These nodes contain a mixture of classes Do not disambiguate between the classes 18 9

Entropy Entropy is a measure of the impurity of a distribution, defined as: P i = probability of occurrence of value i High entropy All the classes are (nearly) equally likely Low entropy A few classes are likely; most of the classes are rarely observed assume 0 log 2 0 = 0 19 Entropy The entropy captures the degree of purity of the distribution 20 10

Example Entropy Calculation 21 Conditional Entropy Entropy before splitting: H After splitting, a fraction P L of the data goes to the left node, which has entropy H L After splitting, a fraction P R of the data goes to the right node, which has entropy H R The average entropy (or conditional entropy ) after splitting is: 22 11

Information Gain We want nodes as pure as possible We want to reduce the entropy as much as possible We want to maximize the difference between the entropy of the parent node and the expected entropy of the children Maximize: 23 Notations Entropy: H(Y) = Entropy of the distribution of classes at a node Conditional Entropy: Discrete: H(Y X j ) = Entropy after splitting with respect to variable j Continuous: H(Y X j,t) = Entropy after splitting with respect to variable j with threshold t Information gain: Discrete: IG(Y X j ) = H(Y) - H(Y X j ) Continuous: IG(Y X j,t) = H(Y) - H(Y X j,t) 24 12

27 Another Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No 28 14

Another Illustrative Example Day Outlook Temperature Humidity Wind PlayTennis Entropy(S) = 9 14 5 14 = 0.94 log( 9 ) 14 log( 5 ) 14 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No 29 Another Illustrative Example Humidity Wind PlayTennis High Weak No High Strong No High Weak Yes High Weak Yes Normal Weak Yes Normal Strong No Normal Strong Yes High Weak No Normal Weak Yes Normal Weak Yes Normal Strong Yes High Strong Yes Normal Weak Yes High Strong No 9+,5-9+,5- E=.94 30 15

Another Illustrative Example Humidity Wind PlayTennis High Weak No Humidity Wind High Strong No High Weak Yes High Weak Yes Normal Weak Yes High Normal Weak Strong Normal Strong No 3+,4-6+,1-6+2-3+,3- Normal Strong Yes High Weak No E=.985 E=.592 E=.811 E=1.0 Normal Weak Yes Normal Weak Yes IG(S Humidity)= IG(S Wind)= Normal Strong Yes.94-7/14 0.985.94-8/14 0.811 High Strong Yes - 7/14 0.592= - 6/14 1.0 = Normal Weak Yes 0.151 0.048 High Strong No IG(S a) = Entropy(S) v values(a) S v Entropy(S v ) S 9+,5- E=.94 31 Another Illustrative Example Outlook IG(S Humidity)=0.151 IG(S Wind)=0.048 IG(S Temperature)=0.029 IG(S Outlook)=0.246 32 16

Another Illustrative Example Sunny Outlook Overcast Rain? Yes? Continue until: Every attribute is included in path, or, All examples in the leaf have same label Day Outlook PlayTennis 1 Sunny No 2 Sunny No 3 Overcast Yes 4 Rain Yes 5 Rain Yes 6 Rain No 7 Overcast Yes 8 Sunny No 9 Sunny Yes 10 Rain Yes 11 Sunny Yes 12 Overcast Yes 13 Overcast Yes 14 Rain No 33 Another Illustrative Example Outlook IG(S sunny Humidity) =.97-(3/5) 0-(2/5) 0 =.97 IG(S sunny Temp) =.97-0-(2/5) 1 =.57 Sunny 3,7,12,13 4,5,6,10,14 4+,0-3+,2-1,2,8,9,11 2+,3- Overcast Rain 3,7,12,13 4,5,6,10,14 4+,0-3+,2-1,2,8,9,11 2+,3-? Yes? IG(S sunny Wind) =.97-(2/5) 1 - (3/5).92=.02 Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 11 Sunny Mild Normal Strong Yes 34 17

Another Illustrative Example Outlook Sunny Overcast Rain? Yes? 35 Another Illustrative Example Outlook Sunny 3,7,12,13 4,5,6,10,14 4+,0-3+,2-1,2,8,9,11 2+,3- Overcast Rain 3,7,12,13 4,5,6,10,14 4+,0-3+,2-1,2,8,9,11 2+,3- Humidity Yes? High No Normal Yes 36 18

Another Illustrative Example Outlook Overcast Rain 3,7,12,13 4+,0-4,5,6,10,14 3+,2- Yes Wind Sunny 1,2,8,9,11 2+,3- Humidity High No Normal Yes Strong No Weak Yes 37 Basic Questions How to choose the attribute/value to split on at each level of the tree? When to stop splitting? When should a node be declared a leaf? If a leaf node is impure, how should the class label be assigned? If the tree is too large, how can it be pruned? 38 19

Pure and Impure Leaves and When to Stop Splitting All the data in the node comes from a single class. We declare the node to be a leaf and stop splitting. This leaf will output the class of the data it contains. Several data points have exactly the same attributes even though they are not from the same class. We cannot split any further We still declare the node to be a leaf, but it will output the class that is the majority of the classes in the node 39 ID3 Decision Tree Algorithm (Discrete Attributes) LearnTree(X,Y) Input: Set X of R training vectors, each containing the values (x 1,..,x M ) of M attributes (X 1,..,X M ) A vector Y of R elements, where y j = class of the j th datapoint If all the datapoints in X have the same class value y Return a leaf node that predicts y as output If all the datapoints in X have the same attribute value (x 1,..,x M ) Return a leaf node that predicts the majority of the class values in Y as output Try all the possible attributes X j and choose the one, j*, for which IG(Y X j ) is maximum For every possible value v of X j* : X v,y v = set of datapoints for which x j* = v and corresponding classes Child v LearnTree(X v,y v ) 40 20

ID3 Decision Tree Algorithm (Continuous Attributes) LearnTree(X,Y) Input: Set X of R training vectors, each containing the values (x 1,..,x M ) of M attributes (X 1,..,X M ) A vector Y of R elements, where y j = class of the j th datapoint If all the datapoints in X have the same class value y Return a leaf node that predicts y as output If all the datapoints in X have the same attribute value (x 1,..,x M ) Return a leaf node that predicts the majority of the class values in Y as output Try all the possible attributes X j and threshold t and choose the one, j*, for which IG(Y X j,t) is maximum X L,Y L = set of datapoints for which x j * < t and corresponding classes X H,Y H = set of datapoints for which x j * >= t and corresponding classes Left Child LearnTree(X L,Y L ) Right Child LearnTree(X H,Y H ) 41 Expressiveness of Decision Trees Can represent any Boolean function. Can be rewritten as rules in Disjunctive Normal Form (DNF) No: ((outlook=sunny)^(humidity=high)) V ((outlook=rain)^(wind=strong) ) Yes: (outlook=overcast) V ((outlook=sunny)^humidity=normal) V ((outlook=rain)^(wind=weak)) 42 21

Decision Trees So Far Given R observations from training data, each with M attributes X and a class attribute Y, construct a sequence of tests (decision tree) to predict the class attribute Y from the attributes X Basic strategy for defining the tests ( when to split ) maximize the information gain on the training data set at each node of the tree Problem (next): Evaluating the tree on training data is dangerous overfitting 43 The Overfitting Problem (Example) Suppose that, in an ideal world, class B is everything such that X 2 >= 0.5 and class A is everything with X 2 < 0.5 Note that attribute X 1 is irrelevant Seems like generating a decision tree would be trivial 44 22

The Overfitting Problem (Example) However, we collect training examples from the perfect world through some imperfect observation device As a result, the training data is corrupted by noise. 45 The Overfitting Problem: Example Because of the noise, the resulting decision tree is far more complicated than it should be This is because the learning algorithm tries to classify all of the training set perfectly. This is a fundamental problem in learning: overfitting 46 23

The Overfitting Problem: Example The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A 47 The Overfitting Problem: Example It would be nice to identify automatically that splitting this node is stupid. Possible criterion: figure out that splitting this node will lead to a complicated tree suggesting noisy data The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A 48 24

The Overfitting Problem: Example Note that, even though the attribute X 1 is completely irrelevant in the original distribution, it is used to make the decision at that node The effect of overfitting is that the tree is guaranteed to classify the training data perfectly, but it may do a terrible job at classifying new test data. Example: (0.6,0.9) is classified as A 49 Possible Overfitting Solution: post-pruning Grow tree based on training data (unpruned tree) Prune the tree by removing useless nodes based on additional test data (not used for training) 50 25

Unpruned decision tree from training data 51 Training data with the partitions induced by the decision tree (Notice the tiny regions at the top necessary to correctly classify the A outliers!) Unpruned decision tree from training data 52 26

Unpruned decision tree from training data Performance (% correctly classified) Training: 100% Test: 77.5% 53 Pruned decision tree from training data Performance (% correctly classified) Training: 95% Test: 80% 54 27

Pruned decision tree from training data Performance (% correctly classified) Training: 80% Test: 97.5% 55 56 28

Using Test Data General principle: As the complexity of the classifier increases (depth of the decision tree), the performance on the training data increases and the performance on the test data decreases when the classifier overfits the training data. 57 Decision Tree Pruning Construct the entire tree as before Starting at the leaves, recursively eliminate splits: Evaluate performance of the tree on test data (also called validation data, or held-out data set) Prune the tree if the classification performance increases by removing the split There are other methods (early stopping, significance test) 58 29

Inductive Learning The decision tree approach is one example of an inductive learning technique: Suppose that data x is related to output y by an unknown function y = f(x) Suppose that we have observed training examples {(x 1,y 1 ),..,(x n,y n )} Inductive learning problem: Recover a function h (the hypothesis ) such that h(x) f(x) y = h(x) predicts y from the input data x The challenge: The hypothesis space (the space of all hypothesis h of a given form; for example the space of all of the possible decision trees for a set of M attributes) is huge + many different hypotheses may agree with the training data. 59 Inductive Learning What property should h have? It should agree with the training data 60 30

Inductive Learning Two hypotheses that fit the training data perfectly What property should h have? It should agree with the training data But that can lead to arbitrarily complex hypotheses and there are many of them; which one should we choose? 61 Inductive Learning What property should h have? It should agree with the training data But that can lead to arbitrarily complex hypotheses Which leads to completely wrong prediction on new test data The model does not generalize beyond the training data. It overfits the training data 62 31

Inductive Learning Simplicity principle (Occam s razor): entities are not to be multiplied beyond necessity The simpler hypothesis is preferred Compromise between: Error on data under hypothesis h Complexity of hypothesis h 63 Inductive Learning Different illustration, same concept. 64 32

Inductive Learning Decision tree is one example of inductive learning In many supervised learning algorithms, the goal is to minimize: Error on data + complexity of model 65 Summary: Decision Trees Information Gain (IG) criterion for choosing splitting criteria at each level of the tree. Versions with continuous attributes and with discrete (categorical) attributes Basic tree learning algorithm leads to overfitting of the training data Pruning with: Additional test data (not used for training) Statistical significance tests Example of inductive learning 66 33

Decision Trees on Real Problems Must consider the following issues: 1. Assessing the performance of a learning algorithm 2. Inadequate attributes 3. Noise in the data 4. Missing values 5. Attributes with numeric values 6. Bias in attribute selection 67 Assessing the Performance of a Learning Algorithm Performance task: predict the classifications of unseen examples Assessing prediction quality after tree construction: check the classifier s predictions on a test set (development set) 68 34

Learning Curve 69 Evaluation Methodology 1. Collect a large set of examples 2. Divide it into two disjoint sets: the training set and the test set 3. Use the learning algorithm with the training set to generate a hypothesis H 4. Measure the percentage of examples in the test set that are classified correctly by H 5. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected training sets of each size. 70 35

Inadequate Attributes Cause inconsistent instances Lead to larger decision trees as more splits are required 71 Dealing with Noise: The Problem of Overfitting Definition: finding meaningless regularity in the data This problem is general to all learning algorithms Solution for decision trees: post-pruning. Stop growing the tree when the splits are not statistically significant 72 36

Noisy Data Incorrect attribute values. Incorrect class labels. The decision tree algorithm may fail to find a tree consistent with all training examples. A further complication: may or may not know whether data is noisy. 73 Unknown Attribute Values 1. Throw away instances during training; during testing, try all paths, letting leaves vote. 2. Take class average. 3. Build another classifier to fill in the missing value. 4. Use a probabilistic approach (allowing a sample to go to different branches, with probabilities) Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot? Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 74 37

Attributes with Numeric Values Look for best splits. 1. Sort values 2. Create Boolean variables out of mid points 3. Evaluate all of these using information gain formula Example: Length (L): 10 15 21 28 32 40 50 Class: - + + - + + - 75 Bias in Attribute Selection Problem: Metric chooses higher branching attributes (reducing uncertainty more) Solution: Take into account the branching factor 76 38

Representational Restrictions Just consider boolean concept learning. Concept description: Problems: Inefficient representation for some functions. Can t test two features simultaneously. 77 Decision Tree Recap Internal decision nodes Univariate: uses a single attribute, x i Numeric x i : binary split Discrete x i : n-way split for n possible values Leaves Classification: class labels, or proportions Regression: numeric Learning is greedy, find the best split recursively See textbooks for inductive bias Hypothesis representation Ways to find the best hypothesis 78 39

Advantages Relatively fast (only process all the data at root) Can be converted to rules Easy interpretation Often yields compact model Can do feature selection 79 Disadvantages Large or complex trees are not intelligible Trees don t easily represent some basic concepts such as M-of-N, parity, Trees don t handle real-valued features as well as categorical ones Not good at regression Recursive partitions fragment data (not enough data as tree descends) Similar rules may be represented by very different trees Can t model dependency among attributes 80 40

Decision Tree Remarks When to use DT? classification, model interpretation is important, not many features, missing values, linear combination of feature is not critical, Further improvement: increasing representation power incorporating background knowledge optimized for cost or other metrics, not accuracy better search methods multivariate trees 81 Popular DT packages ID3 [Quinlan] CART [Breiman] C4.5 [Quinlan] IND [Buntine] Weka 82 41