Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Review: Decisions Based on Attributes } raining set: cases where patrons have decided to wait or not, along with the associated attributes for each case Class #05: Mutual Information & Decision rees Machine Learning (CS 49/59): M. Allen, 4 Sept. 8 } We now want to learn a tree that agrees with the decisions already made, in hopes that it will allow us to predict future decisions riday, 4 Sep. 08 Machine Learning (CS 49/59) Review: Decision ree unctions } or the examples given, here is a true tree (one that will lead from the inputs to the same outputs) ne Some ull >60 30-60 0-30 0-0 Bar? Reservation? WaitEstimate? Alternate? ri/sat? Hungry? Alternate? Raining? riday, 4 Sep. 08 Machine Learning (CS 49/59) 3 Decision ree Learning Algorithm function DECISION-REE-LEARNING(examples, attributes, parent examples) returns tree if examples is empty then return PLURALIY-VALUE(parent examples) else if all examples have the same classification then return the classification else if attributes is empty then return PLURALIY-VALUE(examples) else A argmax a attributes IMPORANCE(a, examples) tree anewdecisiontreewithroottesta for each value v k of A do exs {e : e examples and e.a = v k} subtree DECISION-REE-LEARNING(exs, attributes A, examples) add a branch to tree with label (A = v k) and subtree subtree return tree PLURALIY-VALUE(): returns output decision-value for majority of examples IMPORANCE(): rates attributes for their importance in making decisions for the given set of examples (the only actually complex part) riday, 4 Sep. 08 Machine Learning (CS 49/59) 4

} he precise tree we build will depend upon the order in which the algorithm chooses attributes and splits up examples } Suppose we have the following training set of 6 examples, defined by the boolean attributes A, B, C, with outputs as shown: 3 4 5 6 } We will consider two possible orders for the attributes when we build our tree: {A, B, C} and {C, B, A} riday, 4 Sep. 08 Machine Learning (CS 49/59) 5 } Suppose we use the order {A, B, C}: start by dividing up cases based on variable A 3 4 5 6 riday, 4 Sep. 08 Machine Learning (CS 49/59) 6 :, 3: :, 4:, 5:, 6: Here, all Outputs are the same, so we can replace this with a simple leaf node with that value. his is an example of the second base case stopping condition of the recursive algorithm. Each of these is a case for which attribute A has the right value, along with the appropriate Output value for that case. } Order {A, B, C}: next, divide un-decided cases based on variable B 3 4 5 6 riday, 4 Sep. 08 Machine Learning (CS 49/59) 7 B? :, 6: 4:, 5: Again, all Outputs are the same on this branch. } Order {A, B, C}: last, divide un-decided cases based on variable C 3 4 5 6 riday, 4 Sep. 08 Machine Learning (CS 49/59) 8 B? C? 4: 5: w, we can replace the last nodes with the relevant decision Output.

} Order {A, B, C}: the final decision tree for our data-set 3 4 5 6 B? C? } If we reverse the order of attributes and do the same process, we get a different, somewhat larger tree (although both will give same decision results on our set) B? C? C? B? B? {A, B, C} {C, B, A} riday, 4 Sep. 08 Machine Learning (CS 49/59) 9 riday, 4 Sep. 08 Machine Learning (CS 49/59) 0 Choosing Attributes ne Some ull Entropy for Decision rees ype? rench Italian hai Burger } or a binary (yes/no) decision problem, we can treat a training set with p positive examples and n negative examples as if it were a random variable with two values and probabilities: P (P os) = p p + n P (Neg)= n p + n } Intuitively, a good choice of the attribute to use is one that gives us the most information about how output decisions are made } Ideally, it would divide our outputs perfectly, telling us everything we needed to know to make our decision } Often, a single attribute only tells us part of what we need to know, so we prefer those that tell us the most } In the example, Patrons gives us more information than ype, since some values of the first attribute predict decision perfectly, while no values of second do the same riday, 4 Sep. 08 Machine Learning (CS 49/59) } We can then use the definition of entropy to measure the information gained by finding out whether an example is positive or negative: H(Examples) = (P (P os) log P (P os) + P (Neg) log P (Neg)) p = ( p + n log p p + n + n p + n log n p + n ) riday, 4 Sep. 08 Machine Learning (CS 49/59) 3

Information Gain } When we choose an attribute A with d values, we divide our training set into sub-sets E,, E d } Each set E k has its own number of positive and negative examples, p k and n k, and entropy H (E k) } he total remaining entropy after dividing on A is thus: dx p k + n k Remainder(A) = p + n H(E k) k= } And the total information gain (entropy reduction) if we do choose to use A as the dividing-branch variable is: Gain(A) = H(Examples) Remainder(A) riday, 4 Sep. 08 Machine Learning (CS 49/59) 3 Choosing Variables Using the Information Gain ne Some ull ype? rench Italian hai Burger } w we can be precise about how Patrons gives us more information than ype: H(Examples) = ( 6 log = ( log = ( 6 + 6 log + log ) + )=.0 6 ) riday, 4 Sep. 08 Machine Learning (CS 49/59) 4 Choosing Variables Using the Information Gain Choosing Variables Using the Information Gain ype? ype? ne Some ull rench Italian hai Burger ne Some ull rench Italian hai Burger } w we can be precise about how Patrons gives us more information than ype: Gain(P atrons) =H(Examples) hus, since we have: Remainder(P atrons) =.0 ( H(E)+ 4 H(E)+ 6 H(E3)) H(E) = ( 0 log 0 + log )=0 H(E) = ( 4 4 log 4 4 + 0 4 log 0 4 )=0 H(E3) = ( 6 log 6 + 4 6 log 4 6 ) 0.98 Gain(P atrons) =.0 0.98 =0.54 riday, 4 Sep. 08 Machine Learning (CS 49/59) 5 } w we can be precise about how Patrons gives us more information than ype: Gain(ype)=H(Examples) hus, since we have: Remainder(ype) =.0 ( H(E)+ H(E)+ 4 H(E3)+ 4 H(E4)) H(E) =H(E) =H(E3) =H(E4) =.0 Gain(P atrons) =.0.0 =0 And so we would choose to split on P atrons, since: Gain(P atrons) =0.54 > Gain(ype)=0 riday, 4 Sep. 08 Machine Learning (CS 49/59) 6 4

Learning with Information Gain } If we use this information gain concept of information to rate the IMPORANCE of an attribute, and always split based on the one that gives us the greatest gain, we can learn the following, more compact tree for the restaurant example: ne Some ull Hungry? ype? rench Italian hai Burger ri/sat? riday, 4 Sep. 08 Machine Learning (CS 49/59) 7 Performance of Learning Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 0 0 40 60 80 00 raining set size } If we start with a set of 00 random examples of the restaurant problem, we can see that the accuracy of the learning increases relative to the size of the training set riday, 4 Sep. 08 Machine Learning (CS 49/59) 8 Improving Decision rees } One well-known drawback of decision trees is that they tend to overfit to the training set } hat is, they give very good (often exact) performance on the training set, but don t generalize well to new cases } o improve on this, various randomization steps can be added to generate Decision orests:. Build multiple different decision trees. Given an input case, run it through all of the trees, and return the decision given by the majority of those trees Bootstrapping with Multiple rees } When building our different trees, one way this is done is to build each one using different, randomly chosen subsets of the original training set: } Random subsets may or may not overlap } Each tree is built on its own subset, and learns a decision function only for that subset } Each may thus give different decision outputs for the same input, if that input is not in one or the other subsets (or both) Original raining Set S Subset S Subset S Subset S N ree ree ree N riday, 4 Sep. 08 Machine Learning (CS 49/59) 9 riday, 4 Sep. 08 Machine Learning (CS 49/59) 0 5

Random orests } If some of our features give us most of the information about our data, the prior random process may not be as random as we like, however } he same features may be used, over and over, in all our trees, and they will tend to act the same way, eliminating the variation we are trying to achieve } We can modify the procedure to generate a more random forest of trees, by again splitting into random subsets of examples, but also, when we build the trees, build each one using a random subset of the features, too Original raining Set S Subset S Subset S Subset S N his Week } Information heory & Decision rees } Readings: } Blog post on Information heory (linked from class schedule) } Section 8.3 from Russell & rvig } Office Hours: Wing 0 } Monday/Wednesday/riday, :00 PM :00 PM } uesday/hursday, :30 PM 3:00 PM eatures eatures eatures N ree ree ree N riday, 4 Sep. 08 Machine Learning (CS 49/59) riday, 4 Sep. 08 Machine Learning (CS 49/59) 6