} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Review: Entropy and Information H(P) = X i p i log p i Class #04: Mutual Information & Decision rees Machine Learning (CS 419/519): M. Allen, 1 Sept. 18 } Entropy is the information gained on average when observing events that occur according to a probability distribution } It is non-zero, and maximized given a uniform distribution } hus, for any distribution possible, we have: P = {p 1,p,...,p k } 0 apple H(P) apple log k Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) Review: Joint Probability & Independence } If we have two events e 1 and e, the probability that both events occur, called the oint probability, is written: P (e 1 ^ e )=P (e 1,e ) } We say that two events are independent if and only if: P (e 1,e )=P (e 1 ) P (e ) } Independent events tell us nothing about each other Review: Conditional Probability } Given two events e 1 and e, the probability that e 1 occurs, given that e also occurs, called the conditional probability of e 1 given e, is written: P (e 1 e ) } In general, the conditional probability of an event can be quite different from the basic probability that it occurs } hus, for our weather example, we might have: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0.1 P ( U R) =0. P ( U R) =0.9 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 3 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 4 1

Properties of Conditional Probability } Conditional probability can be defined using oint probability: P (e 1 e )= P (e 1,e ) P (e ) P (e 1,e )=P (e 1 e )P (e ) } hus, if the events are actually independent, we get: P (e 1 e )= P (e 1,e ) P (e ) P (e 1 e )= P (e 1)P (e ) P (e ) P (e 1 e )=P (e 1 ) By definition of independence Calculating Joint Probabilities } We have the simple and conditional probabilities of rain and my umbrellacarrying behavior: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0.1 P ( U R) =0. P ( U R) =0.9 } his allows us to calculate various oint probabilities: P (U, R) =P (U R)P (R) =0.8 0.5 =0.4 P (U, R) =P (U R)P ( R) =0.1 0.5 =0.05 P ( U, R) =P ( U R)P (R) =0. 0.5 =0.1 P ( U, R) =P ( U R)P ( R) =0.9 0.5 =0.45 otal set of probabilities sums to 1.0 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 5 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 6 Mutual Information } Suppose we have two sets of possible events, each with its own probability distributions: E = {e 1,e,...,e m } P E = {p 1,p,...,p m } E 0 = {e 0 1,e 0,...,e 0 n} P E 0 = {p 0 1,p 0,...,p 0 n} } We can define mutual information, the amount that one event tells us about the other: I(E; E 0 )= X e i,e 0 } Effectively, this measures how much knowing that E 0 has happened reduces the entropy of P (e i,e 0 ) log P (e i,e 0 ) P (e i )P (e 0 ) Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 7 E Mutual Information } his allows us to quantify exactly how much knowing whether or not it is raining tells us about whether or not I will be carrying an umbrella: P (U, R) I(U; W) =P (U, R) log P (U)P (R) + P (U, R) log P (U, R) P (U)P ( R) + P ( U, R) P ( U, R) log P ( U)P (R) + P ( U, R) log P ( U, R) P ( U)P ( R) 0.4 =0.4log 0. 0.5 +0.05 log 0.05 0. 0.5 + 0.1 0.1 log 0.8 0.5 +0.45 log 0.45 0.8 0.5 =0.4 log 4+0.05 log 0.5+0.1 log 0.5 + 0.45 log 1.15 0.8 0.05 0. + 0.0765 = 0.665 te: the final value doesn t matter so much (e.g., it would change if we used a different base for our logarithms). It does allow us to compare different combinations of variables, however, to see which tells us the most about another. Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 8

Properties of Mutual Information I(E; E 0 )= X P (e i,e 0 P (e i,e 0 ) log ) P (e i )P (e 0 e i,e 0 ) } As defined, mutual information is: 1. Symmetric: I(E; E 0 )=I(E 0 ; E) Because: P (e i,e 0 )=P (e 0,e i ). n-negative: I(E; E 0 ) 0 Because: it s complicated, but trust me 3. Zero when events are independent (i.e., when independent, one event tells us nothing about the other that we didn t already know): I(E; E 0 )= X e i,e 0 = X e i,e 0 P (e i,e 0 P (e i,e 0 ) log ) P (e i )P (e 0 ) = X P (e i )P (e 0 P (e i )P (e 0 ) log ) P (e i )P (e 0 e i,e 0 ) P (e i )P (e 0 ) log 1= X e i,e 0 P (e i )P (e 0 ) 0=0 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 9 Review: Inductive Learning } In its simplest form, induction is the task of learning a function on some inputs from examples of its outputs } or a target function, f, each training example is a pair (x, f (x )) } We assume that we do not yet know the actual form of the function f (if we did, we don t need to learn) } Learning problem: find a hypothesis function, h, such that h (x ) = f (x ) most of the time, based on a training set of example input-output pairs Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 10 Decision rees } A decision tree leads us from a set of attributes (features of the input) to some output } or example, we have a database of customer records for restaraunts } hese customers have made a number of decisions about whether to wait for a table, based on a number of attributes: 1. Alternate: is there an alternative restaurant nearby?. Bar: is there a comfortable bar area to wait in? 3. ri/sat: is today riday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (ne, Some, ull) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. ype: kind of restaurant (rench, Italian, hai, Burger) 10. WaitEstimate: estimated wait time in minutes (0-10, 10-30, 30-60, >60) } he function we want to learn is whether or not a (future) customer will decide to wait, given some particular set of attributes Decisions Based on Attributes } raining set: cases where patrons have decided to wait or not, along with the associated attributes for each case } We now want to learn a tree that agrees with the decisions already made, in hopes that it will allow us to predict future decisions Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 11 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 1 3

Decision ree unctions } or the examples given, here is a true tree (one that will lead from the inputs to the same outputs) ne Some ull Patrons? >60 30-60 10-30 0-10 Bar? Reservation? WaitEstimate? Alternate? ri/sat? Hungry? Alternate? Raining? Decision rees are Expressive A B A &&!B } Such trees can express any deterministic function we: } or example, in boolean functions, each row of a truth-table will correspond to a path in a tree } or any such function, there is always a tree: ust make each example a different path to a correct leaf output } A Problem: such trees most often do not generalize to new examples } Another Problem: we want compact trees to simplify inference B A B Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 13 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 14 Why t Search for rees? } One thing we might consider would be to search through possible trees to find ones that are most compact and consistent with our inputs } Exhaustive search is too expensive, however, due to the large number of possible functions (trees) that exist } or n binary-valued attributes, and boolean decision outputs, there are n possibilities } or 5 such attributes, we have 4,94,967,96 trees! } Even restricting our search to conunctions over attributes, it is easy to get 3 n possible trees Building rees op-down } Rather than search for all trees, we build our trees by: 1. Choosing an attribute A from our set. Dividing our examples according to the values of A 3. Placing each subset of examples into a sub-tree below the node for attribute A } his can be implemented in a number of ways, but is perhaps most easily understood recursively } he main question becomes: how do we choose the attribute A that we use to split our examples? Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 15 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 16 4

Decision ree Learning Algorithm function DECISION-REE-LEARNING(examples, attributes, parent examples) returns tree if examples is empty then return PLURALIY-VALUE(parent examples) else if all examples have the same classification then return the classification else if attributes is empty then return PLURALIY-VALUE(examples) else A argmax a attributes IMPORANCE(a, examples) tree anewdecisiontreewithroottesta for each value v k of A do exs {e : e examples and e.a = v k} subtree DECISION-REE-LEARNING(exs, attributes A, examples) add a branch to tree with label (A = v k) and subtree subtree return tree his Week } Information heory & Decision rees } Readings: } Blog post on Information heory (linked from class schedule) } Section 18.3 from Russell & rvig } Office Hours: Wing 10 } Monday/Wednesday/riday, 1:00 PM 1:00 PM } uesday/hursday, 1:30 PM 3:00 PM PLURALIY-VALUE(): returns output decision-value for maority of examples IMPORANCE(): rates attributes for their importance in making decisions for the given set of examples (the only actually complex part) Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 17 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 18 5