} Often, when learning, we deal with uncertainty:

Uncertanty and Learnng } Often, when learnng, we deal wth uncertanty: } Incomplete data sets, wth mssng nformaton } Nosy data sets, wth unrelable nformaton } Stochastcty: causes and effects related non-determnstcally } And many more Class #03: Informaton Theory } Probablty theory gves us mathematcs for such cases } A precse mathematcal theory of chance and causalty Machne Learnng (CS 49/59): M. Allen, 0 Sept. 8 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) Basc Elements of Probablty } Suppose we have some event, e : some fact about the world that may be true or false } We wrte P (e ) for the probablty that e occurs: 0 apple P (e) apple } We can understand ths value as:. P (e ) = : e wll certanly happen. P (e ) = 0: e wll certanly not happen 3. P (e ) = k, 0 < k < : over an arbtrarly long stretch of tme, we wll observe the fracton Event e occurs Total # of events = k Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3 Propertes of Probablty } Every event must ether occur, or not occur: P (e _ e) = P (e) = p( e) } Furthermore, suppose that we have a set of all possble events, each wth ts own probablty: E = {e,e,...,e k } } Ths set of probabltes s called a probablty dstrbuton, and t must have the followng property: X p = Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 4

Probablty Dstrbutons } A unform dstrbuton s one n whch every event occurs wth equal probablty, whch means that we have: ^ 8, p = k } Such dstrbutons are common n games of chance, e.g. where we have a far con-toss: E = {Heads, Tals} P = {0.5, 0.5} } Not every dstrbuton s unform, and we mght have a con that comes up tals more often than heads (or even always!) P = {0.5, 0.75} P 3 = {0.0,.0} Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 5 Informaton Theory } Claude Shannon created nformaton theory n hs 948 paper, A mathematcal theory of communcaton } A theory of the amount of nformaton that can be carred by communcaton channels } Has mplcatons n networks, encrypton, compresson, and many other areas } Also the source of the term bt (credted to John Tukey) Photo source: Konrad Jacobs (https://opc.mfo.de/detal?photo_d=3807) Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 6 Informaton Carred by Events } Informaton s relatve to our uncertanty about an event } If we do not know whether an event has happened or not, then learnng that fact s a gan n nformaton } If we already know ths fact, then there s no nformaton ganed when we see the outcome } Thus, f we have a fxed con that always comes up tals, actually flppng t tells us nothng we don t already know } Flppng a far con does tell us somethng, on the other hand, snce we can t predct the outcome ahead of tme Amount of Informaton } From N. Abramson (963): If an event e occurs wth probablty p, the amount of nformaton carred s: I(e ) = log p } (The base of the logarthm doesn t really matter, but f we use base-, we are measurng nformaton n bts) } Thus, f we flp a far con, and t comes up tals, we have ganed nformaton equal to: I(Tals) = log P (Tals) = log 0.5 = log =.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 7 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 8

Based Data Carres Less Informaton } Whle flppng a far con yelds.0 bt of nformaton, flppng one that s based gves us less } If we have a somewhat based con, then we get: E = {Heads, Tals} P = {0.5, 0.75} I(Tals) = log P (Tals) = log 0.75 = log.33 0.45 } If we have a totally based con, then we get: P 3 = {0.0,.0} I(Tals) = log P (Tals) = log.0 = log.0 =0.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 9 Entropy: Total Average Informaton } Shannon defned the entropy of a probablty dstrbuton as the average amount of nformaton carred by events: H(P) = X p log = X p log p p } Ths can be thought of n a varety of ways, ncludng: } How much uncertanty we have about the average event } How much nformaton we get when an average event occurs } How many bts on average are needed to communcate about the events (Shannon was nterested n fndng the most effcent overall encodngs to use n transmttng nformaton) Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 0 Entropy: Total Average Informaton } For a con, C, the formula for entropy becomes: H(C) = (P (Heads) log P (Heads)+P(Tals) log P (Tals)) } A far con, {0.5, 0.5}, has maxmum entropy: H(C) = (0.5 log 0.5+0.5 log 0.5) =.0 } A somewhat based con, {0.5, 0.75}, has less: H(C) = (0.5 log 0.5 + 0.75 log 0.75) 0.8 } And a fxed con, {0.0,.0}, has none: H(C) = (.0 log.0+0.0 log 0.0) = 0.0 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) A Mathematcal Defnton H(P) = X p log p } It s easy to show that for any dstrbuton, entropy s always greater than or equal to 0 (never negatve) } Maxmum entropy occurs wth a unform dstrbuton } In such cases, entropy s log k, where k s the number of dfferent probablstc outcomes } Thus, for any dstrbuton possble, we have: 0 apple H(P) apple log k Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3

Jont Probablty & Independence } If we have two events e and e, the probablty that both events occur, called the jont probablty, s wrtten: P (e ^ e )=P (e,e ) } We say that two events are ndependent f and only f: P (e,e )=P (e ) P (e ) } Independent events tell us nothng about each other Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 3 Jont Probablty & Independence } Independent events tell us nothng about each other: } For example, suppose rany weather s unformly dstrbuted } Suppose further that we choose a day of the week, unformly at random: that day s ether on a weekend or not, gvng us: W = {Ran, Ran} P W = {0.5, 0.5} D = {Weekend, Weekend} P D = {/7, 5/7} } If the weather on any day s ndependent of whether or not that day s a weekend, then we wll have the followng: P (Ran, W eekend) =P (Ran)P (Weekend) = 0.5 /7 =/4 P ( Ran, W eekend) =P ( Ran)P (Weekend) = 0.5 /7 =/4 P (Ran, Weekend)=P (Ran)P ( Weekend) = 0.5 5/7 =5/4 P ( Ran, Weekend)=P ( Ran)P ( Weekend) = 0.5 5/7 =5/4 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 4 Lack of Independence } Suppose we compare the probablty that t rans to the probablty that I brng an umbrella to work: W = {Ran, Ran} P W = {0.5, 0.5} U = {Umbrella, Umbrella} P U = {0., 0.8} } Note: presumably, nether of these s really purely random; we can stll treat them as random varables based upon observng how frequently they occur (ths s sometmes called the emprcal probablty) } Now, f these were ndependent events, then the probablty, e.g., that I am carryng an umbrella and t s ranng s: P (Ran, Umbrella) =P (Ran)P (Umbrella) =0.5 0. =0. } Obvously, however, these are not ndependent; and the actual probablty of seeng me wth my umbrella on rany days could be much hgher than just calculated Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 5 Condtonal Probablty } Gven two events e and e, the probablty that e occurs, gven that e also occurs, called the condtonal probablty of e gven e, s wrtten: P (e e ) } In general, the condtonal probablty of an event can be qute dfferent from the basc probablty that t occurs } Thus, for our weather example, we mght have: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0. P ( U R) =0. P ( U R) =0.9 P (e e )+P ( e e )=.0 P (e e )+P (e e ) 6=.0 Can be equal, but not necessarly Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 6 4

Propertes of Condtonal Probablty } Condtonal probablty can be defned usng jont probablty: P (e e )= P (e,e ) P (e ) P (e,e )=P (e e )P (e ) } Thus, f the events are actually ndependent, we get: P (e e )= P (e,e ) P (e ) P (e e )= P (e )P (e ) P (e ) P (e e )=P (e ) By defnton of ndependence Ths Week } Informaton Theory & Decson Trees } Readngs: } Blog post on Informaton Theory (lnked from class schedule) } Secton 8.3 from Russell & Norvg } Offce Hours: Wng 0 } Monday/Wednesday/Frday, :00 PM :00 PM } Tuesday/Thursday, :30 PM 3:00 PM Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 7 Monday, 0 Sep. 08 Machne Learnng (CS 49/59) 8 5