CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

CS 687 Jana Koseka Unertainty Bayesian Networks Chapter 13 Russell and Norvig Chapter 14 14.1-14.3

Outline Unertainty robability Syntax and Semantis Inferene Independene and Bayes' Rule

Syntax Basi element: random variable Similar to propositional logi: possible worlds defined by assignment of values to random variables. Boolean random variables e.g. Cavity do I have a avity? <true false> Disrete random variables e.g. Weather is one of <sunnyrainyloudysnow> Domain values must be exhaustive and mutually exlusive Elementary proposition onstruted by assignment of a value to a random variable: e.g. Weather sunny Cavity false abbreviated as avity

Syntax Atomi event: A omplete speifiation of the state of the world about whih the agent is unertain E.g. if the world onsists of only two Boolean variables Cavity and Toothahe then there are 4 distint atomi events: Cavity false Toothahe false Cavity false Toothahe true Cavity true Toothahe false Cavity true Toothahe true Atomi events are mutually exlusive and exhaustive

Axioms of probability For any propositions A B 0 A 1 true 1 and false 0 A B A + B - A B

rior probability rior or unonditional probabilities of propositions e.g. Cavity true 0.1 and Weather sunny 0.72 orrespond to belief prior to arrival of any new evidene robability distribution gives values for all possible assignments: Weather <0.720.10.080.1> normalized i.e. sums to 1 Joint probability distribution for a set of random variables gives the probability of every atomi event on those random variables WeatherCavity a 4 2 matrix of values: Weather sunny rainy loudy snow Cavity true 0.144 0.02 0.016 0.02 Cavity false 0.576 0.08 0.064 0.08

Joint Distribution Weather sunny rainy loudy snow Cavity true 0.144 0.02 0.016 0.02 Cavity false 0.576 0.08 0.064 0.08 Every question about the domain an be answered from joint probability distribution

Conditional probability Conditional or posterior probabilities e.g. avity toothahe 0.8 i.e. given that toothahe is all I know Notation for onditional distributions: Cavity Toothahe 2-element vetor of 2-element vetors If we know more e.g. avity is also given then we have avity toothaheavity 1 New evidene may be irrelevant allowing simplifiation e.g. avity toothahe sunny avity toothahe 0.8

Conditional probability Definition of onditional probability: a b a b b rodut rule gives an alternative formulation: a b a bb b aa A general version holds for whole distributions e.g. WeatherCavity Weather Cavity Cavity View as a set of 4 2 equations not matrix multipliation This is analogous to logial reasoning where logial agent annot simultaneously believe A B and ~A and B Where do probabilities ome from? frequentist objetivist subjetivist Bayesian

Inferene by enumeration Start with the joint probability distribution: For any proposition φ sum the atomi events where it is true: φ Σ ω:ω φ ω

Inferene by enumeration Start with the joint probability distribution: For any proposition φ sum the atomi events where it is true: φ Σ ω:ω φ ω toothahe 0.108 + 0.012 + 0.016 + 0.064 0.2

Inferene by enumeration Start with the joint probability distribution: For any proposition φ sum the atomi events where it is true: φ Σ ω:ω φ ω toothahe 0.108 + 0.012 + 0.016 + 0.064 0.2 toothahe V avity 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 0.28 roess of summing out marginalization sum out all possible values of the other variables Y Y z Cavity Cavity z z Z z {CathTootahe}

Inferene by enumeration Start with the joint probability distribution: Can also ompute onditional probabilities: avity toothahe avity toothahe toothahe 0.016 + 0.064 0.108 + 0.012 + 0.016 + 0.064 0.4

Normalization Denominator an be viewed as a normalization onstant α Cavity toothahe α Cavitytoothahe α [Cavitytoothaheath + Cavitytoothahe ath] α [<0.1080.016> + <0.0120.064>] α <0.120.08> <0.60.4> General idea: ompute distribution on query variable by fixing evidene variables and summing over hidden variables

Inferene by enumeration ontd. Typially we are interested in the posterior joint distribution of the query variables given speifi values e for the evidene variables E Let the hidden variables be Y E Then the required summation of joint entries is done by summing out the hidden variables: E e αe e ασ h E e Y y The terms in the summation are joint entries beause E and Y together exhaust the set of random variables Obvious problems: 1. Worst-ase time omplexity Od n where d is the largest arity 2. Spae omplexity Od n to store the joint distribution 3. How to find the numbers for Od n entries?

Independene A and B are independent iff A B A or B A B or A B A B Toothahe Cath Cavity Weather Toothahe Cath Cavity Weather 32 entries redued to 12; for n independent biased oins O2 n On Absolute independene powerful but rare Dentistry is a large field with hundreds of variables none of whih are independent.

rodut rule Bayes' rule: or in distribution form Bayes' Rule a b a bb b aa Y a b YY b aa b Useful for assessing diagnosti probability from ausal probability: Cause Effet Effet Cause Cause / Effet Note: posterior probability of meningitis still very small!

Bayes' Rule Baye s rule Y YY More general version onditionalized on some evidene Y ey e Y e e E.g. let M be meningitis S be stiff nek: m s s mm s Normalization same for m and ~m Y α Y Y 0.8 0.001 0.1 0.008

Bayes' Rule and ombining evidene Cavity toothahe ath αtoothahe ath Cavity Cavity We an assume independene in the presene of Cavity αtoothahe Cavity ath Cavity Cavity Given Cavity toothahe and ath are independent In general Conditional Independene Y Z ZY Z

Conditional independene Toothahe Cavity Cath has 2 3 1 7 independent entries If I have a avity the probability that the probe athes in it doesn't depend on whether I have a toothahe: 1 ath toothahe avity ath avity The same independene holds if I haven't got a avity: 2 ath toothahe avity ath avity Cath is onditionally independent of Toothahe given Cavity: Cath ToothaheCavity Cath Cavity

Conditional independene ontd. Write out full joint distribution given the onditional independene assumption Toothahe Cath Cavity Toothahe Cath Cavity Cavity Toothahe CavityCath Cavity Cavity Toothahe Cavity Cath Cavity Cavity I.e. 2 + 2 + 1 5 independent numbers In most ases the use of onditional independene redues the size of the representation of the joint distribution from exponential in n to linear in n.

Naïve Bayes Naïve Bayes model the effets are independent given the ause CauseEffet 1 Effet n Cause π i Effet i Cause This simplifying assumption often works well Example SAM lassifiation

Slide from Dan Klein

32 robabilisti Classifiation MA lassifiation rule MA: Maximum A osterior Assign x to * if Generative lassifiation with the MA rule Apply Bayesian rule to onvert them into posterior probabilities Then apply the MA rule L C C > 1 * * x x L i C C C C C i i i i i 12 for x x x x

33 Naïve Bayes Bayes lassifiation Diffiulty: learning the joint probability Naïve Bayes lassifiation Assumption that all input attributes are onditionally independent! MA lassifiation rule: for 1 C C C C C n 1 C n 2 1 2 1 2 2 1 2 1 C C C C C C C C n n n n n L n n x x x x ] [ ] [ 1 * 1 * * * 1 > 2 1 n x x x x

Naïve Bayes Naïve Bayes Algorithm for disrete input attributes Learning hase: Given a training set S For eah target value of i i 1 L ˆ C estimate C with examples in S; For every attribute value x ˆ j i x jk C i estimate of eah attribute C Look up tables to assign the label * to if ˆ ʹ ˆ ʹ jk i ˆ ʹ j x jk j j i 1 n; k 1 N with examples in S; Output: onditional probability tables; for elements j N j L Test hase: Given an unknown instane ʹ aʹ aʹ * * * * [ a1 an ] ˆ > [ a1 an ] ˆ 1 ˆ ʹ 1 n L j 34

Example Example: lay Tennis 35

Example Learning hase Outlook layyes layno Sunny 2/9 3/5 Overast 4/9 0/5 Rain 3/9 2/5 Humidity layye s layn o High 3/9 4/5 Normal 6/9 1/5 Temperatur e layyes layno Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Wind layyes layno Strong 3/9 3/5 Weak 6/9 2/5 layyes 9/14 layno 5/14 36

Test hase Given a new instane Example x OutlookSunny TemperatureCool HumidityHigh WindStrong Look up tables OutlookSunny layyes 2/9 TemperatureCool layyes 3/9 HuminityHigh layyes 3/9 WindStrong layyes 3/9 layyes 9/14 OutlookSunny layno 3/5 TemperatureCool layno 1/5 HuminityHigh layno 4/5 WindStrong layno 3/5 layno 5/14 MA rule Yes x : [Sunny YesCool YesHigh YesStrong Yes]layYes 0.0053 No x : [Sunny No Cool NoHigh NoStrong No]layNo 0.0206 Given the fat Yes x < No x 37 we label x to be No.

Relevant Issues Violation of Independene Assumption For many real world tasks Nevertheless naïve Bayes works surprisingly well anyway! Zero onditional probability roblem If no example ontains the attribute value j ajk ˆ j ajk C i 0 In this irumstane ˆ x ˆ a ˆ x 0 during test For a remedy onditional probabilities estimated with Laplaian smoothing n 38 1 1 n C C n C 1 i jk i n i n + mp ˆ j ajk C i n + m : number of training examples for whih n : number of training examples for whih C a and C p : prior estimate usually p 1/ t for t possible values of m : weight to prior number of "ʺvirtual"ʺ examples m 1 j i jk j i

Relevant Issues Continuous-valued Input Attributes Numberless values for an attribute Conditional probability modeled with the normal distribution 2 1 j µ ji ˆ j C i exp 2 2πσ ji 2σ ji µ : meanavearage of attribute values of examples σ ji ji : standarddeviation of attribute values Learning hase: for 1 n C 1 L Output: n L normal distributions and C i i 1 L Test hase: for ʹ 1ʹ nʹ Calulate onditional probabilities with all the normal distributions Apply the MA rule to make a deision j j of examples for whih C for whih C i i 39

Conlusions Naïve Bayes based on the independene assumption Training is very easy and fast; just requiring onsidering eah attribute in eah lass separately Test is straightforward; just looking up tables or alulating onditional probabilities with normal distributions A popular generative model erformane ompetitive to most of state-of-the-art lassifiers even in presene of violating independene assumption Many suessful appliations e.g. spam mail filtering A good andidate of a base learner in ensemble learning Apart from lassifiation naïve Bayes an do more 40