Elementary manipulations of probabilities

Elemetary maipulatios of probabilities Set probability of multi-valued r.v. {=Odd} = +3+5 = /6+/6+/6 = ½ X X,, X i j X i j Multi-variat distributio: Joit probability: X true true X X,, X X i j i j X X Margial robability: X js j

Joit robability A joit probability distributio for a set of RVs gives the probability of every atomic evet sample poit lu,drikeer = a matri of values: 0.005 0.0 0.95 0.78 lu,drikeer, eadache =? Every uestio about a domai ca be aswered by the joit distributio, as we will see later.

Coditioal robability X = ractio of worlds i which X is true that also have true = "havig a headache" = "comig dow with lu" =/0 =/40 =/ = fractio of flu-iflicted worlds i which you have a headache Defiitio: = / X X Corollary: The Chai Rule X X X X

MLE Objective fuctio: h t l ; D log D log log log We eed to maimize this w.r.t. Take derivatives wrt h h l h h 0 MLE h or MLE i i Sufficiet statistics reuecy as sample mea The couts,, where, are sufficiet statistics of data D h k i i

The ayes Rule What we have just did leads to the followig geeral epressio: X p X X This is ayes Rule

More Geeral orms of ayes Rule lu eadhead Drakeer p X p X p X X Z p Z X Z p Z X Z p Z X Z X Z p Z X Z X S i i i i y p y X p X X y

robabilistic Iferece = "havig a headache" = "comig dow with lu" =/0 =/40 =/ Oe day you wake up with a headache. ou come with the followig reasoig: "sice 50% of flues are associated with headaches, so I must have a 50-50 chace of comig dow with flu Is this reasoig correct?

robabilistic Iferece = "havig a headache" = "comig dow with lu" =/0 =/40 =/ The roblem: =?

rior Distributio Support that our propositios about the possible has a "causal flow" e.g., rior or ucoditioal probabilities of propositios e.g., lu =true = 0.05 ad Drikeer =true = 0. correspod to belief prior to arrival of ay ew evidece A probability distributio gives values for all possible assigmets: Drikeer =[0.0,0.09, 0., 0.8] ormalized, i.e., sums to

osterior coditioal probability Coditioal or posterior see later probabilities e.g., lueadache = 0.78 give that flu is all I kow OT if flu the 7.8% chace of eadache Represetatio of coditioal distributios: lueadache = -elemet vector of -elemet vectors If we kow more, e.g., Drikeer is also give, the we have lueadache,drikeer = 0.070 This effect is kow as eplai away! lueadache,lu = ote: the less or more certai belief remais valid after more evidece arrives, but is ot always useful ew evidece may be irrelevat, allowig simplificatio, e.g., lueadache,stealerwi = lueadache This kid of iferece, sactioed by domai kowledge, is crucial

Iferece by eumeratio Start with a Joit Distributio uildig a Joit Distributio of M=3 variables Make a truth table listig all combiatios of values of your variables if there are M oolea variables the the table will have M rows. rob 0 0 0 0.4 0 0 0. 0 0 0.7 0 0. 0 0 0.05 0 0.05 0 0.05 0.05 or each combiatio of values, say how probable it is. ormalized, i.e., sums to

Iferece with the Joit Oe you have the JD you ca ask for the probability of ay atomic evet cosistet with you uery E row i i E 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Margials lu eadache 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Margials eadache 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Coditioals E E E ie ie E E E row row i i 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Iferece with the Joit Compute Coditioals lu eadhead lu eadhead eadhead 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Geeral idea: compute distributio o uery variable by fiig evidece variables ad summig over hidde variables

Summary: Iferece by eumeratio Let X be all the variables. Typically, we wat the posterior joit distributio of the uery variables give specific values e for the evidece variables E Let the hidde variables be = X--E The the reuired summatio of joit etries is doe by summig out the hidde variables: E=e=α,E=e=α h,e=e, =h The terms i the summatio are joit etries because, E, ad together ehaust the set of radom variables Obvious problems: Worst-case time compleity Od where d is the largest arity Space compleity Od to store the joit distributio ow to fid the umbers for Od etries???

Coditioal idepedece Write out full joit distributio usig chai rule: eadache;lu;virus;drikeer = eadache lu;virus;drikeer lu;virus;drikeer = eadache lu;virus;drikeer lu Virus;Drikeer Virus Drikeer Drikeer Assume idepedece ad coditioal idepedece = eadachelu;drikeer luvirus Virus Drikeer I.e.,? idepedet parameters I most cases, the use of coditioal idepedece reduces the size of the represetatio of the joit distributio from epoetial i to liear i. Coditioal idepedece is our most basic ad robust form of kowledge about ucertai eviromets.

Rules of Idepedece --- by eamples Virus Drikeer = Virus iff Virus is idepedet of Drikeer lu Virus;Drikeer = luvirus iff lu is idepedet of Drikeer, give Virus eadache lu;virus;drikeer = eadachelu;drikeer iff eadache is idepedet of Virus, give lu ad Drikeer

Margial ad Coditioal Idepedece Recall that for evets E i.e. X= ad say, =y, the coditioal probability of E give, writte as E, is E ad / = the probability of both E ad are true, give is true E ad are statistically idepedet if E = E i.e., prob. E is true does't deped o whether is true; or euivaletly E ad =E. E ad are coditioally idepedet give if or euivaletly E, = E E, = E

Why kowledge of Idepedece is useful Lower compleity time, space, search 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Motivates efficiet iferece for all kids of ueries Stay tued!! Structured kowledge about the domai easy to learig both from epert ad from data easy to grow

Where do probability distributios come from? Idea Oe: uma, Domai Eperts Idea Two: Simpler probability facts ad some algebra e.g.,,, 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05 Idea Three: Lear them from data! A good chuk of this course is essetially about various ways of learig various forms of them!

Desity Estimatio A Desity Estimator lears a mappig from a set of attributes to a robability Ofte kow as parameter estimatio if the distributio form is specified iomial, Gaussia Three importat issues: ature of the data iid, correlated, Objective fuctio MLE, MA, Algorithm simple algebra, gradiet methods, EM, Evaluatio scheme likelihood o test data, predictability, cosistecy,

arameter Learig from iid data Goal: estimate distributio parameters from a dataset of idepedet, idetically distributed iid, fully observed, traiig cases D = {,..., } Maimum likelihood estimatio MLE. Oe of the most commo estimators. With iid ad full-observability assumptio, write L as the likelihood of the data: L,, ;, i ; 3. pick the settig of parameters most likely to have geerated the data we saw: * i ; ;,, ; arg ma L arg ma log L

Eample : eroulli model Data: We observed iid coi tossig: D={, 0,,, 0} Represetatio: iary r.v: { 0, } Model: p p for 0 for ow to write the likelihood of a sigle observatio i? i i i The likelihood of datasetd={,, }: i i i,,..., i i i i i #head i #tails

MLE for discrete joit distributios More geerally, it is easy to show that evet i #records total i which evet umber of records i is true This is a importat but sometimes ot so effective learig algorithm! 0.4 0. 0.7 0. 0.05 0.05 0.05 0.05

Eample : uivariate ormal Data: We observed iid real samples: D={-0., 0,, -5.,, 3} Model: Log likelihood: / ep / D D l ; log log MLE: take derivative ad set to zero: l l / 4 MLE MLE ML

Overfittig Recall that for eroulli Distributio, we have head ML head head tail What if we tossed too few times so that we saw zero head? We have head 0, ad we will predict that the probability of ML seeig a head et is zero!!! The rescue: Where ' is kow as the pseudo- imagiary cout head ML ut ca we make this more formal? head head ' tail '

The ayesia Theory The ayesia Theory: e.g., for date D ad model M MD = DMM/D the posterior euals to the likelihood times the prior, up to a costat. This allows us to capture ucertaity about the model i a pricipled way

ierarchical ayesia Models are the parameters for the likelihood p a are the parameters for the prior pa. We ca have hyper-hyper-parameters, etc. We stop whe the choice of hyper-parameters makes o differece to the margial likelihood; typically make hyperparameters costats. Where do we get the prior? Itelliget guesses Empirical ayes Type-II maimum likelihood computig poit estimates of a : a arg ma p a MLE a

ayesia estimatio for eroulli eta distributio: osterior distributio of : otice the isomorphism of the posterior to the prior, such a prior is called a cojugate prior a a t h t h p p p,...,,...,,..., a a a a a a,, ;

ayesia estimatio for eroulli, co'd osterior distributio of : Maimum a posteriori MA estimatio: osterior mea estimatio: rior stregth: A=a+ A ca be iteroperated as the size of a imagiary data set from which we obtai the pseudo-couts a a t h t h p p p,...,,...,,..., a a a d C d D p h ayes t h ata parameters ca be uderstood as pseudo-couts,..., ma log arg MA

Effect of rior Stregth Suppose we have a uiform prior a==/, ad we observe h, 8 Weak prior A =. osterior predictio: Strog prior A = 0. osterior predictio: owever, if we have eough data, it washes away the prior. e.g., h 00, 800. The the estimates uder t 00 000 weak ad strog prior are ad, respectively, 000 0000 both of which are close to 0. t p h h, t 8, a a' 0. 5 0 0 p h h, t 8, a a' 0 0. 40 0 0

ayesia estimatio for ormal distributio ormal rior: Joit probability: osterior: 0 / ep / 0 ~ ad, / / / / / / ~ where Sample mea 0 / ep ep, / / ~ / ~ ep ~ /