Probabilistic Graphical Models

School of Computer Scence robablstc Graphcal Models Appromate Inference: Markov Chan Monte Carlo 05 07 Erc Xng Lecture 7 March 9 04 X X 075 05 05 03 X 3 Erc Xng @ CMU 005-04

Recap of Monte Carlo Monte Carlo methods are algorthms that: Generate samples from a gven probablty dstrbuton Estmate epectatons of functons [ f ] under a dstrbuton p E p Why s ths useful? Can use samples of p to appromate p tself Allows us to do graphcal model nference when we can t compute E [ f ] p Epectatons reveal nterestng propertes about eg means and varances of p p Erc Xng @ CMU 005-04

Lmtatons of Monte Carlo Drect samplng Hard to get rare events n hgh-dmensonal spaces Infeasble for MRFs unless we know the normalzer Z Rejecton samplng Importance samplng Do not work well f the proposal Q s very dfferent from Yet constructng a Q smlar to can be dffcult Makng a good proposal usually requres knowledge of the analytc form of but f we had that we wouldn t even need to sample! Intuton: nstead of a fed proposal Q what f we could use an adaptve proposal? Erc Xng @ CMU 005-04 3

Markov Chan Monte Carlo MCMC algorthms feature adaptve proposals Instead of Q they use Q where s the new state beng sampled and s the prevous sample As changes Q can also change as a functon of Importance samplng wth a bad proposal Q MCMC wth adaptve proposal Q Q Q Q 3 Q 4 3 3 3 Erc Xng @ CMU 005-04 4

Metropols-Hastngs Let s see how MCMC works n practce Later we ll look at the theoretcal aspects Metropols-Hastngs algorthm Draws a sample from Q where s the prevous sample The new sample s accepted or rejected wth some probablty A Ths acceptance probablty s ' Q ' A ' mn Q ' A s lke a rato of mportance samplng weghts /Q s the mportance weght for /Q s the mportance weght for We dvde the mportance weght for by that of Notce that we only need to compute / rather than or separately A ensures that after suffcently many draws our samples wll come from the true dstrbuton we shall learn why later n ths lecture Erc Xng @ CMU 005-04 5

The MH Algorthm Intalze startng state 0 set t =0 Burn-n: whle samples have not converged = t t =t + sample * ~ Q* // draw from proposal sample u ~ Unform0 // draw acceptance threshold * Q * -f u A * mn Q * t = * // transton -else t = // stay n current state Take samples from = : Reset t=0 for t =:N t+ Draw sample t Functon Draw sample t Erc Xng @ CMU 005-04 6

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Q 0 0 Erc Xng @ CMU 005-04 7

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Q 0 0 Erc Xng @ CMU 005-04 8

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Q 0 Erc Xng @ CMU 005-04 9

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Q 3 0 rejected 3 Erc Xng @ CMU 005-04 0

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = We reject because /Q < and /Q > hence A s close to zero! Q 3 0 rejected 3 Erc Xng @ CMU 005-04

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Draw accept 4 Q 3 0 3 4 Erc Xng @ CMU 005-04

The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Draw accept 4 Draw accept 5 Q 3 0 3 4 5 Erc Xng @ CMU 005-04 3

Theoretcal aspects of MCMC The MH algorthm has a burn-n perod Why do we throw away samples from burn-n? Why are the MH samples guaranteed to be from? The proposal Q keeps changng wth the value of ; how do we know the samples wll eventually come from? What s the connecton between Markov Chans and MCMC? Erc Xng @ CMU 005-04 5

Markov Chans A Markov Chan s a sequence of random varables n wth the Markov roperty n n n n n n s known as the transton kernel The net state depends only on the precedng state recall HMMs! Note: the rvs can be vectors We defne t to be the t-th sample of all varables n a graphcal model X t represents the entre state of the graphcal model at tme t We study homogeneous Markov Chans n whch the t t transton kernel s fed wth tme T To emphasze ths we wll call the kernel where s the prevous state and s the net state Erc Xng @ CMU 005-04 6

MC Concepts To understand MCs we need to defne a few concepts: t robablty dstrbutons over states: s a dstrbuton over the state of the system at tme t When dealng wth MCs we don t thnk of the system as beng n one state but as havng a dstrbuton over states For graphcal models remember that represents all varables Transtons: recall that states transton from t to t+ accordng to the transton kernel T We can also transton entre dstrbutons: t t T At tme t state has probablty mass π t The transton probablty redstrbutes ths mass to other states Statonary dstrbutons: s statonary f t does not change under the transton kernel: T for all Erc Xng @ CMU 005-04 7

MC Concepts Statonary dstrbutons are of great mportance n MCMC To understand them we need to defne some notons: Irreducble: an MC s rreducble f you can get from any state to any other state wth probablty > 0 n a fnte number of steps e there are no unreachable parts of the state space Aperodc: an MC s aperodc f you can return to any state at any tme erodc MCs have states that need tme steps to return to cycles Ergodc or regular: an MC s ergodc f t s rreducble and aperodc Ergodcty s mportant: t mples you can reach the statonary 0 dstrbuton no matter the ntal dstrbuton st All good MCMC algorthms must satsfy ergodcty so that you can t ntalze n a way that wll never converge Erc Xng @ CMU 005-04 8

MC Concepts Reversble detaled balance: an MC s reversble f there ests a dstrbuton such that the detaled balance condton s satsfed: robablty of and can be dfferent but the jont of amd reman the same no matter whch drecton to go Reversble MCs always have a statonary dstrbuton! roof: The last lne s the defnton of a statonary dstrbuton! 9 T T T T T T T T T Erc Xng @ CMU 005-04

Why does Metropols-Hastngs work? Recall that we draw a sample accordng to Q and then accept/reject accordng to A In other words the transton kernel s We can prove that MH satsfes detaled balance Recall that Notce ths mples the followng: 0 ' ' A Q T mn ' Q Q A ' A Q Q f then and thus ' A Erc Xng @ CMU 005-04

Why does Metropols-Hastngs work? Now suppose A < and A = We have The last lne s eactly the detaled balance condton In other words the MH algorthm leads to a statonary dstrbuton Recall we defned to be the true dstrbuton of Thus the MH algorthm eventually converges to the true dstrbuton! ' ' ' T T A Q A Q Q A Q Q Q A Erc Xng @ CMU 005-04 ' A Q Q f then and thus ' A

Caveats Although MH eventually converges to the true dstrbuton we have no guarantees as to when ths wll occur The burn-n perod represents the un-converged part of the Markov Chan that s why we throw those samples away! Knowng when to halt burn-n s an art We wll look at some technques later n ths lecture Erc Xng @ CMU 005-04

Gbbs Samplng Gbbs Samplng s an MCMC algorthm that samples each random varable of a graphcal model one at a tme GS s a specal case of the MH algorthm GS algorthms Are farly easy to derve for many graphcal models eg mture models Latent Drchlet allocaton Have reasonable computaton and memory requrements because they sample one rv at a tme Can be Rao-Blackwellzed ntegrate out some rvs to decrease the samplng varance Erc Xng @ CMU 005-04 3

Gbbs Samplng The GS algorthm: Suppose the graphcal model contans varables n Intalze startng values for n 3 Do untl convergence: ck an orderng of the n varables can be fed or random For each varable n order: Sample from - + n e the condtonal dstrbuton of gven the current values of all other varables Update When we update we mmedately use ts new value for samplng other varables j Erc Xng @ CMU 005-04 4

Markov Blankets The condtonal - + n looks ntmdatng but recall Markov Blankets: Let MB be the Markov Blanket of then MB n For a BN the Markov Blanket of s the set contanng ts parents chldren and co-parents For an MRF the Markov Blanket of s ts mmedate neghbors Erc Xng @ CMU 005-04 5

Gbbs Samplng: An Eample t B E A J M 0 F F F F F 3 4 Consder the alarm network Assume we sample varables n the order BEAJM Intalze all varables at t = 0 to False Erc Xng @ CMU 005-04 6

Gbbs Samplng: An Eample Samplng BAE at t = : Usng Bayes Rule AE = FF so we compute the followng and sample B = F 7 t B E A J M 0 F F F F F F 3 4 B E B A E A B 09980 09990999 00006 00600 F E F A F B F E F A T B Erc Xng @ CMU 005-04

Gbbs Samplng: An Eample Samplng EAB: Usng Bayes Rule AB = FF so we compute the followng and sample E = T 8 t B E A J M 0 F F F F F F T 3 4 E E B A B A E 09970 09990998 004 0700 F B F A F E F B F A T E Erc Xng @ CMU 005-04

Gbbs Samplng: An Eample Samplng ABEJM: Usng Bayes Rule BEJM = FTFF so we compute the followng and sample A = F 9 t B E A J M 0 F F F F F F T F 3 4 E B A A M A J M J E B A 06678 09509907 00087 00309 F M F J T E F B F A F M F J T E F B T A Erc Xng @ CMU 005-04

Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T 3 4 Samplng JA: No need to apply Bayes Rule A = F so we compute the followng and sample J = T J T A F 005 J F A F 095 Erc Xng @ CMU 005-04 30

Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F 3 4 Samplng MA: No need to apply Bayes Rule A = F so we compute the followng and sample M = F M T A F 00 M F A F 099 Erc Xng @ CMU 005-04 3

Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F F T T T T 3 4 Now t = and we repeat the procedure to sample new values of BEAJM Erc Xng @ CMU 005-04 3

Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F F T T T T 3 T F T F T 4 T F T F F Now t = and we repeat the procedure to sample new values of BEAJM And smlarly for t = 3 4 etc Erc Xng @ CMU 005-04 33

Topc Models: Collapsed Gbbs Tom Grffths & Mark Steyvers Collapsed Gbbs samplng opular nference algorthm for topc models α Integrate out topc vectors π and topcs B Only need to sample word-topc assgnments z β π Algorthm: K B z For all varables z = z z z n Draw z t+ from z z - w where z - = z t+ z t+ z t+ - z t + z t n w M N Erc Xng @ CMU 005-04 34

Collapsed Gbbs samplng What s z z - w? It s a product of two Drchlet-Multnomal condtonal dstrbutons: word-topc term doc-topc term Erc Xng @ CMU 005-04 35

Collapsed Gbbs samplng What s z z - w? It s a product of two Drchlet-Multnomal condtonal dstrbutons: # word postons a ecludng w such that: w a = w z a = j # word postons a n the current document d ecludng w such that: z a = j # word postons a ecludng w such that: z a = j # word postons a n the current document d ecludng w Erc Xng @ CMU 005-04 36

Collapsed Gbbs llustraton w d z 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton Erc Xng @ CMU 005-04 37

Collapsed Gbbs llustraton w d z z 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc Xng @ CMU 005-04 38

Collapsed Gbbs llustraton w d z z z 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton 000 Erc Xng @ CMU 005-04 45

Gbbs Samplng s a specal case of MH The GS proposal dstrbuton s Where - denotes all varables ecept Applyng MH to ths proposal we fnd that samples are always accepted whch s eactly what GS does: GS s smply MH wth a proposal that s always accepted! 46 Q mn mn mn mn Q Q A Erc Xng @ CMU 005-04

ractcal Aspects of MCMC How do we know f our proposal Q s any good? Montor the acceptance rate lot the autocorrelaton functon How do we know when to stop burn-n? lot the sample values vs tme lot the log-lkelhood vs tme Erc Xng @ CMU 005-04 47

Acceptance Rate Low-varance proposal Q Hgh-varance proposal Q Choosng the proposal Q s a tradeoff: Narrow low-varance proposals have hgh acceptance but take many teratons to eplore fully because the proposed are too close Wde hgh-varance proposals have the potental to eplore much of but many proposals are rejected whch slows down the sampler A good Q proposes dstant samples wth a suffcently hgh acceptance rate Erc Xng @ CMU 005-04 48

Acceptance Rate Low-varance proposal Q Hgh-varance proposal Q Acceptance rate s the fracton of samples that MH accepts General gudelne: proposals should have ~05 acceptance rate [] Gaussan specal case: If both and Q are Gaussan the optmal acceptance rate s ~045 for D= dmenson and approaches ~03 as D tends to nfnty [] [] Muller 993 A Generc Approach to osteror Integraton and Gbbs Samplng [] Roberts GO Gelman A and Glks WR 994 Weak Convergence and Optmal Scalng of Random Walk Metropols Algorthms Erc Xng @ CMU 005-04 49

Autocorrelaton functon MCMC chans always show autocorrelaton AC AC means that adjacent samples n tme are hghly correlated We quantfy AC wth the autocorrelaton functon of an rv : 50 Low autocorrelaton Hgh autocorrelaton k n t t k n t k t t k R Erc Xng @ CMU 005-04

Autocorrelaton functon R k nk t t nk t t t k Low autocorrelaton Hgh autocorrelaton The frst-order AC R can be used to estmate the Sample Sze Inflaton Factor SSIF: R s R If we took n samples wth SSIF s then the effectve sample sze s n/s Hgh autocorrelaton leads to smaller effectve sample sze! We want proposals Q wth low autocorrelaton Erc Xng @ CMU 005-04 5

Sample Values vs Tme Well-med chans oorly-med chans Montor convergence by plottng samples of rvs from multple MH runs chans If the chans are well-med left they are probably converged If the chans are poorly-med rght we should contnue burn-n Erc Xng @ CMU 005-04 5

Log-lkelhood vs Tme Not converged Converged Many graphcal models are hgh-dmensonal Hard to vsualze all rv chans at once Instead plot the complete log-lkelhood vs tme The complete log-lkelhood s an rv that depends on all model rvs Generally the log-lkelhood wll clmb then eventually plateau Erc Xng @ CMU 005-04 53

Summary Markov Chan Monte Carlo methods use adaptve proposals Q to sample from the true dstrbuton Metropols-Hastngs allows you to specfy any proposal Q But choosng a good Q requres care Gbbs samplng sets the proposal Q to the condtonal dstrbuton Acceptance rate always! But remember that hgh acceptance usually entals slow eploraton In fact there are better MCMC algorthms for certan models Knowng when to halt burn-n s an art Erc Xng @ CMU 005-04 54