Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

Preston Hardy
5 years ago
Views:

1 School of Computer Scence robablstc Graphcal Models Appromate Inference: Markov Chan Monte Carlo Erc Xng Lecture 7 March 9 04 X X X 3 Erc CMU

2 Recap of Monte Carlo Monte Carlo methods are algorthms that: Generate samples from a gven probablty dstrbuton Estmate epectatons of functons [ f ] under a dstrbuton p E p Why s ths useful? Can use samples of p to appromate p tself Allows us to do graphcal model nference when we can t compute E [ f ] p Epectatons reveal nterestng propertes about eg means and varances of p p Erc CMU

3 Lmtatons of Monte Carlo Drect samplng Hard to get rare events n hgh-dmensonal spaces Infeasble for MRFs unless we know the normalzer Z Rejecton samplng Importance samplng Do not work well f the proposal Q s very dfferent from Yet constructng a Q smlar to can be dffcult Makng a good proposal usually requres knowledge of the analytc form of but f we had that we wouldn t even need to sample! Intuton: nstead of a fed proposal Q what f we could use an adaptve proposal? Erc CMU

4 Markov Chan Monte Carlo MCMC algorthms feature adaptve proposals Instead of Q they use Q where s the new state beng sampled and s the prevous sample As changes Q can also change as a functon of Importance samplng wth a bad proposal Q MCMC wth adaptve proposal Q Q Q Q 3 Q Erc CMU

5 Metropols-Hastngs Let s see how MCMC works n practce Later we ll look at the theoretcal aspects Metropols-Hastngs algorthm Draws a sample from Q where s the prevous sample The new sample s accepted or rejected wth some probablty A Ths acceptance probablty s ' Q ' A ' mn Q ' A s lke a rato of mportance samplng weghts /Q s the mportance weght for /Q s the mportance weght for We dvde the mportance weght for by that of Notce that we only need to compute / rather than or separately A ensures that after suffcently many draws our samples wll come from the true dstrbuton we shall learn why later n ths lecture Erc CMU

6 The MH Algorthm Intalze startng state 0 set t =0 Burn-n: whle samples have not converged = t t =t + sample * ~ Q* // draw from proposal sample u ~ Unform0 // draw acceptance threshold * Q * -f u A * mn Q * t = * // transton -else t = // stay n current state Take samples from = : Reset t=0 for t =:N t+ Draw sample t Functon Draw sample t Erc CMU

7 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Q 0 0 Erc CMU

8 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Q 0 0 Erc CMU

9 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Q 0 Erc CMU

10 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Q 3 0 rejected 3 Erc CMU

11 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = We reject because /Q < and /Q > hence A s close to zero! Q 3 0 rejected 3 Erc CMU

12 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Draw accept 4 Q Erc CMU

13 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Draw accept 4 Draw accept 5 Q Erc CMU

14 The MH Algorthm A ' mn ' Q ' Q ' Eample: Let Q be a Gaussan centered on We re tryng to sample from a bmodal dstrbuton Intalze 0 Draw accept Draw accept Draw but reject; set 3 = Draw accept 4 Draw accept 5 The adaptve proposal Q allows us to sample both modes of! Q Erc CMU

15 Theoretcal aspects of MCMC The MH algorthm has a burn-n perod Why do we throw away samples from burn-n? Why are the MH samples guaranteed to be from? The proposal Q keeps changng wth the value of ; how do we know the samples wll eventually come from? What s the connecton between Markov Chans and MCMC? Erc CMU

16 Markov Chans A Markov Chan s a sequence of random varables n wth the Markov roperty n n n n n n s known as the transton kernel The net state depends only on the precedng state recall HMMs! Note: the rvs can be vectors We defne t to be the t-th sample of all varables n a graphcal model X t represents the entre state of the graphcal model at tme t We study homogeneous Markov Chans n whch the t t transton kernel s fed wth tme T To emphasze ths we wll call the kernel where s the prevous state and s the net state Erc CMU

17 MC Concepts To understand MCs we need to defne a few concepts: t robablty dstrbutons over states: s a dstrbuton over the state of the system at tme t When dealng wth MCs we don t thnk of the system as beng n one state but as havng a dstrbuton over states For graphcal models remember that represents all varables Transtons: recall that states transton from t to t+ accordng to the transton kernel T We can also transton entre dstrbutons: t t T At tme t state has probablty mass π t The transton probablty redstrbutes ths mass to other states Statonary dstrbutons: s statonary f t does not change under the transton kernel: T for all Erc CMU

18 MC Concepts Statonary dstrbutons are of great mportance n MCMC To understand them we need to defne some notons: Irreducble: an MC s rreducble f you can get from any state to any other state wth probablty > 0 n a fnte number of steps e there are no unreachable parts of the state space Aperodc: an MC s aperodc f you can return to any state at any tme erodc MCs have states that need tme steps to return to cycles Ergodc or regular: an MC s ergodc f t s rreducble and aperodc Ergodcty s mportant: t mples you can reach the statonary 0 dstrbuton no matter the ntal dstrbuton st All good MCMC algorthms must satsfy ergodcty so that you can t ntalze n a way that wll never converge Erc CMU

19 MC Concepts Reversble detaled balance: an MC s reversble f there ests a dstrbuton such that the detaled balance condton s satsfed: robablty of and can be dfferent but the jont of amd reman the same no matter whch drecton to go Reversble MCs always have a statonary dstrbuton! roof: The last lne s the defnton of a statonary dstrbuton! 9 T T T T T T T T T Erc CMU

20 Why does Metropols-Hastngs work? Recall that we draw a sample accordng to Q and then accept/reject accordng to A In other words the transton kernel s We can prove that MH satsfes detaled balance Recall that Notce ths mples the followng: 0 ' ' A Q T mn ' Q Q A ' A Q Q f then and thus ' A Erc CMU

21 Why does Metropols-Hastngs work? Now suppose A < and A = We have The last lne s eactly the detaled balance condton In other words the MH algorthm leads to a statonary dstrbuton Recall we defned to be the true dstrbuton of Thus the MH algorthm eventually converges to the true dstrbuton! ' ' ' T T A Q A Q Q A Q Q Q A Erc CMU ' A Q Q f then and thus ' A

22 Caveats Although MH eventually converges to the true dstrbuton we have no guarantees as to when ths wll occur The burn-n perod represents the un-converged part of the Markov Chan that s why we throw those samples away! Knowng when to halt burn-n s an art We wll look at some technques later n ths lecture Erc CMU

23 Gbbs Samplng Gbbs Samplng s an MCMC algorthm that samples each random varable of a graphcal model one at a tme GS s a specal case of the MH algorthm GS algorthms Are farly easy to derve for many graphcal models eg mture models Latent Drchlet allocaton Have reasonable computaton and memory requrements because they sample one rv at a tme Can be Rao-Blackwellzed ntegrate out some rvs to decrease the samplng varance Erc CMU

24 Gbbs Samplng The GS algorthm: Suppose the graphcal model contans varables n Intalze startng values for n 3 Do untl convergence: ck an orderng of the n varables can be fed or random For each varable n order: Sample from - + n e the condtonal dstrbuton of gven the current values of all other varables Update When we update we mmedately use ts new value for samplng other varables j Erc CMU

25 Markov Blankets The condtonal - + n looks ntmdatng but recall Markov Blankets: Let MB be the Markov Blanket of then MB n For a BN the Markov Blanket of s the set contanng ts parents chldren and co-parents For an MRF the Markov Blanket of s ts mmedate neghbors Erc CMU

26 Gbbs Samplng: An Eample t B E A J M 0 F F F F F 3 4 Consder the alarm network Assume we sample varables n the order BEAJM Intalze all varables at t = 0 to False Erc CMU

27 Gbbs Samplng: An Eample Samplng BAE at t = : Usng Bayes Rule AE = FF so we compute the followng and sample B = F 7 t B E A J M 0 F F F F F F 3 4 B E B A E A B F E F A F B F E F A T B Erc CMU

28 Gbbs Samplng: An Eample Samplng EAB: Usng Bayes Rule AB = FF so we compute the followng and sample E = T 8 t B E A J M 0 F F F F F F T 3 4 E E B A B A E F B F A F E F B F A T E Erc CMU

29 Gbbs Samplng: An Eample Samplng ABEJM: Usng Bayes Rule BEJM = FTFF so we compute the followng and sample A = F 9 t B E A J M 0 F F F F F F T F 3 4 E B A A M A J M J E B A F M F J T E F B F A F M F J T E F B T A Erc CMU

30 Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T 3 4 Samplng JA: No need to apply Bayes Rule A = F so we compute the followng and sample J = T J T A F 005 J F A F 095 Erc CMU

31 Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F 3 4 Samplng MA: No need to apply Bayes Rule A = F so we compute the followng and sample M = F M T A F 00 M F A F 099 Erc CMU

32 Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F F T T T T 3 4 Now t = and we repeat the procedure to sample new values of BEAJM Erc CMU

33 Gbbs Samplng: An Eample t B E A J M 0 F F F F F F T F T F F T T T T 3 T F T F T 4 T F T F F Now t = and we repeat the procedure to sample new values of BEAJM And smlarly for t = 3 4 etc Erc CMU

34 Topc Models: Collapsed Gbbs Tom Grffths & Mark Steyvers Collapsed Gbbs samplng opular nference algorthm for topc models α Integrate out topc vectors π and topcs B Only need to sample word-topc assgnments z β π Algorthm: K B z For all varables z = z z z n Draw z t+ from z z - w where z - = z t+ z t+ z t+ - z t + z t n w M N Erc CMU

35 Collapsed Gbbs samplng What s z z - w? It s a product of two Drchlet-Multnomal condtonal dstrbutons: word-topc term doc-topc term Erc CMU

36 Collapsed Gbbs samplng What s z z - w? It s a product of two Drchlet-Multnomal condtonal dstrbutons: # word postons a ecludng w such that: w a = w z a = j # word postons a n the current document d ecludng w such that: z a = j # word postons a ecludng w such that: z a = j # word postons a n the current document d ecludng w Erc CMU

37 Collapsed Gbbs llustraton w d z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton Erc CMU

38 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

39 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

40 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

41 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

42 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

43 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

44 Collapsed Gbbs llustraton w d z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton? Erc CMU

45 Collapsed Gbbs llustraton w d z z z MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 teraton 000 Erc CMU

46 Gbbs Samplng s a specal case of MH The GS proposal dstrbuton s Where - denotes all varables ecept Applyng MH to ths proposal we fnd that samples are always accepted whch s eactly what GS does: GS s smply MH wth a proposal that s always accepted! 46 Q mn mn mn mn Q Q A Erc CMU

47 ractcal Aspects of MCMC How do we know f our proposal Q s any good? Montor the acceptance rate lot the autocorrelaton functon How do we know when to stop burn-n? lot the sample values vs tme lot the log-lkelhood vs tme Erc CMU

48 Acceptance Rate Low-varance proposal Q Hgh-varance proposal Q Choosng the proposal Q s a tradeoff: Narrow low-varance proposals have hgh acceptance but take many teratons to eplore fully because the proposed are too close Wde hgh-varance proposals have the potental to eplore much of but many proposals are rejected whch slows down the sampler A good Q proposes dstant samples wth a suffcently hgh acceptance rate Erc CMU

49 Acceptance Rate Low-varance proposal Q Hgh-varance proposal Q Acceptance rate s the fracton of samples that MH accepts General gudelne: proposals should have ~05 acceptance rate [] Gaussan specal case: If both and Q are Gaussan the optmal acceptance rate s ~045 for D= dmenson and approaches ~03 as D tends to nfnty [] [] Muller 993 A Generc Approach to osteror Integraton and Gbbs Samplng [] Roberts GO Gelman A and Glks WR 994 Weak Convergence and Optmal Scalng of Random Walk Metropols Algorthms Erc CMU

50 Autocorrelaton functon MCMC chans always show autocorrelaton AC AC means that adjacent samples n tme are hghly correlated We quantfy AC wth the autocorrelaton functon of an rv : 50 Low autocorrelaton Hgh autocorrelaton k n t t k n t k t t k R Erc CMU

51 Autocorrelaton functon R k nk t t nk t t t k Low autocorrelaton Hgh autocorrelaton The frst-order AC R can be used to estmate the Sample Sze Inflaton Factor SSIF: R s R If we took n samples wth SSIF s then the effectve sample sze s n/s Hgh autocorrelaton leads to smaller effectve sample sze! We want proposals Q wth low autocorrelaton Erc CMU

52 Sample Values vs Tme Well-med chans oorly-med chans Montor convergence by plottng samples of rvs from multple MH runs chans If the chans are well-med left they are probably converged If the chans are poorly-med rght we should contnue burn-n Erc CMU

53 Log-lkelhood vs Tme Not converged Converged Many graphcal models are hgh-dmensonal Hard to vsualze all rv chans at once Instead plot the complete log-lkelhood vs tme The complete log-lkelhood s an rv that depends on all model rvs Generally the log-lkelhood wll clmb then eventually plateau Erc CMU

54 Summary Markov Chan Monte Carlo methods use adaptve proposals Q to sample from the true dstrbuton Metropols-Hastngs allows you to specfy any proposal Q But choosng a good Q requres care Gbbs samplng sets the proposal Q to the condtonal dstrbuton Acceptance rate always! But remember that hgh acceptance usually entals slow eploraton In fact there are better MCMC algorthms for certan models Knowng when to halt burn-n s an art Erc CMU

Markov chain Monte Carlo Lecture 9

Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events