article-based Approimate Inference on Graphical Model Reference: robabilistic Graphical Model Ch. 2 Koller & Friedman CMU, 0-708, Fall 2009 robabilistic Graphical Models Lectures 8,9 Eric ing attern Recognition & Machine Learning Ch.. Bishop
In terms of difficulty, there are 3 types of inference problem. Inference which is easily solved with Bayes rule. Inference which is tractable using some dynamic programming technique. e.g. Variable Elimination or J-tree algorithm Today s focus Inference which is proved intractable & should be solved using some Approimate Method. e.g. Approimation with Optimization or Sampling technique.
Agenda When to use article-based Approimate Inference? Forward Sampling & Importance Sampling Markov Chain Monte Carlo MCMC Collapsed articles
Agenda When to use article-based Approimate Inference? Forward Sampling & Importance Sampling Markov Chain Monte Carlo MCMC Collapsed articles
Eample : General Factorial HMM A clique size=5, intractable most of times. o tractable elimination eist V V 2 V 3 V V 2 V 3 W W 2 W 3 W W 2 W 3 Y Y 2 Y 3 Y Y 2 Y 3 Z Z 2 Z 3 Moralize Z Z 2 Z 3 2 3 2 3
Some Model are Intractable for Eact Inference Eample: A Grid MRF
Some Model are Intractable for Eact Inference Eample: A Grid MRF
Some Model are Intractable for Eact Inference Eample: A Grid MRF
Some Model are Intractable for Eact Inference Eample: A Grid MRF
Some Model are Intractable for Eact Inference Eample: A Grid MRF Approimate Inference needed. Generally, we will have clique of size for a * grid, which is indeed intractable.
General idea of article-based Monte Carlo Approimation Most of Queries we want can be formed as: E [ f ]...... K * f... K which is intractable most of time. Assume we can generate i.i.d. samples n from, we can approimate above using: fˆ n f n It s a unbiased estimator whose variance converges to 0 when. E[ fˆ] Var[ fˆ] E[ n f n ] E[ f ] n Var[ f ] Var[ f ] 2 n Intractable when K. K Var. not Related to dimension of. Var0 as
Which roblem can use article-based Monte Carlo Approimation? Type of queries:. Likelihood of evidence/assignments on variables 2. Conditional robability of some variables given others. 3. Most robable Assignment for some variables given others. roblem which can be written as following form: E [ f ]...... K * f... K K
Marginal Distribution Monte Carlo }] [{ } *{,, k k k k k k k k k k k E k k k article-based Approimation: To Compute Marginal Distribution on k } { ˆ n k n k f Just count the proportion of samples in which k = k
Marginal Joint Distribution Monte Carlo }] & [{ } & *{,,,,, j j i i i j j i i j k j i ij j j i i j j i i E ij ij article-based Approimation: To Compute Marginal Distribution on i, j }, { ˆ n j n j i n i f Just count the proportion of samples in which i = i & j = j
So What s the roblem? ote what we can do is: Evaluate the probability/likelihood =,, K = K. What we cannot do is: Summation / Integration in high-dim. space:,, K. What we want to do for approimation is: Draw samples from,, K. How to make better use of samples? How to know we ve sampled enough?
How to draw Samples from? Forward Sampling draw from ancestor to descendant in B. Rejection Sampling create samples using Forward Sampling, and reject those inconsistent with evidence. Importance Sampling Sample from proposal dist. Q, but give large weight on sample with high likelihood in. Markov Chain Monte Carlo Define a Transition Dist. T s.t. samples can get closer and closer to.
Agenda When to use article-based Approimate Inference? Forward Sampling & Importance Sampling Markov Chain Monte Carlo MCMC Collapsed articles
Forward Sampling B E A t t 0.95 B 0.00 Burglary t f 0.94 f t 0.29 f f 0.0 Earthquake E 0.002 Alarm A J t 0.90 f 0.05 John Calls Mary Calls A M t 0.70 f 0.0
Forward Sampling B E A t t 0.95 B ~b 0.00 Burglary t f 0.94 a f t 0.29 f f 0.0 Earthquake E e 0.002 Alarm A J t 0.90 j f 0.05 John Calls i.i.d Sample : ~b,e,a,j,~m Mary Calls A M t 0.70 ~m f 0.0
Forward Sampling Samples : ~b,e,a,j,~m ~b,~e, a,~j, ~m article-based Represent of the joint distribution B,E,A,J,M. n n n m M b B m M b B } ~, { ~, n n m M m M } { What if we want samples from B, E, A J=j, M=~m?. Collect all samples in which J=j, M=~m. 2. Those samples form the particle-based representation of B, E, A J=j, M=~m. Disadvantage.
Forward Sampling from Z Data? Z Z Z 2 Z 3 Z K 0 0. Forward Sampling times. 2. Collect all samples Z n, n in which =, 2 =0, 3 =, K =0. 3. Those samples form the particle-based representation of Z. How many such samples can we get?? *Data!! Less than if not large enough Solutions.
Importance Sampling to the Rescue ] * [ * * * ] [ f Q E f Q Q f f E Q We need not draw from to compute E [f] : That is, we can draw from an arbitrary distribution Q, but give larger weights on samples having higher probability under. n n n n f Q f E * ] [ ˆ
Importance Sampling to the Rescue ] ~ [ ~ ~ Q E Q Q Z Q Sometimes we can only evaluate an unnormalized distribution : ~ ] * ~ [ ] [ f Q E Z f E Q n n n Q Z ~ ˆ n n n n n n n Q f Q f E f E ~ ~ * ~ Ẑ ] [ ˆ ] [ ˆ ~, Z where ote that we can compute only if we can evaluate a normalized distribution, that is, we have or is from a B. Ẑ Q Q Z Q Then we can estimate Z as follows:
Importance Sampling from Z Data? Z ZA ZB 2 ZA 3 ZA K 0 0. Sampling from Z, a normalized distribution obtained from B truncating the part with evidence.
Importance Sampling from Z Data? Z ZA ZB 2 ZA 3 ZA K w n Data Z 0 0 A 0 B A... 0 A. Sampling from Z, a normalized distribution obtained from B truncating the part with evidence. 2. Give each sample Zn, n a weight: w n ~ Z Z Data Z Q Z Z Data Z
Importance Sampling from Z Data? Z Z Z 2 0 Z 3 Z K 0. Sampling from Z, a normalized distribution obtained from B truncating the part with evidence. 2. Give each sample Zn, n a weight: w n ~ Z Q Z Z Data Z Z 3. The effective number of samples is ˆData n w n n Data Z eff w n Data Z A,B,A,,A w= 0.0 A,B,A,,B 0.3 B,B,B,,A.0 n n eff =.3 Data= eff / =.3/3
Importance Sampling from Z Data? Z Z Z 2 Z 3 0 0 Z K A,B,A,,A w= 0.0 A,B,A,,B 0.3 B,B,B,,A.0 eff =.3 Data= eff / =.3/3 To get estimate of Z Data : 0.0*0 0.3*0.0* ˆ Z B Data.3 0.76 0.0*0 0.3*.0*0 ˆ Z A, ZK B Data.3 0.23 Any joint dist. can be estimated. o out of clique problem
Bayesian Treatment with Importance Sampling E. θ Y= = logistic θ * + θ 0 Often, osterior on parameters θ : d Data Data ata Data Data D θ Y is intractable because many types of θ Data θ cannot be integrated analytically. We need not evaluate the integration to estimate θ Data using Importance Sampling. ˆ } { } { D ˆ Data a a Data Data a a Data ata a n n n n n n n Approimate with:
^ What s the problem? Q Val If and Q not matched properly Only small number of samples will fall in the region with high. Very large needed to get a good picture of.
How Z and QZ Match? When evidence is close to root, forward sampling is a good QZ, which can generate samples with high likelihood in Z. Evidence QZ close to Z
How Z and QZ Match? When evidence is on the leaves, forward sampling is a bad QZ, yields very low likelihood= Z. QZ far from Z So we need very large sample size to get a good picture of Z. Evidence Can we improve with time to draw from a distribution more like the desired Z? MCMC try to draw from a distribution closer and closer to Z. Apply equally well in B & MRF.
Agenda When to use article-based Approimate Inference? Forward Sampling & Importance Sampling Markov Chain Monte Carlo MCMC Collapsed articles
What is Markov Chain MC? A set of Random Variables: =,. K ossible Configurations of Variables change with Time: t = t,. K t which take transition following: t+ = t = = T There is a stationary distribution π T for Transition T, in which: π T = = π T = * T After transition, still the same distribution over all possible configurations ~ 3 E. The MC Markov Chain above has only variable taking on values {, 2, 3 }, There is a π T s.t. 0.25 0 0.75 T * T 0.2 0.5 0.3 0 0.7 0.3 0.2 0.5 0. 3 T 0.5 0.5 0
What is MCMC Markov Chain Monte Carlo? Importance Sampling is efficient only if Q matches well. Finding such Q is difficult. Instead, MCMC tries to find a transition dist. T, s.t. tends to transit into states with high, and finally follows stationary dist. π T =. T Setting 0 =any initial value, we samples, 2 M following T, and hope that M follows stationary distribution π T =. If M really does, we got a sample M from. Why will the MC converge to stationary distribution? there is a simple, useful sufficient condition: Regular Markov Chain : for finite state space Any state can reach any other states with prob. > 0. all entries of otential/cd > 0 M follows a unique π T as M large enough.
Eample Result
How to define T? ---- Gibbs Sampling Gibbs Sampling is the most popular one used in Graphical Model. In graphical model : It is easy to draw sample from each individual variable given others k -k, while drawing from the joint dist. of, 2,, K is difficult. So, we define T in Gibbs-Sampling as : Taking transition of ~ K in turn with transition distribution : T, T 2 2 2, T K K K Where T k k k = k = k -k Redraw k ~ conditional dist. given all others. In a Graphical Model, k = k -k = k = k Markov Blanket k
Gibbs Sampling for MRF Gibbs Sampling : t=. Initialize all variables randomly. for t = ~M for every variable 2. Draw t from t-. end end 0 0 0 0 0 Y Y, Y, Y Y 0, Y φ,y 0 0 5 9
Gibbs Sampling for MRF Gibbs Sampling : t=2. Initialize all variables randomly. for t = ~M for every variable 2. Draw t from t-. end end Y Y, Y, Y Y 0, Y For the central node: *9*9* 0.76 *9*9* 5***5 0 0 0 0 0 φ,y 0 0 5 9
Gibbs Sampling for MRF Gibbs Sampling : t=3. Initialize all variables randomly. for t = ~M for every variable 2. Draw t from t-. end end Y Y, Y, Y Y 0, Y For the central node: 9*9*9*9 0.99 9*9*9*9 *** 0 0 0 φ,y 0 0 5 9
Gibbs Sampling for MRF Gibbs Sampling : t=3. Initialize all variables randomly. for t = ~M for every variable 2. Draw t from t-. end end 0 0 0 When M is large enough, M follows stationary dist. : T C Z C Regularity: All entries in the otential are positive. φ,y 0 0 5 9
Why Gibbs Sampling has π T =? To prove is the stationary distribution, we prove is invariant under T k k k : Assume,, K currently follows = k -k * -k,. After T k k k, -K still follows -k because they are unchanged. 2. After T k k k = k = k -k new state indep. from current value k k t still follows k -k. So, after T,, T K K, =,, K still follows. Uniqueness & Convergence guaranteed from Regularity of MC.
Gibbs Sampling not Always Work When drawing from individual variable is not possible: We can evaluate Y but not Y. ker, K, log, 2 n 0 2 2 2 0 trick nel Y w w istic Y w w w Y n on-linear Dependency : Large State Space :...,, statespace, 3 2 G G G Learning Structure In G G G Data G G Data Data G Too large state space to do summation d Y Y Y Intractable Integration Other MCMC like Metropolis-Hasting needed. see reference.
Metropolis-Hasting ----MCMC Metropolis-Hasting M-H is a general MCMC method to sample Y whenever we can evaluate Y. evaluation of Y not needed In M-H, instead of drawing from Y, we draw from another roposal Dist. T based on current sample, and Accept the roposal with probability: accept from to ' 'T' T ',, if o. w. 'T' T Accept? ' T
Eample : = μ,σ 2 roposal Dist. T =, 0.2 2 red: Reject green: Accept. ' '..,, ;, ; ' - ', ' 2 2 case this T T o w if from to accept
Eample : Structure osterior = G Data roposal Distribution: T G G' add/remove a randomly chosen edge of G G' B A T G G' C B A C accept from G to G', if Data G' Data G', o. w. Data G Data G T G G' T G' G this case.
Why Metropolis-Hasting has π T =? Detailed-Balance Sufficient Condition: If π T *T = π T *T, then π T is stationary under T. Given desired π T =, and a roposal dist. T, we can let Detailed Balance satisfied using accept prob. A :.., ' ' ' ' ' ', ' ' ' ' '* * ' ' e know : ', ' ' o w T T T T A define T T T T W then T T Assume T T π T π T
How to Collect Samples? Assume we want collecting samples:. Run times of MCMC and collect their M th samples. M MC M MC 2. Run time of MCMC and collect M+ th ~ M+ th samples. M M+ MC What s the difference??
How to Collect Samples? Assume we want collecting samples: Cost = M* samples. Run times of MCMC and collect their M th samples. M MC Independent Samples M MC 2. Run time of MCMC and collect M+ th ~ M+ th samples. Cost = M + samples M M+ MC Correlated Samples
Comparison Cost = M* samples M M MC Independent Samples MC Cost = M + samples M MC M+ Correlated Samples E[ fˆ] E[ n f n ] E[ n f n ] n E[ f n ] E[ f ] o Independent Assumption Used Unbiased Estimator in both cases. For simple analysis, Take =2: Var[ fˆ] Var 4 Var[ f [ f 2 f ] Var[ f 2 2 ] Var[ f ] ] 2 2 Var[ fˆ] Var[ f f ] 2 2 2 Var[ f ] Var[ f ] 2Cov[ f, f ] 4 Var[ f ] Var[ f ] Var[ f ] 2 * f, f 2 2 2 ractically, many correlated samples right outperforms few independent samples left.
How to Check Convergence? M M+ MC Should be consistent if converge to π T M M+ MC 2 Check Ratio B W close to enough. assume K MCs,each with samples. f K K k f k B Var. between MC K - K k f k f 2 W Var. within MC K K k n f k, n f k 2
The Critical roblem of MCMC When ρ, M, Var[.] not decreasing with MCMC cannot yield acceptable result in reasonable time. 0.000 0.0002 Taking very large M to converge to π T.
How to Reduce Correlation ρ among Samples? Taking Large Step in Sample Space : Block Gibbs Sampling Collapsed-article Sampling
roblem of Gibbs Sampling Correlation ρ between samples is high, when correlation among variables ~ K is high. 0 Y Y 0 φy, 0 0 99 99 Taking very large M to converge to π T.
Block Gibbs Sampling Draw block of variables jointly:,y=y Marginal 00 00 Y Y φy, 0 0 99 99 Converge to π T much quickly.
Block Gibbs Sampling Divide into several tractable blocks, 2,, B. Each block b can be drawn jointly given variables in other blocks. V V 2 V 3 W W 2 W 3 Y Y 2 Y 3 Z Z 2 Z 3 2 3
Block Gibbs Sampling Divide into several tractable blocks, 2,, B. Each block b can be drawn jointly given variables in other blocks. Given samples on other blocks, Drawing a block jointly from is tractable. V V 2 V 3 W W 2 W 3 Y Y 2 Y 3 Z Z 2 Z 3 2 3
Block Gibbs Sampling Divide into several tractable blocks, 2,, B. Each block b can be drawn jointly given variables in other blocks. Given samples on other blocks, Drawing a block jointly from is tractable. V V 2 V 3 W W 2 W 3 Y Y 2 Y 3 Z Z 2 Z 3 2 3
Block Gibbs Sampling Divide into several tractable blocks, 2,, B. Each block b can be drawn jointly given variables in other blocks. Given samples on other blocks, Drawing a block jointly from is tractable. V V 2 V 3 W W 2 W 3 Y Y 2 Y 3 Z Z 2 Z 3 2 3
Block Gibbs Sampling Divide into several tractable blocks, 2,, B. Each block b can be drawn jointly given variables in other blocks. Given samples on other blocks, Drawing a block jointly from is tractable. V V 2 V 3 W W 2 W 3 Y Y 2 Y 3 Z Z 2 Z 3 2 3
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw A from: MA= B fa,bmb A B C D E fa,b fb,c fc,d fd,e MB MC MD
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw B from: fa=a,bmb a B C D E fa,b fb,c fc,d fd,e MB MC MD
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw C from: fb=b,cmc a b C D E fa,b fb,c fc,d fd,e MC MD
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw D from: fc=c,dmd a b c D E fa,b fb,c fc,d fd,e MD
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw E from: fd=d,e a b c d E fa,b fb,c fc,d fd,e
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw E from: fd=d,e a b c d e fa,b fb,c fc,d fd,e
Block Gibbs Sampling by VE Drawing from a block b jointly may need pass of VE. Draw Another Block.
KC K K K C C C K K K C C C S T T T T T T S KC. Intractable with Eact Inference. 2. Gibbs Sampling may converge too slow. K K K C C C T T T C C C Divide into 4 blocks:. Student type 2. KC type 3. KState 4. roblem Type K K K S KC
KC K K K C C C K K K C C C S T T T T T T S KC Student: Indep. to each other given samples on other blocks. Easy to Draw Jointly. K K K C C C T T T C C C K K K Divide into 4 blocks:. Student type 2. KC type 3. KState 4. roblem Type S KC
KC K K K C C C K K K C C C S T T T T T T S KC KC: Indep. to each other given samples on other blocks. Easy to Draw Jointly. K K K C C C T T T C C C K K K Divide into 4 blocks:. Student type 2. KC type 3. KState 4. roblem Type S KC
KC K K K C C C K K K C C C S T T T T T T S KC roblem Type: Indep. to each other given samples on other blocks. Easy to Draw Jointly. K K K C C C T T T C C C K K K Divide into 4 blocks:. Student type 2. KC type 3. KState 4. roblem Type S KC
KC K K K C C C K K K C C C S T T T T T T S KC KState: Chain-structured given samples on other blocks. Draw jointly using VE. K K K C C C T T T C C C K K K Divide into 4 blocks:. Student type 2. KC type 3. KState 4. roblem Type S KC
Agenda When to use Approimate Inference? Forward Sampling & Importance Sampling Markov Chain Monte Carlo MCMC Collapsed articles
Collapsed article Eact: article-based: Collapsed-article: * ] [ f f E ˆ n n f f p d p d p f f f E * * ] [, ] [ ˆ n n p d d n p d f f E Divide into 2 parts { p, d }, where d can do inference given p If p contains few variables, Var. can be much reduced!!
KC K K K C C C K K K C C C S T T T T T T S KC d can be eactly inferenced given p. K K K C C C T T T C C C K K K Divide into {p,d} p: d: Student type KC type roblem Type KState S KC
Collapsed article with VE S KC K K K C C C To draw k, Given all other variables in p sum out all other variables in d T T T fs,kc=k,k fs,kc=k,k,k2 S KC K K2 MS,KC=kc,K2 S KC K2 K3 MK,T=t Draw S given KC=k & T=t from: MS= K,K2 FS,KC=k,K,K2 MK,T=t MS, KC=k,K2 K T K2 T2 K3 T3 ft3
Collapsed article with VE S KC K K K C C C To draw k, Given all other variables in p sum out all other variables in d T T T fs=s,kc,k fs=s,kc,k,k2 S KC K K2 MS=s,KC,K2 S KC K2 K3 MK,T=t Draw KC given S=s & T=t from: MKC= K,K2 FS=s,KC,K,K2 MK,T=t MS=s,KC,K2 K T K2 T2 K3 T3 ft3
Collapsed article with VE S KC K K K C C C To draw k, Given all other variables in p sum out all other variables in d T T T S KC K K2 S KC K2 K3 MK,S=s, KC=k Draw T given S=s & KC=k from: MT = K MK,S=s,KC=k FK,T K T K2 T2 K3 T3 ft3
p Collect Samples d S, KC, T, T2, T3 K, K2, K3 Intel, Quick, Hard, Easy, Hard Intel, Slow, Easy, Easy, Hard.. Dull, Slow, Easy, Easy, Hard Average {/3,/3,/3}, {/4,/4,/2}, {/2,/2,0} {/2,/2,/4}, {/5,4/5,0}, {/4,/4,/2}.. {/3,/3,/3}, {/4,/4,/2}, {/2,/2,0} Average Eˆ [ f ] n d d n p f d, n p