Variational Scoring of Graphical Model Structures

Size: px

Start display at page:

Download "Variational Scoring of Graphical Model Structures"

Beverly Dawson
5 years ago
Views:

1 Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003

2 Overview Bayesian model selection Approximations using Variational Bayesian EM Annealed Importance Sampling Structure scoring in discrete Directed Acyclic Graphs Cheeseman-Stutz vs. Variational Bayes (if time)

3 Bayesian Model Selection Select the model class m j with the highest probability given the data y: p(m j y) = p(m j)p(y m j ), p(y m j ) = dθ j p(θ j m j )p(y θ j, m j ) p(y) Interpretation: The probability that randomly selected parameter values from the model class would generate data set y. too simple P(Data) "just right" too complex Y Data Sets Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random.

4 Intractable Likelihoods The marginal likelihood is often a difficult integral p(y m) = dθ p(θ m)p(y θ) because of the high dimensionality of the parameter space analytical intractability and also due to the presence of hidden variables: p(y m) = dθ p(θ m)p(y θ) = dθ p(θ m) dx p(y, x θ, m) x y θ

5 Practical Bayesian methods Laplace approximations: Appeal to Central Limit Theorem, making a Gaussian approximation about the maximum a posteriori parameter estimate, ˆθ. ln p(y m) ln p(ˆθ m) + ln p(y ˆθ) + d 2 ln 2π 1 2 ln H Large sample approximations: e.g. BIC: as n, ln p(y m) ln p(y ˆθ) d 2 ln n Markov chain Monte Carlo (MCMC): Guaranteed to converge in the limit. Many samples required for accurate results. Hard to assess convergence. Variational approximations... this changes the cost function Other deterministic approximations are also available now: Freeman & Weiss, 2000) and Expectation Propagation (Minka, 2001). e.g. Bethe approximations (Yedidia,

6 Lower Bounding the Marginal Likelihood Variational Bayesian Learning Let the hidden states be x, data y and the parameters θ. We can lower bound the marginal likelihood (Jensen s inequality): ln p(y m) = ln = ln dx dθ p(y, x, θ m) p(y, x, θ m) dx dθ q(x, θ) q(x, θ) dx dθ q(x, θ) ln p(y, x, θ m). q(x, θ) x y θ Use a simpler, factorised approximation to q(x, θ): ln p(y m) dx dθ q x (x)q θ (θ) ln = F m (q x (x), q θ (θ), y). p(y, x, θ m) q x (x)q θ (θ)

7 Updating the variational approximation... Maximizing this lower bound, F m, leads to EM-like updates: [ qx(x) exp [ qθ(θ) p(θ) exp ] dθ q θ (θ) ln p(x, y θ) ] dx q x (x) ln p(x, y θ) E like step M like step Maximizing F m is equivalent to minimizing KL-divergence between the approximate posterior, q θ (θ) q x (x) and the true posterior, p(θ, x y, m): ln p(y m) }{{} desired quantity F m (q x (x), q θ (θ), y) }{{} computable = dx dθ q x (x) q θ (θ) ln q x(x) q θ (θ) p(x, θ y, m) = KL(q p) }{{} measure of inaccuracy of approximation In the limit as n, for identifiable models, the variational lower bound approaches Schwartz s (1978) BIC criterion.

8 Conjugate-Exponential models Let s focus on conjugate-exponential (CE) models, which satisfy (1) and (2): Condition (1). The joint probability over variables is in the exponential family: p(x, y θ) = f(x, y) g(θ) exp { φ(θ) u(x, y) } where φ(θ) is the vector of natural parameters, u are sufficient statistics Condition (2). The prior over parameters is conjugate to this joint probability: p(θ η, ν) = h(η, ν) g(θ) η exp { φ(θ) ν } where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: η: number of pseudo-observations ν: values of pseudo-observations

9 Conjugate-Exponential examples In the CE family: Gaussian mixtures factor analysis, probabilistic PCA hidden Markov models and factorial HMMs linear dynamical systems and switching models discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: Boltzmann machines, MRFs (no conjugacy) logistic regression (no conjugacy) sigmoid belief networks (not exponential) independent components analysis (not exponential) One can often approximate these models with models in the CE family e.g. IFA (Attias, 1998).

10 A very useful result Theorem Given an iid data set y = (y 1,... y n ), if the model is CE then: (a) q θ (θ) is also conjugate, i.e. q θ (θ) = h( η, ν)g(θ) η exp { φ(θ) ν } (b) q x (x) = n i=1 q x i (x i ) is of the same form as in the E step of regular EM, but using pseudo parameters computed by averaging over q θ (θ) q xi (x i ) f(x i, y i ) exp { } φ u(x i, y i ) = p(x i y i, θ)? where φ = φ(θ) qθ (θ) = φ( θ) KEY points: (a) the approximate parameter posterior is of the same form as the prior; (b) the approximate hidden variable posterior, averaging over all parameters, is of the same form as the exact hidden variable posterior under θ.

11 The Variational Bayesian EM algorithm EM for MAP estimation Goal: maximize p(θ y, m) w.r.t. θ E Step: compute Variational Bayesian EM Goal: lower bound p(y m) VB-E Step: compute M Step: θ (t+1) = arg max θ q (t+1) x (x) = p(x y, θ (t) ) dx q (t+1) x (x) ln p(x, y, θ) VB-M Step: q (t+1) x (x) = p(x y, φ (t) ) [ q (t+1) θ (θ) exp ] dx q (t+1) x (x) ln p(x, y, θ) Properties: Reduces to the EM algorithm if q θ (θ) = δ(θ θ ). F m increases monotonically, and incorporates the model complexity penalty. Analytical parameter distributions (but not constrained to be Gaussian). VB-E step has same complexity as corresponding E step. We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters, φ.

12 Variational Bayesian EM The Variational Bayesian EM algorithm has been used to approximate Bayesian learning in a wide range of models, such as: probabilistic PCA and factor analysis (Bishop, 1999) mixtures of Gaussians (Attias, 1999) mixtures of factor analysers (Ghahramani & Beal, 1999) state-space models (Ghahramani & Beal, 2000; Beal, 2003) ICA, IFA (Attias, 1999; Miskin & MacKay, 2000; Valpola 2000) mixtures of experts (Ueda & Ghahramani, 2000) hidden Markov models (MacKay, 1995; Beal, 2003) The main advantage is that it can be used to automatically do model selection and does not suffer from overfitting to the same extent as ML methods do. Also it is about as computationally demanding as the usual EM algorithm. See:

13 Graphical Models (Bayesian Networks / Directed Acyclic Graphical Models) A B A Bayesian network corresponds to a factorization of the joint distribution: C D p(a, B, C, D, E) =p(a)p(b)p(c A, B) p(d B, C)p(E C, D) E In general: p(x 1,..., X n ) = where pa(i) are the parents of node i. n p(x i Xpa(i)) i=1 Semantics: X Y V if V d-separates X from Y. Two advantages: interpretability and efficiency.

14 Model Selection Task Which of the following graphical models is the data generating process? Discrete directed acyclic graphical models: data y = (A, B, C, D, E) n ALL OBSERVED E A true complex simple C B D E A C B D E A C B D marginal likelihood tractable If the data are just y = (C, D, E) n, and s = (A, B) n are hidden variables...? true complex simple OBS.+HIDDEN E A C B D E A C B D E A C B D marginal likelihood intractable

15 Case study: Discrete DAGs Let the hidden and observed variables be denoted with z = {z 1,..., z n } = {s 1, y 1,..., s n, y n }, of which some j H are hidden and j V are observed variables, i.e. s i = {z ij } j H and y i = {z ij } j V. Complete-data likelihood Complete-data marginal likelihood p(z θ)= n z i i=1 j=1 p(z ij z ipa(j), θ), Incomplete-data likelihood p(z m)= dθ p(θ m) n z i i=1 j=1 p(z ij z ipa(j), θ) p(y θ) = n p(y i θ) = i=1 n z i i=1 {z ij } j H j=1 p(z ij z ipa(j), θ) Incomplete-data marginal likelihood p(y m) = dθ p(θ m) n z i p(z ij z ipa(j), θ) (intractable!) i=1 {z ij } j H j=1

16 The BIC and CS approximations BIC - Bayesian Information Criterion p(y m) = dθ p(θ m)p(y θ), ln p(y m) ln p(y m) BIC = ln p(y }{{ ˆθ) d } 2 ln n. use EM to find ˆθ CS - Cheeseman-Stutz criterion p(y m) p(y m) = p(z m) p(z m) dθ p(θ m)p(y θ) = p(s, y m) dθ p(θ m)p(s, y θ ), ( ) ln p(y m) ln p(y m) CS = ln p(ŝ, y m) + ln p(y ˆθ) ln p(ŝ, y ˆθ). ( ) is correct for any completion s of the hidden variables, so what completion ŝ should we use? [ CS uses result of E-step from the EM algorithm. ]

17 Experiments Bipartite structure: only hidden variables can be parents of observed variables. Two binary hidden variables, and four five-valued discrete observed variables. s i1 s i2 i=1...n y i1 y i2 y i3 y i4 Conjugate prior is Dirichlet, Conjugate-Exponential model, so the VB-EM algorithm is a straightforward modification of EM. Experiment: There are 136 distinct structures (out of 256) with 2 latent variables as potential parents of 4 conditionally independent observed vars. Score each structure for twenty varying size data sets: n {10, 20, 40, 80, 110, 160, 230, 320, 400, 430, 480, 560, 640, 800, 960, 1120, 1280, 2560, 5120, 10240} using 4 methods: BIC, CS, VB, and a gold standard AIS 2720 graphs to score, times for each: BIC (1.5s), CS (1.5s), VB (4s), AIS (400s).

18 Annealed Importance Sampling (AIS) AIS is a state-of-the-art method for estimating marginal likelihoods, by breaking a difficult integral into a series of easier ones. Combines ideas from importance sampling, Markov chain Monte Carlo, & annealing. Define Z k = dθ p(θ m)p(y θ) τ(k) = dθ f k (θ) with τ(0) = 0 = Z 0 = dθ p(θ m) = 1 and τ(k) = 1 = Z K = p(y m) normalisation of prior, marginal likelihood. Schedule: {τ(k)} K k=1, Z K Z 0 Z 1 Z 0 Z 2 Z 1... Z K Z K 1. Importance sample from f k 1 (θ) as follows: with θ (r) f k 1 (θ), Z k Z k 1 dθ f k(θ) f k 1 (θ) f k 1 (θ) Z k 1 1 R R r=1 f k (θ (r) ) f k 1 (θ (r) ) = 1 R R p(y θ (r), m) τ(k) τ(k 1) r=1 How reliable is AIS? How tight are the variational bounds?

19 How reliable is the AIS for this problem? Varying the annealing schedule with random initialisation. n = 480, K = Marginal Likelihood (AIS) * AIS VB BIC Duration of Annealing (samples)

20 Scoring all structures by every method 400 log marginal likelihood estimate MAP AIS VB BIC

21 Every method scoring every structure MAP BIC BICp CS VB AIS (5)

22 Ranking of the true structure by each method n MAP BIC BICp CS VB AIS (5)

23 10 0 Ranking the true structure VB score finds correct structure earlier, and more reliably AIS VB CS BICp BIC rank of true structure n

24 Acceptance rate of the AIS sampler M-H acceptance fraction, measured over each of four quarters of the annealing schedule st 2nd 3rd 4th acceptance fraction n

25 How true are the various scores? score difference Difference in scores between true and top-ranked structures AIS VB CS BICp BIC n

26 Average Rank of the true structure Averaged over true structure parameters drawn from the prior (106 instances) 10 0 VB CS BICp BIC median rank of true structure n

27 True Top-ranked score differences Averaged over true structure parameters drawn from the prior (106 instances) median score difference n VB CS BICp BIC

28 Overall Success Rate of each method Averaged over true structure parameters drawn from the prior (106 instances) success rate at selecting true structure VB CS BICp BIC n

29 Results summary VB outperforms BIC on ranking: 20 data sets, and 95 instances. % times that \ than BIC* BICp* CS* BIC BICp CS VB ranks worse same better AIS standard can break down at high n, violating VB strict lower bound when scoring the 136 structures: n % M-H rej. < single #AIS (1) <VB averaged #AIS (5) <VB AIS has many parameters to tune: Metropolis-Hastings proposal widths/shapes, annealing schedules (non-linear), # samples, reaching equilibrium... VB has none!

30 Cheeseman-Stutz is a lower bound p(y m) = p(y ˆθ) = dθ p(θ) p(y ˆθ) p(y m) CS = p(ŝ, y m) p(y m). p(ŝ, y ˆθ) n p(y i θ) i=1 n p(y i ˆθ) = i=1 dθ p(θ) n exp i=1 n exp i=1 q si (s i ) ln p(s i, y i θ) q s si (s i ) i s i ˆq si (s i ) ln p(s i, y i ˆθ) ˆq si (s i ).. where ˆq si (s i ) p(s i y, ˆθ), ŝ i : ln p(ŝ i, y ˆθ) = s i ˆq si (s i ) ln p(s i, y i θ). p(y m) = n 1 exp ˆq si (s i ) ln ˆq si (s i ) i=1 s i p(y ˆθ) n i=1 exp { s i ˆq si (s i ) ln p(s i, y i ˆθ)} = p(y ˆθ) n i=1 p(ŝ i, y i ˆθ) dθ p(θ) dθ p(θ) dθ p(θ) n exp ˆq si (s i ) ln p(s i, y i θ) s i n exp ˆq si (s i ) ln p(s i, y i θ) s i i=1 i=1 n p(ŝ i, y i θ). i=1

31 VB can be made universally tighter than CS ln p(y m) CS F m (q s (s), q θ (θ)) ln p(y m). Let s approach this result by considering the following forms for q s (s) and q θ (θ): n q s (s) = q si (s i ), with q si (s i ) = p(s i y i, ˆθ), i=1 q θ (θ) ln p(θ)p(s, y θ) qs (s). We write the form for q θ (θ) explicitly: q θ (θ) = p(θ) { n } i=1 exp s i q si (s i ) ln p(s i, y i θ) dθ p(θ ) { n }, i=1 exp s i q si (s i ) ln p(s i, y i θ ) and then substitute q s (s) and q θ (θ) into the variational lower bound F m.

32 F m (q s (s), q θ (θ)) = = = dθ q θ (θ) + dθ q θ (θ) n i=1 n q si (s i ) ln i=1 s i dθ q θ (θ) ln n q si (s i ) ln i=1 s i q si (s i ) ln p(s i, y i θ) q s si (s i ) i 1 q si (s i ) + dθ q θ (θ) ln p(θ) q θ (θ) n dθ p(θ ) exp q si (s i ) ln p(s i, y i θ ) i=1 s i n dθ p(θ) exp q si (s i ) ln p(s i, y i θ). s i 1 q si (s i ) + ln With this choice of q θ (θ) and q s (s) we acheive equality between the CS and VB approximations. We complete the proof by noting that at the very next step of VBEM (VBE-step) is guaranteed to increase or leave unchanged F m, and hence surpass the CS bound. i=1

33 Summary & Conclusions Bayesian learning avoids overfitting and can be used to do model selection. Variational Bayesian EM for CE models and propagation algorithms. These methods have advantages over MCMC in that they can provide fast approximate Bayesian inference. Especially important in machine learning applications with large data sets. Results: VB outperforms BIC and CS in scoring discrete DAGs. VB approaches the capability of AIS sampling, at little computational cost. Finds the true structure consistently, whereas AIS needs tuning (e.g. large n). Compute time: BIC CS VB AIS time to compute each graph score 1.5s 1.5s 4s 400s total time for all 2720 graphs 1hr 1hr 3hrs 13days No need to use CS because VB is provably better!

34 Future Work Comparison to other methods: Laplace Method Other more sophisticated MCMC variants? (e.g. slice sampling) Bethe/Kikuchi approximations to marginal likelihood for discrete models (Heskes) Incorporate into a local search algorithm over structures (exhaustive enumeration done here is only of academic interest!). Extend to Gaussian and other non-discrete graphical models. Apply to real-world data sets. VB tools development: Overlay an AIS module into a general variational inference engine, such as VIBES (Winn et al.) Automated algorithm derivation, such as AutoBayes (Gray et al.)

35 no more slides... coffee

36 Scoring all structures by every method 10 log marginal likelihood estimate MAP AIS VB BIC

37 Scoring all structures by every method 20 log marginal likelihood estimate MAP AIS VB BIC

38 Scoring all structures by every method 40 log marginal likelihood estimate MAP AIS VB BIC

39 Scoring all structures by every method 80 log marginal likelihood estimate MAP AIS VB BIC

40 Scoring all structures by every method 110 log marginal likelihood estimate MAP AIS VB BIC

41 Scoring all structures by every method 160 log marginal likelihood estimate MAP AIS VB BIC

42 Scoring all structures by every method 230 log marginal likelihood estimate MAP AIS VB BIC

43 Scoring all structures by every method 320 log marginal likelihood estimate MAP AIS VB BIC

44 Scoring all structures by every method 400 log marginal likelihood estimate MAP AIS VB BIC

45 Scoring all structures by every method 430 log marginal likelihood estimate MAP AIS VB BIC

46 Scoring all structures by every method 480 log marginal likelihood estimate MAP AIS VB BIC

47 Scoring all structures by every method 560 log marginal likelihood estimate MAP AIS VB BIC

48 Scoring all structures by every method 640 log marginal likelihood estimate MAP AIS VB BIC

49 Scoring all structures by every method 800 log marginal likelihood estimate MAP AIS VB BIC

50 Scoring all structures by every method 960 log marginal likelihood estimate MAP AIS VB BIC

51 Scoring all structures by every method 1120 log marginal likelihood estimate MAP AIS VB BIC

52 Scoring all structures by every method 1280 log marginal likelihood estimate MAP AIS VB BIC

53 Scoring all structures by every method 2560 log marginal likelihood estimate MAP AIS VB BIC

54 Scoring all structures by every method 5120 log marginal likelihood estimate MAP AIS VB BIC

55 Scoring all structures by every method log marginal likelihood estimate MAP AIS VB BIC

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection