Minimizing D(Q,P) def = Q(h)

Size: px

Start display at page:

Download "Minimizing D(Q,P) def = Q(h)"

Leo Johns
5 years ago
Views:

1 Inference Lecture 20: Variational Metods Kevin Murpy 29 November 2004 Inference means computing P( i v), were are te idden variables v are te visible variables. For discrete (eg binary) idden nodes, exact inference takes O(2 w ) time, were w is te induced widt of te grap. For continuous idden nodes, exact (closed-form) inference is only possible in rare circumstances eg. jointly Gaussian models (Kalman filters, etc.). We will first consider various approximations for approximate inference for discrete variables. For continuous or mixed discrete/cts variables Extend ADF/ PF from online inference in cains to offline inference in general graps: (ADF EP, PF NBP) Or use MCMC (eg Gibbs) Variational inference Let us try to find an approximation Q() wic is as close as possible to P( v). We usually measure closeness using Kullback Leibler divergence: D(Q,P) def = Q() log Q() P( v) = E H Q log Q(H) P(H v) Tis is different tan minimizing D(P,Q) = E H P log Q(H) P(H v) wic is intractable. D(q, p) D(p, q) We will endow Q wit free (variational) parameters, and minimize min ξ D(Q(,ξ),P( v)). Variational Free Energy Minimizing D(Q,P) def = Q() Q() log is ard, since P( v) is P( v) intractable. But for a Bayes net, P(,v) is easy (product of CPDs). So we minimize te free energy: F(Q,P) def = D(Q,P) log P(v) = Q() log Q() P( v) Q() log P(v) = Q() log Q() P(,v) Since D(Q,P) 0, we ave F(Q,P) log P(v), so minimizing F is maximizing an upper bound on te log likeliood. Alternative derivation: use Jensen s inequality: log P(v) = log P(,v) = log Q() P(,v) Q() Q() log P(,v) = F(Q,P) Q()

2 Exact inference Let us minimize F(Q,P) subject only to te constraint tat Q(H) = 1: J = Q() log Q() Q() log P( v) + λ( Q() 1) Derivative: J Q( ) = log Q( ) + Q( ) Q( ) log P( v) + λ Solving J Q( = 0 yields Q() = P( v). ) Pairwise MRFs For ease of explanation, I will often assume te model can be written as an MRF wit pairwise potentials (one per edge): P(x y) = 1 Z ij (x i,x j <ij>ψ ) ψ ii (x i ) i Any Bayes net/ Markov net/ factor grap can be converted into tis form, by creating extra meganodes. Viterbi approximation Te Viterbi approximation is to assume tat all te posterior probability mass is assigned to a single (MAP) assignment ĥ: Q() = δ(,ĥ). i.e., we associate every idden variable wit a single value. For GMs wit low treewidt, we can find ĥ efficiently. In general, we can use iterative tecniques. Iterative Conditional Modes (ICM) ICM assigns eac variable to its MAP estimate, olding all te oters constant: i := arg maxp( i \ i,v) = arg max ψ ii ( i ) ψ ij ( i, j ) i i were te j s are in i s Markov blanket. K-means clustering is an example of ICM, were are te assignment variables for eac data point to a cluster, and te value of te cluster centers (means). ICM is very greedy and often gets stuck in local optima.

3 Gibbs sampling Gibbs sampling is a stocastic version of ICM, were instead of picking te best state, we sample a state: i P( i \ i,v) = F( i) i F( i ) were F( i ) = ψ ii ( i ) ψ ij ( i, j ) Tis is less greedy tan ICM, but can be muc slower. Mean field metod Te mean field metod is like a deterministic version of Gibbs sampling, were we replace samples wit expected values. We make a fully factorized approximation: Q(x) = i b i(x i ). So te mean field free energy is F MF ({b i }) = b i (x i )b j (x j ) log ψ ij (x i,x j ) <ij> x i,x j + b i (x i )[log b i (x i ) log ψ ii (x i )] i x i We want to minimize F MF (b i ) subject to x i b i (x i ) = 1. Hence we iteratively update b i (x i ) ψ ii (x i ) exp b j (x j ) log ψ ij (x i,x j ) x j Mean field Boltzmann macines Te Boltzmann macine (stocastic Hopfield network) is a pairwise MRF were nodes are binary (eiter S i {0, 1} or i { 1, +1}), and potentials ave te restricted form ψ ij (S i,s j ) = exp θ ij S i S j and φ ii (S i ) = exp θ i0 S i : P(s) = 1 Z exp θ ij S i S j + i<j i θ i0 S i Te mean field approximation is Q( v) = i µs i i (1 µ i) 1 S i, were µ i = E(S i ) = P(S i = 1 v). Minimizing D(P,Q) yields te mean field update equations: µ i = σ( j θ ij µ j + θ i0 ) Structured variational approximations Meanfield assumes Q is fully factorized. We can model correlations by exploiting tractable substructure. e.g., decompose factorial HMM into product of cains Q(X 1:N N 1:T ) = Q(X1:N i ) i=1 X (1) 1 X (1) 2 X (1) 3 X (2) 1 X (2) 2 X (2) 3 X (3) 1 X (3) 2 X (3) 3 Y1 Y2 Y3

4 Loopy belief propagation Structured variational approximations remove some edges from te grap and replace teir effect wit variational parameters. An alternative is to leave all te original edges intact, but only capture teir effect locally. Recall tat te free energy is F = Q() log Q() k Ck Q( Ck ) log ψ k ( Ck,v Ck ) Te second term is te expected value of a local factor, and is easy to compute for any Q(). But te entropy term is intractable for general Q(). We will sow ow loopy belief propagation can minimize an approximation to tis. Bete free energy Te Bete approximation is <ij> Q() b ij( i, j ) i b i( i ) d i 1 were d i is te degree of i (ie., number of factors it appears in). So te Bete free energy is F MF ({b i,b ij }) = b ij (x i,x j )[log b ij (x i,x j ) log φ ij ( <ij> x i,x j (d i 1) b i (x i )[log b i (x i ) log ψ i (x i )] i x i were φ ij (x i,x j ) = ψ ij (x i,x j )ψ ii (x i )ψ jj (x j ). Loopy belief propagation is a way to minimize tis subject to te constraints x i b ij (x i,x j ) = b j (x j ) and x i b i (x i ) = 1. Loopy belief propagation minimizes Bete free energy In LBP, we iteratively update our beliefs by message passing m ij (x j ) ψ ij (x i,x j )ψ ii (x i ) m ki (x i ) x i b i (x i ) ψ ii (x i ) k N i m ki (x i ) k N i \j Te messages m ij are exp(λ ij ), were λ ij is te Lagrange multiplier enforcing te marginalization constraint wile minimizing F bete. LBP sometimes called sum-product algoritm. Discrete Message passing/ belief propagation Consider an MRF wit one potential per edge P(X) = 1 Z ij (X i,x j <ij>ψ ) φ i (X i ) i We can generalize te forwards-backwards algoritm as follows: m ij (x j ) = φ i (x i )ψ ij (x i,x j ) m ji (x i ) x i b i (x i ) φ i (x i ) m ji (x i ) k N i \{j} If all potentials, messages and beliefs are discrete: m ij = ψij T φ i. m ki, b i φ i. k m ji

5 Complexity of discrete BP If all potentials, messages and beliefs are discrete: m ij = ψij T φ i. m ki, b i φ i. k m ji If tere are K states, eac message takes O(K 2 ) time to compute. For certain kinds of potentials (e.g., ψ ij (i,j) = exp( u i u j 2 ) were u i,u j IR), te messages can be computed in O(K log K) or even O(K) time. For general potentials, once can use multipole metods and fancy data structures (like kd-trees) to do tis in O(K) or O(K log K) time. See Nando s NIPS worksop on fast metods. Loopy belief propagation Te BP equations are exact if te grap is a cain or a tree (assuming we can implement sum and product operators analytically). Wat appens if BP is applied to graps wit loops? If may not convergence, and even if it does, it may be wrong. However, in practice, it often works well (e.g., error correcting codes). Msg Type Algo Correct if conv? Suff cond for conv? Discrete No No Discrete max Strong local opt. No Gaussian Means - yes, covs - no Yes General?? Summary so far For discrete state spaces, we ave te following ranking of algoritms from best to worst (in terms of accuracy/ speed): Loopy belief propagation Mean field Iterative conditional modes Gibbs sampling Wat about continuous state spaces? Message passing for general state spaces Filtering on cains is equivalent to message passing in a left-to-rigt fasion. X1 Y1 Smooting on cains involves a forward and a backwards pass. Inference on trees involves an upwards and a downwards pass. Inference on loopy graps involves parallel message passing. X2 Y2 X3 Y3 X4 Y4 H1 H2 H3 H4 x 1,n x 3,n x 2,n x 4,n O1 O2 O3 O4 x 5,n x 6,n

6 Gaussian Message passing/ belief propagation Consider an MRF wit one potential per edge P(X) = 1 Z ij (X i,x j <ij>ψ ) φ i (X i ) i were ψ ij = exp(x T i V ijx j ) and φ i = exp( 1 σ i (X i µ i ) 2 ). Te BP equations are as before: m ij (x j ) = x i φ i (x i )ψ ij (x i,x j ) b i (x i ) φ i (x i ) m ji (x i ) k N i \{j} m ji (x i ) Since a Gaussian times a Gaussian is a Gaussian, and te marginal of a Gaussian is anoter Gaussian, we can implement tese equations in closed form (generalization of te Kalman filter). General Message passing/ belief propagation In general, ow can we implement tese equations? m ij (x j ) = φ i (x i )ψ ij (x i,x j ) m ji (x i ) x i b i (x i ) φ i (x i ) m ji (x i ) k N i \{j} Tis depends on te form of te potentials ψ and φ, and te form of te beliefs b (from wic te form of te messages m can be inferred), just as in filtering for state-space models. Expectation Propagation (EP) Cross between ADF (assumed density filtering) and BP. Suppose potentials/ beliefs are mixtures of K Gaussians. Number of mixture components of posterior belief is K d for a node wit d neigbors; need to project back to K Gaussians (moment matcing). Te tractable messages are inferred by dividing te new belief by te old belief. EP is iterated ADF. Advantages of iterating: Errors made earlier in te sequence can be recovered from. Less dependence on te order in wic data arrives. Multiple forward-backwards passes are necessary, even for cains/ trees, because te message computations are not exact. Non-parametric belief propagation (NPBP) Cross between particle filtering and BP.

EM If all te idden variables are discrete, and all te parameters are continuous, we can use te approximation Q(,θ) = Q()δ(θ, ˆθ), were Q() is a general posterior on and a delta function on te

7 EM If all te idden variables are discrete, and all te parameters are continuous, we can use te approximation Q(,θ) = Q()δ(θ, ˆθ), were Q() is a general posterior on and a delta function on te parameters. Te EM algoritm consists of minimizing F(Q,P θ ) using coordinate ascent. E-step: minimize wrt Q() = computing P( v, ˆθ). M-step: minimize wrt δ(θ, ˆθ). EM variants Standard EM: Q(,θ) = P( v, ˆθ)δ(θ, ˆθ). Variational EM: use a variational approximation (eg mean field) in te E-step. Stocastic EM: use Monte Carlo in te E-step. Incremental EM: update parameters after eac training case (online) instead of after all data (batc). Generalized EM : do a partial M-step (eg. gradient step). Variational Bayes EM (ensemble learning): replace point estimates of parameters wit distributions in te M-step. CG-EM: alternate between conjugate gradient and EM. Comparison of metods Layered model of foreground + background A comparison of algoritms for inference and learning in PGMs, Frey and Jojic, PAMI 2004 to appear

8 Multiple layers plus continuous deformations Comparison of EM: exact, mean field, ICM, BP Comparison of EM: exact, mean field, ICM, BP

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann