Variational Bayesian Theory

Similar documents
Expectation propagation

Lecture 10 Support Vector Machines II

Conjugacy and the Exponential Family

EM and Structure Learning

Relevance Vector Machines Explained

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

NUMERICAL DIFFERENTIATION

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

The Expectation-Maximization Algorithm

Limited Dependent Variables

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

1 Motivation and Introduction

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Lecture Notes on Linear Regression

Hidden Markov Models

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Difference Equations

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Lecture 12: Discrete Laplacian

Gaussian Mixture Models

Gaussian process classification: a message-passing viewpoint

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Kernel Methods and SVMs Extension

3.1 ML and Empirical Distribution

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Composite Hypotheses testing

Global Sensitivity. Tuesday 20 th February, 2018

Problem Set 9 Solutions

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Feature Selection: Part 1

Lecture 3: Probability Distributions

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

Homework Assignment 3 Due in class, Thursday October 15

The Feynman path integral

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Generalized Linear Methods

Linear Regression Analysis: Terminology and Notation

Hidden Markov Models

Maximum Likelihood Estimation (MLE)

arxiv: v2 [stat.me] 26 Jun 2012

Assortment Optimization under MNL

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

EEE 241: Linear Systems

STAT 3008 Applied Regression Analysis

1 Convex Optimization

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

The Geometry of Logit and Probit

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

4DVAR, according to the name, is a four-dimensional variational method.

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Canonical transformations

Introduction to Hidden Markov Models

Module 9. Lecture 6. Duality in Assignment Problems

A Robust Method for Calculating the Correlation Coefficient

Chapter 20 Duration Analysis

More metrics on cartesian products

Lecture 12: Classification

Differentiating Gaussian Processes

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Notes on Frequency Estimation in Data Streams

Lecture 4. Instructor: Haipeng Luo

Computing MLE Bias Empirically

Markov Chain Monte Carlo Lecture 6

Chapter 11: Simple Linear Regression and Correlation

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Numerical Heat and Mass Transfer

Integrals and Invariants of Euler-Lagrange Equations

Bayesian predictive Configural Frequency Analysis

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Week 5: Neural Networks

Goodness of fit and Wilks theorem

Learning from Data 1 Naive Bayes

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Linear Approximation with Regularization and Moving Least Squares

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

The Basic Idea of EM

MMA and GCMMA two methods for nonlinear optimization

CHAPTER 14 GENERAL PERTURBATION THEORY

Maximum Likelihood Estimation

Classification as a Regression Problem

SDMML HT MSc Problem Sheet 4

Statistical learning

Transcription:

Chapter 2 Varatonal Bayesan Theory 2.1 Introducton Ths chapter covers the majorty of the theory for varatonal Bayesan learnng that wll be used n rest of ths thess. It s ntended to gve the reader a context for the use of varatonal methods as well as a nsght nto ther general applcablty and usefulness. In a model selecton task the role of a Bayesan s to calculate the posteror dstrbuton over a set of models gven some a pror knowledge and some new observatons (data). The knowledge s represented n the form of a pror over model structures p(m), and ther parameters p( m) whch defne the probablstc dependences between the varables n the model. By Bayes rule, the posteror over models m havng seen data y s gven by: p(m y) = p(m)p(y m) p(y). (2.1) The second term n the numerator s the margnal lkelhood or evdence for a model m, and s the key quantty for Bayesan model selecton: p(y m) = d p( m)p(y, m). (2.2) For each model structure we can compute the posteror dstrbuton over parameters: p( y, m) = p( m)p(y, m) p(y m). (2.3) 44

2.1. Introducton We mght also be nterested n calculatng other related quanttes, such as the predctve densty of a new datum y gven a data set y = {y 1,..., y n }: p(y y, m) = d p( y, m) p(y, y, m), (2.4) whch can be smplfed nto p(y y, m) = d p( y, m) p(y, m) (2.5) f y s condtonally ndependent of y gven. We also may be nterested n calculatng the posteror dstrbuton of a hdden varable, x, assocated wth the new observaton y p(x y, y, m) d p( y, m) p(x, y, m). (2.6) The smplest way to approxmate the above ntegrals s to estmate the value of the ntegrand at a sngle pont estmate of, such as the maxmum lkelhood (ML) or the maxmum a posteror (MAP) estmates, whch am to maxmse respectvely the second and both terms of the ntegrand n (2.2), ML = arg max MAP = arg max p(y, m) (2.7) p( m)p(y, m). (2.8) ML and MAP examne only probablty densty, rather than mass, and so can neglect potentally large contrbutons to the ntegral. A more prncpled approach s to estmate the ntegral numercally by evaluatng the ntegrand at many dfferent va Monte Carlo methods. In the lmt of an nfnte number of samples of ths produces an accurate result, but despte ngenous attempts to curb the curse of dmensonalty n usng methods such as Markov chan Monte Carlo, these methods reman prohbtvely computatonally ntensve n nterestng models. These methods were revewed n the last chapter, and the bulk of ths chapter concentrates on a thrd way of approxmatng the ntegral, usng varatonal methods. The key to the varatonal method s to approxmate the ntegral wth a smpler form that s tractable, formng a lower or upper bound. The ntegraton then translates nto the mplementatonally smpler problem of bound optmsaton: makng the bound as tght as possble to the true value. We begn n secton 2.2 by descrbng how varatonal methods can be used to derve the wellknown expectaton-maxmsaton (EM) algorthm for learnng the maxmum lkelhood (ML) parameters of a model. In secton 2.3 we concentrate on the Bayesan methodology, n whch prors are placed on the parameters of the model, and ther uncertanty ntegrated over to gve the margnal lkelhood (2.2). We then generalse the varatonal procedure to yeld the varatonal Bayesan EM (VBEM) algorthm, whch teratvely optmses a lower bound on ths margnal 45

2.2. Varatonal methods for ML / MAP learnng lkelhood. In analogy to the EM algorthm, the teratons consst of a varatonal Bayesan E (VBE) step n whch the hdden varables are nferred usng an ensemble of models accordng to ther posteror probablty, and a varatonal Bayesan M (VBM) step n whch a posteror dstrbuton over model parameters s nferred. In secton 2.4 we specalse ths algorthm to a large class of models whch we call conjugate-exponental (CE): we present the varatonal Bayesan EM algorthm for CE models and dscuss the mplcatons for both drected graphs (Bayesan networks) and undrected graphs (Markov networks) n secton 2.5. In partcular we show that we can ncorporate exstng propagaton algorthms nto the varatonal Bayesan framework and that the complexty of nference for the varatonal Bayesan treatment s approxmately the same as for the ML scenaro. In secton 2.6 we compare VB to the BIC and Cheeseman-Stutz crtera, and fnally summarse n secton 2.7. 2.2 Varatonal methods for ML / MAP learnng In ths secton we revew the dervaton of the EM algorthm for probablstc models wth hdden varables. The algorthm s derved usng a varatonal approach, and has exact and approxmate versons. We nvestgate themes on convexty, computatonal tractablty, and the Kullback- Lebler dvergence to gve a deeper understandng of the EM algorthm. The majorty of the secton concentrates on maxmum lkelhood (ML) learnng of the parameters; at the end we present the smple extenson to maxmum a posteror (MAP) learnng. The hope s that ths secton provdes a good steppng-stone on to the varatonal Bayesan EM algorthm that s presented n the subsequent sectons and used throughout the rest of ths thess. 2.2.1 The scenaro for parameter learnng Consder a model wth hdden varables x and observed varables y. The parameters descrbng the (potentally) stochastc dependences between varables are gven by. In partcular consder the generatve model that produces a dataset y = {y 1,..., y n } consstng of n ndependent and dentcally dstrbuted (..d.) tems, generated usng a set of hdden varables x = {x 1,..., x n } such that the lkelhood can be wrtten as a functon of n the followng way: n n p(y ) = p(y ) = dx p(x, y ). (2.9) The ntegraton over hdden varables x s requred to form the lkelhood of the parameters, as a functon of just the observed data y. We have assumed that the hdden varables are contnuous as opposed to dscrete (hence an ntegral rather than a summaton), but we do so wthout loss of generalty. As a pont of nomenclature, note that we use x and y to denote collectons of x hdden and y observed varables respectvely: x = {x 1,..., x x }, and 46

2.2. Varatonal methods for ML / MAP learnng y = {y 1,..., y y }. We use notaton to denote the sze of the collecton of varables. ML learnng seeks to fnd the parameter settng ML that maxmses ths lkelhood, or equvalently the logarthm of ths lkelhood, L() ln p(y ) = n ln p(y ) = n ln dx p(x, y ) (2.10) so defnng ML arg max L(). (2.11) To keep the dervatons clear, we wrte L as a functon of only; the dependence on y s mplct. In Bayesan networks wthout hdden varables and wth ndependent parameters, the log-lkelhood decomposes nto local terms on each y j, and so fndng the settng of each parameter of the model that maxmses the lkelhood s straghtforward. Unfortunately, f some of the varables are hdden ths wll n general nduce dependences between all the parameters of the model and so make maxmsng (2.10) dffcult. Moreover, for models wth many hdden varables, the ntegral (or sum) over x can be ntractable. We smplfy the problem of maxmsng L() wth respect to by ntroducng an auxlary dstrbuton over the hdden varables. Any probablty dstrbuton q x (x) over the hdden varables gves rse to a lower bound on L. In fact, for each data pont y we use a dstnct dstrbuton q x (x ) over the hdden varables to obtan the lower bound: L() = = = ln ln dx p(x, y ) (2.12) dx q x (x ) p(x, y ) q x (x ) dx q x (x ) ln p(x, y ) q x (x ) dx q x (x ) ln p(x, y ) (2.13) (2.14) dx q x (x ) ln q x (x ) (2.15) F(q x1 (x 1 ),..., q xn (x n ), ) (2.16) where we have made use of Jensen s nequalty (Jensen, 1906) whch follows from the fact that the log functon s concave. F(q x (x), ) s a lower bound on L() and s a functonal of the free dstrbutons q x (x ) and of (the dependence on y s left mplct). Here we use q x (x) to mean the set {q x (x )} n. Defnng the energy of a global confguraton (x, y) to be ln p(x, y ), the lower bound F(q x (x), ) L() s the negatve of a quantty known n statstcal physcs as the free energy: the expected energy under q x (x) mnus the entropy of q x (x) (Feynman, 1972; Neal and Hnton, 1998). 47

2.2. Varatonal methods for ML / MAP learnng 2.2.2 EM for unconstraned (exact) optmsaton The Expectaton-Maxmzaton (EM) algorthm (Baum et al., 1970; Dempster et al., 1977) alternates between an E step, whch nfers posteror dstrbutons over hdden varables gven a current parameter settng, and an M step, whch maxmses L() wth respect to gven the statstcs gathered from the E step. Such a set of updates can be derved usng the lower bound: at each teraton, the E step maxmses F(q x (x), ) wth respect to each of the q x (x ), and the M step does so wth respect to. Mathematcally speakng, usng a superscrpt (t) to denote teraton number, startng from some ntal parameters (0), the update equatons would be: E step: M step: q (t+1) x arg max F(q x (x), (t) ), {1,..., n}, (2.17) q x (t+1) arg max F(q (t+1) x (x), ). (2.18) For the E step, t turns out that the maxmum over q x (x ) of the bound (2.14) s obtaned by settng q (t+1) x (x ) = p(x y, (t) ),, (2.19) at whch pont the bound becomes an equalty. Ths can be proven by drect substtuton of (2.19) nto (2.14): F(q x (t+1) (x), (t) ) = = = = dx q (t+1) x (x ) ln p(x, y (t) ) q (t+1) x (x ) dx p(x y, (t) ) ln p(x, y (t) ) p(x y, (t) ) dx p(x y, (t) ) ln p(y (t) ) p(x y, (t) ) p(x y, (t) ) (2.20) (2.21) (2.22) dx p(x y, (t) ) ln p(y (t) ) (2.23) = ln p(y (t) ) = L( (t) ), (2.24) where the last lne follows as ln p(y ) s not a functon of x. After ths E step the bound s tght. The same result can be obtaned by functonally dfferentatng F(q x (x), ) wth respect to q x (x ), and settng to zero, subject to the normalsaton constrants: dx q x (x ) = 1,. (2.25) 48

2.2. Varatonal methods for ML / MAP learnng The constrants on each q x (x ) can be mplemented usng Lagrange multplers {λ } n, formng the new functonal: F(q x (x), ) = F(q x (x), ) + λ [ ] dx q x (x ) 1. (2.26) We then take the functonal dervatve of ths expresson wth respect to each q x (x ) and equate to zero, obtanng the followng q x (x ) F(q x (x), (t) ) = ln p(x, y (t) ) ln q x (x ) 1 + λ = 0 (2.27) = q (t+1) x (x ) = exp ( 1 + λ ) p(x, y (t) ) (2.28) where each λ s related to the normalsaton constant: λ = 1 ln = p(x y, (t) ),, (2.29) dx p(x, y (t) ),. (2.30) In the remanng dervatons n ths thess we always enforce normalsaton constrants usng Lagrange multpler terms, although they may not always be explctly wrtten. The M step s acheved by smply settng dervatves of (2.14) wth respect to to zero, whch s the same as optmsng the expected energy term n (2.15) snce the entropy of the hdden state dstrbuton q x (x) s not a functon of : M step: (t+1) arg max dx p(x y, (t) ) ln p(x, y ). (2.31) Note that the optmsaton s over the second n the ntegrand, whlst holdng p(x y, (t) ) fxed. Snce F(q x (t+1) (x), (t) ) = L( (t) ) at the begnnng of each M step, and snce the E step does not change the parameters, the lkelhood s guaranteed not to decrease after each combned EM step. Ths s the well known lower bound nterpretaton of EM: F(q x (x), ) s an auxlary functon whch lower bounds L() for any q x (x), attanng equalty after each E step. These steps are shown schematcally n fgure 2.1. Here we have expressed the E step as obtanng the full dstrbuton over the hdden varables for each data pont. However we note that, n general, the M step may requre only a few statstcs of the hdden varables, so only these need be computed n the E step. 2.2.3 EM wth constraned (approxmate) optmsaton Unfortunately, n many nterestng models the data are explaned by multple nteractng hdden varables whch can result n ntractable posteror dstrbutons (Wllams and Hnton, 1991; 49

2.2. Varatonal methods for ML / MAP learnng log lkelhood ln p(y (t) ) h KL q x (t) p(x y, (t) ) E step makes the lower bound tght ln p(y (t) ) = F(q x (t+1), (t) ) h KL q x (t+1) p(x y, (t) ) = 0 new log lkelhood new lower bound ln p(y (t+1) ) h KL q x (t+1) p(x y, (t+1) ) F(q (t+1) x, (t+1) ) lower bound F(q (t) x, (t) ) E step M step Fgure 2.1: The varatonal nterpretaton of EM for maxmum lkelhood learnng. In the E step the hdden varable varatonal posteror s set to the exact posteror p(x y, (t) ), makng the bound tght. In the M step the parameters are set to maxmse the lower bound F(q x (t+1), ) whle holdng the dstrbuton over hdden varables q x (t+1) (x) fxed. Neal, 1992; Hnton and Zemel, 1994; Ghahraman and Jordan, 1997; Ghahraman and Hnton, 2000). In the varatonal approach we can constran the posteror dstrbutons to be of a partcular tractable form, for example factorsed over the varable x = {x j } x j=1. Usng calculus of varatons we can stll optmse F(q x (x), ) as a functonal of constraned dstrbutons q x (x ). The M step, whch optmses, s conceptually dentcal to that descrbed n the prevous subsecton, except that t s based on suffcent statstcs calculated wth respect to the constraned posteror q x (x ) nstead of the exact posteror. We can wrte the lower bound F(q x (x), ) as F(q x (x), ) = = = dx q x (x ) ln p(x, y ) q x (x ) dx q x (x ) ln p(y ) + ln p(y ) dx q x (x ) ln dx q x (x ) ln p(x y, ) q x (x ) (2.32) (2.33) q x (x ) p(x y, ). (2.34) Thus n the E step, maxmsng F(q x (x), ) wth respect to q x (x ) s equvalent to mnmsng the followng quantty dx q x (x ) ln q x (x ) p(x y, ) KL [q x (x ) p(x y, )] (2.35) 0, (2.36) whch s the Kullback-Lebler dvergence between the varatonal dstrbuton q x (x ) and the exact hdden varable posteror p(x y, ). As s shown n fgure 2.2, the E step does not 50

2.2. Varatonal methods for ML / MAP learnng log lkelhood ln p(y (t) ) h KL q x (t) p(x y, (t) ) constraned E step, so lower bound s no longer tght ln p(y (t) ) h KL q x (t+1) p(x y, (t) ) F(q x (t+1), (t) ) new log lkelhood new lower bound ln p(y (t+1) ) h KL q x (t+1) p(x y, (t+1) ) F(q (t+1) x, (t+1) ) lower bound F(q (t) x, (t) ) E step M step Fgure 2.2: The varatonal nterpretaton of constraned EM for maxmum lkelhood learnng. [ In the E step the hdden ] varable varatonal posteror s set to that whch mnmses KL q x (x) p(x y, (t) ), subject to q x (x) lyng n the famly of constraned dstrbutons. In the M step the parameters are set to maxmse the lower bound F(q x (t+1), ) gven the current dstrbuton over hdden varables. generally result n the bound becomng an equalty, unless of course the exact posteror les n the famly of constraned posterors q x (x). The M step looks very smlar to (2.31), but s based on the current varatonal posteror over hdden varables: M step: (t+1) arg max dx q (t+1) x (x ) ln p(x, y ). (2.37) One can choose q x (x ) to be n a partcular parametersed famly: q x (x ) = q x (x λ ) (2.38) where λ = {λ 1,..., λ r } are r varatonal parameters for each datum. If we constran each q x (x λ ) to have easly computable moments (e.g. a Gaussan), and especally f ln p(x y, ) s polynomal n x, then we can compute the KL dvergence up to a constant and, more mportantly, can take ts dervatves wth respect to the set of varatonal parameters λ of each q x (x ) dstrbuton to perform the constraned E step. The E step of the varatonal EM algorthm therefore conssts of a sub-loop n whch each of the q x (x λ ) s optmsed by takng dervatves wth respect to each λ s, for s = 1,..., r. 51

2.2. Varatonal methods for ML / MAP learnng The mean feld approxmaton The mean feld approxmaton s the case n whch each q x (x ) s fully factorsed over the hdden varables: x q x (x ) = q xj (x j ). (2.39) j=1 In ths case the expresson for F(q x (x), ) gven by (2.32) becomes: F(q x (x), ) = x x x dx q xj (x j ) ln p(x, y ) q xj (x j ) ln q xj (x j ) j=1 j=1 j=1 (2.40) = x x dx q xj (x j ) ln p(x, y ) q xj (x j ) ln q xj (x j ). j=1 j=1 (2.41) Usng a Lagrange multpler to enforce normalsaton of the each of the approxmate posterors, we take the functonal dervatve of ths form wth respect to each q xj (x j ) and equate to zero, obtanng: q xj (x j ) = 1 exp Z j dx /j x q xj (x j ) ln p(x, y ), (2.42) j /j for each data pont {1,..., n}, and each varatonal factorsed component j {1,..., x }. We use the notaton dx /j to denote the element of ntegraton for all tems n x except x j, and the notaton j /j to denote a product of all terms excludng j. For the th datum, t s clear that the update equaton (2.42) appled to each hdden varable j n turn represents a set of coupled equatons for the approxmate posteror over each hdden varable. These fxed pont equatons are called mean-feld equatons by analogy to such methods n statstcal physcs. Examples of these varatonal approxmatons can be found n the followng: Ghahraman (1995); Saul et al. (1996); Jaakkola (1997); Ghahraman and Jordan (1997). EM for maxmum a posteror learnng In MAP learnng the parameter optmsaton ncludes pror nformaton about the parameters p(), and the M step seeks to fnd MAP arg max p()p(y ). (2.43) 52

2.3. Varatonal methods for Bayesan learnng In the case of an exact E step, the M step s smply augmented to: M step: (t+1) arg max [ ln p() + dx p(x y, (t) ) ln p(x, y ) ]. (2.44) In the case of a constraned approxmate E step, the M step s gven by M step: (t+1) arg max [ ln p() + dx q (t+1) x (x ) ln p(x, y ) ]. (2.45) However, as mentoned n secton 1.3.1, we reterate that an undesrable feature of MAP estmaton s that t s nherently bass-dependent: t s always possble to fnd a bass n whch any partcular s the MAP soluton, provded has non-zero pror probablty. 2.3 Varatonal methods for Bayesan learnng In ths secton we show how to extend the above treatment to use varatonal methods to approxmate the ntegrals requred for Bayesan learnng. By treatng the parameters as unknown quanttes as well as the hdden varables, there are now correlatons between the parameters and hdden varables n the posteror. The basc dea n the VB framework s to approxmate the dstrbuton over both hdden varables and parameters wth a smpler dstrbuton, usually one whch assumes that the hdden states and parameters are ndependent gven the data. There are two man goals n Bayesan learnng. The frst s approxmatng the margnal lkelhood p(y m) n order to perform model comparson. The second s approxmatng the posteror dstrbuton over the parameters of a model p( y, m), whch can then be used for predcton. 2.3.1 Dervng the learnng rules As before, let y denote the observed varables, x denote the hdden varables, and denote the parameters. We assume a pror dstrbuton over parameters p( m) condtonal on the model m. The margnal lkelhood of a model, p(y m), can be lower bounded by ntroducng any 53

2.3. Varatonal methods for Bayesan learnng dstrbuton over both latent varables and parameters whch has support where p(x, y, m) does, by appealng to Jensen s nequalty once more: ln p(y m) = ln d dx p(x, y, m) (2.46) p(x, y, m) = ln d dx q(x, ) (2.47) q(x, ) p(x, y, m) d dx q(x, ) ln. (2.48) q(x, ) Maxmsng ths lower bound wth respect to the free dstrbuton q(x, ) results n q(x, ) = p(x, y, m) whch when substtuted above turns the nequalty nto an equalty (n exact analogy wth (2.19)). Ths does not smplfy the problem snce evaluatng the exact posteror dstrbuton p(x, y, m) requres knowng ts normalsng constant, the margnal lkelhood. Instead we constran the posteror to be a smpler, factorsed (separable) approxmaton to q(x, ) q x (x)q (): ln p(y m) = p(x, y, m) d dx q x (x)q () ln q x (x)q () [ p(x, y, m) d q () dx q x (x) ln + ln q x (x) ] p( m) q () (2.49) (2.50) = F m (q x (x), q ()) (2.51) = F m (q x1 (x 1 ),..., q xn (x n ), q ()), (2.52) where the last equalty s a consequence of the data y arrvng..d. (ths s shown n theorem 2.1 below). The quantty F m s a functonal of the free dstrbutons, q x (x) and q (). The varatonal Bayesan algorthm teratvely maxmses F m n (2.51) wth respect to the free dstrbutons, q x (x) and q (), whch s essentally coordnate ascent n the functon space of varatonal dstrbutons. The followng very general theorem provdes the update equatons for varatonal Bayesan learnng. Theorem 2.1: Varatonal Bayesan EM (VBEM). Let m be a model wth parameters gvng rse to an..d. data set y = {y 1,... y n } wth correspondng hdden varables x = {x 1,... x n }. A lower bound on the model log margnal lkelhood s F m (q x (x), q ()) = d dx q x (x)q () ln p(x, y, m) q x (x)q () (2.53) and ths can be teratvely optmsed by performng the followng updates, usng superscrpt (t) to denote teraton number: VBE step: q x (t+1) (x ) = 1 [ exp Z x ] d q (t) () ln p(x, y, m) (2.54) 54

2.3. Varatonal methods for Bayesan learnng where q (t+1) x (x) = n q (t+1) x (x ), (2.55) and VBM step: q (t+1) () = 1 [ p( m) exp Z ] dx q x (t+1) (x) ln p(x, y, m). (2.56) Moreover, the update rules converge to a local maxmum of F m (q x (x), q ()). Proof of q x (x ) update: usng varatonal calculus. Take functonal dervatves of F m (q x (x), q ()) wth respect to q x (x), and equate to zero: q x (x) F m(q x (x), q ()) = = [ d q () q x (x) dx q x (x) ln ] p(x, y, m) q x (x) (2.57) d q () [ln p(x, y, m) ln q x (x) 1] (2.58) = 0 (2.59) whch mples ln q x (t+1) (x) = d q (t) () ln p(x, y, m) ln Z(t+1) x, (2.60) where Z x s a normalsaton constant (from a Lagrange multpler term enforcng normalsaton of q x (x), omtted for brevty). As a consequence of the..d. assumpton, ths update can be broken down across the n data ponts ln q x (t+1) (x) = d q (t) () n ln p(x, y, m) ln Z (t+1) x, (2.61) whch mples that the optmal q x (t+1) (x) s factorsed n the form q x (t+1) (x) = n wth ln q x (t+1) (x ) = q(t+1) x (x ), d q (t) () ln p(x, y, m) ln Z (t+1) x, (2.62) wth Z x = Thus for a gven q (), there s a unque statonary pont for each q x (x ). Proof of q () update: usng varatonal calculus. n Z x. (2.63) 55

2.3. Varatonal methods for Bayesan learnng log margnal lkelhood ln p(y m) ln p(y m) ln p(y m) h KL q x (t) q (t) p(x, y) new lower bound h KL q x (t+1) q (t) p(x, y) F(q (t+1) x (x), q (t) ()) newer lower bound h KL q x (t+1) F(q (t+1) x q (t+1) p(x, y) (x), q (t+1) ()) lower bound F(q (t) x (x), q (t) ()) VBE step VBM step Fgure 2.3: The varatonal Bayesan EM (VBEM) algorthm. In the VBE step, the varatonal posteror over hdden varables q x (x) s set accordng to (2.60). In the VBM step, the varatonal posteror over parameters s set accordng to (2.56). Each step s guaranteed to ncrease (or leave unchanged) the lower bound on the margnal lkelhood. (Note that the exact log margnal lkelhood s a fxed quantty, and does not change wth VBE or VBM steps t s only the lower bound whch ncreases.) Proceedng as above, take functonal dervatves of F m (q x (x), q ()) wth respect to q () and equate to zero yeldng: q () F m(q x (x), q ()) = q () [ d q () ] dx q x (x) ln p(x, y, m) (2.64) p( m) + ln q () (2.65) = dx q x (x) ln p(x, y ) + ln p( m) ln q () + c (2.66) = 0, (2.67) whch upon rearrangement produces ln q (t+1) () = ln p( m) + dx q x (t+1) (x) ln p(x, y ) ln Z (t+1), (2.68) where Z s the normalsaton constant (related to the Lagrange multpler whch has agan been omtted for succnctness). Thus for a gven q x (x), there s a unque statonary pont for q (). 56

2.3. Varatonal methods for Bayesan learnng At ths pont t s well worth notng the symmetry between the hdden varables and the parameters. The ndvdual VBE steps can be wrtten as one batch VBE step: q x (t+1) (x) = 1 [ exp Z x d q (t) () ln p(x, y, m) ] wth Z x = (2.69) n Z x. (2.70) On the surface, t seems that the varatonal update rules (2.60) and (2.56) dffer only n the pror term p( m) over the parameters. There actually also exsts a pror term over the hdden varables as part of p(x, y, m), so ths does not resolve the two. The dstngushng feature between hdden varables and parameters s that the number of hdden varables ncreases wth data set sze, whereas the number of parameters s assumed fxed. Re-wrtng (2.53), t s easy to see that maxmsng F m (q x (x), q () s smply equvalent to mnmsng the KL dvergence between q x (x) q () and the jont posteror over hdden states and parameters p(x, y, m): ln p(y m) F m (q x (x), q ()) = d dx q x (x) q () ln q x(x) q () p(x, y, m) (2.71) = KL [q x (x) q () p(x, y, m)] (2.72) 0. (2.73) Note the smlarty between expressons (2.35) and (2.72): whle we mnmse the former wth respect to hdden varable dstrbutons and the parameters, the latter we mnmse wth respect to the hdden varable dstrbuton and a dstrbuton over parameters. The varatonal Bayesan EM algorthm reduces to the ordnary EM algorthm for ML estmaton f we restrct the parameter dstrbuton to a pont estmate,.e. a Drac delta functon, q () = δ( ), n whch case the M step smply nvolves re-estmatng. Note that the same cannot be sad n the case of MAP estmaton, whch s nherently bass dependent, unlke both VB and ML algorthms. By constructon, the VBEM algorthm s guaranteed to monotoncally ncrease an objectve functon F, as a functon of a dstrbuton over parameters and hdden varables. Snce we ntegrate over model parameters there s a naturally ncorporated model complexty penalty. It turns out that for a large class of models (see secton 2.4) the VBE step has approxmately the same computatonal complexty as the standard E step n the ML framework, whch makes t vable as a Bayesan replacement for the EM algorthm. 57

2.3. Varatonal methods for Bayesan learnng 2.3.2 Dscusson The mpact of the q(x, ) q x (x)q () factorsaton Unless we make the assumpton that the posteror over parameters and hdden varables factorses, we wll not generally obtan the further hdden varable factorsaton over n that we have n equaton (2.55). In that case, the dstrbutons of x and x j wll be coupled for all cases {, j} n the data set, greatly ncreasng the overall computatonal complexty of nference. Ths further factorsaton s depcted n fgure 2.4 for the case of n = 3, where we see: (a) the orgnal drected graphcal model, where s the collecton of parameters governng pror dstrbutons over the hdden varables x and the condtonal probablty p(y x, ); (b) the moralsed graph gven the data {y 1, y 2, y 3 }, whch shows that the hdden varables are now dependent n the posteror through the uncertan parameters; (c) the effectve graph after the factorsaton assumpton, whch not only removes arcs between the parameters and hdden varables, but also removes the dependences between the hdden varables. Ths latter ndependence falls out from the optmsaton as a result of the..d. nature of the data, and s not a further approxmaton. Whlst ths factorsaton of the posteror dstrbuton over hdden varables and parameters may seem drastc, one can thnk of t as replacng stochastc dependences between x and wth determnstc dependences between relevant moments of the two sets of varables. The advantage of gnorng how fluctuatons n x nduce fluctuatons n (and vce-versa) s that we can obtan analytcal approxmatons to the log margnal lkelhood. It s these same deas that underle mean-feld approxmatons from statstcal physcs, from where these lower-boundng varatonal approxmatons were nspred (Feynman, 1972; Pars, 1988). In later chapters the consequences of the factorsaton for partcular models are studed n some detal; n partcular we wll use samplng methods to estmate by how much the varatonal bound falls short of the margnal lkelhood. What forms for q x (x) and q ()? One mght need to approxmate the posteror further than smply the hdden-varable / parameter factorsaton. A common reason for ths s that the parameter posteror may stll be ntractable despte the hdden-varable / parameter factorsaton. The free-form extremsaton of F normally provdes us wth a functonal form for q (), but ths may be unweldy; we therefore need to assume some smpler space of parameter posterors. The most commonly used dstrbutons are those wth just a few suffcent statstcs, such as the Gaussan or Drchlet dstrbutons. Takng a Gaussan example, F s then explctly extremsed wth respect to a set of varatonal parameters ζ = (µ, ν ) whch parameterse the Gaussan q ( ζ ). We wll see examples of ths approach n later chapters. There may also exst ntractabltes n the hdden varable 58

2.3. Varatonal methods for Bayesan learnng x 1 x 2 x 3 x 1 x 2 x 3 y 1 y 2 y 3 y 1 y 2 y 3 (a) The generatve graphcal model. (b) Graph representng the exact posteror. x 1 x 2 x 3 (c) Posteror graph after the varatonal approxmaton. Fgure 2.4: Graphcal depcton of the hdden-varable / parameter factorsaton. (a) The orgnal generatve model for n = 3. (b) The exact posteror graph gven the data. Note that for all case pars {, j}, x and x j are not drectly coupled, but nteract through. That s to say all the hdden varables are condtonally ndependent of one another, but only gven the parameters. (c) the posteror graph after the varatonal approxmaton between parameters and hdden varables, whch removes arcs between parameters and hdden varables. Note that, on assumng ths factorsaton, as a consequence of the..d. assumpton the hdden varables become ndependent. 59

2.3. Varatonal methods for Bayesan learnng posteror, for whch further approxmatons need be made (some examples are mentoned below). There s somethng of a dark art n dscoverng a factorsaton amongst the hdden varables and parameters such that the approxmaton remans fathful at an acceptable level. Of course t does not make sense to use a posteror form whch holds fewer condtonal ndependences than those mpled by the moral graph (see secton 1.1). The key to a good varatonal approxmaton s then to remove as few arcs as possble from the moral graph such that nference becomes tractable. In many cases the goal s to fnd tractable substructures (structured approxmatons) such as trees or mxtures of trees, whch capture as many of the arcs as possble. Some arcs may capture crucal dependences between nodes and so need be kept, whereas other arcs mght nduce a weak local correlaton at the expense of a long-range correlaton whch to frst order can be gnored; removng such an arc can have dramatc effects on the tractablty. The advantage of the varatonal Bayesan procedure s that any factorsaton of the posteror yelds a lower bound on the margnal lkelhood. Thus n practce t may pay to approxmately evaluate the computatonal cost of several canddate factorsatons, and mplement those whch can return a completed optmsaton of F wthn a certan amount of computer tme. One would expect the more complex factorsatons to take more computer tme but also yeld progressvely tghter lower bounds on average, the consequence beng that the margnal lkelhood estmate mproves over tme. An nterestng avenue of research n ths ven would be to use the varatonal posteror resultng from a smpler factorsaton as the ntalsaton for a slghtly more complcated factorsaton, and move n a chan from smple to complcated factorsatons to help avod local free energy mnma n the optmsaton. Havng proposed ths, t remans to be seen f t s possble to form a coherent closely-spaced chan of dstrbutons that are of any use, as compared to startng from the fullest posteror approxmaton from the start. Usng the lower bound for model selecton and averagng The log rato of posteror probabltes of two competng models m and m s gven by ln p(m y) p(m y) = + ln p(m) + p(y m) ln p(m ) ln p(y m ) (2.74) = + ln p(m) + F(q x, ) + KL [q(x, ) p(x, y, m)] ln p(m ) F (q x, ) KL [ q (x, ) p(x, y, m ) ] (2.75) where we have used the form n (2.72), whch s exact regardless of the qualty of the bound used, or how tghtly that bound has been optmsed. The lower bounds for the two models, F and F, are calculated from VBEM optmsatons, provdng us for each model wth an approxmaton to the posteror over the hdden varables and parameters of that model, q x, and q x, ; these may n general be functonally very dfferent (we leave asde for the moment local maxma problems 60

2.3. Varatonal methods for Bayesan learnng n the optmsaton process whch can be overcome to an extent by usng several dfferently ntalsed optmsatons or n some models by employng heurstcs talored to explot the model structure). When we perform model selecton by comparng the lower bounds, F and F, we are assumng that the KL dvergences n the two approxmatons are the same, so that we can use just these lower bounds as gude. Unfortunately t s non-trval to predct how tght n theory any partcular bound can be f ths were possble we could more accurately estmate the margnal lkelhood from the start. Takng an example, we would lke to know whether the bound for a model wth S mxture components s smlar to that for S + 1 components, and f not then how badly ths nconsstency affects the posteror over ths set of models. Roughly speakng, let us assume that every component n our model contrbutes a (constant) KL dvergence penalty of KL s. For clarty we use the notaton L(S) and F(S) to denote the exact log margnal lkelhood and lower bounds, respectvely, for a model wth S components. The dfference n log margnal lkelhoods, L(S + 1) L(S), s the quantty we wsh to estmate, but f we base ths on the lower bounds the dfference becomes L(S + 1) L(S) = [F(S + 1) + (S + 1) KL s ] [F(S) + S KL s ] (2.76) = F(S + 1) F(S) + KL s (2.77) F(S + 1) F(S), (2.78) where the last lne s the result we would have basng the dfference on lower bounds. Therefore there exsts a systematc error when comparng models f each component contrbutes ndependently to the KL dvergence term. Snce the KL dvergence s strctly postve, and we are basng our model selecton on (2.78) rather than (2.77), ths analyss suggests that there s a systematc bas towards smpler models. We wll n fact see ths n chapter 4, where we fnd an mportance samplng estmate of the KL dvergence showng ths behavour. Optmsng the pror dstrbutons Usually the parameter prors are functons of hyperparameters, a, so we can wrte p( a, m). In the varatonal Bayesan framework the lower bound can be made hgher by maxmsng F m wth respect to these hyperparameters: a (t+1) = arg max a F m (q x (x), q (), y, a). (2.79) A smple depcton of ths optmsaton s gven n fgure 2.5. Unlke earler n secton 2.3.1, the margnal lkelhood of model m can now be ncreased wth hyperparameter optmsaton. As we wll see n later chapters, there are examples where these hyperparameters themselves have governng hyperprors, such that they can be ntegrated over as well. The result beng that 61

2.3. Varatonal methods for Bayesan learnng new optmsed log margnal lkelhood ln p(y a (t+1), m) log margnal lkelhood ln p(y a (t), m) log margnal lkelhood ln p(y a (t), m) h KL q x (t+1) q (t+1) p(x, y, a (t) ) h KL q (t+1) x q (t+1) p(x, y, a (t+1) ) h KL q x (t) q (t) p(x, y, a(t) ) new lower bound F(q x (t+1) (x), q (t+1) (), a (t) ) new lower bound F(q (t+1) x (x), q (t+1) (), a (t+1) ) lower bound F(q x (t) (x), q (t) (), a(t) ) VBEM step hyperparameter optmsaton Fgure 2.5: The varatonal Bayesan EM algorthm wth hyperparameter optmsaton. The VBEM step conssts of VBE and VBM steps, as shown n fgure 2.3. The hyperparameter optmsaton ncreases the lower bound and also mproves the margnal lkelhood. we can nfer dstrbutons over these as well, just as for parameters. The reason for abstractng from the parameters ths far s that we would lke to ntegrate out all varables whose cardnalty ncreases wth model complexty; ths standpont wll be made clearer n the followng chapters. Prevous work, and general applcablty of VBEM The varatonal approach for lower boundng the margnal lkelhood (and smlar quanttes) has been explored by several researchers n the past decade, and has receved a lot of attenton recently n the machne learnng communty. It was frst proposed for one-hdden layer neural networks (whch have no hdden varables) by Hnton and van Camp (1993) where q () was restrcted to be Gaussan wth dagonal covarance. Ths work was later extended to show that tractable approxmatons were also possble wth a full covarance Gaussan (Barber and Bshop, 1998) (whch n general wll have the mode of the posteror at a dfferent locaton than n the dagonal case). Neal and Hnton (1998) presented a generalsaton of EM whch made use of Jensen s nequalty to allow partal E-steps; n ths paper the term ensemble learnng was used to descrbe the method snce t fts an ensemble of models, each wth ts own parameters. Jaakkola (1997) and Jordan et al. (1999) revew varatonal methods n a general context (.e. non-bayesan). Varatonal Bayesan methods have been appled to varous models wth hdden varables and no restrctons on q () and q x (x ) other than the assumpton that they factorse n some way (Waterhouse et al., 1996; Bshop, 1999; Ghahraman and Beal, 2000; Attas, 2000). Of partcular note s the varatonal Bayesan HMM of MacKay (1997), n whch free-form optmsatons are explctly undertaken (see chapter 3); ths work was the nspraton for the examnaton of Conjugate-Exponental (CE) models, dscussed n the next secton. An example 62

2.3. Varatonal methods for Bayesan learnng of a constraned optmsaton for a logstc regresson model can be found n Jaakkola and Jordan (2000). Several researchers have nvestgated usng mxture dstrbutons for the approxmate posteror, whch allows for more flexblty whlst mantanng a degree of tractablty (Lawrence et al., 1998; Bshop et al., 1998; Lawrence and Azzouz, 1999). The lower bound n these models s a sum of a two terms: a frst term whch s a convex combnaton of bounds from each mxture component, and a second term whch s the mutual nformaton between the mxture labels and the hdden varables of the model. The frst term offers no mprovement over a nave combnaton of bounds, but the second (whch s non-negatve) has to mprove on the smple bounds. Unfortunately ths term contans an expectaton over all confguratons of the hdden states and so has to be tself bounded wth a further use of Jensen s nequalty n the form of a convex bound on the log functon (ln(x) λx ln(λ) 1) (Jaakkola and Jordan, 1998). Despte ths approxmaton drawback, emprcal results n a handful of models have shown that the approxmaton does mprove the smple mean feld bound and mproves monotoncally wth the number of mxture components. A related method for approxmatng the ntegrand for Bayesan learnng s based on an dea known as assumed densty flterng (ADF) (Bernardo and Gron, 1988; Stephens, 1997; Boyen and Koller, 1998; Barber and Sollch, 2000; Frey et al., 2001), and s called the Expectaton Propagaton (EP) algorthm (Mnka, 2001a). Ths algorthm approxmates the ntegrand of nterest wth a set of terms, and through a process of repeated deleton-ncluson of term expressons, the ntegrand s teratvely refned to resemble the true ntegrand as closely as possble. Therefore the key to the method s to use terms whch can be tractably ntegrated. Ths has the same flavour as the varatonal Bayesan method descrbed here, where we teratvely update the approxmate posteror over a hdden state q x (x ) or over the parameters q (). The key dfference between EP and VB s that n the update process (.e. deleton-ncluson) EP seeks to mnmse the KL dvergence whch averages accordng to the true dstrbuton, KL [p(x, y) q(x, )] (whch s smply a moment-matchng operaton for exponental famly models), whereas VB seeks to mnmse the KL dvergence accordng to the approxmate dstrbuton, KL [q(x, ) p(x, y)]. Therefore, EP s at least attemptng to average accordng to the correct dstrbuton, whereas VB has the wrong cost functon at heart. However, n general the KL dvergence n EP can only be mnmsed separately one term at a tme, whle the KL dvergence n VB s mnmsed globally over all terms n the approxmaton. The result s that EP may stll not result n representatve posteror dstrbutons (for example, see Mnka, 2001a, fgure 3.6, p. 6). Havng sad that, t may be that more generalsed deleton-ncluson steps can be derved for EP, for example removng two or more terms at a tme from the ntegrand, and ths may allevate some of the local restrctons of the EP algorthm. As n VB, EP s constraned to use partcular parametrc famles wth a small number of moments for tractablty. An example of EP used wth an assumed Drchlet densty for the term expressons can be found n Mnka and Lafferty (2002). 63

2.4. Conjugate-Exponental models In the next secton we take a closer look at the varatonal Bayesan EM equatons, (2.54) and (2.56), and ask the followng questons: - To whch models can we apply VBEM?.e. whch forms of data dstrbutons p(y, x ) and prors p( m) result n tractable VBEM updates? - How does ths relate formally to conventonal EM? - When can we utlse exstng belef propagaton algorthms n the VB framework? 2.4 Conjugate-Exponental models 2.4.1 Defnton We consder a partcular class of graphcal models wth latent varables, whch we call conjugateexponental (CE) models. In ths secton we explctly apply the varatonal Bayesan method to these parametrc famles, dervng a smple general form of VBEM for the class. Conjugate-exponental models satsfy two condtons: Condton (1). The complete-data lkelhood s n the exponental famly: p(x, y ) = g() f(x, y ) e φ() u(x,y ), (2.80) where φ() s the vector of natural parameters, u and f are the functons that defne the exponental famly, and g s a normalsaton constant: g() 1 = dx dy f(x, y ) e φ() u(x,y ). (2.81) The natural parameters for an exponental famly model φ are those that nteract lnearly wth the suffcent statstcs of the data u. For example, for a unvarate Gaussan n x wth mean µ and standard devaton σ, the necessary quanttes are obtaned from: p(x µ, σ) = exp { x2 2σ 2 + xµ σ 2 µ2 2σ 2 1 } 2 ln(2πσ2 ) (2.82) = ( σ 2, µ ) (2.83) 64

2.4. Conjugate-Exponental models and are: ( 1 φ() = σ 2, µ ) σ 2 ) u(x) = ( x2 2, x (2.84) (2.85) f(x) = 1 (2.86) g() = exp { µ2 2σ 2 1 } 2 ln(2πσ2 ). (2.87) Note that whlst the parametersaton for s arbtrary, e.g. we could have let = (σ, µ), the natural parameters φ are unque up to a multplcatve constant. Condton (2). The parameter pror s conjugate to the complete-data lkelhood: p( η, ν) = h(η, ν) g() η e φ() ν, (2.88) where η and ν are hyperparameters of the pror, and h s a normalsaton constant: h(η, ν) 1 = d g() η e φ() ν. (2.89) Condton 1 (2.80) n fact usually mples the exstence of a conjugate pror whch satsfes condton 2 (2.88). The pror p( η, ν) s sad to be conjugate to the lkelhood p(x, y ) f and only f the posteror p( η, ν ) p( η, ν)p(x, y ) (2.90) s of the same parametrc form as the pror. In general the exponental famles are the only classes of dstrbutons that have natural conjugate pror dstrbutons because they are the only dstrbutons wth a fxed number of suffcent statstcs apart from some rregular cases (see Gelman et al., 1995, p. 38). From the defnton of conjugacy, we see that the hyperparameters of a conjugate pror can be nterpreted as the number (η) and values (ν) of pseudo-observatons under the correspondng lkelhood. We call models that satsfy condtons 1 (2.80) and 2 (2.88) conjugate-exponental. The lst of latent-varable models of practcal nterest wth complete-data lkelhoods n the exponental famly s very long, for example: Gaussan mxtures, factor analyss, prncpal components analyss, hdden Markov models and extensons, swtchng state-space models, dscretevarable belef networks. Of course there are also many as yet undreamt-of models combnng Gaussan, gamma, Posson, Drchlet, Wshart, multnomal, and other dstrbutons n the exponental famly. 65

2.4. Conjugate-Exponental models However there are some notable outcasts whch do not satsfy the condtons for membershp of the CE famly, namely: Boltzmann machnes (Ackley et al., 1985), logstc regresson and sgmod belef networks (Bshop, 1995), and ndependent components analyss (ICA) (as presented n Comon, 1994; Bell and Sejnowsk, 1995), all of whch are wdely used n the machne learnng communty. As an example let us see why logstc regresson s not n the conjugateexponental famly: for y { 1, 1}, the lkelhood under a logstc regresson model s p(y x, ) = e y x e x + e x, (2.91) where x s the regressor for data pont and s a vector of weghts, potentally ncludng a bas. Ths can be rewrtten as p(y x, ) = e y x f(,x ), (2.92) where f(, x ) s a normalsaton constant. To belong n the exponental famly the normalsng constant must splt nto functons of only and only (x, y ). Expandng f(, x ) yelds a seres of powers of x, whch could be assmlated nto the φ() u(x, y ) term by augmentng the natural parameter and suffcent statstcs vectors, f t were not for the fact that the seres s nfnte meanng that there would need to be an nfnty of natural parameters. Ths means we cannot represent the lkelhood wth a fnte number of suffcent statstcs. Models whose complete-data lkelhood s not n the exponental famly can often be approxmated by models whch are n the exponental famly and have been gven addtonal hdden varables. A very good example s the Independent Factor Analyss (IFA) model of Attas (1999a). In conventonal ICA, one can thnk of the model as usng non-gaussan sources, or usng Gaussan sources passed through a non-lnearty to make them non-gaussan. For most non-lneartes commonly used (such as the logstc), the complete-data lkelhood becomes non-ce. Attas recasts the model as a mxture of Gaussan sources beng fed nto a lnear mxng matrx. Ths model s n the CE famly and so can be tackled wth the VB treatment. It s an open area of research to nvestgate how best to brng models nto the CE famly, such that nferences n the modfed model resemble the orgnal as closely as possble. 2.4.2 Varatonal Bayesan EM for CE models In Bayesan nference we want to determne the posteror over parameters and hdden varables p(x, y, η, ν). In general ths posteror s nether conjugate nor n the exponental famly. In ths subsecton we see how the propertes of the CE famly make t especally amenable to the VB approxmaton, and derve the VBEM algorthm for CE models. 66

2.4. Conjugate-Exponental models Theorem 2.2: Varatonal Bayesan EM for Conjugate-Exponental Models. Gven an..d. data set y = {y 1,... y n }, f the model satsfes condtons (1) and (2), then the followng (a), (b) and (c) hold: (a) the VBE step yelds: q x (x) = and q x (x ) s n the exponental famly: n q x (x ), (2.93) q x (x ) f(x, y ) e φ u(x,y ) = p(x y, φ), (2.94) wth a natural parameter vector φ = d q ()φ() φ() q () (2.95) obtaned by takng the expectaton of φ() under q () (denoted usng angle-brackets ). For nvertble φ, defnng such that φ( ) = φ, we can rewrte the approxmate posteror as q x (x ) = p(x y, ). (2.96) (b) the VBM step yelds that q () s conjugate and of the form: q () = h( η, ν) g() η e φ() ν, (2.97) where η = η + n, (2.98) n ν = ν + u(y ), (2.99) and u(y ) = u(x, y ) qx (x ) (2.100) s the expectaton of the suffcent statstc u. We have used qx (x ) to denote expectaton under the varatonal posteror over the latent varable(s) assocated wth the th datum. (c) parts (a) and (b) hold for every teraton of varatonal Bayesan EM. Proof of (a): by drect substtuton. 67

2.4. Conjugate-Exponental models Startng from the varatonal extrema soluton (2.60) for the VBE step: q x (x) = 1 Z x e ln p(x,y,m) q (), (2.101) substtute the parametrc form for p(x, y, m) n condton 1 (2.80), whch yelds (omttng teraton superscrpts): q x (x) = 1 P n ln g()+ln f(x,y )+φ() e u(x,y ) q Z () (2.102) x [ = 1 n ] f(x, y ) e P n φ u(x,y ), (2.103) Z x where Z x has absorbed constants ndependent of x, and we have defned wthout loss of generalty: φ = φ() q (). (2.104) If φ s nvertble, then there exsts a such that φ = φ( ), and we can rewrte (2.103) as: q x (x) = 1 = Z x [ n ] f(x, y )e φ( ) u(x,y ) (2.105) n p(x, y, m) (2.106) n q x (x ) (2.107) = p(x, y, m). (2.108) Thus the result of the approxmate VBE step, whch averages over the ensemble of models q (), s exactly the same as an exact E step, calculated at the varatonal Bayes pont estmate. Proof of (b): by drect substtuton. Startng from the varatonal extrema soluton (2.56) for the VBM step: q () = 1 Z p( m) e ln p(x,y,m) qx(x), (2.109) 68

2.4. Conjugate-Exponental models substtute the parametrc forms for p( m) and p(x, y, m) as specfed n condtons 2 (2.88) and 1 (2.80) respectvely, whch yelds (omttng teraton superscrpts): where q () = 1 Z h(η, ν)g() η e φ() ν e Pn ln g()+ln f(x,y )+φ() u(x,y ) qx(x) (2.110) = 1 Z h(η, ν)g() η+n e φ() [ν+ P n u(y P )] n e ln f(x,y ) qx(x) }{{} has no dependence (2.111) = h( η, ν)g() η e φ() ν, (2.112) h( η, ν) = 1 P n e ln f(x,y ) qx(x). (2.113) Z Therefore the varatonal posteror q () n (2.112) s of conjugate form, accordng to condton 2 (2.88). Proof of (c): by nducton. Assume condtons 1 (2.80) and 2 (2.88) are met (.e. the model s n the CE famly). From part (a), the VBE step produces a posteror dstrbuton q x (x) n the exponental famly, preservng condton 1 (2.80); the parameter dstrbuton q () remans unaltered, preservng condton 2 (2.88). From part (b), the VBM step produces a parameter posteror q () that s of conjugate form, preservng condton 2 (2.88); q x (x) remans unaltered from the VBE step, preservng condton 1 (2.80). Thus under both the VBE and VBM steps, conjugate-exponentalty s preserved, whch makes the theorem applcable at every teraton of VBEM. As before, snce q () and q x (x ) are coupled, (2.97) and (2.94) do not provde an analytc soluton to the mnmsaton problem, so the optmsaton problem s solved numercally by teratng between the fxed pont equatons gven by these equatons. To summarse brefly: VBE Step: Compute the expected suffcent statstcs {u(y )} n under the hdden varable dstrbutons q x (x ), for all. VBM Step: Compute the expected natural parameters φ = φ() under the parameter dstrbuton gven by η and ν. 2.4.3 Implcatons In order to really understand what the conjugate-exponental formalsm buys us, let us reterate the man ponts of theorem 2.2 above. The frst result s that n the VBM step the analytcal form of the varatonal posteror q () does not change durng teratons of VBEM e.g. f the posteror s Gaussan at teraton t = 1, then only a Gaussan need be represented at future teratons. If t were able to change, whch s the case n general (theorem 2.1), the 69

2.4. Conjugate-Exponental models EM for MAP estmaton Varatonal Bayesan EM Goal: maxmse p( y, m) w.r.t. Goal: lower bound p(y m) E Step: compute VBE Step: compute q x (t+1) (x) = p(x y, (t) ) q x (t+1) (x) = p(x y, φ (t) ) M Step: VBM Step: (t+1) (t+1) = arg max dx q x (x) ln p(x, y, ) q (t+1) () exp dx q x (t+1) (x) ln p(x, y, ) Table 2.1: Comparson of EM for ML/MAP estmaton aganst varatonal Bayesan EM for CE models. posteror could quckly become unmanageable, and (further) approxmatons would be requred to prevent the algorthm becomng too complcated. The second result s that the posteror over hdden varables calculated n the VBE step s exactly the posteror that would be calculated had we been performng an ML/MAP E step. That s, the nferences usng an ensemble of models q () can be represented by the effect of a pont parameter,. The task of performng many nferences, each of whch corresponds to a dfferent parameter settng, can be replaced wth a sngle nference step t s possble to nfer the hdden states n a conjugate exponental model tractably whle ntegratng over an ensemble of model parameters. Comparson to EM for ML/MAP parameter estmaton We can draw a tght parallel between the EM algorthm for ML/MAP estmaton, and our VBEM algorthm appled specfcally to conjugate-exponental models. These are summarsed n table 2.1. Ths general result of VBEM for CE models was reported n Ghahraman and Beal (2001), and generalses the well known EM algorthm for ML estmaton (Dempster et al., 1977). It s a specal case of the varatonal Bayesan algorthm (theorem 2.1) used n Ghahraman and Beal (2000) and n Attas (2000), yet encompasses many of the models that have been so far subjected to the varatonal treatment. Its partcular usefulness s as a gude for the desgn of models, to make them amenable to effcent approxmate Bayesan nference. The VBE step has about the same tme complexty as the E step, and s n all ways dentcal except that t s re-wrtten n terms of the expected natural parameters. In partcular, we can make use of all relevant propagaton algorthms such as juncton tree, Kalman smoothng, or belef propagaton. The VBM step computes a dstrbuton over parameters (n the conjugate famly) rather than a pont estmate. Both ML/MAP EM and VBEM algorthms monotoncally ncrease an objectve functon, but the latter also ncorporates a model complexty penalty by 70