CS Lecture 19. Exponential Families & Expectation Propagation

Similar documents
13: Variational inference II

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Probabilistic Graphical Models

Chapter 5 continued. Chapter 5 sections

Expectation Propagation Algorithm

ST5215: Advanced Statistical Theory

Information Geometric view of Belief Propagation

14 : Theory of Variational Inference: Inner and Outer Approximation

Structured Variational Inference

Probabilistic and Bayesian Machine Learning

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Probabilistic Graphical Models

Bayesian Methods for Machine Learning

Bayesian Machine Learning - Lecture 7

Probabilistic Graphical Models for Image Analysis - Lecture 4

Expectation Propagation in Factor Graphs: A Tutorial

CS Lecture 13. More Maximum Likelihood

Machine learning - HT Maximum Likelihood

Lecture 13 : Variational Inference: Mean Field Approximation

17 Variational Inference

STA 4273H: Statistical Machine Learning

Chapter 5. Chapter 5 sections

Junction Tree, BP and Variational Methods

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Lecture 6: Gaussian Mixture Models (GMM)

Variational Inference (11/04/13)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

PMR Learning as Inference

Inference and Representation

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Graphical Models and Kernel Methods

Unsupervised Learning

Probability Distributions

Probabilistic Graphical Models

Introduction to Machine Learning

Minimizing D(Q,P) def = Q(h)

13 : Variational Inference: Loopy Belief Propagation

Stat 521A Lecture 18 1

Lecture 9: PGM Learning

Probabilistic Graphical Models

Introduction to Machine Learning

Probability and Estimation. Alan Moses

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Classical and Bayesian inference

1 Bayesian Linear Regression (BLR)

Probabilistic Graphical Models

Advanced Machine Learning & Perception

Latent Variable Models and EM algorithm

Introduction to Machine Learning

Expectation propagation for symbol detection in large-scale MIMO communications

14 : Mean Field Assumption

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Information geometry for bivariate distribution control

Stat 5101 Lecture Notes

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

Introduction to graphical models: Lecture III

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Part 1: Expectation Propagation

12 : Variational Inference I

Quantitative Biology II Lecture 4: Variational Methods

Recitation 9: Loopy BP

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

STA 4273H: Statistical Machine Learning

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Foundations of Nonparametric Bayesian Methods

5. Conditional Distributions

Variational Scoring of Graphical Model Structures

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Stat 5101 Notes: Brand Name Distributions

Estimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking

Message-Passing Algorithms for GMRFs and Non-Linear Optimization

Artificial Intelligence

Curve Fitting Re-visited, Bishop1.2.5

Pattern Recognition. Parameter Estimation of Probability Density Functions

Lecture 21: Minimax Theory

Variational Learning : From exponential families to multilinear systems

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Statistical Methods in Particle Physics

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Inference as Optimization

The Expectation Maximization or EM algorithm

Multiple Random Variables

CSC 2541: Bayesian Methods for Machine Learning

Formulas for probability theory and linear models SF2941

Data Mining Techniques

Variational algorithms for marginal MAP

Non-Parametric Bayes

Introduction to Probability and Statistics (Continued)

Goodness of Fit Test and Test of Independence by Entropy

Lecture 5: Moment Generating Functions

Transcription:

CS 6347 Lecture 19 Exponential Families & Expectation Propagation

Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces correspond to vectors of probabilities for each element in the space such that the vector sums to one The partition function is simply a sum over all of the possible values for each variable Entropy of the distribution is nonnegative and is also computed by summing over the state space 2

Continuous State Spaces p x = 1 Z C ψ C xc For continuous state spaces, the partition function is now an integral The entropy becomes Z = න C ψ C xc dx H x = න p x log p(x) dx 3

Differential Entropy H x = න p x log p(x) dx This is called the differential entropy It is not always greater than or equal to zero Easy to construct such distributions: Let q x be the uniform distribution over the interval [a, b], what is the entropy of q(x)? 4

Differential Entropy H x = න p x log p(x) dx This is called the differential entropy It is not always greater than or equal to zero Easy to construct such distributions: Let q x be the uniform distribution over the interval [a, b], what is the entropy of q(x)? b 1 H q = න a b a log 1 b a dx = log(b a) 5

KL Divergence d(q p = න q x log q x p(x) dx The KL-divergence is still nonnegative, even though it contains the differential entropy This means that all of the observations that we made for finite state spaces will carry over to the continuous case The EM algorithm, mean-field methods, etc. Most importantly log Z H q + න q C x C log ψ C x C dx C C 6

Continuous State Spaces Examples of probability distributions over continuous state spaces The uniform distribution over the interval [a, b] q x = 1 x a,b b a The multivariate normal distribution with mean μ and covariance matrix Σ q x = 1 2π k det(σ) exp 1 2 x μ T Σ 1 (x μ) 7

Continuous State Spaces What makes continuous distributions so difficult to deal with? They may not be compactly representable Families of continuous distributions need not be closed under marginalization The marginal distributions of multivariate normal distributions are (multivariate) normal distributions Integration problems of interest (e.g., the partition function or marginal distributions) may not have closed form solutions Integrals may also not exist! 8

The Exponential Family p x θ = h x exp θ, φ(x) log Z(θ) A distribution is a member of the exponential family if its probability density function can be expressed as above for some choice of parameters θ and potential functions φ(x) We are only interested in models for which Z(θ) is finite The family of log-linear models that we have been focusing on in the discrete case belong to the exponential family 9

The Exponential Family p x θ = h x exp θ, φ(x) log Z(θ) As in the discrete case, there is not necessarily a unique way to express a distribution in this form We say that the representation is minimal if there does not exist a vector a 0 such that a, φ(x) = constant In this case, there is a unique parameter vector associated with each member of the family The φ are called sufficient statistics for the distribution 10

The Multivariate Normal q x μ, Σ = 1 Z exp 1 2 x μ T Σ 1 x μ The multivariate normal distribution is a member of the exponential family q x θ = 1 Z(θ) exp i θ i x i + θ ij x i x j The mean and the covariance matrix (must be positive semidefinite) are sufficient statistics of the multivariate normal distribution i j 11

The Exponential Family Many of the discrete distributions that you have seen before are members of the exponential family Binomial, Poisson, Bernoulli, Gamma, Beta, Laplace, Categorical, etc. The exponential family, while not the most general parametric family, is one of the easiest to work with and captures a variety of different distributions 12

Continuous Bethe Approximation Recall that, from the nonnegativity of the KL-divergence log Z H q + න q C x C log ψ C x C dx C for any probability distribution q C We can make the same approximations that we did in the discrete case to approximate Z(θ) in the continuous case 13

Continuous Bethe Approximation max τ T H B τ + C න τ C x C log ψ c x c dx C where τ C (x C ) H B τ = න τ i x i log τ i (x i ) dx i න τ C x C log ς i C τ i x i i V and T is a vector of locally consistent marginals C dx C This approximation is exact on trees 14

Continuous Belief Propagation p x = 1 Z i V φ i (x i ) i,j E ψ ij (x i, x j ) The messages passed by belief propagation are m ij x j න φ i x i ψ ij x i, x j k N i j m ki (x i ) dx i Depending on the functional form of the potential functions, the message update may not have a closed form solution We can t necessarily compute the correct marginal distributions/partition function even in the case of a tree! 15

Gaussian Belief Propagation When p(x) is a multivariate normal distribution, the message updates can be computed in closed form In this case, max-product and sum-product are equivalent Note that computing the mode of a multivariate normal is equivalent to solving a linear system of equations Called Gaussian belief propagation or GaBP Does not converge for all multivariate normal The messages can have a non-positive definite inverse covariance matrix 16

Properties of Exponential Families Exponential families are Closed under multiplication Not closed under marginalization Easy to get mixtures of Gaussians when a model has both discrete and continuous variables Let p(x, y) be such that x R n and y {1,, k} such that p(x y) is normally distributed and p y is multinomially distributed p(x) is then a Gaussian mixture (mixtures of exponential family distributions are not generally in the exponential family) 17

Properties of Exponential Families Derivatives of the log-partition function correspond to expectations of the sufficient statistics So do second derivatives 2 θ k θ l log Z θ = θ log Z(θ) = න p x θ φ(x)dx න p x θ φ x k φ x l dx න p x θ φ x k dx න p x θ φ x l dx 18

KL-Divergence and Exponential Families Minimizing KL divergence is equivalent to moment matching Let q x θ = h x exp θ, φ(x) log Z(θ) and let p(x) be an arbitrary distribution d(p q = න p(x) log p(x) q(x θ) dx This KL divergence is minimized (as a function of θ) when න p x φ x k dx = න q x θ φ x k dx 19

Expectation Propagation Key idea: given p x = 1 Z ς C ψ C (x C ) approximate it by a simpler distribution p x p x = 1 Z ς C ψ C (x C ) We could just replace each factor with a member of some exponential family that best describes it, but this can result in a poor approximation unless each ψ C is essentially a member of the exponential family already Instead, we construct the approximating distribution by performing a series of optimizations 20

Expectation Propagation Initialize the approximate distribution p x = 1 Z ς C ψ C (x C ) so that each ψ C (x C ) is a member of some exponential family Repeat until convergence For each C Let p C x = p x ψ C x C Set q = argmin q d( p \C ψ C q) where the minimization is over all exponential families q of the chosen form Set ψ C x C = q x p \C (x) 21

Expectation Propagation This is one approach used to handle continuous distributions in practice Other methods include discretization/sampling methods that approximate BP messages and then sample to compute the message updates Nonparametric BP Particle BP Stochastic BP 22