Variational Inference and Learning. Sargur N. Srihari

Similar documents
Machine Learning Basics: Maximum Likelihood Estimation

Variational Inference. Sargur Srihari

Linear Dynamical Systems

Variational Inference. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari

G8325: Variational Bayes

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation Propagation Algorithm

The Expectation Maximization or EM algorithm

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Machine Learning Srihari. Information Theory. Sargur N. Srihari

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Variational Bayes and Variational Message Passing

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Chapter 20. Deep Generative Models

Inference as Optimization

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probability and Information Theory. Sargur N. Srihari

Using Graphs to Describe Model Structure. Sargur N. Srihari

Need for Sampling in Machine Learning. Sargur Srihari

An Introduction to Expectation-Maximization

Learning Energy-Based Models of High-Dimensional Data

Quantitative Biology II Lecture 4: Variational Methods

An introduction to Variational calculus in Machine Learning

Probabilistic and Bayesian Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Factor Models. Sargur N. Srihari

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari

Learning MN Parameters with Approximation. Sargur Srihari

STA 4273H: Statistical Machine Learning

Gradient Descent. Sargur Srihari

Generative adversarial networks

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Lecture 8: Graphical models for Text

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Week 3: The EM algorithm

Variational Bayesian Logistic Regression

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Generative Adversarial Networks

Gaussian Mixture Models

Variational Model Selection for Sparse Gaussian Process Regression

Contrastive Divergence

14 : Mean Field Assumption

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Variational Learning : From exponential families to multilinear systems

A Note on the Expectation-Maximization (EM) Algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Markov Chain Monte Carlo Methods

Structured Variational Inference

Latent Variable Models

Bayesian Machine Learning - Lecture 7

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

{ p if x = 1 1 p if x = 0

Introduction to Bayesian inference

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Integrating Correlated Bayesian Networks Using Maximum Entropy

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Gaussian Process Regression

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

U Logo Use Guidelines

Introduction to Bayesian Statistics

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Approximate inference in Energy-Based Models

Bayesian Linear Regression. Sargur Srihari

Machine Learning 4771

GAUSSIAN PROCESS REGRESSION

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Chapter 16. Structured Probabilistic Models for Deep Learning

STA 4273H: Statistical Machine Learning

Density Estimation. Seungjin Choi

Basic Sampling Methods

Recurrent Latent Variable Networks for Session-Based Recommendation

Posterior Regularization

Lecture 16 Deep Neural Generative Models

Expectation Propagation for Approximate Bayesian Inference

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Variational Inference via Stochastic Backpropagation

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

13: Variational inference II

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

STA 414/2104: Machine Learning

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Bayesian Machine Learning

Multivariate Gaussians. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari

Gaussian Processes for Machine Learning

Greedy Layer-Wise Training of Deep Networks

Variational Mixture of Gaussians. Sargur Srihari

The Expectation-Maximization Algorithm

MAP Examples. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

Statistical Machine Learning Lectures 4: Variational Bayes

Variational Inference (11/04/13)

Transcription:

Variational Inference and Learning Sargur N. srihari@cedar.buffalo.edu 1

Topics in Approximate Inference Task of Inference Intractability in Inference 1. Inference as Optimization 2. Expectation Maximization 3. MAP Inference and Sparse Coding 4. Variational Inference and Learning 5. Learned Approximate Inference 2

4. Variational Inference 3

Variational: Discrete vs Continuous Specifying how to factorize means: Discrete case: Use traditional optimization techniques to optimize a finite no. of variables describing the q distributions Continuous case: Use calculus of variations over a space of functions Determine which function should be used to represent q Although calculus of variations is not used in discrete case, it is still called as variational Calculus of variations is powerful Only need specify factorization, not design of q

Maximizing L = Minimizing D KL Maximizing L wrt q is equivalent to minimizing D KL (q(h v) P(h v) This is because L(v,θ,q)=log p(v;θ)-d KL (q(h v) p(h v) Maximum likelihood encourages the model to have high probability everywhere that the data has high probability Optimization-based inference encourages q to have low probability everywhere the true posterior has low probability Illustrated in figure next 5

Asymmetry of KL divergence We wish to approximate p with q We have a choice of using D KL (p q) or D KL (q p) Two Gaussians for p and one Gaussian for q (a) Effect of minimizing D KL (p q) select q that has high probability where p has high probability q chooses to blur the two modes together (b) Effect of minimizing D KL (q p) select q that has low probability where p has low probability Avoids putting probability mass in low probability areas between 6 modes

Topics in Variational Inference 1. Discrete Latent Variables 2. Calculus of Variations 3. Continuous Latent Variables 4. Interactions between Learning and Inference We discuss a subset of the above topics here 7

Calculus of Variations A function of a function f is known as a functional J [f ] We can take functional derivatives wrt to individual values of of the function f (x) at any specific value of x Denoted by δ δf (x) J 8

Functional Derivative Identity For differentiable functions f (x) and differentiable functions g (y,x) with continuous derivatives δ δf (x) g(f (x),x)dx = Intuition: f (x) is an infinite vector indexed by x Identity is same as for a vector θ ε R n indexed by positive integers θ i More general is the Euler-Lagrange equation Allows g to depend on derivatives of f as well as value of f, but not needed for deep learning y g(θ j, j) = g(θ θ i,i) i j g(f (x),x) 9

Optimization wrt a vector Procedure to optimize a function wrt a vector Take the gradient of the function wrt the vector Solve for the point where every element of the gradient is equal to zero Procedure to optimize a functional Solve for the function where the functional derivative at every point is equal to zero 10

Example of Optimizing a Functional Find probability distribution over x ε R that has the maximum entropy Entropy of probability distribution p(x) is defined as H[p]=-E x log p(x) For continuous values the expectation is an integral H[p] = p(x)log p(x)dx 11

Lagrangian contraints for maxent 1. To ensure that the result is a distribution we add constraint that p(x) integrates to 1 2. Since entropy increases unbounded with variance, seek distribution with variance σ 2 3. Since problem is undetermined as distribution can be arbitrarily shifted without changing entropy we fix the mean to be μ Lagrangian functional is L[p] = λ 1 ( ) + λ 2 (E[x] µ) + λ 3 (E[(x µ) 2 ] σ 2 ) + H[p] p(x)dx 1 L[p] = (λ 1 p(x)dx + λ 2 p(x)x + λ 3 p(x)(x µ 2 ) p(x)log p(x))dx λ 1 µλ 2 σ 2 λ 3

Functional derivative of Lagrangian To minimize Lagrangian wrt p, we set the functional derivatives equal to 0: L[p] = (λ 1 p(x)dx + λ 2 p(x)x + λ 3 p(x)(x µ 2 ) p(x)log p(x))dx λ 1 µλ 2 σ 2 λ 3 δ δf (x) g(f (x),x)dx = x, y g(f (x),x) y=p(x) δ δp(x) L = λ 1 + λ 2 x + λ 3 (x µ)2 1 log p(x) = 0 This condition tells us functional form of p(x) By algebraic rearrangement g 1 (y,x)=y g 2 (y,x)=yx g 3 (y,x)=y(x-μ) 2 g 4 (y,x)=ylogy d (y logy) = 1 + logy dy p(x)=exp(λ 1 +λ 2 x+λ 3 (x -μ) 2-1) We never assumed directly that p(x) has this form 13

Choosing λ values We must choose λ values so that all constraints are satisfied We may set λ 1 =1-log σ 2π. λ 2 =0, λ 3 =-1/2σ 2 to obtain p(x)=n(x:μ,σ 2 ) Because the normal distribution has maximum entropy, we impose the least possible structure by using maximum entropy 14

Continuous Latent Variables When our graphical model contains continuous latent variables, we can perform variational inference and learning by maximizing L We must now use calculus of variations when maximizing L with respect to q(h v) Not necessary for practitioners to solve calculus of variations problems Instead there is a general equation for meanfield fixed point updates 15

Optimal q(h i v) for Mean Field q(h i v) = exp(e hi~q(h i log p(v,h)) v) So long as p does not assign 0 probability to any joint configuration of variables. Carrying out expectation yields correct functional form of q(h i v) Designed to be iteratively applied for each value of i until convergence

Ex: Latent h R 2 and one visible variable v 17