Learning MN Parameters with Approximation. Sargur Srihari

Similar documents
Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Inference as Optimization

Learning Parameters of Undirected Models. Sargur Srihari

Structured Variational Inference

Exact Inference: Clique Trees. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari

Variational Inference. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

CSC 412 (Lecture 4): Undirected Graphical Models

Basic Sampling Methods

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Undirected Graphical Models: Markov Random Fields

Lecture 9: PGM Learning

Probabilistic Graphical Models

Does Better Inference mean Better Learning?

Probabilistic Graphical Models

13: Variational inference II

13 : Variational Inference: Loopy Belief Propagation

CS Lecture 13. More Maximum Likelihood

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Probabilistic Graphical Models & Applications

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Graphical Models for Collaborative Filtering

Probabilistic Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Undirected Graphical Models

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

MAP Examples. Sargur Srihari

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

From Distributions to Markov Networks. Sargur Srihari

Contrastive Divergence

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Variational Inference (11/04/13)

Machine Learning 4771

12 : Variational Inference I

Likelihood Weighting and Importance Sampling

Probabilistic and Bayesian Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Study Notes on the Latent Dirichlet Allocation

Probabilistic Graphical Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

Undirected graphical models

Notes on Markov Networks

Conditional Random Field

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Bayesian Learning in Undirected Graphical Models

17 Variational Inference

Graphical models. Sunita Sarawagi IIT Bombay

14 : Theory of Variational Inference: Inner and Outer Approximation

Linear Dynamical Systems

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Junction Tree, BP and Variational Methods

Expectation maximization tutorial

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Need for Sampling in Machine Learning. Sargur Srihari

Chapter 16. Structured Probabilistic Models for Deep Learning

Variational Inference and Learning. Sargur N. Srihari

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

14 : Theory of Variational Inference: Inner and Outer Approximation

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

The Ising model and Markov chain Monte Carlo

Introduction to Restricted Boltzmann Machines

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Probabilistic Graphical Models

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models

3 : Representation of Undirected GM

4 : Exact Inference: Variable Elimination

Clustering and Gaussian Mixture Models

Chapter 20. Deep Generative Models

Representation of undirected GM. Kayhan Batmanghelich

Training an RBM: Contrastive Divergence. Sargur N. Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari

Support Vector Machines

Linear & nonlinear classifiers

Kyle Reing University of Southern California April 18, 2018

Hidden Markov Models

Lecture 13 : Variational Inference: Mean Field Approximation

Restricted Boltzmann Machines

Gradient-Based Learning. Sargur N. Srihari

Markov Networks.

Variable Elimination: Algorithm

The Origin of Deep Learning. Lili Mou Jan, 2015

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Part 1: Expectation Propagation

Semi-Markov/Graph Cuts

Pattern Recognition and Machine Learning

Inference in Bayesian Networks

Machine Learning Basics: Maximum Likelihood Estimation

Clustering with k-means and Gaussian mixture distributions

Lecture 8: PGM Inference

Basic math for biology

Transcription:

Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1

Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief Propagation MAP-based Learning Alternative Objective Functions Pseudo-likelihood Contrastive Objective Divergence Map-based 2

Exact Learning of MN parameters θ Initialize. θ P θ (χ) = 1 Z θ ( ) exp θ f (D ) i i i k i=1 Run inference (compute Z(θ) ) Compute gradient of l. Update θ. ( ) = exp θ i f i ( ξ) Z θ E θ [ f i ] = 1 ξ i Z(θ) θ i l θ; D ( ) = E D [ f i χ ( )] E θ [ f i ] θ t+1 θ t + η l( θ t ;D) ξ f i (ξ)exp θ j f j ( ξ ) j l(θ) = d dθ l(θ) = l ( θ ). θ 1 l( θ) θ k No Optimum Reached? Yes Stop θ t+1 θ t δ 3

Difficulty with Exact Methods Exact Parameter Estimation Methods assume ability to compute 1. Partition function Z(θ) and 2. Expectations E Pθ [ f i ] In many applications structure of network does not allow exact computation of these terms In image segmentation, grid networks lead to exponential size clusters for exact inference Cluster graph is clique tree with overlapping factors 4

Approximate Methods for Learning θ 1. Use approximate inference for queries of P θ Decouples inference from Learning Parameters Inference is a black-box But approximation may interfere with learning Non-convergence of inference can lead to oscillating estimates of the gradient & no learning convergence 2. Use an approximate objective function Whose optimization doesn t require much inference Approximately optimizing the likelihood function can be reformulated as exactly optimizing an approximate objective 5

Approximate Methods (Taxonomy) 1. Approximate Inference methods 1. Belief Propagation 2. MAP-based Learning 2. Alternative Objective Functions 1. Pseudo-likelihood and its generalizations 2. Contrastive Optimization Criteria 1. Contrastive Divergence 2. Margin-Based Training 6

Approximate Inference: Belief Propagation Popular Approach for Approximate Inference is Belief Propagation Given a model resulting from a learning procedure, an algorithm from this family is used Model trained with same inference algorithm is better than model trained with exact inference! BP is run with each iteration of Gradient Ascent to compute expected feature count E Pθ [ f i ] 7

Ex: Direct (Gibbs) vs BP (Clique tree) A 1.Gibbs Distribution!P Φ ( A,B,C,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 ( D,A) Factors D B P(A, B,C.D) = 1 Z φ 1(A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) where Z = φ 1 (A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) A,B,C,D Unormalized Distribution C Z=7,201,840 (a) 2. Clique Tree (triangulated): Initial Potentials: Each ψ has every factor involving its arguments ψ 1 ψ 2 Clique/Cluster C 1 Cluster C 2 Sepset ( A,B,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A ( B,C,D) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A 1. A,B,D {B,D} 2. B,C,D ( ) ( ) Computing Clique Beliefs (β i ), Sepset Beliefs (µ i,j ) ( A,B,D ) = P! Φ ( A,B,D ) = ψ 1 ( A,B,D ) β 1 e.g., β 1 (a 0,b 0,d 0 ) = 300,000 + 300,000 = 600,000 µ 1,2 (B,D) = β 1 C 1 C 1 S 1,2 = φ 1 (A,B)φ 2 (B,C)φ 3 (C,D)φ 4 (D,A) C ( ) = β 1 ( A,B,D ) e.g., µ 1,2 (b 0,d 0 ) = 600,000 + 200 = 600,200 β 2 ( B,C,D) = P! Φ B,C,D A ( ) = µ 1,2 B,D ( ) = 300,000 +100 = 300,100 e.g., β 2 b 0,c 0,d 0 A C ( ) ψ 2 ( B,C,D) = ψ 2 B,C,D A ( ) Assignment a a,o ao n a" al al al AL All Clique and Sepset Beliefs 6o 6r 6t 6o 6o 6r 6r Verifying Inference maxc 600,000 300,030 5, ooo,5oo 1,000 200 1,000, 100 100,010 200,000 Assignment d 6o bl br!p Φ a 1,b 0,c 1,d 0 n,z(b, D) 600,200 1,300, 130 5, 100, 510 201,000 ( ) = 100 ( )β 2 ( b 0,c 1,d 0 ) µ 1,2 ( b 0,d 0 ) β 1 a 1,b 0,d 0 = Assienment b0 bo b0 bl bl bt 6t co cl ct co co c1 ct 200 300 100 600 200 d1 ll I 4o 4t 4o 4r d0 ll 5, l3z(8,c, = 100

Difficulty with Belief Propagation In principle, BP is run in each Gradient Ascent iteration to compute E Pθ [ f i ] used in gradient computation Due to family preservation property each feature must be a subset of a cluster C i in the cluster graph Hence to compute E Pθ [ f i ] we can compute BP marginals over C i But BP often does not converge Marginals derived often oscillate Final results depend on where we stop As a result gradient computed is unstable Hurt convergence properties of gradient descent Even more severe with line search 9

Convergent alternatives to Belief Propagation Three BP methods: 1. Pseudo- moment matching Reformulate task of learning with approximate inference as optimizing an alternative objective 2. Maximum Entropy approximation (CAMEL) General derivation that allows us to reformulate maximum likelihood with BP as a unified optimization problem with an approximate objective 3. MAP-based Learning Approximate expected feature counts with their counts in the single MAP assignment in current MN 10

Pseudo-moment Matching Begin with analysis of fixed points in learning Converged BP beliefs must satisfy Or j β i ( c i ) = ˆP j c i ( ) E βi C i ( ) f Ci We define for each sepset S i,j between C i and C j φ i β i µ i, j We use the final set of potentials as the parameterization of the Markov Network Provides a closed-form solution for both inference and learning Cannot be used with parameter regularization, nontable factors or CRFs 11 ( ) = E f D i C i Convergent point is a set of beliefs that match the data C 1 : A,B,D S 12 ={B,D} C 2 : B,C,D

BP and Maximum Entropy Maximum entropy dual of max likelihood Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k Made tractable by approximating H Q and E Q : H Q : Given a cluster graph U with clusters C i and sepset S i,j H Q ( χ ) H βi ( C i ) C i U E Q [f i ]: replace by E βi [f i ] ( ) H µi, j S i, j C i C j S Approximation is exact when cluster graph is a tree Method is known as CAMEL (Constrained Approximate Maximum Entropy Learning) 12

Example of Max Ent Learning Pairwise Markov Network A B C Variables are binary Three clusters: C 1 = {A,B}, C 2 ={B,C}, C 3 ={C,A} Log-linear model with features f 00 (x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of C i f 11 (x,y)=1 if x=1, y=1 and 0 otherwise Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) Unnormalized Feature counts, pooled over all clusters, are [ ] = (3 +1+1) / 3 = 5 / 3 [ ] = (0 + 0 + 0) / 3 = 0 E! P f 00 E! P f 11 13

CAMEL Optimization Problem Optimization problem takes the form with two types of constraints: Type 1 Constraints: E Q [f i ]=E D [f i ], i=1,.., k Type 2 Constraints: Marginals from Cluster-graph approximation 14

CAMEL Solutions CAMEL optimization is a constrained maximization problem with linear constraints and a nonconcave objective Several solution algorithms, one of which is Lagrange multipliers for all constraints and optimize over resulting new variables 15

Sampling-based Learning Partition function Z(θ) is a summation over an exponentially large space Reformulate as expectation wrt a distribution Q(χ) ( ) = exp ξ i Q( ξ) = Q( ξ) exp ξ Z θ θ i f i ( ξ) θ i f i ( ξ) i 1 =E Q Q χ i ( ) exp θ f i i ( χ ) Precisely the form of importance sampling estimator Can generate samples from Q and correct by weights Simplify by choosing Q to be P θ 0 for some θ 0 Z ( θ) = E Pθ 0 { } Z ( θ 0 )exp i θ i f i ( χ ) exp{ θ 0 i i f i ( χ) } exp (θ i θ 0 i )f i (ξ k ) i =Z ( θ 0 )E Pθ 0 16

Importance sampling Samples {z (l) } are drawn from simpler dist. q(z) E[ f ] = f (z)p(z)dz = f (z) p(z) q(z) q(z)dz = 1 L L p(z (l ) ) f (z (l ) ) q(z (l ) ) Unlike rejection sampling l=1 All of the samples are retained Proposal distribution Samples are weighted by ratios r l = p(z (l) ) / q(z (l) ) Known as importance weights Which corrects the bias introduced by wrong distribution 17

Samples to approximate ln Z(θ) Choose Q to be P θ 0 for some parameters θ 0. Sample instances ξ 1,..ξ k from P θ 0, to approximate the log-partition function as lnz θ Plug it into K ( ) ln 1 K 1 M k=1 l(θ : D) = θ 0 exp ( θ i θ i )f i ξ k i i E D f i (d i ) i and optimize Use a sampling procedure to generate samples from current parameter set θ t Then use gradient descent to find θ t+1 that improves log-likelihood based on the samples ( ) ( [ ]) ln Z(θ) + lnz ( θ0)

MAP-based Learning Another approach to inference in learning Approximating expected feature counts with the counts in the single MAP assignment to current MN Approximate gradient at assignment θ is E D f i ( χ ) f i ( ξ MAP ( θ )) where ξ MAP (θ)=arg max ξ P(ξ θ) is the MAP assignment given the current set of parameters θ Approach also called as Viterbi training Equivalent to exact optimization of approximate objective 1 M l θ : D ( ) ( ) ln P ξ MAP ( θ ) θ 19

Alternative Objectives A class of approximations obtained by replacing objective with one more tractable In the case of a single instance ξ l( θ : ξ) = ln P " ( ξ θ) lnz ( θ) Expanding the partition function l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Maximizing l is to increase distance (contrast) between two terms First term: Log measure Summing over all possible values of dummy variable Unnormalized measure (log-probability) of ξ Second Term: Aggregate measure of all instances 20

Contrastive Objective Since log-measure increases with parameters increase parameters with positive empirical expectations in ξ and decrease parameters with negative empirical expectations However second term balances the first Key difficulty second term has exponential instances in Val(χ) and requires inference in the network Approach Increase log-measure of data instances and a more tractable set of other instances, one not requiring summation over an exponential space 21

Two Approaches to increase probability gap 1. Pseudolikelihood and its generalizations Simplifies likelihood by considering only pairwise dependencies 2. Contrastive Optimization Contrast data with a randomly perturbed set of neighbors 22

Pseudolikelihood for Tractability Consider likelihood of single instance ξ : n ( ) = P ( x j x 1,..,x j 1 ) P ξ j=1 Approximate by n ( ) = P ( x j x 1,..,x j 1, x j+1,..,x n ) P ξ j=1 Since from product rule P(x 1,x 2 ) = P(x 1 ) P(x 2 x 1 ) and P(x 1,x 2,x 3 ) = P(x 1 ) P(x 2,x 3 x 1 )= P(x 1 ) P(x 2 x 1 ) P(x 3 x 1.x 2 ) Replace each product term above by conditional probability of x j given all the other variables Gives the pseudolikelihood objective l PL θ : D lnp ( x j m m ) ( ) = 1 M m j x j which eliminates exponential summation P(x j x j ) = P(x j,x j ) P(x j ) = x j '!P(x j,x j )!P(x j ',x j ),θ where x -j stands for x 1,.., x j-1, x j+1,.., x n Requires only summation over X j 23

Pseudolikelihood is concave Pseudolikelihood objective of a single data ξ: lnp ( x j x j ) = ln P! ( x j,x j ) ln P! ( x 'j,x j ) j j x j ' second term is partition fn. = ln P! ( ξ) ln P! ( x 'j,x j ) j x j ' Simplify each term in summation as lnp(x j x j ) = θ i f i x j,x j i:scope f i ( ) Each term is a log-conditional-likelihood term for a MN for a single variable X j conditioned on rest Function is concave in the parameters θ Gradient is ( ) = f i (x j,x j ) E x j '~P θ X j x j θ i ln x j x j X j ln exp θ f x ' i i j,x j i:scope f x i X j ' j ( ) fi (x ( ) j ',x j ) 24

Contrastive Optimization Likelihood and Pseudolikelihood l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Both attempt to increase log-probability gap between D and probability of a set of instances Base on this intuition, a range of methods developed: By driving the probability of the observed data higher relative to other instances, we are tuning the parameters to predict the data better 25

Contrastive Optimization Criteria Aim to maximize the log-probability gap Consider a single training instance ξ Maximize the contrastive objective ln P! ( ξ θ) ln P! ξ ' θ where ξ is some other instance This takes a simple form θ T [f(ξ) - f(ξ )] ( ) which is a linear function of θ and hence unbounded Choice of ξ has to change throughout the optimization Two methods for choosing ξ : 1. Contrastive divergence 2. Margin-based training 26

Contrastive Divergence Popularity of method has grown Used in deep learning for training layers of RBMs Contrast data instances D with set of randomly perturbed neighbors D - Maximize l CD ( θ : D D ) = E ξ~ PD " ln P " θ ξ ( ) E ξ~ P " D ln P " θ ( ξ) Where P D and P D- are distributions relative to D and D - We want model to give high probability to instances in D relative to the perturbed instances in D- 27

Generating D - for Contrastive Divergence Contrasted instances D - will differ at different stages in the search Given current parameters θ generate samples D - from P θ using Gibbs sampling Initialize from the instances in D, run the Markov chain only a few steps to define D - Gradient of objective is l θ CD ( θ : D D ) = E "PD f i ( χ) E PD " i f i ( χ) 28

Margin-Based Training When the goal is MAP assignment, an SVMbased method can be used Training set consists of pairs Given observation x[m] we would like learned model to give highest probability to y[m] Maximize the margin lnp θ y m x ( m ) max y y m lnp θ y m x m ( ) D = {( y m,x m ) } m=1 the difference between the log-probability of the target assignment y[m] and next best assignment 29 M