Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1

Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief Propagation MAP-based Learning Alternative Objective Functions Pseudo-likelihood Contrastive Objective Divergence Map-based 2

Exact Learning of MN parameters θ Initialize. θ P θ (χ) = 1 Z θ ( ) exp θ f (D ) i i i k i=1 Run inference (compute Z(θ) ) Compute gradient of l. Update θ. ( ) = exp θ i f i ( ξ) Z θ E θ [ f i ] = 1 ξ i Z(θ) θ i l θ; D ( ) = E D [ f i χ ( )] E θ [ f i ] θ t+1 θ t + η l( θ t ;D) ξ f i (ξ)exp θ j f j ( ξ ) j l(θ) = d dθ l(θ) = l ( θ ). θ 1 l( θ) θ k No Optimum Reached? Yes Stop θ t+1 θ t δ 3

Difficulty with Exact Methods Exact Parameter Estimation Methods assume ability to compute 1. Partition function Z(θ) and 2. Expectations E Pθ [ f i ] In many applications structure of network does not allow exact computation of these terms In image segmentation, grid networks lead to exponential size clusters for exact inference Cluster graph is clique tree with overlapping factors 4

Approximate Methods for Learning θ 1. Use approximate inference for queries of P θ Decouples inference from Learning Parameters Inference is a black-box But approximation may interfere with learning Non-convergence of inference can lead to oscillating estimates of the gradient & no learning convergence 2. Use an approximate objective function Whose optimization doesn t require much inference Approximately optimizing the likelihood function can be reformulated as exactly optimizing an approximate objective 5

Approximate Methods (Taxonomy) 1. Approximate Inference methods 1. Belief Propagation 2. MAP-based Learning 2. Alternative Objective Functions 1. Pseudo-likelihood and its generalizations 2. Contrastive Optimization Criteria 1. Contrastive Divergence 2. Margin-Based Training 6

Approximate Inference: Belief Propagation Popular Approach for Approximate Inference is Belief Propagation Given a model resulting from a learning procedure, an algorithm from this family is used Model trained with same inference algorithm is better than model trained with exact inference! BP is run with each iteration of Gradient Ascent to compute expected feature count E Pθ [ f i ] 7

Ex: Direct (Gibbs) vs BP (Clique tree) A 1.Gibbs Distribution!P Φ ( A,B,C,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 ( D,A) Factors D B P(A, B,C.D) = 1 Z φ 1(A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) where Z = φ 1 (A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) A,B,C,D Unormalized Distribution C Z=7,201,840 (a) 2. Clique Tree (triangulated): Initial Potentials: Each ψ has every factor involving its arguments ψ 1 ψ 2 Clique/Cluster C 1 Cluster C 2 Sepset ( A,B,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A ( B,C,D) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A 1. A,B,D {B,D} 2. B,C,D ( ) ( ) Computing Clique Beliefs (β i ), Sepset Beliefs (µ i,j ) ( A,B,D ) = P! Φ ( A,B,D ) = ψ 1 ( A,B,D ) β 1 e.g., β 1 (a 0,b 0,d 0 ) = 300,000 + 300,000 = 600,000 µ 1,2 (B,D) = β 1 C 1 C 1 S 1,2 = φ 1 (A,B)φ 2 (B,C)φ 3 (C,D)φ 4 (D,A) C ( ) = β 1 ( A,B,D ) e.g., µ 1,2 (b 0,d 0 ) = 600,000 + 200 = 600,200 β 2 ( B,C,D) = P! Φ B,C,D A ( ) = µ 1,2 B,D ( ) = 300,000 +100 = 300,100 e.g., β 2 b 0,c 0,d 0 A C ( ) ψ 2 ( B,C,D) = ψ 2 B,C,D A ( ) Assignment a a,o ao n a" al al al AL All Clique and Sepset Beliefs 6o 6r 6t 6o 6o 6r 6r Verifying Inference maxc 600,000 300,030 5, ooo,5oo 1,000 200 1,000, 100 100,010 200,000 Assignment d 6o bl br!p Φ a 1,b 0,c 1,d 0 n,z(b, D) 600,200 1,300, 130 5, 100, 510 201,000 ( ) = 100 ( )β 2 ( b 0,c 1,d 0 ) µ 1,2 ( b 0,d 0 ) β 1 a 1,b 0,d 0 = Assienment b0 bo b0 bl bl bt 6t co cl ct co co c1 ct 200 300 100 600 200 d1 ll I 4o 4t 4o 4r d0 ll 5, l3z(8,c, = 100

Difficulty with Belief Propagation In principle, BP is run in each Gradient Ascent iteration to compute E Pθ [ f i ] used in gradient computation Due to family preservation property each feature must be a subset of a cluster C i in the cluster graph Hence to compute E Pθ [ f i ] we can compute BP marginals over C i But BP often does not converge Marginals derived often oscillate Final results depend on where we stop As a result gradient computed is unstable Hurt convergence properties of gradient descent Even more severe with line search 9

Convergent alternatives to Belief Propagation Three BP methods: 1. Pseudo- moment matching Reformulate task of learning with approximate inference as optimizing an alternative objective 2. Maximum Entropy approximation (CAMEL) General derivation that allows us to reformulate maximum likelihood with BP as a unified optimization problem with an approximate objective 3. MAP-based Learning Approximate expected feature counts with their counts in the single MAP assignment in current MN 10

Pseudo-moment Matching Begin with analysis of fixed points in learning Converged BP beliefs must satisfy Or j β i ( c i ) = ˆP j c i ( ) E βi C i ( ) f Ci We define for each sepset S i,j between C i and C j φ i β i µ i, j We use the final set of potentials as the parameterization of the Markov Network Provides a closed-form solution for both inference and learning Cannot be used with parameter regularization, nontable factors or CRFs 11 ( ) = E f D i C i Convergent point is a set of beliefs that match the data C 1 : A,B,D S 12 ={B,D} C 2 : B,C,D

BP and Maximum Entropy Maximum entropy dual of max likelihood Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k Made tractable by approximating H Q and E Q : H Q : Given a cluster graph U with clusters C i and sepset S i,j H Q ( χ ) H βi ( C i ) C i U E Q [f i ]: replace by E βi [f i ] ( ) H µi, j S i, j C i C j S Approximation is exact when cluster graph is a tree Method is known as CAMEL (Constrained Approximate Maximum Entropy Learning) 12

Example of Max Ent Learning Pairwise Markov Network A B C Variables are binary Three clusters: C 1 = {A,B}, C 2 ={B,C}, C 3 ={C,A} Log-linear model with features f 00 (x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of C i f 11 (x,y)=1 if x=1, y=1 and 0 otherwise Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) Unnormalized Feature counts, pooled over all clusters, are [ ] = (3 +1+1) / 3 = 5 / 3 [ ] = (0 + 0 + 0) / 3 = 0 E! P f 00 E! P f 11 13

CAMEL Optimization Problem Optimization problem takes the form with two types of constraints: Type 1 Constraints: E Q [f i ]=E D [f i ], i=1,.., k Type 2 Constraints: Marginals from Cluster-graph approximation 14

CAMEL Solutions CAMEL optimization is a constrained maximization problem with linear constraints and a nonconcave objective Several solution algorithms, one of which is Lagrange multipliers for all constraints and optimize over resulting new variables 15

Sampling-based Learning Partition function Z(θ) is a summation over an exponentially large space Reformulate as expectation wrt a distribution Q(χ) ( ) = exp ξ i Q( ξ) = Q( ξ) exp ξ Z θ θ i f i ( ξ) θ i f i ( ξ) i 1 =E Q Q χ i ( ) exp θ f i i ( χ ) Precisely the form of importance sampling estimator Can generate samples from Q and correct by weights Simplify by choosing Q to be P θ 0 for some θ 0 Z ( θ) = E Pθ 0 { } Z ( θ 0 )exp i θ i f i ( χ ) exp{ θ 0 i i f i ( χ) } exp (θ i θ 0 i )f i (ξ k ) i =Z ( θ 0 )E Pθ 0 16

Importance sampling Samples {z (l) } are drawn from simpler dist. q(z) E[ f ] = f (z)p(z)dz = f (z) p(z) q(z) q(z)dz = 1 L L p(z (l ) ) f (z (l ) ) q(z (l ) ) Unlike rejection sampling l=1 All of the samples are retained Proposal distribution Samples are weighted by ratios r l = p(z (l) ) / q(z (l) ) Known as importance weights Which corrects the bias introduced by wrong distribution 17

Samples to approximate ln Z(θ) Choose Q to be P θ 0 for some parameters θ 0. Sample instances ξ 1,..ξ k from P θ 0, to approximate the log-partition function as lnz θ Plug it into K ( ) ln 1 K 1 M k=1 l(θ : D) = θ 0 exp ( θ i θ i )f i ξ k i i E D f i (d i ) i and optimize Use a sampling procedure to generate samples from current parameter set θ t Then use gradient descent to find θ t+1 that improves log-likelihood based on the samples ( ) ( [ ]) ln Z(θ) + lnz ( θ0)

MAP-based Learning Another approach to inference in learning Approximating expected feature counts with the counts in the single MAP assignment to current MN Approximate gradient at assignment θ is E D f i ( χ ) f i ( ξ MAP ( θ )) where ξ MAP (θ)=arg max ξ P(ξ θ) is the MAP assignment given the current set of parameters θ Approach also called as Viterbi training Equivalent to exact optimization of approximate objective 1 M l θ : D ( ) ( ) ln P ξ MAP ( θ ) θ 19

Alternative Objectives A class of approximations obtained by replacing objective with one more tractable In the case of a single instance ξ l( θ : ξ) = ln P " ( ξ θ) lnz ( θ) Expanding the partition function l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Maximizing l is to increase distance (contrast) between two terms First term: Log measure Summing over all possible values of dummy variable Unnormalized measure (log-probability) of ξ Second Term: Aggregate measure of all instances 20

Contrastive Objective Since log-measure increases with parameters increase parameters with positive empirical expectations in ξ and decrease parameters with negative empirical expectations However second term balances the first Key difficulty second term has exponential instances in Val(χ) and requires inference in the network Approach Increase log-measure of data instances and a more tractable set of other instances, one not requiring summation over an exponential space 21

Two Approaches to increase probability gap 1. Pseudolikelihood and its generalizations Simplifies likelihood by considering only pairwise dependencies 2. Contrastive Optimization Contrast data with a randomly perturbed set of neighbors 22

Pseudolikelihood for Tractability Consider likelihood of single instance ξ : n ( ) = P ( x j x 1,..,x j 1 ) P ξ j=1 Approximate by n ( ) = P ( x j x 1,..,x j 1, x j+1,..,x n ) P ξ j=1 Since from product rule P(x 1,x 2 ) = P(x 1 ) P(x 2 x 1 ) and P(x 1,x 2,x 3 ) = P(x 1 ) P(x 2,x 3 x 1 )= P(x 1 ) P(x 2 x 1 ) P(x 3 x 1.x 2 ) Replace each product term above by conditional probability of x j given all the other variables Gives the pseudolikelihood objective l PL θ : D lnp ( x j m m ) ( ) = 1 M m j x j which eliminates exponential summation P(x j x j ) = P(x j,x j ) P(x j ) = x j '!P(x j,x j )!P(x j ',x j ),θ where x -j stands for x 1,.., x j-1, x j+1,.., x n Requires only summation over X j 23

Pseudolikelihood is concave Pseudolikelihood objective of a single data ξ: lnp ( x j x j ) = ln P! ( x j,x j ) ln P! ( x 'j,x j ) j j x j ' second term is partition fn. = ln P! ( ξ) ln P! ( x 'j,x j ) j x j ' Simplify each term in summation as lnp(x j x j ) = θ i f i x j,x j i:scope f i ( ) Each term is a log-conditional-likelihood term for a MN for a single variable X j conditioned on rest Function is concave in the parameters θ Gradient is ( ) = f i (x j,x j ) E x j '~P θ X j x j θ i ln x j x j X j ln exp θ f x ' i i j,x j i:scope f x i X j ' j ( ) fi (x ( ) j ',x j ) 24

Contrastive Optimization Likelihood and Pseudolikelihood l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Both attempt to increase log-probability gap between D and probability of a set of instances Base on this intuition, a range of methods developed: By driving the probability of the observed data higher relative to other instances, we are tuning the parameters to predict the data better 25

Contrastive Optimization Criteria Aim to maximize the log-probability gap Consider a single training instance ξ Maximize the contrastive objective ln P! ( ξ θ) ln P! ξ ' θ where ξ is some other instance This takes a simple form θ T [f(ξ) - f(ξ )] ( ) which is a linear function of θ and hence unbounded Choice of ξ has to change throughout the optimization Two methods for choosing ξ : 1. Contrastive divergence 2. Margin-based training 26

Contrastive Divergence Popularity of method has grown Used in deep learning for training layers of RBMs Contrast data instances D with set of randomly perturbed neighbors D - Maximize l CD ( θ : D D ) = E ξ~ PD " ln P " θ ξ ( ) E ξ~ P " D ln P " θ ( ξ) Where P D and P D- are distributions relative to D and D - We want model to give high probability to instances in D relative to the perturbed instances in D- 27

Generating D - for Contrastive Divergence Contrasted instances D - will differ at different stages in the search Given current parameters θ generate samples D - from P θ using Gibbs sampling Initialize from the instances in D, run the Markov chain only a few steps to define D - Gradient of objective is l θ CD ( θ : D D ) = E "PD f i ( χ) E PD " i f i ( χ) 28

Margin-Based Training When the goal is MAP assignment, an SVMbased method can be used Training set consists of pairs Given observation x[m] we would like learned model to give highest probability to y[m] Maximize the margin lnp θ y m x ( m ) max y y m lnp θ y m x m ( ) D = {( y m,x m ) } m=1 the difference between the log-probability of the target assignment y[m] and next best assignment 29 M