Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1

Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient Ascent Conditionally Trained Models Learning with Missing Data EM with Gradient Ascent Maximum Entropy and Maximum Likelihood Parameter Priors and Regularization 2

Local vs Global Normalization BN: Local normalization within each CPD MN: Global normalization (partition function) P(X 1,.., X n ) = 1 Z Global factor couples all parameters preventing decomposition Significant computational ramifications M.L. parameter estimation has no closed-form soln. Need iterative methods n p(x) = ln p(x i pa i ) i=1 K i=1 φ i ( ) Z = φ i D i X 1,..X n ( ) D i 3

Issues in Parameter Estimation Simple ML parameter estimation (even with complete data) cannot be solved in closed form Need iterative methods such as gradient ascent Good news: Likelihood function is concave Methods converge with global optimum Bad news: Each step in iterative algorithm requires inference Simple parameter estimation expensive/intractable Bayesian estimation is practically infeasible 4

Discriminative Training Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind Train the network discriminatively to get good performance for our particular inference task 5

Likelihood Function Basis for all discussion of learning How likelihood function can be optimized to find maximum likelihood parameter estimates Begin with form of likelihood function for Markov networks, its properties and their computational implications 6

Example of Coupled Likelihood Function Simple Markov network A B C Two potentials ϕ 1 (A,B), ϕ 2 (B,C): Gibbs: P(A,B,C)=(1/Z) ϕ 1 (A,B)ϕ 2 (B,C) Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ 1 (A,B)+ln ϕ 2 (B,C)-ln Z Log-likelihood of data set D with M instances: l(θ : D) = M m=1 ( lnφ 1 (a[m],b[m]) + lnφ 2 (b[m],c[m]) ln Z(θ) ) = M[a,b] lnφ 1 (a,b) + M[b,c] lnφ 2 (b,c) M ln Z(θ) a,b b,c Z = Summing over different Instances M Summing over different values of A and B First term involves only ϕ 1. Second only ϕ 2. But third involves lnz(θ) = ln φ 1 (a,b)φ 2 (b,c) a,b,c Which couples the two potentials in the likelihood function When we change one potential ϕ 1, Z(θ) changes, possibly changing the value of ϕ 2 that maximizes ln Z(θ) a,b,c φ 1 (a,b)φ 2 (b,c) Parameter θ consists of all values of factorsϕ 1 and ϕ 2

Illustration of Coupled Likelihood Log-likelihood A B C wrt two factors only With binary variables we would have 8 factors Log-likelihood surface when ϕ 1 changes, ϕ 2 also changes In this case problem avoidable Equivalent to BN AàBàC & estimate parameters of ϕ 1 (A,B)=P(A)P(B A), ϕ 2 (B,C)=P(C B) In general, cannot convert learned BN parameters into equivalent MN Optimal likelihood achievable by the two representations is not the same 8 f 1 (a 1, b 1 ) f 2 (b 0, c 1 ) All other parameters set to 1 Data Set has M=100 M[a 1,b 1 ]=40, M[b 0,c 1 ]=40

Form of Likelihood Function Instead of Gibbs, use log-linear framework Joint distribution of n variables X 1,..X n k features F = { f i (D i ) } i=1,.. k where D i is a sub-graph and f i maps D i to R P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 Parameters θ i are weights we put on features If we have a sample ξ then its features are f i (ξ(d i )) which has the shorthand f i (ξ). Representation is general can capture Markov networks with global structure 9 and local structure

Parameters θ, Factors ϕ, binary features f Variables: A B C D A feature for every entry in every table f i (D i ) are sixteen indicator functions defined over clusters, AB, BC,CD,DA f a (A, B) = I { A = a 0 0 b }I { B = b 0 } 0 etc. P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 With this representation θ a = lnφ ( 0 b 0 1 a 0,b 0 ) A B ϕ 1 (A,B) a 0 b 0 ϕ 1 (a 0,b 0 ) a 0 b 1 ϕ 1 (a 0,b 1 ) a 1 b 0 ϕ 1 (a 1,b 0 ) a 1 b 1 ϕ 1 (a 1, b 1 ) (a) Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. Parameters θ are potentials which are weights put on features D A C B

Log Likelihood and Sufficient Statistics Joint probability distribution: P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 θ={θ 1,..θ k } are table entries f i are features over instances of D i Let D be a data set of M samples ξ [m] m =1,..M Log-likelihood Log of product of probs of M indep. instances: l(θ : D) = θ i f i ξ[m] i m ( ) M ln Z(θ) Sufficient statistics (likelihood depends only on this) Dividing by no. of samples 1 l(θ : D) = M θ ( [ ]) ln Z(θ) i E D f i (d i ) i E D [f i (d i )] is the average in the data set

Properties of Log-Likelihood Log-likelihood is a sum of two functions l(θ : D) = First term is linear in the parameters θ Increasing parameters increases this term But likelihood has upper-bound of probability 1 Second term: θ i f i ξ[m] i m balances first term ( ) Partition function Z(θ) is convex as seen next» Its negative is concave Sum of linear fn. and concave is concave, So There are no local optima Can use gradient ascent First Term M ln Z(θ) ln Z θ Second Term ( ) = ln exp θ i f i ( ξ) ξ i 12

Proof that Z(θ) is Convex A function f (! x) is convex if for every 0 α 1 f (α! x + (1 α )! y) α f (! x) + (1 α ) f (! y) The function is bowl-like Every interpolation between the images of two points is larger than the image of their interpolation One way to show that a Function is convex is to show that its Hessian (matrix of the function s second derivatives) is positive-semi-definite Hessian (2 nd der) of ln Z(θ) computed as:

Hessian of ln Z(θ) Given set of features F with θ i ln Z(θ) = E θ [ f i ] ln Z θ ( ) = ln exp θ i f i ( ξ) 2 θ i θ j ln Z(θ) = Cov θ [ f i ; f j ] where E θ [f i ] is shorthand for E P(χ,θ) [f i ] ξ i Proof: Partial derivatives wrt θ i and θ j ln Z(θ) = 1 θ i Z(θ) 2 ln Z(θ) = 1 θ i θ j θ j Z(θ) exp θ j f j ( ξ) = 1 f i (ξ)exp θ j f j ( ξ) = E θ [ f i ] ξ θ i j Z(θ) ξ j ξ exp θ k f k ( ξ) = Cov θ [ f i ; f j ] θ i k Since covariance matrix of features is pos. semidefinite, we have -ln Z(θ) is a concave function of θ Corollary: log-likelihood function is concave

Non-unique Solution Since ln Z(θ) is convex, -ln Z(θ) is concave Implies that log-likelihood is unimodal Has no local optima However does not imply uniqueness of global optimum Multiple parameterizations can result in same distribution A feature for every entry in the table is always redundant, e.g., f a0,b0 = 1- f a0,b1 - f a1,b0 - f a1,b1 A continuum of parameterizations 15

Maximum (Conditional) Likelihood Parameter Estimation Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D Simplest variant of the problem is maximum likelihood parameter estimation Log-likelihood given features F={ f i, i=1,.. k} is l(θ : D) = θ i f i ξ m i=1 m ( [ ]) M ln Z ( θ ) 16

Gradient of Log-likelihood Gradient of log-likelihood is zero at its maximum points Log-likelihood is l(θ : D) = θ i f i ξ m i=1 ( [ ]) Gradient of the average log-likelihood is m 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] M ln Z ( θ ) Provides a precise characterization of m.l. parameters θ ( ) E D f i χ = E ˆ θ [ f i ] First term is average value of f i in data D. Second term is expected value from distribution Expected value of each feature relative to P θ matches its empirical expectation in D

Need for Iterative Method Although function is concave, there is no analytical form for the maximum Since no closed-form solution Can use iterative methods, e.g. gradient ascent as shown next 18

Simple Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do ( ) 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Taylor s expansion of function f(θ) in the neighborhood of θ t is Let θ=θ t+1 =θ t +h, thus f (θ t+1 ) f (θ t )+ h f (θ t ) Derivative of f(θ t+1 ) wrt h is f (θ t ) At h = f (θ t ) a maximum occurs (since h 2 is positive) and at h = f (θ t ) a minimum occurs. Alternatively, The slope f (θ t ) points to the direction of steepest ascent. If we take a step η in the opposite direction we decrease the value of f f (θ) f (θ t )+(θ θ t ) T f (θ t ) One-dimensional example Let f(θ)=θ 2 This function has minimum at θ=0 which we want to determine using gradient descent We have f (θ)=2θ For gradient descent, we update by f (θ) If θ t > 0 then θ t+1 <θ t If θ t <0 then f (θ t )=2θ t is negative, thus θ t+1 >θ t

Difficulties with Simple Gradient Descent Performance depends on choice of learning rate η Large learning rate Overshoots Small learning rate Extremely slow Solution Start with large η and settle on optimal value Need a schedule for shrinking η 20

Improvements to Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t ( ) Line Search Adaptively choose η at each step Define a line in the direction of gradient g ( η) = θ! t + η f ( θ t ) Find three points η 1 < η 2 < η 3 so that f(g(η 2 )) is smaller than at both others η =(η 1 +η 2 )/2 Brent s method: Solid line is f(g(η)) Three points bracket its maximum Dashed line shows quadratic fit

Conjugate Gradient Ascent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f obj (θ t+1 ) f obj (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f obj (x,y)=-(x 2 +10y 2 ) Exponential: f obj (x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent

Computing the gradient Gradient of log-likelihood is E D [ f i ( χ )] E θ [ f i ] It is the difference between the feature s empirical count in data D and the expected count relative to our current parameterization Computing the empirical count Ex: for feature f a (a,b) = I { a = a 0 0 b }I b = b 0 0 { } E D [ f i is easy it is the empirical frequency in D of the event a 0,b 0 23 D ( χ )] A C (a) B

Computing the expected count We need to compute the different probabilities of the form P ( θ t a,b) Since expectation is a probability-weighted average Computing probability requires running inference over the network Thus computational cost of parameter estimation is very high Gradient ascent is not efficient Much faster convergence using second order methods based on Hessian

A standard iterative solution Initialize. θ Run inference (compute Z(θ) ) Compute gradient of l. Update θ. E D [ f i ( χ )] E θ [ f i ] θ t+1 θ t + η f obj ( θ t ) No Optimum Reached? Yes Stop 25

Iterative methods for MRF parameters Gradient ascent over parameter space Good news: likelihood function is concave Guaranteed to converge to global optimum Bad news: each step needs inference Simple parameter estimation is intractable Bayesian parameter estimation even harder Integration done using MCMC 26

Newton s method Newton s method finds zeroes of a function using derivatives More efficient than simple gradient descent Quasi Newton method uses an approximation to the gradient Since we are solving for derivative of l(θ,d) need second derivative (Hessian) Newton s Method

Hessian of the log-likelihood Likelihood has the form l(θ : D) = θ i f i ξ m i=1 m Hessian ( [ ]) θ i θ j l θ, D ( ) = M Cov θ f i, f j ( ) Requires joint expectation of two features, often computationally infeasible L-BFGS (a quasi-newton algorithm) uses line 28 search to avoid computing the Hessian M ln Z ( θ )

Conditionally Trained Models Often we want to perform an inference task where we have a known set of variables, or features, X We want to query a pre-determined set of variables Y We prefer to use discriminative training Train the network as a Conditional Random Field (CRF) that encodes a conditional distribution P(Y X) 29

CRF Training Train the network as a CRF that encodes a conditional distribution P(Y X) Conditional log-likelihood Training set consists of pairs D={y[m], x[m]}, m =1,.., M Objective function is conditional log-likelihood l Y X ( ) ( θ : D) = ln P y[ 1,.., M ] x[ 1,.., M ],θ M ( ) = ln P y[ m] x[ m],θ m=1 E (x,y)~p* ( ) log P! y x Example: word category, word It is concave since each term is concave 30 Joint probability of observed pairs

Gradient of Conditional Likelihood A reduced MN is itself an MN. We use log-linear representation with features f i and parameters θ Analogous to gradient for full MN 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] we can write gradient for reduced MN θ i l Y X ( ) M ( θ : D) = f i ( y m,x m ) E θ f i x m m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case

Conditional Training is simpler X 1 X 2 X 3 X 4 X 5 Full MN encodes Y 1 Y 2 4 Y 3!P(X,Y) = φ i Y i,y i+1 i=1 ( ) Edges disappear in a reduced Markov network After conditioning on X, remaining edges form a simple chain 5 Y 4 Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity Y 5 5 φ i Y i,x 1,X 2,X 3,X 4,X 5 i=1 ( ) ( )!P(Y X) = φ i Y i,y i+1 X 1,X 2,X 3,X 4,X 5 i=1 32

Learning Models for Sequence Labeling Given: sequence of observations X={X 1,..X k }. Need: a joint label Y={Y 1,..Y k } Model Trade-offs: expressive power, learnability CRF is a discriminative model MEMM is a conditional directed model Y 1 X 2 if not given Y 2, by D-separation More generally, Y i X j X -j j > i Has label bias problem: Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. Earlier frames may be blurry but later ones clearer. HMM is a generative model That needs joint probability P(X,Y) CRF requires gradient-based learning which is expensive, difficult with large data sets MEMM and HMM are more easily learned Pure directed models: parameters computed in closed-form using maximum likelihood CRF Directly obtains P(Y X) Note: Z(X) is marginal of un-normalized measure MEMM Does not model distribution over X HMM Needs joint distribution X 1 Y 1 X 1 Y 1 X 1 Y 1 X 2 Y 2 X 2 Y 2 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 3 Y 3 P(Y X) = 1!P(Y, X) Z(X) X 4 Y 4 X 4 Y 4 X 4 Y 4!P(Y, X) = φ i (Y i,y i+1 ) φ i (Y i, X i ) k 1 i=1 Z(X) =!P(Y, X) Y P(Y X) = P(Y i X i,y i 1 ) k X 5 Y 5 X 5 Y 5 P(X,Y) = P(X i /Y i )P(Y i Y i 1 ) i=1 i=1 P(Y / X) = P(X,Y) P(X) k k i=1 X 5 Y 5

Parameter Estimation with Missing Data Missing data examples Some data fields omitted or not collected Some hidden variables Learning problem has difficulties Parameters may not be identifiable Coupling between different parameters Likelihood is not concave (has local maxima) Available methods 1. Gradient Ascent (assume missing data is random) 2. Expectation-Maximization 34

Gradient Ascent with Missing Data Assume data is missing at random in data D In the m th instance, let o[m] be observed entries and h[m] random variables that are missing Log-likelihood has the form 1 M ( ) = 1 M ln P ( o[m], h[m] θ ) ln P D θ Gradient for feature f i has the form M m=1 1 θ i M ( θ : D) = 1 M M m=1 h[m] [ ] E h[m]~p( h[m] o[m],θ ) f i ln Z E θ [ f i ] Which is a difference between two expectations of f i Expectation over data and hidden variables Expectation in current distribution P(χ θ) 35

Comparison with Full Data Case With full data, gradient of log-likelihood is 1 θ i M l(θ : D) = E D[ f i ( χ )] E θ [ f i ] For second term we need inference over current distribution P(χ θ) First term is aggregate over data D. With missing data we have 1 θ i M l ( θ : D ) = 1 M M m=1 E fi h[m ]~P ( h[m ] o[m ],θ ) E θ We have to run inference separately for every instance m conditioning on o[m] Cost is much higher than with full data f i 36

Use of Expectation Maximization As for any probabilistic model, an alternative method of parameter estimation with missing data is EM E step is to use current parameters to estimate missing values M step is to re-estimate the parameters For BN it has significant advantages M-step has closed-form solution For MN, the answer is not clear-cut M-step requires running inference multiple times As seen next 37

E-step EM for MN parameter learning using current parameters θ t to compute expected sufficient statistics, i.e., expected feature counts M-step At iteration t expected sufficient statistic for feature f i is M θ (t ) f i = 1 M E h[m ]~P h[m ] o[m ],θ ( ) done using maximum likelihood parameter estimation (using gradient ascent!) Requires running inference multiple times, once for each iteration of gradient ascent procedure At step k of this inner loop optimization, we have a gradient of the form 38 f Mθ ( t) i E θ t,k ( ) f i fi

Maximum Entropy and Maximum Likelihood Return to basic maximum likelihood estimation Alternative formulation provides insight Consider log-linear model P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Relate it to finding the distribution of maximum entropy subject to certain constraints 39

Motivation of Max Entropy Suppose we have some summary statistics of an empirical distribution Ex 1: published in a census report Marginal distributions of single variables or certain pairs, or other events of interest Ex 2: average final grade of students in class, correlation of final grades with homework scores But no access to full data set Find a typical distribution that satisfies these constraints 40

Maximum Entropy Formulation Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information We formulate the following problem: Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k These are called expectation constraints 41

Solution to Max Entropy Estimation Somewhat surprisingly The distribution Q*, the maximum entropy distribution satisfying the expectation constraints, E Q [f i ]=E D [f i ], is a Gibbs distribution Q* = Pˆ θ where Pˆ θ ( χ) = 1 Z θ ˆ ( ) exp ˆ i θi f i ( χ) where ˆθ is the m.l.e. parameterization relative to D 42

Duality of Max Ent and Max Likelihood The two problems 1. Maximizing entropy subject to expectation constraints 2. Maximizing likelihood given structural constraints on the distribution are convex duals of each other For the maximum likelihood parameters ˆθ H P θ ( χ) = 1 M l ( ˆθ : D) 43

Parameter Priors and Regularization Maximum likelihood is prone to over-fitting Can introduce prior distribution P(θ) Due to non-decomposability a fully Bayesian approach is infeasible Instead, we perform MAP estimation Find parameters that maximize P(θ)P(D θ) Where we have ln P(D θ) expressed in log-space as l(θ : D) = θ i f i ξ m i=1 m ( [ ]) Which in turn is derived from the joint distribution P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 M ln Z ( θ ) 44

Choice of Prior P(θ) Only a few priors used in practice L 2 regularization, quadratic penalty on weights Laplacian (L 1 regularization) Both priors penalize parameters whose magnitude (positive or negative) is large 45

Gaussian Prior and L 2 -Regularization Most common is Gaussian prior on log-linear parameters θ P ( θ σ 2 ) = 1 exp θ 2 i 2πσ 2σ 2 For some choice of hyper-parameter σ 2 k i=1 Converting MAP objective P(θ)P(D θ) to logspace, gives ln P( θ ) + l(θ : D) whose first term lnp ( θ) = 1 2σ 2 k i=1 2 θ i is called an L 2 - regularization term 46

Laplacian Prior and L 1 -Regularization A different prior used is the Laplacian P Laplacian ( θ) = 1 2β θ exp β Taking log we obtain a term ln P θ ( ) = 1 β which we wish to minimize Generally called L 1 -regularization k θ i i=1 Both forms of regularization penalize parameters whose magnitude is large 47 0.5 0.4 0.3 0.2 0.1 0 10 5 0 5 10 Laplacian distribution β=1 Gaussian distribution σ 2 =1

Why prefer low magnitude parameters? Properties of prior To pull distribution towards an uninformed one To smooth fluctuations in the data A distribution is smooth if Probabilities assigned to different assignments are not radically different Consider two assignments ξ and ξ An assignment is an instance of variables X 1,..X n We consider ratio of their probabilities next 48

Smoothness resulting from small θ Given two assignments ξ and ξ Their relative probability is ( ) ( ) = P ξ P ξ '!P ( ξ)/ Z θ!p ( ξ ')/ Z θ ( ) ( ) = where the un-normalized probabilities are In log-space, log-probability ratio is ln P ξ P ξ '!P(ξ) = exp k ( ) ( ) = θ f ξ i i ( ) i=1 k i=1 When θ i s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar This results in a smooth distribution ( ) ( )!P ξ!p ξ ' θ i f i (ξ) k θ i f i ( ξ ') = θ i f i ( ξ) f i ξ ' i=1 k i=1 ( ( )) 49

Comparison of L 1 and L 2 Regularization In both L 1 and L 2 we penalize magnitude of parameters In Gaussian case (L 2 ), penalty grows quadratically with parameters An increase in θ i from 3 to 3.1 is penalized more than θ i from 0 to 0.1 Leads to many small parameters In Laplacian case (L 1 ), penalty grows linearly Results in fewer edges and is more tractable 50

Efficiency of Optimization Both L 1 and L 2 Regularization terms are Concave Because Log-likelihood is also Concave, resulting posterior is also concave Can be optimized using gradient descent methods Introduction of penalty terms eliminates multiple equivalent minima 51

Choice of Hyper-parameters Regularization parameters σ 2 in case of L 2 and β in the L 1 case encode our beliefs that model weights should be close to zero Larger these parameters are broader our parameter prior Choice of prior has effect on learned model Standard method of selecting this parameter is via cross-validation 52