Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1
Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient Ascent Conditionally Trained Models Learning with Missing Data EM with Gradient Ascent Maximum Entropy and Maximum Likelihood Parameter Priors and Regularization 2
Markov Network Parameters Potentials A D B C (a) φ(d) = exp( ε(d)) ε(d) = lnφ(d) Energy Functions 3
Log-linear MN parameterizations Log-linear (potentials to energy by taking log) Sum of features Good both for hand-coded models and for learning P(X 1,.., X n ) = 1 k Z(θ) exp θ i f i (D i ) i=1 4
Difference bet. BN,MN Parameter Est. MN differs from BN learning since: BN: Local normalization for CPDs n p(x) = ln p(x i pa i ) i=1 MN: Global normalization (partition function) P(X 1,.., X n ) = 1 k Z(θ) exp θ f (D ) k i i i Z(θ) = exp θ i f i (D i ) i=1 i=1 Global factor couples all parameters preventing decomposition Significant computational ramifications M.L. parameter estimation has no closed-form soln. Need iterative methods X 1,..X n 5
Example of Coupled Likelihood Function Simple Markov network A B C Two potentials ϕ 1 (A,B), ϕ 2 (B,C) Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ 1 (A,B)+ln ϕ 2 (B,C)-ln Z Log-likelihood of data set D with M samples: l(θ : D) = M m=1 ( lnφ 1 (a[m]) + lnφ 2 (b[m]) ln Z(θ) ) = M[a,b] lnφ 1 (a,b) + M[b,c] lnφ 2 (a,b) M ln Z(θ) a,b b,c First term involves only ϕ 1. Second involves only ϕ 2. But the third term is the log partition function Z(θ) = a,b,c φ 1 (a, b)φ 2 (b,c) Which couples the two potentials in the likelihood function 6
Illustration of Coupled Likelihood Log-likelihood surface when ϕ 1 changes, ϕ 2 also changes We can convert MN to BN: Aà Bà C & estimate parameters of ϕ 1 (A,B)=P(A)P(B A), ϕ 2 (B,C)=P(C B) In general, cannot convert learned BN parameters into equivalent MN Optimal likelihood achievable by the two representations is not the same f 1 (a 1, b 1 ) f 2 (b 0, c 1 ) 7
Log-linear Representation Instead of Gibbs, use log-linear framework Joint distribution of n variables X 1,..X n k features F = { f i (D i ) } i=1,.. k where D i is a sub-graph and f i maps D i to R P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 Parameters θ i are weights we put on features If we have a sample ξ then its features are f i (ξ(d i )) which has the shorthand f i (ξ). Representation is general can capture Markov networks with global structure 8 and local structure
Example of binary features Variables: A B C D A feature for every entry in every table f i (D i ) are sixteen indicator functions defined over clusters, AB, BC,CD,DA f a (A, B) = I { A = a 0 0 b }I { B = b 0 } 0 etc. P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 With this representation θ a = lnφ ( 0 b 0 1 a 0,b 0 ) A B ϕ 1 (A,B) a 0 b 0 ϕ 1 a0, b0 a 0 b 1 ϕ 1 a0,b1 a 1 b 0 ϕ 1 a1, b0 a 1 b 1 ϕ 1 a1,b1 Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. Parameters θ are potentials which are weights put on features D A C (a) B
Features in Text Analysis Named-entity Recognition BIO: beginning,inside,other Two types of factors: ϕ t1 (Y t,y t+1 ), ϕ t2 (Y t,x 1,..X T ) Factors are derived from a no. of feature functions f t (Y t,x t )=I(Y t =B-organization, X t = Times ) Word is capitalized, does it appear in list of person names Appear on atlas of locations, end with ton, exactly York Is the following word Times Does sentence have more than two sports-related words Hundred of thousands of features
Log Likelihood and Sufficient Statistics Joint probability distribution: P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 θ={θ 1,..θ k } are table entries f i are features over instances of D i Let D be a data set of M samples ξ [m] m =1,..M Log-likelihood product of log-probabilities of samples in data set: l(θ : D) = θ i f i ξ[m] i m ( ) M ln Z(θ) Sufficient statistics (likelihood depends only on this) 1 l(θ : D) = M θ ( [ ]) ln Z(θ) i E D f i (d i ) i E D [f i (d i )] is the average in the data set
Properties of Log-Likelihood Log-likelihood is a sum of two functions l(θ : D) = First term is linear in the parameters Increasing parameters increases this term Second term: θ i f i ξ[m] i m balances first term ( ) since likelihood has upper-bound of probability 1 Partition function Z(θ) is convex as seen next» Its negative is concave Sum of linear fn. and concave is concave, So There are no local optima Can use gradient ascent First Term ln Z θ M ln Z(θ) Second Term ( ) = ln exp θ i f i ( ξ) ξ i 12
Proof that Z(θ) is Convex A convex function is bowl-shaped, defined as: A function f (! x) is convex if for every 0 α 1 f (α! x + (1 α )! y) α f (! x) + (1 α ) f (! y) Function is convex if its Hessian is positive-semi-definite Hessian (2 nd derivative) of ln Z(θ) computed as: ln Z θ ( ) = ln exp θ i f i ( ξ) Partial derivatives wrt θ i and θ j ξ i ln Z(θ) = 1 θ i Z(θ) 2 ln Z(θ) = 1 θ i θ j θ j Z(θ) Since covariance matrix of features is pos. semidefinite, we have -ln Z(θ) is a concave function of θ exp θ j f j ( ξ) = 1 f i (ξ)exp θ j f j ( ξ) = E θ [ f i ] ξ θ i j Z(θ) ξ j ξ exp θ k f k ( ξ) = Cov θ [ f i ; f j ] θ i k
Non-unique Solution Since ln Z(θ) is convex, -ln Z(θ) is concave Implies that log-likelihood is unimodal Has no local optima However does not imply uniqueness of global optimum Multiple parameterizations can result in same distribution A feature for every entry in the table is always redundant, e.g., f a0,b0 = 1- f a0,b1 - f a1,b0 - f a1,b1 A continuum of parameterizations 14
Maximum Likelihood Parameter Estimation Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D Simplest variant of the problem is maximum likelihood parameter estimation Log-likelihood given features F={ f i, i=1,.. k} is l(θ : D) = θ i f i ξ m i=1 m ( [ ]) M ln Z ( θ ) 15
Gradient of Log-likelihood Gradient of log-likelihood is zero at its maximum points Log-likelihood is l(θ : D) = θ i f i ξ m i=1 ( [ ]) Gradient of the average log-likelihood is m 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] M ln Z ( θ ) Provides a precise characterization of m.l. parameters θ ( ) E D f i χ = E ˆ θ [ f i ] First term is average value of f i in data D. Second term is expected value from distribution Expected value of each feature relative to P θ matches its empirical expectation in D
Need for Iterative Method Although function is concave, there is no analytical form for the maximum Since no closed-form solution Can use iterative methods, e.g. gradient ascent as shown next 17
Simple Gradient Ascent Procedure Gradient-Ascent ( θ 1 f obj δ ) 1 t 1 2 do //Initial starting point //Function to be optimized //Convergence threshold ( ) 3 θ t+1 θ t + η f obj θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Using Taylor s expansion of a function in neighborhood of θ 0, function can be approximated by the linear equation f obj (θ) f obj (θ 0 )+(θ θ 0 ) T f obj (θ 0 ) The slope of this function is f obj (θ 0 ) points to the direction of steepest ascent. If we take a step η in this direction we increase the value of f obj. Line Search Adaptively choose η at each step Define a line in the direction of gradient g ( η) = θ! t + η f obj ( θ t ) Find three points η 1 < η 2 < η 3 so that f obj (g(η 2 )) is larger than at both others η =(η 1 +η 2 )/2 Brent s method: Solid line is f obj (g(η)) Three points bracket its maximum Dashed line shows quadratic fit
Conjugate Gradient Ascent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f obj (θ t+1 ) f obj (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f obj (x,y)=-(x 2 +10y 2 ) Exponential: f obj (x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent
Computing the gradient Gradient of log-likelihood is E D [ f i ( χ )] E θ [ f i ] It is the difference between the feature s empirical count in data D and the expected count relative to our current parameterization Computing the empirical count Ex: for feature f a (a,b) = I { a = a 0 0 b }I b = b 0 0 { } E D [ f i is easy it is the empirical frequency in D of the event a 0,b 0 20 D ( χ )] A C (a) B
Computing the expected count We need to compute the different probabilities of the form P ( θ t a,b) Since expectation is a probability-weighted average Computing probability requires running inference over the network Thus computational cost of parameter estimation is very high Gradient ascent is not efficient Much faster convergence using second order methods based on Hessian
A standard iterative solution Initialize. θ Run inference (compute Z(θ) ) Compute gradient of l. Update θ. E D [ f i ( χ )] E θ [ f i ] θ t+1 θ t + η f obj ( θ t ) No Optimum Reached? Yes Stop 22
Iterative methods for MRF parameters Gradient ascent over parameter space Good news: likelihood function is concave Guaranteed to converge to global optimum Bad news: each step needs inference Simple parameter estimation is intractable Bayesian parameter estimation even harder Integration done using MCMC 23
Newton s method Newton s method finds zeroes of a function using derivatives More efficient than simple gradient descent Quasi Newton method uses an approximation to the gradient Since we are solving for derivative of l(θ,d) need second derivative (Hessian) Newton s Method
Hessian of the log-likelihood Likelihood has the form l(θ : D) = θ i f i ξ m i=1 m Hessian ( [ ]) θ i θ j l θ, D ( ) = M Cov θ f i, f j ( ) Requires joint expectation of two features, often computationally infeasible L-BFGS (a quasi-newton algorithm) uses line 25 search to avoid computing the Hessian M ln Z ( θ )
Conditionally Trained Models Often we want to perform an inference task where we have a known set of variables, or features, X We want to query a pre-determined set of variables Y We prefer to use discriminative training Train the network as a Conditional Random Field (CRF) that encodes a conditional distribution P(Y X) 26
CRF Training Train the network as a CRF that encodes a conditional distribution P(Y X) Conditional log-likelihood Training set consists of pairs D={y[m], x[m]}, m =1,.., M Objective function is conditional log-likelihood l Y X ( ) ( θ : D) = ln P y[ 1,.., M ] x[ 1,.., M ],θ M ( ) = ln P y[ m] x[ m],θ m=1 E (x,y)~p* ( ) log P! y x Example: word category, word It is concave since each term is concave 27 Joint probability of observed pairs
Gradient of Conditional Likelihood A reduced MN is itself an MN. We use log-linear representation with features f i and parameters θ Analogous to gradient for full MN 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] we can write gradient for reduced MN θ i l Y X ( ) M ( θ : D) = f i ( y m,x m ) E θ f i x m m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case
Conditional Training is simpler X 1 X 2 X 3 X 4 X 5 Full MN encodes Y 1 Y 2 4 Y 3!P(X,Y) = φ i Y i,y i+1 i=1 ( ) Edges disappear in a reduced Markov network After conditioning on X, remaining edges form a simple chain 5 Y 4 Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity Y 5 5 φ i Y i,x 1,X 2,X 3,X 4,X 5 i=1 ( ) ( )!P(Y X) = φ i Y i,y i+1 X 1,X 2,X 3,X 4,X 5 i=1 29
Learning Models for Sequence Labeling Given: sequence of observations X={X 1,..X k }. Need: a joint label Y={Y 1,..Y k } Model Trade-offs: expressive power, learnability CRF is a discriminative model MEMM is a conditional directed model Y 1 X 2 if not given Y 2, by D-separation More generally, Y i X j X -j j > i Has label bias problem: Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. Earlier frames may be blurry but later ones clearer. HMM is a generative model That needs joint probability P(X,Y) CRF requires gradient-based learning which is expensive, difficult with large data sets MEMM and HMM are more easily learned Pure directed models: parameters computed in closed-form using maximum likelihood CRF Directly obtains P(Y X) Note: Z(X) is marginal of un-normalized measure MEMM Does not model distribution over X HMM Needs joint distribution X 1 Y 1 X 1 Y 1 X 1 Y 1 X 2 Y 2 X 2 Y 2 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 3 Y 3 P(Y X) = 1!P(Y, X) Z(X) X 4 Y 4 X 4 Y 4 X 4 Y 4!P(Y, X) = φ i (Y i,y i+1 ) φ i (Y i, X i ) k 1 i=1 Z(X) =!P(Y, X) Y P(Y X) = P(Y i X i,y i 1 ) k X 5 Y 5 X 5 Y 5 P(X,Y) = P(X i /Y i )P(Y i Y i 1 ) i=1 i=1 P(Y / X) = P(X,Y) P(X) k k i=1 X 5 Y 5
Parameter Estimation with Missing Data Missing data examples Some data fields omitted or not collected Some hidden variables Learning problem has difficulties Parameters may not be identifiable Coupling between different parameters Likelihood is not concave (has local maxima) Available methods 1. Gradient Ascent (assume missing data is random) 2. Expectation-Maximization 31
Gradient Ascent with Missing Data Assume data is missing at random in data D In the m th instance, let o[m] be observed entries and h[m] random variables that are missing Log-likelihood has the form 1 M ( ) = 1 M ln P ( o[m], h[m] θ ) ln P D θ Gradient for feature f i has the form M m=1 1 θ i M ( θ : D) = 1 M M m=1 h[m] [ ] E h[m]~p( h[m] o[m],θ ) f i ln Z E θ [ f i ] Which is a difference between two expectations of f i Expectation over data and hidden variables Expectation in current distribution P(χ θ) 32
Comparison with Full Data Case With full data, gradient of log-likelihood is 1 θ i M l(θ : D) = E D[ f i ( χ )] E θ [ f i ] For second term we need inference over current distribution P(χ θ) First term is aggregate over data D. With missing data we have 1 θ i M l ( θ : D ) = 1 M M m=1 E fi h[m ]~P ( h[m ] o[m ],θ ) E θ We have to run inference separately for every instance m conditioning on o[m] Cost is much higher than with full data f i 33
Use of Expectation Maximization As for any probabilistic model, an alternative method of parameter estimation with missing data is EM E step is to use current parameters to estimate missing values M step is to re-estimate the parameters For BN it has significant advantages M-step has closed-form solution For MN, the answer is not clear-cut M-step requires running inference multiple times As seen next 34
E-step EM for MN parameter learning using current parameters θ t to compute expected sufficient statistics, i.e., expected feature counts M-step At iteration t expected sufficient statistic for feature f i is M θ (t ) f i = 1 M E h[m ]~P h[m ] o[m ],θ ( ) done using maximum likelihood parameter estimation (using gradient ascent!) Requires running inference multiple times, once for each iteration of gradient ascent procedure At step k of this inner loop optimization, we have a gradient of the form 35 f Mθ ( t) i E θ t,k ( ) f i fi
Maximum Entropy and Maximum Likelihood Return to basic maximum likelihood estimation Alternative formulation provides insight Consider log-linear model P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Relate it to finding the distribution of maximum entropy subject to certain constraints 36
Motivation of Max Entropy Suppose we have some summary statistics of an empirical distribution Ex 1: published in a census report Marginal distributions of single variables or certain pairs, or other events of interest Ex 2: average final grade of students in class, correlation of final grades with homework scores But no access to full data set Find a typical distribution that satisfies these constraints 37
Maximum Entropy Formulation Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information We formulate the following problem: Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k These are called expectation constraints 38
Solution to Max Entropy Estimation Somewhat surprisingly The distribution Q*, the maximum entropy distribution satisfying the expectation constraints, E Q [f i ]=E D [f i ], is a Gibbs distribution Q* = Pˆ θ where Pˆ θ ( χ) = 1 Z θ ˆ ( ) exp ˆ i θi f i ( χ) where ˆθ is the m.l.e. parameterization relative to D 39
Duality of Max Ent and Max Likelihood The two problems 1. Maximizing entropy subject to expectation constraints 2. Maximizing likelihood given structural constraints on the distribution are convex duals of each other For the maximum likelihood parameters ˆθ H P θ ( χ) = 1 M l ( ˆθ : D) 40
Parameter Priors and Regularization Maximum likelihood is prone to over-fitting Can introduce prior distribution P(θ) Due to non-decomposability a fully Bayesian approach is infeasible Instead, we perform MAP estimation Find parameters that maximize P(θ)P(D θ) Where we have ln P(D θ) expressed in log-space as l(θ : D) = θ i f i ξ m i=1 m ( [ ]) Which in turn is derived from the joint distribution P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 M ln Z ( θ ) 41
Choice of Prior P(θ) Only a few priors used in practice L 2 regularization, quadratic penalty on weights Laplacian (L 1 regularization) Both priors penalize parameters whose magnitude (positive or negative) is large 42
Gaussian Prior and L 2 -Regularization Most common is Gaussian prior on log-linear parameters θ P ( θ σ 2 ) = 1 exp θ 2 i 2πσ 2σ 2 For some choice of hyper-parameter σ 2 k i=1 Converting MAP objective P(θ)P(D θ) to logspace, gives ln P( θ ) + l(θ : D) whose first term lnp ( θ) = 1 2σ 2 k i=1 2 θ i is called an L 2 - regularization term 43
Laplacian Prior and L 1 -Regularization A different prior used is the Laplacian P Laplacian ( θ) = 1 2β θ exp β Taking log we obtain a term ln P θ ( ) = 1 β which we wish to minimize Generally called L 1 -regularization k θ i i=1 Both forms of regularization penalize parameters whose magnitude is large 44 0.5 0.4 0.3 0.2 0.1 0 10 5 0 5 10 Laplacian distribution β=1 Gaussian distribution σ 2 =1
Why prefer low magnitude parameters? Properties of prior To pull distribution towards an uninformed one To smooth fluctuations in the data A distribution is smooth if Probabilities assigned to different assignments are not radically different Consider two assignments ξ and ξ An assignment is an instance of variables X 1,..X n We consider ratio of their probabilities next 45
Smoothness resulting from small θ Given two assignments ξ and ξ Their relative probability is ( ) ( ) = P ξ P ξ '!P ( ξ)/ Z θ!p ( ξ ')/ Z θ ( ) ( ) = where the un-normalized probabilities are In log-space, log-probability ratio is ln P ξ P ξ '!P(ξ) = exp k ( ) ( ) = θ f ξ i i ( ) i=1 k i=1 When θ i s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar This results in a smooth distribution ( ) ( )!P ξ!p ξ ' θ i f i (ξ) k θ i f i ( ξ ') = θ i f i ( ξ) f i ξ ' i=1 k i=1 ( ( )) 46
Comparison of L 1 and L 2 Regularization In both L 1 and L 2 we penalize magnitude of parameters In Gaussian case (L 2 ), penalty grows quadratically with parameters An increase in θ i from 3 to 3.1 is penalized more than θ i from 0 to 0.1 Leads to many small parameters In Laplacian case (L 1 ), penalty grows linearly Results in fewer edges and is more tractable 47
Efficiency of Optimization Both L 1 and L 2 Regularization terms are Concave Because Log-likelihood is also Concave, resulting posterior is also concave Can be optimized using gradient descent methods Introduction of penalty terms eliminates multiple equivalent minima 48
Choice of Hyper-parameters Regularization parameters σ 2 in case of L 2 and β in the L 1 case encode our beliefs that model weights should be close to zero Larger these parameters are broader our parameter prior Choice of prior has effect on learned model Standard method of selecting this parameter is via cross-validation 49