Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

April Butler
6 years ago
Views:

1 Probabilisti Graphial Models David Sontag New York University Leture 12, April 19, 2012 Aknowledgement: Partially based on slides by Eri Xing at CMU and Andrew MCallum at UMass Amherst David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

2 Today: learning undireted graphial models 1 Learning MRFs a. Feature-based (log-linear) representation of MRFs b. Maximum likelihood estimation. Maximum entropy view 2 Getting around omplexity of inferene a. Using approximate inferene (e.g., TRW) within learning b. Pseudo-likelihood 3 Conditional random fields David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

3 Reall: ML estimation in Bayesian networks Maximum likelihood estimation: max θ l(θ; D), where l(θ; D) = log p(d; θ) = log p(x; θ) = log p(x i ˆx pa(i)) i ˆx pa(i) : x pa(i) =ˆx pa(i) In Bayesian networks, we have the losed form ML solution: θ ML x i x pa(i) = N x i,x pa(i) ˆx i Nˆxi,x pa(i) where N xi,x pa(i) is the number of times that the (partial) assignment x i, x pa(i) is observed in the training data We were able to estimate eah CPD independently beause the objetive deomposes by variable and parent assignment David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

4 Bad news for Markov networks The global normalization onstant Z(θ) kills deomposability: θ ML = arg max θ = arg max θ = arg max θ log p(x; θ) ( ) log φ (x ; θ) log Z(θ) ( ) log φ (x ; θ) D log Z(θ) The log-partition funtion prevents us from deomposing the objetive into a sum over terms for eah potential Solving for the parameters beomes muh more ompliated David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

5 What are the parameters? How do we parameterize φ (x ; θ)? Use a log-linear parameterization: Introdue weights w R d that are used globally For eah potential, a vetor-valued feature funtion f (x ) R d Then, φ (x ; w) = exp(w f (x )) Example: disrete-valued MRF with only edge potentials, where eah variable takes k states Let d = k 2 E, and let w i,j,xi,x j = log φ ij (x i, x j ) Let f i,j (x i, x j ) have a 1 in the dimension orresponding to (i, j, x i, x j ) and 0 elsewhere The joint distribution is in the exponential family! p(x; w) = exp{w f(x) log Z(w)}, where f (x) = f (x ) and Z(w) = x exp{ w f (x )} This formulation allows for parameter sharing David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

6 Log-likelihood for log-linear models θ ML = arg max θ = arg max w ( ) log φ (x ; θ) D log Z(θ) ( ) w f (x ) D log Z(w) = arg max w w The first term is linear in w ( ) f (x ) D log Z(w) The seond term is also a funtion of w: log Z(w) = log ( exp w ) f (x ) x David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

7 Log-likelihood for log-linear models log Z(w) = log x log Z(w) does not deompose ( exp w ) f (x ) No losed form solution; even omputing likelihood requires inferene Reall Problem 4 ( Exponential families ) from Problem Set 2. Letting f(x) = f (x ), you showed that w log Z(w) = E p(x;w) [f(x)] = E p(x ;w)[f (x )] Thus, the gradient of the log-partition funtion an be omputed by inferene, omputing marginals with respet to the urrent parameters w We also laimed that the 2nd derivative of the log-partition funtion gives the seond-order moments, i.e. 2 log Z(w) = ov[f(x)] Sine ovariane matries are always positive semi-definite, this proves that log Z(w) is onvex (so log Z(w) is onave) David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

8 Solving the maximum likelihood problem in MRFs l(w; D) = 1 D w ( ) f (x ) log Z(w) First, note that the weights w are unonstrained, i.e. w R d The objetive funtion is jointly onave. Apply any onvex optimization method to learn! Can use gradient asent, stohasti gradient asent, quasi-newton methods suh as limited memory BFGS (L-BFGS) The gradient of the log-likelihood is: d l(w; D) = 1 dw k D = (f (x )) k 1 D (f (x )) k E p(x ;w)[(f (x )) k ] E p(x ;w)[(f (x )) k ] David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

9 The gradient of the log-likelihood l(w; D) = 1 (x )) k w k D (f Differene of expetations! E p(x ;w)[(f (x )) k ] Consider the earlier pairwise MRF example. This then redues to: ( ) 1 l(w; D) = 1[x i = ˆx i, x j = ˆx j ] p(ˆx i, ˆx j ; w) w i,j,ˆxi,ˆx j D Setting derivative to zero, we see that for the maximum likelihood parameters w ML, we have p(ˆx i, ˆx j ; w ML ) = 1 1[x i = ˆx i, x j = ˆx j ] D for all edges ij E and states ˆx i, ˆx j Model marginals for eah lique equal the empirial marginals! Called moment mathing, and is a property of maximum likelihood learning in exponential families David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

10 Gradient asent requires repeated marginal inferene, whih in many models is hard! We will return to this shortly. David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

11 Maximum entropy (MaxEnt) We an approah the modeling task from an entirely different point of view Suppose we know some expetations with respet to a (fully general) distribution p(x): (true) 1 p(x)f i (x), (empirial) f i (x) = α i D x Assuming that the expetations are onsistent with one another, there may exist many distributions whih satisfy them. Whih one should we selet? The most unertain or flexible one, i.e., the one with maximum entropy. This yields a new optimization problem: s.t. max H(p(x)) = p x p(x)f i (x) = α i x p(x) = 1 p(x) log p(x) (stritly onave w.r.t. p(x)) David Sontag (NYU) x Graphial Models Leture 12, April 19, / 21

12 What does the MaxEnt solution look like? To solve the MaxEnt problem, we form the Lagrangian: L = p(x) log p(x) ( ) ( ) λ i p(x)f i (x) α i µ p(x) 1 x i x x Then, taking the derivative of the Lagrangian, And setting to zero, we obtain: ( L p(x) = 1 log p(x) i λ i f i (x) µ p (x) = exp 1 µ λ i f i (x) = e 1 µ e i λ i f i (x) i From the onstraint x p(x) = 1 we obtain e1+µ = x e i λ i f i (x) = Z(λ) We onlude that the maximum entropy distribution has the form (substituting w i = λ i ) p (x) = 1 Z(w) exp( i ) w i f i (x)) David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

13 Equivalene of maximum likelihood and maximum entropy Feature onstraints + MaxEnt exponential family! We have seen a ase of onvex duality: In one ase, we assume exponential family and show that ML implies model expetations must math empirial expetations In the other ase, we assume model expetations must math empirial feature ounts and show that MaxEnt implies exponential family distribution Can show that one is the dual of the other, and thus both obtain the same value of the objetive at optimality (no duality gap) Besides providing insight into the ML solution, this also gives an alternative way to (approximately) solve the learning problem David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

14 How an we get around the omplexity of inferene during learning? David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

15 Monte Carlo methods Reall the original learning objetive ( ) l(w; D) = 1 D w f (x ) log Z(w) Use any of the sampling approahes (e.g., Gibbs sampling) that we disussed in Leture 9 All we need for learning (i.e., to ompute the derivative of l(w, D)) are marginals of the distribution No need to ever estimate log Z(w) David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

16 Using approximations of the log-partition funtion We an substitute the original learning objetive l(w; D) = 1 ( D w ) f (x ) log Z(w) with one that uses a tratable approximation of the log-partition funtion: l(w; D) = 1 ( D w ) f (x ) log Z(w) Reall from Leture 8 that we ame up with a onvex relaxation that provided an upper bound on the log-partition funtion, log Z(w) log Z(w) (e.g., tree-reweighted belief propagation, log-determinant relaxation) Using this, we obtain a lower bound on the learning objetive l(w; D) l(w; D) Again, to ompute the derivatives we only need pseudo-marginals from the variational inferene algorithm David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

17 Pseudo-likelihood Alternatively, an we ome up with a different objetive funtion (i.e., a different estimator) whih sueeds at learning while avoiding inferene altogether? Pseudo-likelihood method (Besag 1971) yields an exat solution if the data is generated by a model in our model family p(x; θ ) and D (i.e., it is onsistent) Note that, via the hain rule, p(x; w) = i p(x i x 1,..., x i 1 ; w) We onsider the following approximation: p(x; w) i p(x i x 1,..., x i 1, x i+1,..., x n ; w) = i p(x i x i ; w) where we have added onditioning over additional variables David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

18 Pseudo-likelihood The pseudo-likelihood method replaes the likelihood, l(θ; D) = 1 1 log p(d; θ) = D D with the following approximation: l PL (w; D) = 1 D D n m=1 i=1 (we replaed x i with x N(i), i s Markov blanket) D log p(x m ; θ) m=1 log p(x m i x m N(i) ; w) For example, suppose we have a pairwise MRF. Then, p(xi m xn(i) m ; w) = 1 Z(xN(i) m ; j N(i) θ m ij (xi,x m j ), Z(x m w)e N(i) ; w) = ˆxi e j N(i) θ ij (ˆx i,x m j ) More generally, and using the log-linear parameterization, we have: log p(xi m xn(i) m ; w) = w f (x m ) log Z(xN(i) m ; w) :i David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

19 Pseudo-likelihood This objetive only involves summation over x i and is tratable Has many small partition funtions (one for eah variable and eah setting of its neighbors) instead of one big one It is still onave in w and thus has no loal maxima Assuming the data is drawn from a MRF with parameters w, an show that as the number of data points gets large, w PL w David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

20 Conditional random fields Reall from Leture 4, a CRF is a Markov network on variables X Y, whih speifies the onditional distribution P(y x) = 1 φ (x, y ) Z(x) C with partition funtion Z(x) = ŷ φ (x, ŷ ). C The feature funtions now depend on x in addition to y For eah potential, a vetor-valued feature funtion f (x, y ) R d Then, φ (x, y ; w) = exp(w f (x, y )) David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

21 Learning with onditional random fields Exat same as learning with MRFs, exept that we have a different partition funtion for eah data point ( ) θ ML = arg max log φ (x, y ; θ) log Z(x; θ) θ (x,y) D = arg max w f (x, y ) log Z(x; w) w (x,y) D (x,y) D David Sontag (NYU) Graphial Models Leture 12, April 19, / 21

Maximum Entropy and Exponential Families

Maximum Entropy and Exponential Families April 9, 209 Abstrat The goal of this note is to derive the exponential form of probability distribution from more basi onsiderations, in partiular Entropy. It