8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Size: px

Start display at page:

Download "8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF"

Rhoda Harrison
6 years ago
Views:

1 10-708: Probablstc Graphcal Models , Sprng : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the prevous lectures, we have talked about the structure and parameter learnng for the completely observed BNs. In the drected models, the potentals are restrcted to condtonal dstrbutons whch model the dependence of a varable on ts parents. However, sometmes an undrected assocaton graph s more nformatve and natural to do the modelng. For example, n domans such as computer vson, the nfluences of pxels n an mage are ntrnscally symmetrc. In bology, the gene expressons may be nfluenced by the unobserved factors where we don t have enough nformaton to know the drecton. In ths lecture, we wll cover the technques of structural learnng and parameter estmaton for fully observed MRF. 2 Structural Learnng for Completely Observed MRF 2.1 Gaussan Graphcal Models In ths secton, we wll ntroduce the neghborhood selecton for undrected structural learnng. To gve the background of ths method, let s frst look at a typcal MRF: Gaussan Graphcal Model. As what we have known, when the varables follows the multvarate Gaussan densty, t can be expressed as 1 p(x µ, Σ) = exp{ 1 (2π) n 2 Σ (x µ)t Σ 1 (x µ)} where x = [x 1, x 2,, x p ] T. Wthout loss of generalty, let µ = 0 and Q = Σ 1, then we have p(x 1, x 2,, x p µ = 0, Q) = Q 1 2 exp{ 1 q (2π) n (x ) 2 q j x x j } 2 2 <j Now by observaton, we can fnd that whle the lefthand sde of the equaton s stll a jont dstrbuton, the rghthand sde now descrbes the structure of a MRF. 1 Z exp{ φ(x ) + <j φ(x, x j )} where 1 2 q (x ) 2 can be vewed as the node potental φ(x ), and q j x x j can be vewed as the edge potental φ(x, x j ). 1

2 2 8 : Learnng n Fully Observed Markov Networks 2.2 The Covarance and the Precson Matrces Gven the covarance matrx Σ and the precson matrx Q, we want to nvestgate the dfferences of ther probablty and graphcal model nterpretaton. 1. For Σ Wth the assumpton that x 1,, x p follows multvarate Gaussan dstrbuton, x and x j are uncorrelated means they are ndependent. Therefore, Σ,j = 0 x x j or P (x, x j ) = P (x )P (x j ). x and x j are margnally ndependent. Through Σ, we solely concentrate on the relatonshp between x and x j regardless the other nodes n the graph. 2. For Q Q,j = 0 x x j x j or P (x, x j x j ) = P (x x j )P (x j x j ) x and x j are condtonally ndependent gven the rest of the graph. Ths s what we need to decde the structure of a graph. Specfcally, every non-zero entry n Q corresponds to an edge n the MRF. To better understand the condtonal ndependence nduced by Q, we can look at P (x, x j x j, Q). When q j = 0, P (x, x j x j, Q) = Q 1 2 exp{ 1 q (2π) n kk (x k ) 2 q hg x h x g } exp{ (q (x ) 2 + q jj (x j ) 2 )} k,j h,g,j,h<g }{{} constant = P (x x j, Q)P (x j x j, Q) Fgure 1 shows a chan graphcal model. The Q matrx well captures the structure of the chan, where zero ndcates that there s no edge between the two nodes, and non-zero ndcates an edge exsts. However, f we represent a graph accordng to the Σ matrx, we wll get a clque. Fgure 1: Q and Σ Matrces for a Chan Graphcal Model 2.3 The Neghborhood Selecton Algorthm Now we know that whether the entres of Q are zero or non-zero can completely encode the structure of the graph. The next questons s how to learn the precson matrx Q. In the deal stuaton, Σ s nvertble, then we can use MLE to learn Σ. However, n the real world, an common scenaro s that p n where p s the number of features, and n s the number of samples. In ths case, Σ s not nvertble. Thus we are gong to learn a sparse GM by fndng the non-zero entres n the sparse Q drectly from data. The method we used

3 8 : Learnng n Fully Observed Markov Networks 3 to solve ths problem s called neghborhood selecton algorthm (Fgure 2). Here, we apply LASSO regresson to each varable teratvely to fnd ts neghbors. In the th teraton, one varable s ndcated as y, and all the other varables are represented by vector x. We want to compute a vector θ n whch the non-zero entry θ j ndcates an edge between the correspondng varable xj and y. And θ s just the th row or column n Q. Therefore we need to solve where l(θ ) = logp (y x, θ ), and Y = θ T X. ˆθ = arg mn θ l(θ ) + λ θ 1 (a) Pck One Node (b) Iteraton Fgure 2: The Neghborhood Selecton Algorthm Havng known the structure of the graph, we put Q back to the equaton and perform MLE to estmate the parameters. For the dscrete nodes, we can also perform ths procedure and smply use L1 regularzed logstc regresson nstead of L1 lnear regresson. Under fnte data and hgh dmenson condton, the graphcal regresson algorthm s ensured to be consstency. 3 MLE for Decomposable Undrected Graphcal Models Estmatng the parameters n UGM s more challengng than n DGM for the reason that the partton functon Z nvolves all parameters log-lkelhood, thus log-lkelhood s no more decomposable. In some cases, we need to do nferences (.e. margnalzaton) to learn parameters even n the fully observed case. In ths secton, let s begn wth a relatvely smple case, the decomposable (trangulated) UGM. 3.1 Log Lkelhood wth Tabular Clque Potentals To remove Z n the log lkelhood, we ntroduce a notaton for counts. The number of tmes a confguraton x s observed n the dataset D can be represented as m(x) = n δ(x, x n )

4 4 8 : Learnng n Fully Observed Markov Networks and m(x C ) = x V \C m(x) s the count for clque C. In terms of the counts, the lkelhood s gven by p(d θ) = p(x θ) δ(x,xn) n x The log lkelhood s l = log p(d θ) = n = x = x = C δ(x, x n ) log p(x θ) x δ(x, x n ) log p(x θ) n m(x) log( 1 ψ C (x C )) Z l 1 C m(x C ) log ψ C (x C ) N log Z }{{}}{{} x C l 2 We can see that the margnal counts m(x C ) are the suffcent statstcs for our model. 3.2 Maxmum Lkelhood Estmaton To fnd MLE, we take the dervatves of the log lkelhood wth respect to ψ C (x C ) and set t to zero. The dervatve of the frst term can be obtaned mmedately. l 1 ψ C (x C ) = m(x C) ψ C (x C ) Then we turn to the second term l 2 ψ C (x C ) = 1 Z ψ C (x C ) ( x ψ D ( x D )) D = 1 δ( x C, x C ) Z ψ C (x C ) ( ψ D ( x D )) x D = 1 δ( x C, x C ) ψ D ( x D ) Z x = x 1 1 δ( x C, x C ) ψ D ( x D ) ψ C ( x C ) Z D 1 = δ( x C, x C )p( x) ψ C (x C ) = p(x C) ψ C (x C ) Note that when takng dervatve of l 2, x C s fxed. x

5 8 : Learnng n Fully Observed Markov Networks 5 Thus, The MLE of the parameters s l ψ C (x C ) = m(x C) ψ C (x C ) N p(x C) ψ C (x C ) = 0 ˆp MLE (x C ) = m(x C) N = p(x C) Ths tells us an mportant characterzaton of maxmum lkelhood estmates: for each clque, the model margnals must be equal to the observed margnals (emprcal counts). However, t doesn t tell us the MLE of the parameters, ψ C (x C ) themselves appear mplctly n these equatons. 3.3 Decomposable Models When a model G s decomposable, ts jont dstrbuton can be represented as ψ C (x C ) C p(x) = ϕ S (x S ) S If G s potentals are defned on maxmal clques, we can fnd maxmum lkelhood estmates by nspecton. That s, to compute the clque potentals, just set them to the emprcal margnals or condtonals,.e., the separator must be dvded nto one of ts neghbors. Fgure 3 gves us an example of a three node chan model. Its probablty can be wrtten as p(x 1, x 2, x 3 ) = 1 Z ψ 12(x 1, x 2 )ψ 23 (x 2, x 3 ) The maxmum lkelhood estmates of ts clque potentals can be gven as whch also mples that Z = 1. ψ 12,MLE (x 1, x 2 ) = p(x 1, x 2 ) ψ 23,MLE (x 2, x 3 ) = p(x 2, x 3 ) p(x 2 ) Fgure 3: A Three Node Markov Chan 4 MLE for Non-decomposable Undrected Graphcal Models For now we know how to do MLE for decomposable graphcal model, however f the clque potental does not drectly correspond to the clque margnal, then the graph s non-decomposable. How can we do MLE for

6 6 8 : Learnng n Fully Observed Markov Networks non-decomposable graphcal model? If the potentals are tabular, we can use Iteratve Proportonal Fttng (IPF) algorthm and f the potentals are themselves functons of ther own parameters, then we can use Generalzed Iteratve Scalng (GIS) algorthm. 4.1 Iteratve Proportonal Fttng Iteratve Proportonal Fttng (IPF) can be used when the potentals n the undrected graphcal model are tabular. It s an teratve algorthm, and hopng that the teratons can converge to a fxed pont the soluton for the orgnal mplct equatons. One nce thng about IPF s that t s not only a fxed-pont algorthm, but also a a coordnate ascent algorthm, so t s guaranteed to converge. Let s frst derve the update rule for each teraton. Set the dervatve of lkelhood functon equal to 0, we can get m(x c ) Nψ c (x c ) = p(x c) ψ c (x c ) we can rewrte m(xc) N as p(x c), the emprcal margnal of x c, so p(x c ) ψ c (x c ) = p(x c) ψ c (x c ) Note that our goal ψ c (x c ) s on both sde of the equaton, so we can not solvng t n closed-form from ths equaton, however we can fx ψ c (x c ) on the rght sde and solve for t on the left hand sde. At the end we create a rule for teratve update: ψ c (t+1) (x c ) = ψ c (t) (x c ) p(x c) p (t) (x c ) Also note that n the equaton above, p (t) (x c ) s the clque margnal, not the emprcal margnal, so we have to do nference for p (t) (x c ) n each teraton. We do the update for all the clques n each teraton. and t can be proved that t s a coordnate ascent algorthm, where the coordnates are parameters of clque potentals: Frst we take the dervatve of the log lkelhood wth respect to the coordnate ψ c (x c ), for fxed c and varyng x c. l ψ c (x c ) = m(x c) ψ c (x c ) N δ( c, x c ) ψ D ( D ) Z To reflect that D s beng fxed, we can add an teraton superscrpt to ψ D on the rght sde. l ψ c (x c ) = m(x c) ψ c (t+1) (x c ) N δ( Z (t+1) c, x c ) ψ (t) D ( D) Also, one of IPF s propertes s that Z remans constant durng the teraton, so Z (t+1) = Z (t), so l ψ c (x c ) = m(x c) ψ c (t+1) (x c ) N δ( Z (t+1) c, x c ) ψ (t) D ( D) = m(x c) ψ c (t+1) (x c ) N δ( Z (t) c, x c ) ψ (t) D ( D) = m(x c) ψ c (t+1) (x c ) N δ( ψ (t) c, x c ) 1 ψ (t) (x c ) Z (t) D ( D) = m(x c) ψ (t+1) c (x c ) N ψ (t) (x c ) p(t) (x c ). D

7 8 : Learnng n Fully Observed Markov Networks 7 Now we can see that the IPF update functon would set the dervatve of log-lkelhood above to zero, so the algorthm s coordnate ascent, t wll ncrease the log-lkelhood and fnally converge to a global maxmum. 4.2 Feature Based Model For most of the graphcal model we deal wth, the potentals wll not be tabular, but wll themselves be functons of some parameters. Ths s because for large clques, defnng tabular potental functons are exponentally costly for nference and would have exponental numbers of parameters to learn. Wth lmted data and tme, t s mpossble to learn when clques become very large. Feature-based Clque potentals solve ths problem by usng a less general parameterzaton of the clque potentals, that s, for each potental, t defnes a set of feature on t, so the parameter space s now the space of features, whch can be controlled by ourselves. For example, consder a clque: three consecutve characters n a strng of Englsh text. The full jont clque potental would be , because we have 26 letters n Englsh, and ths s a huge potental space. But we can defne features based on the three characters, such as whether they are ng or on, and each feature has only 2 possble values as they are bnary. So suppose we defne n features, we only have to estmate n parameters. Of course we can defne more complcated feature than bnary feature, such as contnuous features. Let s represent clque potentals as ψ c (x c ) = exp( k θ kf k (x c )), we can treat each feature functon as a mcropotental functon. Also note that f the feature functons are ndcator functon per combnaton of x c, we can recover the standard tabular potental. To combne feature nto the probablty model: we can smplfy ths form to p(x) = 1 Z(θ) ψ c (X c ) c = 1 Z(θ) exp c θ k f k (x c ) k p(x) = 1 Z(θ) exp θ f (x c ) Ths s the form of exponental famly model, and features are suffcent statstcs. So now our goal s to fnd MLE under the above form. 4.3 Generalzed Iteratve Scalng One method to solve ths problem s Generalzed Iteratve Scalng (GIS) algorthm. It s also a teratve algorthm and try to attack the lower bound of the scaled lkelhood functon. Scaled lkelhood functon of UGM can be expressed as: ˆl(θ; D) = l(θ; D)/N = x ˆp(x)logp(x θ) = x ˆp(x) θ f (x) logz(θ)

8 8 8 : Learnng n Fully Observed Markov Networks Z(θ) s n the log, and we want to get rd of the log, so we can use the lnear upper bound of logarthm logz(θ) µz(θ) logµ 1. Ths bound holds for all µ, so we can set µ = Z 1 (θ (t) ). Now we have: ˆl(θ; D) ˆp(x) x θ f (x) Z(θ) Z(θ (t) ) logz(θ(t) ) + 1 We can see that now the frst part of the equaton s a lnear combnaton of coeffcent that we want to learn, and the second part Z(θ) s now not n logarthm form. We defne θ (t) = θ θ (t), that s the dfference between old and new verson of θ. Then we can plug θ (t) nto the lower bound: ˆl(θ; D) ˆp(x) θ f (x) x = θ ˆp(x)f (x) x x Z(θ) Z(θ (t) ) logz(θ(t) ) + 1 p(x θ (t) )exp θ (t) f (x) logz(θ (t) ) + 1 We assume f (x) 0 and f (x) = 1, then we can have a nequalty whch has smlar form as Jense s nequalty: exp( π x ) π exp(x ). so we can get the f (x) out of the exp n the above equaton, so we have: ˆl(θ; D) θ ˆp(x)f (x) p(x θ (t) ) f (x)exp θ (t) logz(θ (t) ) + 1 x x Take the dervatve and set t to zero we can have the closed-form soluton of θ: e θ(t) = x ˆp(x)f (x) x p(t) (x)f (x) Z(θ(t) ) We also have a relatonshp between the update functon of θ and total dstrbuton: θ (t+1) p (t+1) (x) = p (t) = θ (t) + θ (t) e θ(t) f (x) so the update rule for GIS s: p (t+1) (x) = p (t) (x) θ (t+1) ( x ˆp(x)f (x) (x)) x p(t) (x)f (x) )(f = θ (t) + log( x ˆp(x)f (x) x p(t) (x)f (x) ) Smlar to IPF, GIS also has to do nference n each teraton, that s to compute the expectaton over the feature functon x p(x) (x)f (x), so estmate a fully observed MRF can be very dffcult.

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models