Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Size: px

Start display at page:

Download "Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation"

Cuthbert Charles
5 years ago
Views:

1 Readngs: K&F 0.3, 0.4, 0.6, 0.7 Learnng undrected Models Lecture 8 June, 0 CSE 55, Statstcal Methods, Sprng 0 Instructor: Su-In Lee Unversty of Washngton, Seattle Mean Feld Approxmaton Is the energy functonal convex n the parameters of Q? - entropy x log(x) s concave n x - xy s ontly convex n (x,y) The energy functonal s easy to compute, even for networks where nference s complex The energy functonal for a fully factored dstrbuton Q can be rewrtten smply as a sum of expectatons, each one over a small set of varables. F [ PF, Q] = EQ[lnφ] + HQ ( U) E Q φ F [ln φ ] = uφ Q( u H Q (U) = H Q ( X ) φ ) ln φ ( u φ ) = ( Q( x )) ln φ ( u φ ) The complexty of ths expresson depends on the sze of the factors n P F, and not on the topology of the network. uφ X uφ

2 Learnng Undrected Graphs The lkelhood functon Learnng parameters Collectve classfcaton wth HMM, MEMM, CRF Generatve vs. dscrmnatve models Drected vs. undrected models Learnng wth ncomplete data Learnng wth Prors Maxmum A Pror (MAP) estmaton Learnng wth alternatve obectves Pseudo lkelhood obectve Max-margn learnng Structure Learnng 3 Collectve Classfcaton Takng a set of nterrelated nstances and ontly labelng them Sequental labelng: labelng nstances organzed n a sequence Example: handwrtng recognton A sequence of observatons (feature) Use local nformaton Explot correlatons br ac e Label them wth some ont label Model-based approach Tranng data: Fully labeled (both Y and X are observed) Test data: only X s observed 4

3 Collectve Classfcaton Trade-offs between dfferent models Hdden Markov Model (HMM) Maxmum Entropy Markov Model (MEMM) Condtonal Random Feld (CRF) HMM MEMM CRF Y Y Y X X X Y: Jont label X: A sequence of observatons (feature) 5 Hdden Markov Model For each classfcaton task, Sngle (hdden) state varable Y (e.g. label) Sngle (observed) observaton varable X (e.g. mage) Observaton probablty P(X Y) For example, P(X= Y= b ) Transton probablty P(Y Y) Statstcal dependences between the neghborng Y s Y Y Y X 0 X X G 0 G Unrolled network 6 3

4 Hdden Markov Model For each classfcaton task, Sngle (hdden) state varable Y Sngle (observed) observaton varable X Observaton probablty P(X Y) Transton probablty P(Y Y) assumed to be sparse Usually encoded by a state transton graph 0.4 Y Y State transton representaton 0.5 P(Y Y) y y y 3 y 4 y y y y Learnng: Hdden Markov Model Generatve models Defne a ont probablty P(Y,X) over pared label Y and observaton X Parameters traned to maxmze the ont log-lkelhood log P(Y,X) HMM Jont dstrbuton Y P(X,Y) =? X We can label new observatons x by nferrng P(Y X=x) To make nference tractable, there are typcally no long-range dependences (Markov assumpton) 8 4

5 Dscrmnatve (Condtonal) Models Specfes the probablty of possble label sequences gven the observatons, P(Y X) X s always observed Key advantage: Does not waste parameters on modelng P(X) Dstrbuton over Y can depend on non-ndependent features X wthout modelng feature dependences Transton probabltes can depend on past and future Two representatons Maxmum Entropy Markov Models (MEMMs) Condtonal Random Felds (CRFs) 9 Max Entropy Markov Models Models the probablty over the next state gven the prevous state and the observatons Dscrmnatve model: Provdes a model for P(Y X) Weakness: label bas problem (Y X X - ) for any > : an observaton from later n the sequence has absolutely no effect on the probablty of the current state Y Y X X HMM MEMM 0 5

6 Label-Bas Problem: Example Y Weakness: current label s not affected by the future observaton X A model for dstngushng rob from rb Suppose we get an nput sequence X= rb Frst step, r matches both possble states equally lkely Next, s observed, but snce both y and y 4 have one outgong state, they both gve probablty to the next state Note: f one word s more lkely n tran data, t wll wn Does not happen n HMMs r y0 y y r y 4 y 5 o y 3 State transton representaton b b P(Y Y) y 0 y y y 3 y 4 y 5 y y y y y y r b Condtonal Random Felds Advantages of MEMMs wthout the label bas problem X Key dfference MEMMs use per-state model for condtonal probabltes of next state gven current state CRFs have a sngle model for the ont probablty of the entre sequence of labels gven the observatons Thus, weghts of dfferent features at dfferent states can trade CRF tranng off aganst each other CRF Y Maxmum lkelhood estmaton or MAP (a lttle later) Obectve functon s concave, guaranteeng convergence to global optmum X MEMM Y - Y 6

7 Condtonal Random Felds Let G=(V,E) be a graph wth vertces V and edges E, such that Y = (Y v ) v V Then (X,Y) s a CRF f the random varables Y v obey the Markov property wth respect to the graph: P( Y X,Y, v) P(Y X,Y, ~ v) v = where Y s the set of Y neghbors of Y And f t models only P(Y X) CRF v Y X OR Y X 3 Condtonal Random Felds Jont probablty dstrbuton for trees over Y Clques (and thus potentals) are the edges and vertces p (y x) exp λ k f k (e,y[e],x) + µ k g k (v,y[v],x) e E,k v V,k x are the observed varables y are the state varables y[s] s the components of Y assocated wth vertces n S f k s an edge feature wth weght λ k g k s a vertex feature wth weght µ k Note that features can be over all of varables n x CRF Y Y X X 4 7

8 Comparson /3 Computatonal perspectve Purely drected models HMMs and MEMMs are much more easly learned Ther parameters can be computed n closed form usng MLE or Bayesan estmaton CRF requres an teratve gradent-based approach and nference must be run for every tranng nstance Y Y Y X X X HMM MEMM CRF 5 Comparson /3 Ablty to use a rch feature set Success n a classfcaton task often depends strongly on the qualty of our features In an HMM, we must explctly model the dstrbuton over features X, ncludng the nteractons between them Dependng on features, ths type of model s very hard and often mpossble to construct correctly MEMM, CRF are both dscrmnatve models and so they avod ths challenge entrely Y Y Y X X X HMM MEMM CRF 6 8

9 Comparson: Summary Independence assumptons made by the model In MEMMs, (Y X X - ) for any > : current label s not affected by the future observaton (label bas problem) Summary In cases where there are many correlated features, dscrmnatve models are probably better If only lmted data are avalable, the stronger bas of the generatve model (modelng P(X)) may domnate and allow learnng wth fewer samples Among the dscrmnatve models, MEMMs should probably be avoded n cases where many transtons are close to determnstc (label bas problem) In many cases, CRFs are lkely to be a safer choce, but the computatonal cost may be prohbtve for large datasets 7 Learnng Undrected Graphs The lkelhood functon Learnng parameters Collectve classfcaton wth HMM, MEMM, CRF Generatve vs. dscrmnatve models Drected vs. undrected models Learnng wth ncomplete data Learnng wth Prors Maxmum A Pror (MAP) estmaton Learnng wth alternatve obectves Pseudo lkelhood obectve Max-margn learnng Structure Learnng 8 9

10 Learnng wth Mssng Data In MLE wth complete data, the gradent s l( : D) = MED[ f[ d ]] ME[ f ] Number of tmes feature f s true n data D Expected number of tmes feature f s true accordng to model Gradent of lkelhood s now dfference of expectatons Y: hdden, X: observed l( : D) = ME[ f[ y x ]] ME[ f ] Expected number of tmes feature f s true gven observed data Expected number of tmes feature f s true accordng to model Can use gradent descent or EM 9 Learnng Undrected Graphs The lkelhood functon Learnng parameters Collectve classfcaton wth HMM, MEMM, CRF Generatve vs. dscrmnatve models Drected vs. undrected models Learnng wth ncomplete data Learnng wth Prors Maxmum A Pror (MAP) estmaton Learnng wth alternatve obectves Pseudo lkelhood obectve Max-margn learnng Structure Learnng 0 0

11 Maxmum A Pror (MAP) estmaton Introducng a pror dstrbuton P( ) over the model parameters Bayesan approach Gven D={x[],,x[M]}, P ( x[ M + ] D) = P( x[ M + ] ) P( D) d Maxmum a Pror (MAP) estmaton arg max P ( D) = arg max P( ) P( D ) Maxmum lkelhood estmaton (MLE) arg max P( D ) Gaussan Pror MAP estmaton arg max P ( D) = arg max P( ) P( D ) log P ( D) = log P( D ) + log P( ) Gaussan pror P( σ ) = Convertng to log-space L regularzaton k = exp πσ σ σ k =

12 Laplacan Pror Laplacan pror P Laplacan Convertng to log-space L regularzaton k = ( β ) exp = β β β k = P( ) Laplacan dstrbuton (β=) and Gaussan dstrbuton (σ =) 3 Why Regularzaton? Both forms of regularzaton penalze parameters whose magntude s large Why s a bas n favor of parameters of low magntude a reasonable one? A pror often serves to pull the dstrbuton toward an unnformed one, smoothng out fluctuatons n the data A dstrbuton s smooth f the probabltes assgned to dfferent assgnments are not radcally dfferent. Consder two assgnments and ξ ' Log of ther relatve probablty s k k P( ξ ) ln = f ( ξ ) f ( ξ ') = P( ξ ') = = ξ k = ( f ( ξ ) f ( ξ ')) When all Θ s have small magntude, ths log-rato s also bounded, resultng n a smooth dstrbuton. 4

13 L vs L Regularzaton Gaussan pror (L): σ Laplacan pror (L): β Key dfferences: k = k = In L, the penalty grows quadratcally wth the parameter magntude. In L, the penalty s lnear n the parameter magntude. In L, as the parameters get close to 0, the effect of the penalty dmnshes, whereas n L case, the penalty s lnear all the way untl the parameter value s 0. The models learned wth an L regularzaton tend to be much sparser than the L case The strength depends on the hyper-parameter β 0 L L Learnng Undrected Graphs The lkelhood functon Learnng parameters Collectve classfcaton wth HMM, MEMM, CRF Generatve vs. dscrmnatve models Drected vs. undrected models Learnng wth ncomplete data Learnng wth Prors Maxmum A Pror (MAP) estmaton Learnng wth alternatve obectves Pseudo lkelhood obectve Max-margn learnng Structure Learnng 6 3

14 Why Alternatve Obectves? The log-lkelhood obectve, on the case of a sngle data nstance ξ ~ ~ ~ l( : ξ ) = ln P( ξ ) ln Z( ) = ln P( ξ ) ln P( ξ ' ) ξ ' MLE can be vewed as amng to ncrease the dstance between the log of the un-normalzed probablty (logmeasure) of ξ and the aggregate of the measures of all nstances. Key dffculty: the nd term nvolves a summaton over the exponentally many nstances n Val(X). In MLE, we have to compute the log-lkelhood n every teraton (approxmate nference) Alternatve obectves Am to ncrease the dfference between the log-measure of the data nstance and a more tractable set of other nstances ( Contrastve obectves) 7 Pseudo-lkelhood For a data nstance ξ, usng the chan rule, we can n wrte P(ξ ) = P( x x,..., = x ) We can approxmate ths formulaton by replacng each term by the condtonal probablty x gven all other varables x - P(ξ ) n = P( x x,..., x, x +,..., xn) Ths approxmaton leads to the pseudolkelhood obectve: Gven D wth M tranng nstances, l PL ( : D) = ln P( x [ m] x [ m], ) M Each nstance m Each varable 8 4

15 Gradent of Pseudolkelhood Pseudolkelhood obectve: lpl( : D) = ln P( x [ m] x [ m], ) M Each nstance m Each varable Each term s a log-condtonal lkelhood term over a sngle var X, condtonal on all the remanng vars ln P( x x ) = f[ x, u ] ln exp f[ x ', u ] : X Scope[f ] x ' : X Scope[f ] The nd term nvolves a summaton over values on only a sngle var X (does not requre nference at each step) Wdely used n vson, spatal statstcs, etc. Jontly concave over all parameters Consstent estmator As the number of data nstances M goes to nfnty, wth probablty, MLE of the log-lkelhood obectve Θ* (the true parameter) s a global optmum of the pseudolkelhood obectve 9 Pseudolkelhood vs Lkelhood When the pseudolkelhood does not work well? Depends on the types of queres for whch we ntend to use the model Psuedolkelhood obectve s a better tranng obectve If we plan to run queres where we condton on most of the varables and query the values of only a few, the pseudolkelhood obectve s a very close match to the type of predctons we would lke to make Any example? 5 Netflx collaboratve flterng? 3 Matrx Star Wars VI? Star Wars I 4 Star Wars II Harry Potter II Harry Potter I? Indana Jones New user Probablstc nference 30 5

Pseudolkelhood vs Lkelhood Lkelhood obectve s a better tranng obectve If a typcal query nvolves most or all of the varables n the model, the lkelhood obectve s more approprate. E.g. Gven E=mage what s P(X=labels E=mage)=?

html) X = = = X 4 = Image from the webste of Prof Daphne Koller s lab http://dags.stanford.edu/proects/scenedataset.

reasons In many cases, ths obectve performs surprsngly well. 3 Max-margn Tranng Say that we want to use the model for predctng a MAP assgnment (E.g. mage segmentaton) In ths settng, our tranng set conssts of a set of pars D={(y[m],x[m])} m=,,m.

16 Pseudolkelhood vs Lkelhood Lkelhood obectve s a better tranng obectve If a typcal query nvolves most or all of the varables n the model, the lkelhood obectve s more approprate. E.g. Gven E=mage what s P(X=labels E=mage)=? Image from the webste of Prof Daphne Koller s lab ( X = = = X 4 = Image from the webste of Prof Daphne Koller s lab = = 3 = 4 =3 = = 3 =3 X =3 34 A grd-structured Markov network However, even n cases where the lkelhood s the more approprate obectve, we may have to resort to pseudolkelhood for computatonal reasons In many cases, ths obectve performs surprsngly well. 3 Max-margn Tranng Say that we want to use the model for predctng a MAP assgnment (E.g. mage segmentaton) In ths settng, our tranng set conssts of a set of pars D={(y[m],x[m])} m=,,m. Gven an observaton x[m], we want our learned model to gve the hghest probablty to y[m]. In other words, we want the probablty P Θ (y[m] x[m]) to be hgher than any other probablty P Θ (y x[m]) for y y[m]. To ncrease our confdence n the predcton, we would lke to ncrease the log-probablty gap as much as possble by ncreasng ln P ( y[ m] x[ m]) max ln P ( y x[ m]) y y[ m] Ths dfference between the log-probablty of the target assgnment y[m] and that of the next best assgnment s called the margn. The hgher the margn, the more confdence the model s 3 6

17 Handwrtng Recognton Example Margn ln P ( y[ m] x[ m]) max ln P ( y x[ m]) y y[ m] We want: Equvalently: brace Y X CRF brace aaaaa brace brace aaaab zzzzz a lot! 33 Learnng Undrected Graphs The lkelhood functon Learnng parameters Collectve classfcaton wth HMM, MEMM, CRF Generatve vs. dscrmnatve models Drected vs. undrected models Learnng wth ncomplete data Learnng wth Prors Maxmum A Pror (MAP) estmaton Learnng wth alternatve obectves Structure Learnng Structure learnng va L regularzaton 37 7

18 Structure Learnng Start wth atomc features Greedly conon features to mprove score Problem: Need to re-estmate weghts for each new canddate Approxmaton: Keep weghts of prevous features constant 38 Structure Learnng va Regularzaton* Treat the structure learnng problem as a parameter estmaton problem n a fully connected network L regularzaton to obtan a sparse representaton X X 5 X 4 X X 5 X 4 Lkelhood or pseudolkelhood obectve Convex optmzaton problem *Lee 07, Wanwrght 07, Hoeflng

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002 Man Goals Present