Language and Statistics II

Size: px

Start display at page:

Download "Language and Statistics II"

Jared Chandler
5 years ago
Views:

1 Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith

2 Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma r $, p ( )q( ) log p r, pretend p, full-observed data MLE

3 Proof that EM = Partial-Data MLE Claim: EM iterations improve likelihood, converging to a local optimum. maimizing likelihood n p r i i=1 n # = $ p r # i, % $ log$ p r ( i,) % $ p ( )log$ p r, i=1 the M step n i=1 # p ( )q( )log p r (, ) = p #, # q log p r,

4 MLE-Objectve - M-Step-Objective what MLE wants maimized what the M step maimizes # p ( )log# p r (,) $ # p ( ) # q( )log p r (, ) = # p ( ) # q( )log# p r (,' ) $ # p # = $ p # = $ p # q # q ' log p r, # p r, ' ' log p r # q log p r, 6 what 4 MLE 44 wants 7 maimized: 4448 $ # p ( )log# p r (,) what the M step maimizes: % ' = # p ( ) # q( )log p r (, ) & p ( ) q # # log p r

5 (# ( t +1) ) $ r # t r (# ( t +1),q) & ' r # t +1 % r part 1: M step part 2: E step Central Claim ( ) (,q) $ % r # ( t),q & '( r # ( t),q) what MLE wants maimized: $ what the M step maimizes: % ' # p ( )log# p r (,) = p ( ) q # # log p r (, ) & p ( ) q # # log p r (# ( t +1),q) = ma r # r (# ( t),q) $ r # t +1 r $ ( r # ( t),q) r #,q % (,q) = q p % = p ( ) p r ( )log p r # # ( t ) = E ( t ) p D p r p r t+1 log p r # ( t ) p r # t+1 [ ( p ( t ) r ) ] $ 0 ( t+1)

6 Central Claim (# ( t +1) ) $ r # t r (# ( t +1),q) & ' r # t +1 % r ( ) (,q) $ % r # ( t),q & '( r # ( t),q) M step guarantees an increase in Φ E step guarantee s a decrease in Δ 6 what 4 MLE 44 wants 7 maimized: 4448 $ # p ( )log# p r (,) what the M step maimizes: % ' = # p ( ) # q( )log p r (, ) & p ( ) q # # log p r

7 Convergence EM iterations will never decrease likelihood. Under some conditions, EM converges to a saddle point; generall it is assumed that EM will converge to a local maimum. Linear convergence (i.e., slow); depends on how much information is missing.

8 Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma r $, p ( )q( ) log p r, pretend p, full-observed data MLE

9 Toward Structure Models There are wa too man values of Y to sum over! Two ke points: Never need to sum over Y b enumeration. Never need q to be computed eplicitl.

10 Consider the M Step r t +1 # argma r $, p ( )q( ) log p r, pretend p, To maimize likelihood, what do we need? For multinomial-based models (HMMs, PCFGs, etc.), we need counts. For log-linear models in general, we need counts.

11 Simplifing the M Step r t +1 (multinomials) # argma r $, p ( )q( ) log p r, pretend p, p ( )q( )log p r count e;, # (,) = # p ( )q( )log$ p e, # = p ( )q,, # count e;, e log p e = # log( p e )# p ( )q( )count e;, e, = # log( p e )# p e n # q count e;, = 1 # log( p e )## q( n i )count( e; i,) e i= epected count of e given e

12 Simplifing the M Step r t +1 (log-linear models) # argma r $, p ( )q( ) log p r, pretend p, # p ( )q( )log p r (, ) = # p ( )q,, = r. $ # p ( )q -, r = 1, r n $ ## q i n. - i=1 r f, f i,, / 1 % logz r 0 / 1 % logz r 0 & ( r ( $ r f, ( ' % log ep r ( ) $ r ) # ( f ( ',' )) + ',' + * net time! Z r

13 Sufficient Statisics A statistic is sufficient for a parameter when = p data S r p data r ( ( )) The M step onl requires sufficient statistics under q. For NLP models, this usuall means epected counts.

14 HMM Forward and Backward Probabilities i,c = p s i+1 ( n C i = c) #( i,c) = p( s i 1,C i = c) backward probabilit forward probabilit ( i,c) # $ ( i,c) = p( s n 1,C i = c) i,c # $ ( i,c) = p C i = c s 1 $ n +1,stop ( n) n ( i,c) # $ ( i,c) % = E i : C i = c $ n +1,stop i=1 [ n { } s ] 1 posterior probabilit that s i is labeled with class c epected count of class c

15 CKY Inside and Outside Probabilities i, j,n ( n S ) 1n = p s 1 i#1 N ij s j +1 $ i, j,n = p s i j N ij # $ ( i, j,n) = p s n 1,N ij # $ ( i, j,n) n = p( N $ ( 1,n,S ) ij s 1 ) # $ ( i, j, N %) # $ ( j +1,k, N % ) # p( N & N % N % ) $ ( 1,n,S ) i, j,n i, j,n i,k,n outside probabilit inside probabilit n = p N ik & N ij % N ( % j +1)k s 1 n n n n $ $ $ p N ik N ij # N ( # j +1)k s 1 = E i, j,k : N ik N ij # # i=1 j= i k= j +1 [ N n { } s ] ( j +1)k 1 epected count of rule

16 In General Don t compute q directl in the E step. Just get the sufficient statistics. Inside and Outside algorithms can help for some models! Other alternatives (less common in NLP): Sample from q( ) to get sufficient statistics. Use a variational approimation to q( ). This should remind ou of the factored dual in structured maimum margin training! Use statistics on structure pieces instead of whole structures.

17 Pereira and Schabes (1992) Suppose ou have a partiall bracketed corpus. Want to constrain re-estimation to respect the known bracketings. Everthing else is hidden. (Democrats took control of both houses) no information ((Democrats) took control of (both houses)) base NPs ((Democrats) took (control of (both houses))) all NPs (Democrats (took control of both houses)) VP

18 CKY X w i w i i X i X Y Z Y Z i j j + 1 k X i k S 1 n

19 CKY X w i w i i X i X Y Z Y Z i j j + 1 k X (i,k) doesn t cross an known constituents i k S 1 n

20 Pereira and Schabes (1992) When compared with unconstrained EM: Better fit to the data (cross-entrop) Better accurac Faster convergence Later result (Hwa, 1999): high-level structure helps more than low-level structure.

21 Merialdo (1994) Suppose ou have some tagged tet and some untagged tet. You could train a tagger on the tagged tet. Can ou use the untagged tet to help? Merialdo: Var the amount of tagged tet Use the tagged tet to initialize Run EM.

22 Merialdo (1994)

23 Merialdo (1994) Similar results b Elworth (1994). Another wa to combine the data: ma r $ log p( i, i ) + $ log p i i#l Equivalentl, augment E step counts with observed counts before each M step. i#u (Same effect, anecdotall.)

24 The Two Main Problems With EM Marginal likelihood Accurac Local optima Plot from Smith (2006); similar results in Charniak (1993)

25 Variants MAP instead of MLE: add a prior (smoothing) For speed: Viterbi approimation (mode instead of epectation in E step) Incremental EM (M step after ever eample) To improve search qualit: Deterministic annealing: graduall rela an entrop constraint on q (affects the E step onl) Random restarts Random reweighting of eamples Reall good initialization Alternative objective (net week)

26 Klein & Manning (2002) A highl deficient grammatical model that predicts POS tag sequences. Constituent-contet model (CCM). Best published unsupervised parsing results on WSJ-10 (in 2002) Trained using EM with an interesting initializer: Pr split

27 Constituent-Contet Model t = (t 1 t n ) is the tag sequence. Let C i,j be a r.v., equal to 1 if t i t j is a constituent, 0 if not. C is the set of all C i,j (an upper matri). The valid values of C are the ones that are binar trees. Pr(t, C) = Pr(C) Pr(t C) = Pr(C) П i,j: 1 i j n Pr(t i t j C i,j ) Pr(t i-1, t j+1 C i,j ) = (1/τ(n)) П i,j: 1 i j n Pr(t i t j C i,j ) Pr(t i-1, t j+1 C i,j ) uniform distribution over trees constituent contet

28 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 t 6 </S> Pr(t Pr(<S>, 1 t 2 t 3 </S> t 4 t 5 t 6 1) 1)

29 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 t 6 Pr(t Pr(<S>, 1 t 2 t 3 t 46 t 5 0) 0)

30 t 1 t 2 t 3 t 4 t 5 t 6 <S> t 1 t 2 t 3 t 4 t 5 Pr(t Pr(<S>, 1 t 2 t 3 t 54 1)

31 t 1 t 2 t 3 t 4 t 5 t 6 t 4 t 5 t 6 </S> Pr(t Pr(t 4, </S> 5 t 6 1) 1)

32 About CCM Onl four multinomials to estimate. Need two features for ever substring of tags! (This is wh the used WSJ-10.) Highl deficient!

33 Pr split Given length of sentence, n. Choose N 1 u.a.r. from {1, 2,, n} Choose N 2 u.a.r. from {1, 2,, N 1 1} Choose N 3 u.a.r. from {1, 2,, n N 1 1} Continue until no further splits can be made. N 1 n N 1 n N 1 N 2 N 1 N 2 This model gives a closed form for the epectations; K&M use it to generate initial epectations, then start with an M step. N 2 N 1 N 2 N 3 n N 1 N 3 *See also Cover & Thomas problem 4.3 (p. 72).

34 Importance of Pr split Alg m Model UP (%) UR (%) Ave. CB Perf ct (%) 0CB 2CB It. s Cross-E (bits) Rightbranching trees EM (Pr split ) CCM EM (unif.) CCM Upper bound (binar trees)

35 Net Time Contrastive estimation as an alternative to EM Application to unsupervised tagging, parsing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities