Ecient Higher-Order CRFs for Morphological Tagging

Size: px

Start display at page:

Download "Ecient Higher-Order CRFs for Morphological Tagging"

Lisa Allen
6 years ago
Views:

1 Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich

2 Outline 1 Contributions 2 Motivation 3 Model 4 Experiments

3 Contributions Fast approximate CRF tagger for big tag sets

4 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders

5 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders Training time reductions from several days to several hours

6 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders Training time reductions from several days to several hours Accuracy improvements due to higher orders

7 Motivation

8 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Assign coarse POS and ne MORPH tags

9 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Assign coarse POS and ne MORPH tags

10 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Tagging works for short-distance dependencies

11 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Need higher orders for long-distance dependencies weil er kein Lösegeld verlangt because he no ransom demands KO PP PI NN VVF

12 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Higher-order tagging is expensive Coarse-To-Fine approach

13 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice

14 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities

15 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors

16 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors 4 Increase lattice order

17 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors 4 Increase lattice order 5 Go to 2

18 Coarse-To-Fine Decoding II Create a 0-order lattice Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART NN VAF PI NN VVF PDS, VAI ART NE VVP PRO PRO VVI PRO ADJ VAF, VAF,, FM VAP

19 Coarse-To-Fine Decoding II Calculate posterior probabilities Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 0.90 NN 1.00 VAF 0.93 PI 1.00 NN 0.93 VVF 0.54 PDS 0.10, 0.00 VAI 0.07 ART 0.00 NE 0.05 VVP 0.46 PRO 0.00 PRO 0.00 VVI 0.00 PRO 0.00 ADJ 0.01 VAF 0.00, 0.00 VAF 0.00, 0.00, 0.00 FM 0.01 VAP

20 Coarse-To-Fine Decoding II Prune states with low posteriors Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 0.90 NN 1.00 VAF 0.93 PI 1.00 NN 0.93 VVF 0.54 PDS 0.10 VAI 0.07 NE 0.05 VVP 0.46 VVI 0.00 ADJ 0.01 FM 0.01

21 Coarse-To-Fine Decoding II Increase lattice order to 1 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 1.00 NN 1.00 VAF 1.00 PI 1.00 NN 1.00 VVF 0.69 PDS 0.00 VAI 0.00 NE 0.00 VVP 0.31 VVI 0.00 ADJ 0.00 FM 0.00

22 Coarse-To-Fine Decoding II Increase lattice order to 2 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP (S ART) 1.00 (ART NN) 1.00 (NN VAF) 1.00 (VAF PI) 0.99 (PI NN) 1.00 (NN VVP) 0.55 (NN VAI) 0.00 (VAI PI) 0.01 (PI NE) 0.00 (NN VVF) 0.45 (NE VVP) 0.00 (NE VVF) 0.00

23 Coarse-To-Fine Decoding II Increase lattice order to 3 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP (S ART) 1.00 (S ART NN) 1.00 (ART NN VAF) 0.99 (NN VAF PI) 0.99 (VAF PI NN) 0.99 (PI NN VVP) 0.63 (ART NN VAI) 0.01 (NN VAI PI) 0.01 (VAI PI NE) 0.01 (PI NN VVF) 0.37 (PI NE VVP) 0.00 (PI NE VVF) 0.00

24 Model

25 Conditional Random Fields Model:

26 Conditional Random Fields Model: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x)

27 Conditional Random Fields Model: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t)

28 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t)

29 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) pme (y x, t) = 1 Z ME ( λ, x) exp λ φ( x, y, t)

30 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) 1 pme (y x, t) = Z ME ( exp λ φ( x, y, t) λ, x) pcrf (y x, t) = ( y :y t =y exp t λ φ( x,y t,y t 1,t )) Z CRF ( x)

31 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) 1 pme (y x, t) = Z ME ( exp λ φ( x, y, t) λ, x) pcrf (y x, t) = ( y :y t =y exp t λ φ( x,y t,y t 1,t )) Z CRF ( x) Train using L1-regularized Stochastic Gradient Descent (SGD) [Tsuruoka et al., 2009]

32 Coarse-To-Fine Decoding We could now do the following:

33 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter...

34 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features

35 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features We do not need optimal lower-order models

36 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features We do not need optimal lower-order models Train a single joint model instead!

37 Lattice Generation function GetSumLattice(sentence, τ, n) candidates getallcandidates(sentence) lattice ZeroOrderLattice(candidates) for i = 1 n do candidates lattice. prune(τ i 1) lattice SequenceLattice(candidates, i) end for return lattice end function

38 Lattice Generation II With the current lattice generation we never do lower-order updates

39 Lattice Generation II With the current lattice generation we never do lower-order updates We thus never force the model to keep the gold tags in the lower-order lattices

40 Lattice Generation II With the current lattice generation we never do lower-order updates We thus never force the model to keep the gold tags in the lower-order lattices If a gold tag gets pruned do an Early Update [Collins and Roark, 2004]

41 Lattice Generation III function GetSumLattice(sentence, τ, n) gold-tags gettags(sentence) candidates getallcandidates(sentence) lattice ZeroOrderLattice(candidates) for i = 1 n do candidates lattice. prune(τ i 1) if gold-tags candidates then return lattice end if lattice SequenceLattice(candidates, i) end for return lattice end function

42 Lattice Generation IV 0.2 Unreachable gold candidates Epochs

43 Lattice Generation V How do we set the τ i, the pruning threshold at order i?

44 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because

45 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes

46 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes During training, we start with uniform models and end with sparse models

47 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes During training, we start with uniform models and end with sparse models Solution: Set τ i dynamically to achieve a certain average number of tags µ τ i = { +0.1 τ i if ˆµ i < µ i 0.1 τ i if ˆµ i > µ i

48 Experiments

49 Languages Language Sentences POS Arabic 15, Czech 38, English 38, Spanish 14, German 40, Hungarian 61,034 57

50 Baseline Taggers SVMTool [Giménez and Màrquez, 2004] SVM-based Left-To-Right tagger CRFSuite [Okazaki, 2007] First-Order Conditional Random Field (trained with SGD) Morfette [Chrupaªa et al., 2008] Averaged Perceptron Stanford Tagger [Toutanova et al., 2003] Bidirectional Maximum Entropy Markov Model

51 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets

52 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF

53 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF

54 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF PCRF * * * * * * PCRF * * * * * CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF Higher orders: Small but signicant improvements in accuracy for all languages

55 POS Experiments II - Accuracy Arabic Czech Spanish German Hungarian English SVMTool Morfette CRFSuite Stanford PCRF * 98.83* * 97.09* PCRF * 98.66* 97.36* 97.50* PCRF * 98.66* 97.44* 97.49* 97.19* PCRF outperforms best baseline for 3 out of 6 languages

56 POS Experiments II - Accuracy Arabic Czech Spanish German Hungarian English SVMTool Morfette CRFSuite Stanford PCRF * 98.83* * 97.09* PCRF * 98.66* 97.36* 97.50* PCRF * 98.66* 97.44* 97.49* 97.19* PCRF outperforms best baseline for 3 out of 6 languages Never signicantly worse than best baseline

57 POS Experiments II - Training Times Arabic Czech Spanish German Hungarian English Morfette CRFSuite PCRF PCRF CRFSuite is fastest baseline tagger for all languages

58 POS Experiments II - Training Times Arabic Czech Spanish German Hungarian English Morfette CRFSuite PCRF PCRF CRFSuite is fastest baseline tagger for all languages PCRF is faster for bigger tag sets (> 38)

59 Languages Language Sentences POS POS+MORPH A Arabic 15, Czech 38, , Spanish 14, German 40, Hungarian 61, ,

60 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle Oracle is a rst-order PCRF, but gold tags get reinserted when pruned

61 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle PCRF * * Oracle is a rst-order PCRF, but gold tags get reinserted when pruned Small losses for Spanish and Hungarian, greater losses for Arabic, English and German

62 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle PCRF * * PCRF * 93.06* * 96.57* PCRF * 92.97* * Oracle is a rst-order PCRF, but gold tags get reinserted when pruned Small losses for Spanish and Hungarian, greater losses for Arabic, English and German Higher-order models outperform Oracle

63 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines

64 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines Moderate improvements for less ambiguous languages (Spanish, Hungarian)

65 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines Moderate improvements for less ambiguous languages (Spanish, Hungarian) Large improvements for more ambiguous languages (Arabic, Czech, German)

66 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech)

67 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech) PCRF usually twice as fast as Morfette

68 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech) PCRF usually twice as fast as Morfette Czech training takes a week for CRFSuite and 5h for PCRF

69 Conclusion Approximate CRF tagger for big tag sets Fast due to coarse-to-ne decoding (speedups of up to 30) Supports high-order CRFs Higher accuracy than a number of baselines thanks to high order

70 MarMoT - MarMoT Morphological Tagger Our open-source implementation MarMoT is available at Thank you for your attention!

71 References I Chrupaªa, G., Dinu, G., and van Genabith, J. (2008). Learning morphology with Morfette. In Proceedings of LREC. Collins, M. and Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of ACL. Giménez, J. and Màrquez, L. (2004). Svmtool: A general POS tagger generator based on Support Vector Machines. In Proceedings of LREC. Okazaki, N. (2007). Crfsuite: A fast implementation of conditional random elds (CRFs). URL

72 References II Schmid, H. and Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to ne-grained POS tagging. In Proceedings of COLING. Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of NAACL. Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. In Proceedings of ACL.

Natural Language Processing

Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability