Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen, Scotland, U.K. NAACL June 3, 200
Outline Introduction Introduction Noun Phrase Ordering Multiple Sequence Alignment (MSA) 2 MSA Training Biological MSA Linguistic MSA 3 Results 4 Conclusion
Noun-phrase Ordering Introduction Noun Phrase Ordering Natural Language Generation Task Applications in: Summarization Machine Translation We want to generate natural-sounding text big clumsy brown bear vs?? brown clumsy big bear
Previous Work Introduction Noun Phrase Ordering Genre Accuracy Shaw and Medical Adjectives 94.9% Hatzivassiloglou (999) Medical w/ noun modifiers 90.7% WSJ Adjectives 80.8% WSJ w/ noun modifiers 7.0% Malouf (2000) BNC Adjectives 9.9% Mitchell (2009) Multi-genre w/ noun modifiers 77.% Nouns as modifiers: executive vice president state teacher cadet program
Introduction Multiple Sequence Alignment (DNA) Multiple Sequence Alignment (MSA) G A C T C - A T - A G T G T A T - C G T - T A T - A G T G T A T - A C T - T - T Bases A Adenine C Cytosine G Guanine T Thymine - Gap
Introduction Multiple Sequence Alignment (DNA) Multiple Sequence Alignment (MSA) G A C T C - A T - A G T G T A T - C G T - T A T - A G T G T A T - A C T - T - T G C C T - - A T Bases A Adenine C Cytosine G Guanine T Thymine - Gap
Introduction Multiple Sequence Alignment Multiple Sequence Alignment (MSA) small clumsy black bear big - black cow two-story - brown house big clumsy - bull valuable 4k gold watch
Introduction Multiple Sequence Alignment Multiple Sequence Alignment (MSA) small clumsy black bear big - black cow two-story - brown house big clumsy - bull valuable 4k gold watch big clumsy brown bear Align each permutation of the test sequence (n!) and choose highest-scoring alignment
Biological MSA Training MSA Training Biological MSA Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search Substitution Matrix A C G T - A 0 2 2 C 0 5 2 G 2 0 5 T 2 3 0-2 2 2 0
Biological MSA Training MSA Training Biological MSA Distance Matrix Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search s s 2 s 3 s 4 s 0 s 2 3 0 s 3 3 4 0 s 4 5 2 4 0
Biological MSA Training MSA Training Biological MSA Distance Matrix Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search s s 2 s 3 s 4 s 0 s 2 3 0 s 3 3 4 0 s 4 5 2 4 0 C G T - T A A G T G T A
Biological MSA Training MSA Training Biological MSA Distance Matrix Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search s s 2 s 3 s 4 s 0 s 2 3 0 s 3 3 4 0 s 4 5 2 4 0 - C G T - T A - A G T G T A - A C T - T -
Biological MSA Training MSA Training Biological MSA Distance Matrix Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search s s 2 s 3 s 4 s 0 s 2 3 0 s 3 3 4 0 s 4 5 2 4 0 - C G T - T A - A G T G T A - A C T - T - G A C T C - A
Biological MSA Training MSA Training Biological MSA Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search - C G T - T A - A G T G T A - A C T - T - G A C T C - A 2 3 4 5 6 7 A 3 4 C 4 2 4 G 4 2 4 T 4 4-3 4 4 4 2 4 3 4 4 3 4 4
Biological MSA Training MSA Training Biological MSA Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search - C G T - T A - A G T G T A - A C T - T - G A C T C - A 2 3 4 5 6 7 A 3 C 2 G 2 T 4-3 2 3 3
Biological MSA Training MSA Training Biological MSA Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search A C G 2 T - 4 - C G T - T A - A G T G T A - A C T - T - G A C T C - A 2 3 4 5 6 7 4 2 3 3 5 2 2 3 4 2 4 2
Biological MSA Training MSA Training Biological MSA Begin with a substitution matrix Calculate distance matrix Align 2 closest sequences Repeatedly align and incorporate the closest sequence not already in the MSA Induce a Position Specific Score Matrix (PSSM) Align unseen sequences with Viterbi search A C G 2 T - 4 - C G T - T A - A G T G T A - A C T - T - G A C T C - A G C C T - - A 2 3 4 5 6 7 4 2 3 3 5 2 2 3 4 2 4 2
MSA Training Linguistic MSA What is the distance between ambling black bear and big hungry grizzly bear? What is the cost of substituting executive for two-story? For a gap in another sequence? We don t want to assume that knowledge a priori So we look for linguistic features that might influence the probability of ambling big or executive two-story
Feature-set MSA Training Linguistic MSA Identity Features Word Stem, derived by the Porter Stemmer Binned length indicators (word length in letters):, 2, 3, 4, 5-6, 7-8, 9-2, 3-8, >8 Indicator Features Word begins with a capital Entire word is capitalized Hyphenated Numeric (e.g. 234) Begins with a numeral (e.g. 2-sided) Ends with -al, -ble, -ed, -er, -est, -ic, -ing, -ive, -ly
MSA Training Linguistic MSA Maximum Likelihood (Generative Model) Treat features as classes Words, stems, lengths Each indicator feature in its own class Make the (clearly false) assumption that feature classes are independent Similar to the independence assumption in Naïve Bayes
ML Training MSA Training Linguistic MSA Incorporate sequences in order of occurrence Re-induce a PSSM after each sequence is incorporated Iterate, re-incorporating sequences into MSA
ML Training MSA Training Linguistic MSA Incorporate sequences in order of occurrence Re-induce a PSSM after each sequence is incorporated Iterate, re-incorporating sequences into MSA Vocabulary: 9 words Hyphenated Ends-with ble small clumsy black big - black two-story - brown big clumsy - valuable 4k gold
ML Training Example MSA Training Linguistic MSA Column Feature Count Prob Smoothed small 2 0 big 0 0 0 two-story 0 0 0 valuable 0 0 0 small clumsy black Hyphenated 0 0 Not hyphenated 3 2 3 -ble 0 0 Not -ble 3 2 3
ML Training Example MSA Training Linguistic MSA Column Feature Count Prob Smoothed small 2 2 big 2 2 two-story 0 0 valuable 0 0 small clumsy black big - black Hyphenated 0 0 Not hyphenated 2 4 3 4 -ble 0 0 Not -ble 2 4 3 4
ML Training Example MSA Training Linguistic MSA Column Feature Count Prob Smoothed small 2 3 2 big 2 3 2 two-story 2 3 2 valuable 0 0 2 small clumsy black big - black two-story - brown Hyphenated Not hyphenated 2 3 2 3 2 5 3 5 -ble 0 0 Not -ble 3 5 4 5
ML Training Example MSA Training Linguistic MSA Column Feature Count Prob Smoothed small 2 4 3 big 2 2 3 4 3 two-story 2 4 3 valuable 0 0 3 Hyphenated 2 4 6 Not hyphenated 3 3 4 4 6 small clumsy black big - black two-story - brown big clumsy - -ble 0 0 Not -ble 4 6 5 6
ML Training Example MSA Training Linguistic MSA Column Feature Count Prob Smoothed small 2 5 4 big 2 2 3 5 4 two-story 2 5 4 valuable 2 5 4 Hyphenated 2 5 7 Not hyphenated 3 3 5 5 7 small clumsy black big - black two-story - brown big clumsy - valuable 4k gold -ble Not -ble 4 5 4 5 2 7 5 7
Discriminative Model Averaged Perceptron MSA Training Linguistic MSA Uses the same features as the generative model Does not require the independence assumption With each sequence: Align each permutation of the sequence and compute alignment cost If the correct ordering does not score highest, perform perceptron update on the correct ordering and the highest-scoring incorrect ordering.
MSA Training Discriminative Training Example Linguistic MSA Alignment Costs Feature Column 2 3 gold 5 8 6 4k 0 0 valuable 4 9 4 valuable 4k gold Total 5 7 3 gold 4k valuable 5 4 0 -ble 0 Not -ble 0 0
MSA Training Discriminative Training Example Linguistic MSA Alignment Costs Feature Column 2 3 gold 6 8 5 4k 0 0 valuable 3 9 5 valuable 4k gold Total 3 5 9 gold 4k valuable 7 6 4 -ble 0 Not -ble 0 0
Corpus Results Corpus From Mitchell (2009), including 0-fold splits Composition Combination of Penn Treebank, Brown Corpus, and Switchboard All corpora hand-annotated trees Extracted NPs including nouns and adjectives 74% Penn Treebank (Financial Text) 3% Brown (Literary Text) 3% Switchboard (Conversational)
Evaluation Results Corpus Token Accuracy Rewards correct prediction of common sequences Penalizes sets of modifers which occur in multiple orders Precision / Recall Does not require predictions for all sets Applicable to types as well as tokens Occurrences Modifiers Predicted Accuracy P R 3 brown two-story 4 3 4 3 two-story brown 0 0-0 fuzzy brown 0 0-0 0 brown fuzzy 0 0-3 5 4 4 3 5
Results Corpus Pairwise Ordering Results Token Accuracy and Type-based Precision and Recall Accuracy Precision Recall F Mitchell 2009 N/A 90.3% 67.2% 77.% ML 85.5% 84.6% 84.7% 84.7% Perceptron 88.9% 88.2% 88.% 88.2% Previous results 7.0% 9.9%
Results Corpus Full Noun Phrase Results Token Accuracy and Token-based Precision and Recall Accuracy Precision Recall F Mitchell 2009 N/A 94.4% 78.6% 85.7% ML 76.9% 76.5% 76.5% 76.5% Perceptron 86.7% 86.7% 86.7% 86.7%
Results Cross-domain Generalization Type-based Precision and Recall Corpus Training Brown+WSJ Swbd+WSJ Swbd+Brown Testing Swbd Brown WSJ Mitchell 2009 72.0% 64.5% 40.9% ML 75.0% 74.8% 7.7% Perceptron 77.9% 76.5% 77.4%
Results Cross-domain Generalization Type-based Precision and Recall Corpus Training Brown+WSJ Swbd+WSJ Swbd+Brown Testing Swbd Brown WSJ Mitchell 2009 72.0% 64.5% 40.9% ML 75.0% 74.8% 7.7% Perceptron 77.9% 76.5% 77.4%
Summary Conclusion Summary Applied MSA techniques to NP-ordering Introduced 2 novel methods of MSA training which do not require either gold-standard alignments or hand-tuned substitution matrix. Accuracy competitive with or superior to the best previously-reported results.
Future Work Conclusion Future Work Train on a larger automatically-parsed corpus. Other learning methods Add additional features: Richer morphological features Semantic class information derived from WordNet, OntoNotes, etc.
Conclusion Questions Questions?
Conclusion Questions Full NP accuracies by modifier count Modifiers Frequency Token Pairwise Accuracy Accuracy 2 89.% 89.7% 89.7% 3 0.0% 64.5% 84.4% 4 0.9% 37.2% 80.7%
Ablation Tests Conclusion Questions Feature(s) Gain/Loss Word 0.0 Stem 0.0 Capitalization -0. All-Caps 0.0 Numeric -0.2 Initial-numeral 0.0 Length -0. Hyphen 0.0 -al 0.0 -ble -0.4 Feature(s) Gain/Loss -ed -0.4 -er 0.0 -est -0. -ic +0. -ing 0.0 -ive -0. -ly 0.0 Word and stem -22.9 Word, stem, -24.2 and endings
Example Sequences Conclusion Questions few quaint old characters instrument-jammed bomber cockpits American nuclear strike Italian state-owned holding company executive vice president monthly mortgage payments great Japanese investment machine