Noisy Channel and Hidden Markov Models
|
|
- Chester Allen
- 5 years ago
- Views:
Transcription
1 Noisy Channel and Hidden Markov Models Natural Language Processing CS 6120 Spring 2014 Northeastern University David Smith with material from Jason Eisner & Andrew McCallum!
2 Warren Weaver to Norbert Wiener 4 March 1947 One thing I wanted to ask you about is this. A most serious problem, for UNESCO and for the constructive and peaceful future of the planet, is the problem of translation, as it unavoidably affects the communication between peoples. Huxley has recently told me that they are appalled by the magnitude and the importance of the translation job.! Recognizing fully, even though necessarily vaguely, the semantic difficulties because of multiple meanings, etc., I have wondered if it were unthinkable to design a computer which would translate. Even if it would translate only scientific material (where the semantic difficulties are very notably less), and even if it did produce an inelegant (but intelligible) result, it would seem to me worth while. Also knowing nothing official about, but having guessed and inferred considerable about, powerful new mechanized methods in cryptography methods which I believe succeed even when one does not know what language has been coded one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.
3 Word Segmentation theprophetsaidtothecity What does this say?! And what other words are substrings? Given L = a lexicon FSA that matches all English words. How to apply to this problem? What if Lexicon is weighted? From unigrams to bigrams? Smooth L to include unseen words?
4 Spelling correction Spelling correction also needs a lexicon L But there is distortion Let T be a transducer that models common typos and other spelling errors ance (à ) ence (deliverance,...) e à ε (deliverance,...) ε à e // Cons _ Cons (athlete,...) rr à r ge à dge etc. Now what can you do with L.o. T? (embarrasş occurrence, ) (privilege, ) Should T and L have probabilities? Want T to include all possible errors
5 Noisy Channel Model real language X noisy channel X à Y yucky language Y want to recover X from Y
6 Noisy Channel Model real language X correct spelling noisy channel X à Y typos yucky language Y misspelling want to recover X from Y
7 Noisy Channel Model real language X (lexicon space)* noisy channel X à Y delete spaces yucky language Y text w/o spaces want to recover X from Y
8 Noisy Channel Model real language X (lexicon space)* noisy channel X à Y pronunciation yucky language Y speech want to recover X from Y
9 Noisy Channel Model real language X (lexicon space)* language model noisy channel X à Y acoustic model pronunciation yucky language Y speech want to recover X from Y
10 Noisy Channel Model real language X target language noisy channel X à Y translation yucky language Y source language want to recover X from Y
11 Noisy Channel Model real language X target language language model noisy channel X à Y translation model translation yucky language Y source language want to recover X from Y
12 Noisy Channel Model real language X tree noisy channel X à Y delete everything but terminals yucky language Y text want to recover X from Y
13 Noisy Channel Model real language X tree probabilistic CFG noisy channel X à Y delete everything but terminals yucky language Y text want to recover X from Y
14 Noisy Channel Model real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y)
15 Noisy Channel Model real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y) want to recover x X from y Y
16 Noisy Channel Model real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y) want to recover x X from y Y choose x that maximizes p(x y) or equivalently p(x,y)
17 Noisy Channel Model p(x) * p(y X) = p(x,y)
18 Noisy Channel Model a:a/0.7 b:b/0.3 p(x) * p(y X) = p(x,y)
19 Noisy Channel Model a:a/0.7 a:c/0.1 a:d/0.9 b:b/0.3 b:c/0.8 b:d/0.2 p(x) * p(y X) = p(x,y)
20 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = p(x,y)
21 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = a:c/0.07 a:d/0.63 b:c/0.24 b:d/0.06 p(x,y)
22 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = a:c/0.07 a:d/0.63 b:c/0.24 b:d/0.06 p(x,y) Note p(x,y) sums to 1.
23 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = a:c/0.07 a:d/0.63 b:c/0.24 b:d/0.06 p(x,y) Note p(x,y) sums to 1. Suppose y= C ; what is best x?
24 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = a:c/0.07 a:d/0.63 b:c/0.24 b:d/0.06 p(x,y) Suppose y= C ; what is best x?!
25 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) = = a:c/0.07 b:c/0.24 p(x, y)
26 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X) restrict just to paths compatible with output C = = a:c/0.07 b:c/0.24 p(x, y)
27 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X).o. * restrict just to paths compatible with output C C:C/1 = (Y=y)? = a:c/0.07 b:c/0.24 p(x, y)
28 Noisy Channel Model a:a/0.7 b:b/0.3 p(x).o. * a:c/0.1 a:d/0.9 b:c/0.8 b:d/0.2 p(y X).o. * restrict just to paths compatible with output C a:c/0.07 C:C/1 = b:c/0.24 best path (Y=y)? = p(x, y)
29 Morpheme Segmentation Let Lexicon be a machine that matches all Turkish words Same problem as word segmentation (in, e.g., Chinese) Just at a lower level: morpheme segmentation Turkish word: uygarlaştıramadıklarımızdanmışsınızcasına = uygar+laş+tır+ma+dık+ları+mız+dan+mış+sınız+ca+sı+na (behaving) as if you are among those whom we could not cause to become civilized Some constraints on morpheme sequence: bigram probs Generative model concatenate then fix up joints stop + -ing = stopping, fly + -s = flies, vowel harmony Use a cascade of transducers to handle all the fixups But this is just morphology! Can use probabilities here too (but people often don t)
30 Edit Distance Transducer O(k) deletion arcs ε:a a:ε b:ε a:b b:a O(k 2 ) substitution arcs O(k) insertion arcs ε:b b:b a:a O(k) no-change arcs
31 Stochastic Edit Distance Transducer O(k) deletion arcs ε:a a:ε b:ε a:b b:a O(k 2 ) substitution arcs O(k) insertion arcs ε:b b:b a:a O(k) identity arcs Likely edits = high-probability arcs
32 Stochastic Edit Distance Transducer clara.o. c:ε l:ε a:ε r:ε a:ε c:c l:c a:c r:c a:c a:ε b:ε a:b ε:c ε:c ε:c ε:c c:ε ε:c l:ε a:ε r:ε a:ε ε:c ε:a b:a = ε:a c:a l:a a:a r:a ε:a ε:a c:ε l:ε a:ε ε:a r:ε ε:a a:ε a:a ε:a l:c ε:b b:b a:a ε:c c:c a:c r:c a:c ε:c ε:c ε:c c:ε ε:c ε:c l:ε a:ε r:ε a:ε.o. caca ε:a c:a l:a a:a r:a a:a ε:a ε:a ε:a ε:a ε:a c:ε l:ε a:ε r:ε a:ε
33 Stochastic Edit Distance Transducer clara.o. Best path (by Dijkstra s algorithm) c:ε l:ε a:ε r:ε a:ε c:c l:c a:c r:c a:c a:ε b:ε a:b ε:c ε:c ε:c ε:c c:ε ε:c l:ε a:ε r:ε a:ε ε:c ε:a b:a = ε:a c:a l:a a:a r:a ε:a ε:a c:ε l:ε a:ε ε:a r:ε ε:a a:ε a:a ε:a l:c ε:b b:b a:a ε:c c:c a:c r:c a:c ε:c ε:c ε:c c:ε ε:c ε:c l:ε a:ε r:ε a:ε.o. caca ε:a c:a l:a a:a r:a a:a ε:a ε:a ε:a ε:a ε:a c:ε l:ε a:ε r:ε a:ε
34 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model p(word seq).o. pronunciation model p(phone seq word seq).o. acoustic model p(acoustics phone seq)
35 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model p(word seq).o. pronunciation model p(phone seq word seq).o. acoustic model p(acoustics phone seq).o. observed acoustics
36 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model p(word seq).o. CAT:k æ t p(phone seq word seq).o. phone context əә: phone context p(acoustics phone seq)
37 Speech Recognition by FST Composition (Pereira & Riley 1996) trigram language model p(word seq).o. CAT:k æ t p(phone seq word seq).o. phone context əә: phone context p(acoustics phone seq).o.
38 Transliteration (Knight & Graehl, 1998) Angela Johnson New York Times ice cream (a n jira jyo n son) (nyu uyo oku ta imuzu) (a isukurfimu) Omaha Beach (omahabiit chi) pro soccer (purosakkaa) Tonya Harding (toonya haadingu) ramp lamp casual fashion team leader ~yt"?y~ ~J=TJ~7~y ~--~--~'-- (ranpu) (ranpu) (kaj yuaruhas shyon) (chifmuriidaa) P(w) -- generates written English word sequences. P(elw) -- pronounces English word sequences. P(jle) -- converts English sounds into Japanese sounds. P(klj ) -- converts Japanese sounds to katakana writing. P(o]k) -- introduces misspellings caused by optical character recognition (OCR).
39 Part-of-Speech Tagging
40 Bigram LM as FSM fox brown quick jumped the
41 Bigram LM as FSM V states fox brown quick jumped the
42 Bigram LM as FSM V states fox brown quick jumped the O(V 2 ) arcs (& parameters)
43 Bigram LM as FSM V states fox What about a trigram model? brown quick jumped the O(V 2 ) arcs (& parameters)
44 Bigram LM as FSM V states fox What about a trigram model? brown quick jumped the O(V 2 ) arcs (& parameters) What about backoff?
45 Grammatical Categories Parts of speech (partes orationis) Some Cool Kids call them word classes Folk definitions Nouns: people, places, concepts, things,... Verbs: expressive of action Adjectives: properties of nouns In linguistics, defined by role in syntax sad The intelligent green fat one is in the corner. Substitution test
46 The Tagging Task
47 The Tagging Task Input: the lead paint is unsafe
48 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj
49 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj
50 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses:
51 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?)
52 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?) can write regexps like (Det) Adj* N+ over the output
53 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?) can write regexps like (Det) Adj* N+ over the output preprocessing to speed up parser (but a little dangerous)
54 The Tagging Task Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?) can write regexps like (Det) Adj* N+ over the output preprocessing to speed up parser (but a little dangerous) if you know the tag, you can back off to it in other tasks
55 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj
56 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task
57 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task Been done to death by different methods
58 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?)
59 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task (in English)
60 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task (in English) Can be done well with methods that look at local context
61 Why Do We Care? Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task (in English) Can be done well with methods that look at local context Though should really do it by parsing!
62 Tagged Data Sets Brown Corpus Designed to be a representative sample from 1961 news, poetry, belles lettres, short stories 87 different tags Penn Treebank 45 different tags Currently most widely used for English Now a paradigm in lots of other languages Chinese Treebank has over 200 tags
63 Penn Treebank POS Tags PART-OF-SPEECH TAG EXAMPLES Adjective JJ happy, bad Adjective, comparative JJR happier, worse Adjective, cardinal number CD 3, fifteen Adverb RB often, particularly Conjunction, coordination CC and, or Conjunction, subordinating IN although, when Determiner DT this, each, other, the, a, some Determiner, postdeterminer JJ many, same Noun NN aircraft, data Noun, plural NNS women, books Noun, proper, singular NNP London, Michael Noun, proper, plural NNPS Australians, Methodists Pronoun, personal PRP you, we, she, it Pronoun, question WP who, whoever Verb, base present form VBP take, live
64 Word Class Classes Importantly for predicting POS tags, there are two broad classes Closed class words Belong to classes that don t accept new members Determiners: the, a, an, this,... Prepositions: in, on, of,... Open class words Nouns, verbs, adjectives, adverbs,... Closed is relative: These words are born and die over longer time scales (e.g, regarding )
65 Ambiguity in Language Fed raises interest rates 0.5% in effort to control inflation S NY Times headline 17 May 2000 NP NNP Fed VP V NP NP PP raises NN NN CD NN PP NP interest rates 0.5 % in NN VP effort V to VP V NP control NN inflation
66 Part-of-speech Ambiguity NNP VBZ NNS VB VBZ NNS VBZ NNS CD NN Fed raises interest rates 0.5 % in effort to control inflation
67 Degree of Supervision
68 Degree of Supervision Supervised: Training corpus is tagged by humans
69 Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn t tagged
70 Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn t tagged Partly supervised: Training corpus isn t tagged, but you have a dictionary giving possible tags for each word
71 Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn t tagged Partly supervised: Training corpus isn t tagged, but you have a dictionary giving possible tags for each word
72 Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn t tagged Partly supervised: Training corpus isn t tagged, but you have a dictionary giving possible tags for each word We ll start with the supervised case and move to decreasing levels of supervision.
73 Current Performance Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj
74 Current Performance Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj How many tags are correct?
75 Current Performance Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj How many tags are correct? About 97% currently
76 Current Performance Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj How many tags are correct? About 97% currently But baseline is already 90% Baseline is performance of stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns
77 What Should We Look At? Bill directed a cortege of autos through the dunes
78 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes
79 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more)
80 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more)
81 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more)
82 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more)
83 What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more) Each unknown tag is constrained by its word
84 What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a correct tags cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more) Each unknown tag is constrained by its word and by the tags to its immediate left and right.
85 What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a correct tags cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more) Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too
86 What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a correct tags cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more) Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too
87 What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a correct tags cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj Prep? some possible tags for each word (maybe more) Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too
88 Finite-State Approaches Noisy Channel Model (statistical) real language X noisy channel X à Y yucky language Y part-of-speech tags (n-gram model) replace tags with words text want to recover X from Y
89 Review: Noisy Channel real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y)
90 Review: Noisy Channel real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y) want to recover x X from y Y
91 Review: Noisy Channel real language X p(x) * noisy channel X à Y p(y X) = yucky language Y p(x,y) want to recover x X from y Y choose x that maximizes p(x y) or equivalently p(x,y)
92 Noisy Channel for Tagging a:a/0.7 a:c/0.1 a:d/0.9 b:b/0.3 acceptor: p(tag sequence) Markov Model.o. b:c/0.8 transducer: tags à words Unigram Replacement b:d/0.2 p(x) * p(y X).o. * a:c/0.07 C:C/1 acceptor: the observed words straight line = transducer: scores candidate tag seqs on their joint probability with obs words; pick best path b:c/0.24 best path (Y = y)? = p(x, y)
93 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
94 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
95 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
96 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
97 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
98 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
99 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
100 Markov Model (bigrams) Det Verb Start Prep Adj Noun Stop
101 Markov Model Det Verb Start Prep Adj Noun Stop
102 Markov Model Det Verb Start Prep Adj Noun Stop
103 Markov Model Det Verb Start Prep Adj Noun Stop
104 Markov Model 0.8 Det Verb Start Prep Adj Noun 0.2 Stop 0.1
105 Markov Model p(tag seq) 0.8 Det Verb Start Prep Adj Noun 0.2 Stop 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
106 Markov Model as an FSA p(tag seq) 0.8 Det Verb Start Prep Adj Noun 0.2 Stop 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
107 Markov Model as an FSA p(tag seq) 0.8 Det Verb Start Prep Adj Noun 0.2 Stop 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
108 Markov Model as an FSA p(tag seq) Det 0.8 Det Verb Start Adj 0.3 Noun 0.7 Prep Adj 0.4 Adj Noun 0.5 Noun ε 0.2 Stop ε 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
109 Markov Model (tag bigrams) p(tag seq) Det 0.8 Start Det Adj 0.3 Adj 0.4 Adj Noun 0.5 Noun ε 0.2 Stop Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2
110 Noisy Channel for Tagging automaton: p(tag sequence) Markov Model.o. transducer: tags à words Unigram Replacement p(x) * p(y X).o. * automaton: the observed words straight line = transducer: scores candidate tag seqs on their joint probability with obs words; pick best path p(y Y) = p(x, y)
111 Noisy Channel for Tagging Det Det 0.8 Noun Adj 0.3 Start 0.7 Adj 0.4 Adj Noun 0.5 ε 0.1 Noun Verb ε 0.2 Prep Stop.o. Noun:cortege/ Noun:autos/0.001 Noun:Bill/0.002 Det:a/0.6 Det:the/0.4 Adj:cool/0.003 Adj:directed/ p(x) * p(y X).o. Adj:cortege/ * the cool directed autos p(y Y) = transducer: scores candidate tag seqs on their joint probability with obs words; we should pick best path = p(x, y)
112 Unigram Replacement Model p(word seq tag seq) Noun:cortege/ Noun:autos/0.001 Noun:Bill/0.002 sums to 1 Det:a/0.6 Det:the/0.4 Adj:cool/0.003 sums to 1 Adj:directed/ Adj:cortege/
113 Compose Det 0.8Adj 0.3 Start Adj 0.4 Det Adj Noun 0.7 Noun 0.5 ε 0.1 Verb Prep Noun Stop ε 0.2 Noun:cortege/ Noun:autos/0.001 Noun:Bill/0.002 Det:a/0.6 Det:the/0.4 Adj:cool/0.003 Adj:directed/ Adj:cortege/ p(tag seq) Det 0.8 Det Verb Adj 0.3 Start Prep Adj 0.4 Adj Noun 0.5 Noun ε 0.2 Stop
114 Compose Det 0.8Adj 0.3 Start Adj 0.4 Det Adj Noun 0.7 Noun 0.5 ε 0.1 Verb Prep Noun Stop ε 0.2 Noun:cortege/ Noun:autos/0.001 Noun:Bill/0.002 Det:a/0.6 Det:the/0.4 Adj:cool/0.003 Adj:directed/ Adj:cortege/ p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) Det:a 0.48 Det:the 0.32 Start Det Verb Adj:cool Adj:directed Adj:cortege Prep Adj Adj:cool Adj:directed Adj:cortege N:cortege N:autos Noun ε Stop
115 Observed Words as Straight-Line FSA word seq the cool directed autos
116 Compose with the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) Det:a 0.48 Det:the 0.32 Start Det Verb Adj:cool Adj:directed Adj:cortege Prep Adj Adj:cool Adj:directed Adj:cortege N:cortege N:autos Noun ε Stop
117 Compose with the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) Det:the 0.32! Det Verb Start Adj:cool Prep Adj! Adj:directed Adj N:autos Noun! ε Stop
118 Compose with the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) Det:the 0.32! Det Verb Start Adj:cool Prep Adj why did this loop go away?! Adj:directed Adj N:autos Noun! ε Stop
119 The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq tag seq) Det:the 0.32! Det Verb Start Adj:cool Prep Adj! Adj:directed Adj N:autos Noun! ε Stop
120 In Fact, Paths Form a Trellis p(word seq, tag seq) Start Det:the 0.32 Adj:cool Det Adj Noun Det Adj Adj:directed Noun Noun Det Det Adj Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
121 In Fact, Paths Form a Trellis p(word seq, tag seq) Start Det:the 0.32 Adj:cool Det Adj Noun:cool Det Adj Adj:directed Noun Noun Noun Det Det Adj Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
122 In Fact, Paths Form a Trellis p(word seq, tag seq) Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
123 The Trellis Shape Emerges from the Cross-Product Construction for o = 1,1 1,2 1,3 1,4 0,0 2,1 2,2 2,3 2,4 4,4 3,1 3,2 3,3 3,4
124 The Trellis Shape Emerges from the Cross-Product Construction for ε ε ε 4 2.o = 1,1 1,2 1,3 1,4 ε 0,0 2,1 2,2 2,3 2,4 ε ε 4,4 3,1 3,2 3,3 3,4
125 The Trellis Shape Emerges from the Cross-Product Construction for ε ε ε 4 2.o ,1 0,0 2,1 1,2 = All paths here are 4 words 1,3 1,4 2,2 2,3 2,4 ε ε ε 4,4 3,1 3,2 3,3 3,4
126 The Trellis Shape Emerges from the Cross-Product Construction for ε ε ε 4 2.o ,1 0,0 2,1 3,1 1,2 2,2 3,2 = All paths here are 4 words 1,3 1,4 2,3 2,4 3,3 3,4 4,4 So all paths here must have 4 words on output side ε ε ε
127 Actually, Trellis Isn t Complete p(word seq, tag seq) Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
128 Actually, Trellis Isn t Complete p(word seq, tag seq) Trellis has no Det à Det or Det à Stop arcs; why? Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
129 Actually, Trellis Isn t Complete p(word seq, tag seq) Lattice is missing some other arcs; why? Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
130 Actually, Trellis Isn t Complete p(word seq, tag seq) Lattice is missing some states; why? Start Det:the 0.32 Adj:cool Det Noun:cool Adj Noun Adj:directed Adj:directed Adj Noun Noun:autos Noun ε 0.2 Stop The best path: Start Det Adj Adj Noun Stop = 0.32 * the cool directed autos
131 Find best path from Start to Stop Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop
132 Find best path from Start to Stop Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop Use dynamic programming: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as probability plus one backpointer)
133 Find best path from Start to Stop Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop Use dynamic programming: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as probability plus one backpointer) Special acyclic case of Dijkstra s shortest-path alg.
134 Find best path from Start to Stop Start Det:the 0.32 Adj:cool Det Adj Noun:cool Noun Det Adj Noun Adj:directed Adj:directed Det Det Adj Noun Noun:autos Adj Noun ε 0.2 Stop Use dynamic programming: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as probability plus one backpointer) Special acyclic case of Dijkstra s shortest-path alg. Faster if some arcs/states are absent
135 In Summary
136 In Summary We are modeling p(word seq, tag seq)
137 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words
138 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words?
139 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model :
140 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : Start PN Verb Det Noun Prep Noun Prep Noun Stop Bill directed a cortege of autos throu
141 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : Start PN Verb Det Noun Prep Noun Prep Noun Stop Bill directed a cortege of autos throu
142 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : probs from tag bigram model Start PN Verb Det Noun Prep Noun Prep Noun Stop Bill directed a cortege of autos throu
143 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : probs from tag bigram model probs from unigram replacement Start PN Verb Det Noun Prep Noun Prep Noun Stop Bill directed a cortege of autos throu
144 In Summary We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a Hidden Markov Model : probs from tag bigram model probs from unigram replacement Start PN Verb Det Noun Prep Noun Prep Noun Stop Bill directed a cortege of autos throu Find X that maximizes probability product
145 Another Viewpoint
146 Another Viewpoint We are modeling p(word seq, tag seq)
147 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff?
148 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are!
149 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det Bill directed a p( ) = p(start) * p(pn Start) * p(verb Start PN) * p(det Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb Det ) * p(a Bill directed, Start PN Verb Det ) *
150 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det Bill directed a p( ) = p(start) * p(pn Start) * p(verb Start PN) * p(det Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb Det ) * p(a Bill directed, Start PN Verb Det ) * Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes
151 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det Bill directed a p( ) = p(start) * p(pn Start) * p(verb Start PN) * p(det Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb Det ) * p(a Bill directed, Start PN Verb Det ) * Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes
152 Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det Bill directed a p( ) = p(start) * p(pn Start) * p(verb Start PN) * p(det Start PN Verb) * * p(bill Start PN Verb ) * p(directed Bill, Start PN Verb Det ) * p(a Bill directed, Start PN Verb Det ) * Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes
153 Variations
154 Variations Multiple tags per word
155 Variations Multiple tags per word Transformations to knock some of them out
156 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts?
157 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts?
158 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning
159 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning Supervised: You have a tagged training corpus
160 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning Supervised: You have a tagged training corpus Unsupervised: You have an untagged training corpus
161 Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning Supervised: You have a tagged training corpus Unsupervised: You have an untagged training corpus Here: You have an untagged training corpus and a dictionary giving possible tags for each word
162 Standard View of HMMs
163 Applications of HMMs NLP Part-of-speech tagging Word segmentation Information extraction Optical character recognition Speech recognition Modeling acoustics, with continuous emissions Computer Vision Gesture recognition Biology Gene finding Protein structure prediction Economics, Climatology, Robotics, etc.
164 Recipe for NLP Input: the lead paint is unsafe Observations Output: the/det lead/n paint/n is/v unsafe/adj Tags 1) Data: Notation, representation 2) Problem: Write down the problem in notation 3) Model: Make some assumptions, define a parametric model (often generative model of the data) 4) Inference: How to search through possible answers to find the best one 5) Learning: How to estimate parameters 6) Implementation: Engineering considerations for an efficient implementation
165 An HMM Tagger View sequence of tags as a Markov chain. Assumptions: Limited horizon Time invariant (stationary) We assume that a word s tag only depends on the previous tag (limited horizon) and that his dependency does not change over time (time invariance) A state (part of speech) generates a word. We assume it depends only on the state.
166 The Markov Property A stochastic process has the Markov property if the conditional probability distribution of future states of the process, given the current state, depends only upon the current state, and conditionally independent of the past states (the path of the process) given the current state. A process with the Markov property is usually called a Markov process, and may be described as Markovian.
167 HMM w/state Emissions transitions P(x t+1 x t ) NN DT VBP emissions JJ IN P(o t x t ) for above in
168 HMM as Bayes Net Top row is unobserved states, interpreted as POS tags Bottom row is observed output observations (words)
169 (One) Standard HMM Formalism (X, O, x s, A, B) are all variables. Model µ = (A, B) X is state sequence of length T; O is observation seq. x s is a designated start state (with no incoming transitions). (Can also be separated into! as in book.) A is matrix of transition probabilities (each row is a conditional probability table (CPT) B is matrix of output probabilities (vertical CPTs) HMM is a probabilistic (nondeterministic) finite state automaton, with probabilistic outputs (from vertices, not arcs, in the simple case)
170 HMM Inference Problems Given an observation sequence, find the most likely state sequence (tagging) Compute the probability of observations when state sequence is hidden (language modeling) Given observations and (optionally) a their corresponding states, find parameters that maximize the probability probability of the observations (parameter estimation)
171 Most Likely State Sequence Given O = (o 1,,o T ) and model µ = (A,B) We want to find P(O,X µ) = P(O X, µ) P(X µ ) P(O X, µ) = b[x 1 o 1 ] b[x 2 o 2 ] b[x T o T ] P(X µ) = a[x 1 x 2 ] a[x 2 x 3 ] a[x T-1 x T ] arg max X P(O,X µ) = arg max x 1, x 2, x T Problem: arg max is exponential in sequence length!
172 Paths in a Trellis States X1 x2 x3 x4 Time T
173 Paths in a Trellis States X1 x2 x3 x4 Time T
174 Paths in a Trellis States X1 x2 x3 x4 a[x 4, x 2] b[o 4 ] Time T! i (t) = Probability of most likely path that ends at state i at time t.
175 Dynamic Programming Efficient computation of max over all states Intuition: Probability of the first t observations is the same for all possible t+1 length sequences. Define forward score: Compute it recursively from the beginning (Then must remember best paths to get arg max.)
176 The Viterbi Algorithm (1967) Used to efficiently find the state sequence that gives the highest probability to the observed outputs Maintains two dynamic programming tables: The probability of the best path (max) The state transitions of the best path (arg) Note that this is different from finding the most likely tag for each time t!
177 Viterbi Recipe Initialization Induction Store backtrace Termination and path readout Probability of entire best seq.
178 HMMs: Maxing and Summing
179 Markov vs. Hidden Markov Models Fed raises rates interest
180 Markov vs. Hidden Markov Models Fed NN raises NNS rates VBZ VB interest NNP
181 Markov vs. Hidden Markov Models interest... Fed NN raises NNS rates VBZ VB interest NNP
182 Markov vs. Hidden Markov Models interest... Fed NN raises rates... NNS raises VBZ rates VB interest NNP
183 Markov vs. Hidden Markov Models interest... Fed NN raises rates... raises NNS raises rates... VBZ rates VB interest NNP
184 Markov vs. Hidden Markov Models interest... Fed NN raises rates... raises NNS raises rates... VBZ rates VB interest... interest NNP
185 Markov vs. Hidden Markov Models interest... Fed NN raises rates... raises NNS raises rates... VBZ rates VB interest... interest NNP Fed...
186 Unrolled into a Trellis NN NNS NNP VB VBZ Fed raises interest rates
187 HMM Inference Problems Given an observation sequence, find the most likely state sequence (tagging) Compute the probability of observations when state sequence is hidden (language modeling) Given observations and (optionally) a their corresponding states, find parameters that maximize the probability probability of the observations (parameter estimation)
188 Tagging Given an observation sequence, find the most likely state sequence. arg max X P (X O, µ) = arg max X P (X, O µ) P (O µ) = arg max X P (X, O µ) arg max P (x 1,x 2,...,x T,O µ) x 1,x 2,...x T Last time: Use dynamic programming to find highestprobability sequence (i.e. best path, like Dijsktra s algorithm)
189 Language Modeling Compute the probability of observations when state sequence is hidden. P (X, O µ) =P (O X, µ)p (X µ) Therefore P (O µ) = P (O X, µ)p (X µ) X x 1,x 2,...x T P (x 1,x 2,...,x T,O µ) Suspiciously similar to max P (x 1,x 2,...,x T,O µ) x 1,x 2,...x T
190 Viterbi Algorithm (Tagging) NN NNS NNP VB VBZ Fed raises interest rates
191 Viterbi Algorithm (Tagging) NN NNS NNP VB VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
192 Viterbi Algorithm (Tagging) NN NNS NNP VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
193 Viterbi Algorithm (Tagging) NN NNS NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
194 Viterbi Algorithm (Tagging) NN NNS a[vb NNS]b[interest VB] NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
195 Viterbi Algorithm (Tagging) NN a[vb NN]b[interest VB] NNS a[vb NNS]b[interest VB] NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
196 Viterbi Algorithm (Tagging) NN NNS δ δ a[vb NN]b[interest VB] a[vb NNS]b[interest VB] NNP VB VBZ δ δ δ a[vb NNP]b[interest VB] a[vb VB]b[interest VB] a[vb VBZ]b[interest VB] Fed raises interest rates
197 Viterbi Algorithm (Tagging) NN δ a[vb NN]b[interest VB] NNS δ a[vb NNS]b[interest VB] NNP δ a[vb NNP]b[interest VB] VB δ a[vb VB]b[interest VB] max = VBZ δ a[vb VBZ]b[interest VB] Fed raises interest rates
198 Forward Algorithm (LM) NN NNS NNP VB VBZ Fed raises interest rates
199 Forward Algorithm (LM) NN NNS NNP VB VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
200 Forward Algorithm (LM) NN NNS NNP VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
201 Forward Algorithm (LM) NN NNS NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
202 Forward Algorithm (LM) NN NNS a[vb NNS]b[interest VB] NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
203 Forward Algorithm (LM) NN a[vb NN]b[interest VB] NNS a[vb NNS]b[interest VB] NNP a[vb NNP]b[interest VB] VB a[vb VB]b[interest VB] VBZ a[vb VBZ]b[interest VB] Fed raises interest rates
204 Forward Algorithm (LM) NN NNS α α a[vb NN]b[interest VB] a[vb NNS]b[interest VB] NNP VB VBZ α α α a[vb NNP]b[interest VB] a[vb VB]b[interest VB] a[vb VBZ]b[interest VB] Fed raises interest rates
205 Forward Algorithm (LM) NN α a[vb NN]b[interest VB] NNS α a[vb NNS]b[interest VB] NNP α a[vb NNP]b[interest VB] VB α a[vb VB]b[interest VB] sum = VBZ α a[vb VBZ]b[interest VB] Fed raises interest rates
206 What Do These Greek Letters Mean? j(t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 j(t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ)
207 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j j(t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 j(t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ)
208 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j j(t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 Probability of all paths from the beginning to word t such that word t has tag j j(t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ)
209 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j j(t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 Probability of all paths from the beginning to word t such that word t has tag j j(t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ) NOT the probability of tag at time
210 HMM Language Modeling Probability of observations, summed over all possible ways of tagging that observation:! i i(t ) This is the sum of all path probabilities in the trellis
211 HMM Parameter Estimation Supervised Train on tagged text, test on plain text Maximum likelihood (can be smoothed): a[vbz NN] = C(NN,VBZ) / C(NN) b[rates VBZ] = C(VBZ,rates) / C(VBZ) Unsupervised Train and test on plain text What can we do?
212 Forward-Backward Algorithm NN NNS α α a[vb NN]b[interest VB] a[vb NNS]b[interest VB] NNP VB VBZ α α α a[vb NNP]b[interest VB] a[vb VB]b[interest VB] a[vb VBZ]b[interest VB] Fed raises interest rates
213 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] NNS α a[vb NNS]b[interest VB] NNP α a[vb NNP]b[interest VB] VB α a[vb VB]b[interest VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
214 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] NNS α a[vb NNS]b[interest VB] NNP α a[vb NNP]b[interest VB] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
215 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] NNS α a[vb NNS]b[interest VB] NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
216 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
217 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
218 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] β NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
219 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] β NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] β NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
220 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] β NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] β NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] β VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
221 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] β NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] β NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] β VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] β VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] Fed raises interest rates
222 Forward-Backward Algorithm NN α a[vb NN]b[interest VB] a[nn VB]b[rates NN] β NNS α a[vb NNS]b[interest VB] a[nns VB]b[rates NNS] β NNP α a[vb NNP]b[interest VB] a[nnp VB]b[rates NNP] β VB α a[vb VB]b[interest VB] a[vb VB]b[rates VB] β VBZ α a[vb VBZ]b[interest VB] a[vbz VB]b[rates VBZ] β Fed raises interest rates
223 Forward-Backward Algorithm P (o 1 o t 1,x t = j µ) = j (t) P (o t o T x t = j, µ) = P (o 1 o T,x t = j µ) = j (t) j (t) j (t) P (x t = j O, µ) = P (x t = j, O µ) P (O µ) = j (t) j (t) #(T ) P (x t = i, x t+1 = j O, µ) = P (x t = i, x t+1 = j, O µ) P (O µ) i(t)a[j i]b[o t j] j (t + 1) = #(T )
224 Expectation Maximization (EM) Iterative algorithm to maximize likelihood of observed data in the presence of hidden data (e.g., tags) Choose an initial model μ Expectation step: find the expected value of hidden variables given current μ Maximization step: choose new μ to maximize probability of hidden and observed data Guaranteed to increase likelihood Not guaranteed to find global maximum
225 Supervised vs. Unsupervised Supervised Unsupervised Annotated training text Plain text Simple count/normalize EM Fixed tag set Training reads data once Set during training Training needs multiple passes
226 Logarithms for Precision P (Y )=p(y 1 )p(y 2 ) p(y T ) log P (Y ) = log p(y 1 ) + log p(y 2 ) + log p(y T ) Increased dynamic range of [0,1] to [-,0]
227 Semirings Set 0 1 Prob R + x 0 1 Max R max x 0 1 Log R {± } log Tropical R {± } max Shortest path R {± } min + 0 Boolean {0,1} F T String Σ longest common prefix concat ε
228 Reading Barzilay & Lee. Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization. HLT-NAACL, N pdf Ritter, Cherry & Dolan. Unsupervised Modeling of Twitter Conversations. HLT-NAACL, N pdf Background: Jurafsky & Martin, ch. 5 and
10/17/04. Today s Main Points
Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points
More informationHidden Markov Models: Maxing and Summing
Hidden Markov Models: Maxing and Summing Introduction to Natural Language Processing Computer Science 585 Fall 2009 University of Massachusetts Amherst David Smith 1 Markov vs. Hidden Markov Models Fed
More informationLanguage Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations
Language Technology CUNY Graduate Center Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required hard optional Liang Huang Probabilities experiment
More informationNatural Language Processing
Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationINF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationINF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 18, 2017 Recap: Probabilistic Language
More informationContext-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing
Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew
More informationCS838-1 Advanced NLP: Hidden Markov Models
CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationINF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models
1 University of Oslo : Department of Informatics INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Language Models & Hidden Markov Models Stephan Oepen & Erik Velldal Language
More informationCMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009
CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationBasic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology
Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are
More informationFun with weighted FSTs
Fun with weighted FSTs Informatics 2A: Lecture 18 Shay Cohen School of Informatics University of Edinburgh 29 October 2018 1 / 35 Kedzie et al. (2018) - Content Selection in Deep Learning Models of Summarization
More informationDoctoral Course in Speech Recognition. May 2007 Kjell Elenius
Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state
More informationDT2118 Speech and Speaker Recognition
DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More informationLecture 9: Hidden Markov Model
Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1 This lecture v Hidden Markov
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationHidden Markov Models
CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word
More informationText Mining. March 3, March 3, / 49
Text Mining March 3, 2017 March 3, 2017 1 / 49 Outline Language Identification Tokenisation Part-Of-Speech (POS) tagging Hidden Markov Models - Sequential Taggers Viterbi Algorithm March 3, 2017 2 / 49
More informationPart-of-Speech Tagging
Part-of-Speech Tagging Informatics 2A: Lecture 17 Adam Lopez School of Informatics University of Edinburgh 27 October 2016 1 / 46 Last class We discussed the POS tag lexicon When do words belong to the
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationParsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)
Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements
More informationP(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:
Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2018-2019 Spring Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers more 122,312 Closed class
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationCSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9
CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationHMM and Part of Speech Tagging. Adam Meyers New York University
HMM and Part of Speech Tagging Adam Meyers New York University Outline Parts of Speech Tagsets Rule-based POS Tagging HMM POS Tagging Transformation-based POS Tagging Part of Speech Tags Standards There
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationGrammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP
More informationLING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars
LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed
More informationPOS-Tagging. Fabian M. Suchanek
POS-Tagging Fabian M. Suchanek 100 Def: POS A Part-of-Speech (also: POS, POS-tag, word class, lexical class, lexical category) is a set of words with the same grammatical role. Alizée wrote a really great
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationHidden Markov Models
Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationDynamic Programming: Hidden Markov Models
University of Oslo : Department of Informatics Dynamic Programming: Hidden Markov Models Rebecca Dridan 16 October 2013 INF4820: Algorithms for AI and NLP Topics Recap n-grams Parts-of-speech Hidden Markov
More informationCS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm
+ September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationSequence Labeling: HMMs & Structured Perceptron
Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition
More informationProbabilistic Graphical Models
CS 1675: Intro to Machine Learning Probabilistic Graphical Models Prof. Adriana Kovashka University of Pittsburgh November 27, 2018 Plan for This Lecture Motivation for probabilistic graphical models Directed
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationSequence modelling. Marco Saerens (UCL) Slides references
Sequence modelling Marco Saerens (UCL) Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004), Introduction to machine learning.
More informationHidden Markov Models. x 1 x 2 x 3 x N
Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5)
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationCSE 490 U Natural Language Processing Spring 2016
CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationDirected Probabilistic Graphical Models CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,
More informationHidden Markov Models (HMMs)
Hidden Markov Models HMMs Raymond J. Mooney University of Texas at Austin 1 Part Of Speech Tagging Annotate each word in a sentence with a part-of-speech marker. Lowest level of syntactic analysis. John
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationLecture 7: Sequence Labeling
http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationStatistical methods in NLP, lecture 7 Tagging and parsing
Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview
More informationSequences and Information
Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationCSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto
CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto Revisiting PoS tagging Will/MD the/dt chair/nn chair/?? the/dt meeting/nn from/in that/dt
More informationToday s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM
Today s Agenda Need to cover lots of background material l Introduction to Statistical Models l Hidden Markov Models l Part of Speech Tagging l Applying HMMs to POS tagging l Expectation-Maximization (EM)
More informationDept. of Linguistics, Indiana University Fall 2009
1 / 14 Markov L645 Dept. of Linguistics, Indiana University Fall 2009 2 / 14 Markov (1) (review) Markov A Markov Model consists of: a finite set of statesω={s 1,...,s n }; an signal alphabetσ={σ 1,...,σ
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationParsing with Context-Free Grammars
Parsing with Context-Free Grammars CS 585, Fall 2017 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2017 Brendan O Connor College of Information and Computer Sciences
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationLanguage Processing with Perl and Prolog
Language Processing with Perl and Prolog es Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and Prolog 1 / 12 Training
More informationLecture 15. Probabilistic Models on Graph
Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how
More informationSequential Data Modeling - The Structured Perceptron
Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, predict y 2 Prediction Problems Given x, A book review
More informationNgram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer
+ Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationRoger Levy Probabilistic Models in the Study of Language draft, October 2,
Roger Levy Probabilistic Models in the Study of Language draft, October 2, 2012 224 Chapter 10 Probabilistic Grammars 10.1 Outline HMMs PCFGs ptsgs and ptags Highlight: Zuidema et al., 2008, CogSci; Cohn
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationConditional Random Fields: An Introduction
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationPart-of-Speech Tagging
Part-of-Speech Tagging Informatics 2A: Lecture 17 Shay Cohen School of Informatics University of Edinburgh 26 October 2018 1 / 48 Last class We discussed the POS tag lexicon When do words belong to the
More informationMACHINE LEARNING 2 UGM,HMMS Lecture 7
LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability
More informationA gentle introduction to Hidden Markov Models
A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27 Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationUnit 1: Sequence Models
CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang
More informationCS532, Winter 2010 Hidden Markov Models
CS532, Winter 2010 Hidden Markov Models Dr. Alan Fern, afern@eecs.oregonstate.edu March 8, 2010 1 Hidden Markov Models The world is dynamic and evolves over time. An intelligent agent in such a world needs
More information