DT2118 Speech and Speaker Recognition

Size: px
Start display at page:

Download "DT2118 Speech and Speaker Recognition"

Transcription

1 DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH VT / 56

2 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 2 / 56

3 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 3 / 56

4 Components of ASR System Representation Speech Signal Spectral Analysis Feature Extraction Constraints - Knowledge Acoustic Models Lexical Models Language Models Decoder Search and Match Recognised Words 4 / 56

5 Why do we need language models? Bayes rule: P(words sounds) = P(sounds words)p(words) P(sounds) where P(words): a priori probability of the words (Language Model) We could use non informative priors (P(words) = 1/N), but... 5 / 56

6 Branching Factor if we have N words in the dictionary at every word boundary we have to consider N equally likely alternatives N can be in the order of millions word1 word word2... wordn 6 / 56

7 Ambiguity ice cream vs I scream /ai s k ô i: m/ 7 / 56

8 Language Models in ASR We want to: 1. limit the branching factor in the recognition network 2. augment and complete the acoustic probabilities we are only interested to know if the sequence of words is plausible grammatically or not this kind of grammar is integrated in the recognition network prior to decoding 8 / 56

9 Language Models in Dialogue Systems we want to assign a class to each word (noun, verb, attribute... parts of speech) parsing is usually performed on the output of a speech recogniser The grammar is used twice in a Dialogue System!! 9 / 56

10 Language Models in ASR small vocabulary: often formal grammar specified by hand example: loop of digits as in the HTK exercise large vocabulary: often stochastic grammar estimated from data 10 / 56

11 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 11 / 56

12 Formal Language Theory grammar: formal specification of permissible structures for the language parser: algorithm that can analyse a sentence and determine if its structure is compliant with the grammar 12 / 56

13 Chomsky s formal grammar Noam Chomsky: linguist, philosopher, / 56

14 Chomsky s formal grammar Noam Chomsky: linguist, philosopher,... where G = (V, T, P, S) V : set of non-terminal constituents T : set of terminals (lexical items) P: set of production rules S: start symbol 13 / 56

15 Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary, person, loves, that,... } P = {S NP VP NP NAME NP ADJ NP1 NP1 N VP VERB NP NAME Mary V loves N person ADJ that } 14 / 56

16 Example S = sentence V = {NP (noun phrase), NP1, VP (verb phrase), NAME, ADJ, V (verb), N (noun)} T = {Mary, person, loves, that,... } NAME P = {S NP VP NP NAME NP ADJ NP1 Mary NP1 N VP VERB NP NAME Mary V loves N person ADJ that } NP loves S V that VP ADJ person NP N NP1 14 / 56

17 Chomsky s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure α β. This is the most general Turing machine grammar grammar Context-sensitive grammar length of α length of β Linear bounded Context-free A β. Equivalent to A w, A Push down grammar BC Regular grammar A w, A wb Finite-state 15 / 56

18 Chomsky s hierarchy Greek letters: sequence of terminals or non-terminals Upper-case Latin letters: single non-terminal Lower-case Latin letters: single terminal Types Constraints Automata Phrase structure α β. This is the most general Turing machine grammar grammar Context-sensitive grammar length of α length of β Linear bounded Context-free A β. Equivalent to A w, A Push down grammar BC Regular grammar A w, A wb Finite-state Context-free and regular grammars are used in practice 15 / 56

19 Are languages context-free? Mostly true, with exceptions Swiss German:... das mer d chind em Hans es huus lönd häfte aastriiche Word-by-word:... that we the children Hans the house let help paint Translation:... that we let the children help Hans paint the house 16 / 56

20 Parsers assign each word in a sentence to a part of speech originally developed for programming languages (no ambiguities) only available for context-free and regular grammars top-down: start with S and generate rules until you reach the words (terminal symbols) bottom-up: start with the words and work your way up until you reach S 17 / 56

21 Example: Top-down parser Parts of speech S Rules 18 / 56

22 Example: Top-down parser Parts of speech S NP VP Rules S NP VP 18 / 56

23 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME 18 / 56

24 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary 18 / 56

25 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP 18 / 56

26 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves 18 / 56

27 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 18 / 56

28 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that 18 / 56

29 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that Mary loves that N NP1 N 18 / 56

30 Example: Top-down parser Parts of speech Rules S NP VP S NP VP NAME VP NP NAME Mary VP NAME Mary Mary V NP VP V NP Mary loves NP V loves Mary loves ADJ NP1 NP ADJ NP1 Mary loves that NP1 ADJ that Mary loves that N NP1 N Mary loves that person N person 18 / 56

31 Example: Bottom-up parser Parts of speech Mary loves that person Rules 19 / 56

32 Example: Bottom-up parser Parts of speech Mary loves that person NAME loves that person Rules NAME Mary 19 / 56

33 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves 19 / 56

34 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that 19 / 56

35 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person 19 / 56

36 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME 19 / 56

37 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N 19 / 56

38 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 19 / 56

39 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 NP VP VP V NP 19 / 56

40 Example: Bottom-up parser Parts of speech Rules Mary loves that person NAME loves that person NAME Mary NAME V that person V loves NAME V ADJ person ADJ that NAME V ADJ N N person NP V ADJ N NP NAME NP V ADJ NP1 NP1 N NP V NP NP ADJ NP1 NP VP VP V NP S S NP VP 19 / 56

41 Top-down vs bottom-up parsers Top-down characteristics: + very predictive + only consider grammatical combinations predict constituents that do not have a match in the text Bottom-up characteristics: + check input text only once + suitable for robust language processing may build trees that do not lead to full parse All in all, similar performance 20 / 56

42 Chart parsing (dynamic programming) Name[1,1] Mary Mary loves that person 21 / 56

43 Chart parsing (dynamic programming) Name Mary NP Name S NP VP V[2,2] loves Mary loves that person 21 / 56

44 Chart parsing (dynamic programming) Name Mary V loves NP Name VP V NP S NP VP ADJ that Mary loves that person 21 / 56

45 Chart parsing (dynamic programming) ADJ that Name Mary V loves NP ADJ NP1 NP Name VP V NP S NP VP S NP VP N person Mary loves that person 21 / 56

46 Chart parsing (dynamic programming) S NP VP VP V NP NP ADJ NP1 Name Mary NP Name S NP VP V VP loves V NP ADJ that NP ADJ NP1 S NP VP N person NP1 N Mary loves that person 21 / 56

47 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 22 / 56

48 Stochastic Language Models (SLM) 1. formal grammars lack coverage (for general domains) 2. spoken language does not follow strictly the grammar Model sequences of words statistically: P(W ) = P(w 1 w 2... w n ) 23 / 56

49 Probabilistic Context-free grammars (PCFGs) Assign probabilities to generative rules: P(A α G) Then calculate probability of generating a word sequence w 1 w 2... w n as probability of the rules necessary to go from S to w 1 w 2... w n : P(S w 1 w 2... w n G) 24 / 56

50 Training PCFGs If annotated corpus, Maximum Likelihood estimate: P(A α j ) = C(A α j ) m i=1 C(A α i) If non-annotated corpus: inside-outside algorithm (similar to HMM training, forward-backward) 25 / 56

51 Independence assumption S NP VP NAME V NP Mary loves ADJ NP1 that N person 26 / 56

52 Inside-outside probabilities Chomsky s normal forms: A i A m A n or A i w l inside(s, A i, t) = P(A i w s w s+1... w t ) outside(s, A i, t) = P(S w 1... w s 1 A i w t+1... w T ) S A i w... w w... w w... w 1 s- 1 s t t + 1 T 27 / 56

53 Probabilistic Context-free grammars:limitations probabilities help sorting alternative explanations, but still problem with coverage: the production rules are hand made P(A α G) 28 / 56

54 N-gram Language Models Flat model: no hierarchical structure P(W) = P(w 1, w 2,..., w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1, w 2..., w n 1 ) n = P(w i w 1, w 2,..., w i 1 ) i=1 29 / 56

55 N-gram Language Models Flat model: no hierarchical structure P(W) = P(w 1, w 2,..., w n ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 ) P(w n w 1, w 2..., w n 1 ) n = P(w i w 1, w 2,..., w i 1 ) i=1 Approximations: P(w i w 1, w 2,..., w i 1 ) = P(w i ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i 1 ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i 2, w i 1 ) P(w i w 1, w 2,..., w i 1 ) = P(w i w i N+1,..., w i 1 ) (Unigram) (Bigram) (Trigram) (N-gram) 29 / 56

56 Example (Bigram) P(Mary, loves, that, person) = P(Mary <s>)p(loves Mary)P(that loves) P(person that)p(</s> person) 30 / 56

57 N-gram estimation (Maximum Likelihood) N {}}{ P(w i w i N+1,..., w i 1 ) = C( w i N+1,..., w i 1, w i ) C(w i N+1,..., w i 1 ) }{{} N 1 = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) 31 / 56

58 N-gram estimation (Maximum Likelihood) N {}}{ P(w i w i N+1,..., w i 1 ) = C( w i N+1,..., w i 1, w i ) C(w i N+1,..., w i 1 ) }{{} N 1 = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) Problem: data sparseness 31 / 56

59 N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = / 56

60 N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = / 56

61 N-gram estimation example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John) P(a read) P(book a) = C(<s>,John) C(<s>) = 2 3 = C(John,read) C(John) = C(read,a) C(read) = 2 2 = 2 3 = C(a,book) C(a) = 1 2 P(< /s > book) = C(book,</s>) C(book) = 2 3 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = P(Mulan, read, a, book) = P(Mulan < s >) = 0 32 / 56

62 N-gram Smoothing Problem: Many very possible word sequences may have been observed in zero or very low numbers in the training data Leads to extremely low probabilities, effectively disabling this word sequence, no matter how strong the acoustic evidence is Solution: smoothing produce more robust probabilities for unseen data at the cost of modelling the training data slightly worse 33 / 56

63 Simplest Smoothing technique Instead of ML estimate Use P(w i w i N+1,..., w i 1 ) = C(w i N+1,..., w i 1, w i ) w i C(w i N+1,..., w i 1, w i ) P(w i w i N+1,..., w i 1 ) = 1 + C(w i N+1,..., w i 1, w i ) w i (1 + C(w i N+1,..., w i 1, w i )) prevents zero probabilities but still very low probabilities 34 / 56

64 N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = / 56

65 N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = (0.148) 35 / 56

66 N-gram simple smoothing example Corpus: 1: John read her book 2: I read a different book 3: John read a book by Mulan P(John < s >) P(read John)... P(Mulan < s >) = 1+C(<s>,John) 11+C(<s>) = 3 14 = 1+C(John,read) 11+C(John) = 3 13 = 1+C(<s>,Mulan) 11+C(<s>) = 1 14 P(John, read, a, book) = P(John < s >)P(read John)P(a read) P(book a)p(< /s > book) = (0.148) P(Mulan, read, a, book) = P(Mulan < s >)P(read Mulan)P(a read) P(book a)p(< /s > book) = (0) 35 / 56

67 Interpolation vs Backoff smoothing Interpolation models: Linear combination with lower order n-grams Modifies the probabilities of both nonzero and zero count n-grams Backoff models: Use lower order n-grams when the requested n-gram has zero or very low count in the training data Nonzero count n-grams are unchanged Discounting: Reduce the probability of seen n-grams and distribute among unseen ones 36 / 56

68 Interpolation vs Backoff smoothing Interpolation models: P smooth (w i w i N+1,..., w i 1 ) = Backoff models: N {}}{ λ P ML (w i w i N+1,..., w i 1 ) + N 1 {}}{ (1 λ) P smooth (w i w i N+2,..., w i 1 ) P smooth (w i w i N+1,..., w i 1 ) = N {}}{ α P(w i w i N+1,..., w i 1 ) if C(w i w i N+1,..., w i 1 ) > 0 N 1 {}}{ γ P smooth (w i w i N+2,..., w i 1 ) if C(w i w i N+1,..., w i 1 ) = 0 37 / 56

69 Deleted interpolation smoothing Recursively interpolate with n-grams of lower order: if history n = w i n+1,..., w i 1 P I (w i history n ) = λ historyn P(w i history n ) + (1 λ historyn )P I (w i history n 1 ) hard to estimate λ historyn for every history cluster into moderate number of weights 38 / 56

70 Backoff smoothing Use P(w i history n 1 ) only if you lack data for P(w i history n ) 39 / 56

71 Good-Turing estimate Partition n-grams into groups depending on their frequency in the training data Change the number of occurrences of an n-gram according to r = (r + 1) n r+1 n r where r is the occurrence number, n r is the number of n-grams that occur r times 40 / 56

72 Katz smoothing based on Good-Turing: combine higher and lower order n-grams For every N-gram: 1. if count r is large (> 5 or 8), do not change it 2. if count r is small but non-zero, discount with r 3. if count r = 0, reassign discounted counts with lower order N-gram C (w i 1, w i ) = α(w i 1 )P(w i ) 41 / 56

73 Kneser-Ney smoothing: motivation Background Problem Lower order n-grams are often used as backoff model if the count of a higher-order n-gram is too low (e.g. unigram instead of bigram) Some words with relatively high unigram probability only occur in a few bigrams. E.g. Francisco, which is mainly found in San Francisco. However, infrequent word pairs, such as New Francisco, will be given too high probability if the unigram probabilities of New and Francisco are used. Maybe instead, the Francisco unigram should have a lower value to prevent it from occurring in other contexts. I can t see without my reading / 56

74 Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(w i ), use P CONTINUATION (w i ) = {w i 1 : C(w i 1 w i ) > 0} w i {w i 1 : C(w i 1 w i ) > 0} 43 / 56

75 Kneser-Ney intuition If a word has been seen in many contexts it is more likely to be seen in new contexts as well. instead of backing off to lower order n-gram, use continuation probability Example: instead of unigram P(w i ), use P CONTINUATION (w i ) = {w i 1 : C(w i 1 w i ) > 0} w i {w i 1 : C(w i 1 w i ) > 0} I can t see without my reading... glasses 43 / 56

76 Class N-grams 1. Group words into semantic or grammatical classes 2. build n-grams for class sequences: P(w i c i N+1... c i 1 ) = P(w i c i )P(c i c i N+1... c i 1 ) rapid adaptation, small training sets, small models works on limited domains classes can be rule-based or data-driven 44 / 56

77 Combining PCFGs and N-grams Only N-grams: Meeting at three with Zhou Li Meeting at four PM with Derek P(Zhou three, with) and P(Derek PM, with)) N-grams + CFGs: Meeting {at three: TIME} with {Zhou Li: NAME} Meeting {at four PM: TIME} with {Derek: NAME} P(NAME TIME, with) 45 / 56

78 Adaptive Language Models conversational topic is not stationary topic stationary over some period of time build more specialised models that can adapt in time Techniques Cache Language Models Topic-Adaptive Models Maximum Entropy Models 46 / 56

79 Cache Language Models 1. build a full static n-gram model 2. during conversation accumulate low order n-grams 3. interpolate between 1 and 2 47 / 56

80 Topic-Adaptive Models 1. cluster documents into topics (manually or data-driven) 2. use information retrieval techniques with current recognition output to select the right cluster 3. if off-line run recognition in several passes 48 / 56

81 Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints 49 / 56

82 Maximum Entropy Models Instead of linear combination: 1. reformulate information sources into constraints 2. choose maximum entropy distribution that satisfies the constraints Constraints general form: P(X )f i (X ) = E i X Example: unigram f wi = { 1 if w = wi 0 otherwise 49 / 56

83 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language Models N-gram Smoothing Class N-grams Adaptive Language Models Language Model Evaluation 50 / 56

84 Language Model Evaluation Evaluation in combination with Speech Recogniser hard to separate contribution of the two Evaluation based on probabilities assigned to text in the training and test set 51 / 56

85 Information, Entropy, Perplexity Information: Entropy: I (x i ) = log 1 P(x i ) H(X ) = E[I (X )] = i P(x i ) log P(x i ) Perplexity: PP(X ) = 2 H(X ) 52 / 56

86 Perplexity of a model We do not know the true distribution p(w 1,..., w n ). But we have a model m(w 1,..., w n ). The cross-entropy is: H(p, m) = p(w 1,..., w n ) log m(w 1,..., w n ) w 1,...,w n Cross-entropy is upper bound to entropy: H H(p, m) The better the model, the lower the cross-entropy and the lower the perplexity (on the same data) 53 / 56

87 Test-set Perplexity Estimate the distribution p(w 1,..., w n ) on the training data Evaluate it on the test data H = p(w 1,..., w n ) log p(w 1,..., w n ) w 1,...,w n test set PP = 2 H 54 / 56

88 Perplexity and branching factor Perplexity is roughly the geometric mean of the branching factor word word1 word2... wordn Shannon: 2.39 for English letters and 130 for English words Digit strings: 10 n-gram English: Wall Street Journal test set: 180 (bigram) 91 (trigram) 55 / 56

89 Performance of N-grams Models Perplexity Word Error Rate Unigram Katz % Unigram Kneser-Ney % Bigram Katz % Bigram Kneser-Ney % Trigram Katz % Trigram Kneser-Ney % Wall Street Journal database Dictionary: words Training set: words 56 / 56

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

CMPT-825 Natural Language Processing

CMPT-825 Natural Language Processing CMPT-825 Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/ anoop February 27, 2008 1 / 30 Cross-Entropy and Perplexity Smoothing n-gram Models Add-one Smoothing Additive Smoothing Good-Turing

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

Language Models. Philipp Koehn. 11 September 2018

Language Models. Philipp Koehn. 11 September 2018 Language Models Philipp Koehn 11 September 2018 Language models 1 Language models answer the question: How likely is a string of English words good English? Help with reordering p LM (the house is small)

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview The Language Modeling Problem We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham, two, } 6.864 (Fall 2007) Smoothed Estimation, and Language Modeling We have an (infinite) set of

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Language Processing with Perl and Prolog

Language Processing with Perl and Prolog Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic

More information

Probabilistic Context-Free Grammar

Probabilistic Context-Free Grammar Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,

More information

Probabilistic Linguistics

Probabilistic Linguistics Matilde Marcolli MAT1509HS: Mathematical and Computational Linguistics University of Toronto, Winter 2019, T 4-6 and W 4, BA6180 Bernoulli measures finite set A alphabet, strings of arbitrary (finite)

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

Chapter 3: Basics of Language Modeling

Chapter 3: Basics of Language Modeling Chapter 3: Basics of Language Modeling Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1 Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Introduction to Probablistic Natural Language Processing

Introduction to Probablistic Natural Language Processing Introduction to Probablistic Natural Language Processing Alexis Nasr Laboratoire d Informatique Fondamentale de Marseille Natural Language Processing Use computers to process human languages Machine Translation

More information

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7 Quiz 1, COMS 4705 Name: 10 30 30 20 Good luck! 4705 Quiz 1 page 1 of 7 Part #1 (10 points) Question 1 (10 points) We define a PCFG where non-terminal symbols are {S,, B}, the terminal symbols are {a, b},

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Kneser-Ney smoothing explained

Kneser-Ney smoothing explained foldl home blog contact feed Kneser-Ney smoothing explained 18 January 2014 Language models are an essential element of natural language processing, central to tasks ranging from spellchecking to machine

More information

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM

Administrivia. Lecture 5. Part I. Outline. The Big Picture IBM IBM IBM IBM Administrivia Lecture 5 The Big Picture/Language Modeling Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen T.J. Watson Research Center Yorktown Heights, New York, USA {bhuvana,picheny,stanchen}@us.ibm.com

More information

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen

Lecture 5. The Big Picture/Language Modeling. Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen Lecture 5 The Big Picture/Language Modeling Bhuvana Ramabhadran, Michael Picheny, Stanley F. Chen T.J. Watson Research Center Yorktown Heights, New York, USA {bhuvana,picheny,stanchen}@us.ibm.com 06 October

More information

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22 Parsing Probabilistic CFG (PCFG) Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 22 Table of contents 1 Introduction 2 PCFG 3 Inside and outside probability 4 Parsing Jurafsky

More information

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing L445 / L545 / B659 Dept. of Linguistics, Indiana University Spring 2016 1 / 46 : Overview Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the

More information

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46. : Overview L545 Dept. of Linguistics, Indiana University Spring 2013 Input: a string Output: a (single) parse tree A useful step in the process of obtaining meaning We can view the problem as searching

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

CS 6120/CS4120: Natural Language Processing

CS 6120/CS4120: Natural Language Processing CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Outline Probabilistic language

More information

Features of Statistical Parsers

Features of Statistical Parsers Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting

More information

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science Graphical Models Mark Gales Lent 2011 Machine Learning for Language Processing: Lecture 3 MPhil in Advanced Computer Science MPhil in Advanced Computer Science Graphical Models Graphical models have their

More information

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition 2010 11 5 N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition 1 48-106413 Abstract Large-Vocabulary Continuous Speech Recognition(LVCSR) system has rapidly been growing today.

More information

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Language Models Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN Data Science: Jordan Boyd-Graber UMD Language Models 1 / 8 Language models Language models answer

More information

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005 A DOP Model for LFG Rens Bod and Ronald Kaplan Kathrin Spreyer Data-Oriented Parsing, 14 June 2005 Lexical-Functional Grammar (LFG) Levels of linguistic knowledge represented formally differently (non-monostratal):

More information

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

{Probabilistic Stochastic} Context-Free Grammars (PCFGs) {Probabilistic Stochastic} Context-Free Grammars (PCFGs) 116 The velocity of the seismic waves rises to... S NP sg VP sg DT NN PP risesto... The velocity IN NP pl of the seismic waves 117 PCFGs APCFGGconsists

More information

Statistical Natural Language Processing

Statistical Natural Language Processing Statistical Natural Language Processing N-gram Language Models Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017 N-gram language models A language model answers

More information

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models 1 University of Oslo : Department of Informatics INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Language Models & Hidden Markov Models Stephan Oepen & Erik Velldal Language

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam Lecture 4: Smoothing, Part-of-Speech Tagging Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam Language Models from Corpora We want a model of sentence probability P(w

More information

CSEP 517 Natural Language Processing Autumn 2013

CSEP 517 Natural Language Processing Autumn 2013 CSEP 517 Natural Language Processing Autumn 2013 Language Models Luke Zettlemoyer Many slides from Dan Klein and Michael Collins Overview The language modeling problem N-gram language models Evaluation:

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite Notes on the Inside-Outside Algorithm To make a grammar probabilistic, we need to assign a probability to each context-free rewrite rule. But how should these probabilities be chosen? It is natural to

More information

On Using Selectional Restriction in Language Models for Speech Recognition

On Using Selectional Restriction in Language Models for Speech Recognition On Using Selectional Restriction in Language Models for Speech Recognition arxiv:cmp-lg/9408010v1 19 Aug 1994 Joerg P. Ueberla CMPT TR 94-03 School of Computing Science, Simon Fraser University, Burnaby,

More information

Spectral Unsupervised Parsing with Additive Tree Metrics

Spectral Unsupervised Parsing with Additive Tree Metrics Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach

More information

A* Search. 1 Dijkstra Shortest Path

A* Search. 1 Dijkstra Shortest Path A* Search Consider the eight puzzle. There are eight tiles numbered 1 through 8 on a 3 by three grid with nine locations so that one location is left empty. We can move by sliding a tile adjacent to the

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

{ Jurafsky & Martin Ch. 6:! 6.6 incl. N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1 Word-prediction Applications Augmentative Communication

More information

Ling/CMSC 773 Take-Home Midterm Spring 2008

Ling/CMSC 773 Take-Home Midterm Spring 2008 Solution set. This document is a solution set. It will only be available to students temporarily. It may not be kept, nor shown to anyone outside the current class at any time. Typical correct answers

More information

Chapter 14 (Partially) Unsupervised Parsing

Chapter 14 (Partially) Unsupervised Parsing Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with

More information

arxiv:cmp-lg/ v1 9 Jun 1997

arxiv:cmp-lg/ v1 9 Jun 1997 arxiv:cmp-lg/9706007 v1 9 Jun 1997 Aggregate and mixed-order Markov models for statistical language processing Abstract We consider the use of language models whose size and accuracy are intermediate between

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1 Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Language Models. Hongning Wang

Language Models. Hongning Wang Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc) NLP: N-Grams Dan Garrette dhg@cs.utexas.edu December 27, 2013 1 Language Modeling Tasks Language idenfication / Authorship identification Machine Translation Speech recognition Optical character recognition

More information