Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Similar documents
Natural Language Processing SoSe Words and Language Model

Machine Learning for natural language processing

Natural Language Processing. Statistical Inference: n-grams

DT2118 Speech and Speaker Recognition

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Model. Introduction to N-grams

Chapter 3: Basics of Language Modelling

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Probabilistic Language Modeling

CS 6120/CS4120: Natural Language Processing

Natural Language Processing

N-gram Language Modeling

Language Modeling. Michael Collins, Columbia University

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

N-gram Language Modeling Tutorial

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Statistical Natural Language Processing

Chapter 3: Basics of Language Modeling

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CMPT-825 Natural Language Processing

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Statistical Methods for NLP

Language Processing with Perl and Prolog

Language Models. Philipp Koehn. 11 September 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

TnT Part of Speech Tagger

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

The Noisy Channel Model and Markov Models

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

SYNTHER A NEW M-GRAM POS TAGGER

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

CS 6120/CS4120: Natural Language Processing

Kneser-Ney smoothing explained

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Natural Language Processing (CSE 490U): Language Models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

A fast and simple algorithm for training neural probabilistic language models

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Part-of-Speech Tagging

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

On Using Selectional Restriction in Language Models for Speech Recognition

Introduction to Probablistic Natural Language Processing

ANLP Lecture 6 N-gram models and smoothing

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Statistical methods in NLP, lecture 7 Tagging and parsing

Language Modelling. Marcello Federico FBK-irst Trento, Italy. MT Marathon, Edinburgh, M. Federico SLM MT Marathon, Edinburgh, 2012

Exploring Asymmetric Clustering for Statistical Language Modeling

Statistical Machine Translation

Neural Networks Language Models

Conditional Language Modeling. Chris Dyer

Hidden Markov Models in Language Processing

Introduction to N-grams

arxiv:cmp-lg/ v1 9 Jun 1997

Natural Language and Statistics

language modeling: n-gram models

Sequences and Information

arxiv: v1 [cs.cl] 5 Mar Introduction

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Microsoft Corporation.

More Smoothing, Tuning, and Evaluation

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

We have a sitting situation

HMM and Part of Speech Tagging. Adam Meyers New York University

Natural Language Processing

Midterm sample questions

Text Mining. March 3, March 3, / 49

Prenominal Modifier Ordering via MSA. Alignment

Topic Modeling: Beyond Bag-of-Words

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Language Models. Hongning Wang

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Naïve Bayes, Maxent and Neural Models

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

What to Expect from Expected Kneser-Ney Smoothing

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

Computer Science. Carnegie Mellon. DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited. DTICQUALBYlSFSPIiCTBDl

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Transcription:

Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi)

Outline 2 Motivation Estimation Evaluation Smoothing

Outline 3 Motivation Estimation Evaluation Smoothing

Language Modelling Finding the probability of a sentence or a sequence of words P(S) = P(w1, w2, w3,..., wn ) all of a sudden I notice three guys standing on the sidewalk... on guys all I of notice sidewalk three a sudden standing the... 4

Language Modelling Finding the probability of a sentence or a sequence of words P(S) = P(w1, w2, w3,..., wn ) all of a sudden I notice three guys standing on the sidewalk... on guys all I of notice sidewalk three a sudden standing the... 5

Language Modelling 6 Applications Word prediction Speech recognition Handwriting recognition Machine translation Spell checking

Applications Word prediction 7 natural language.. processing management orange...

Applications Speech recognition Computers can recognize speech. Computers can wreck a nice peach. http://worldsgreatestsmile.com/html/phonological_ambiguity.html 8

Applications Handwriting recognition https://play.google.com/store/apps/details?id=com.metamoji.mazecen 9

Applications Handwriting recognition Take the money and run, Woody Allen: I have a gub. instead of I have a gun. https://www.youtube.com/watch?v=-uhogkdbvqc 10

Applications Machine translation 11 The cat eats... Die Katze frisst... Die Katze isst... Chinese to English: He briefed to reporters on the chief contents of the statements He briefed reporters on the chief contents of the statements He briefed to reporters on the main contents of the statements He briefed reporters on the main contents of the statements

Applications Spell checking I want to adver this project adverb (noun) advert (verb) They are leaving in about fifteen minuets to go to her house. The design an construction of the system will take more than a year. 12 minutes and

Outline 13 Motivation Estimation Evaluation Smoothing

Language Modeling Finding the probability of a sentence or a sequence of words P(S) = P(w1, w2, w3,..., wn ) Computers can recognize speech. 14 P(Computer, can, recognize, speech)

Conditional Probability P ( A B )= P ( A B) P( A) P ( A, B)=P ( A ) P ( B A ) P ( A, B,C, D)=P ( A) P (B A ) P (C A, B ) P ( D A, B,C ) 15

Conditional Probability http://setosa.io/conditional/ 16

Conditional Probability P (S )= P ( w 1 ) P (w 2 w 1 ) P ( w 3 w 1, w 2 )... P (w n w 1, w 2,, w 3,...,, w n ) P ( S )= i n P (w i w1, w 2,..., w i 1) P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize Computer can) P(speech Computer can recognize) 17

Corpus Probabilities are based on counting things A computer-readable collection of text or speech The Brown Corpus 18 A million-word collection of samples 500 written texts from different genres (newspaper, fiction, non-fiction, academic,...) Assembled at Brown University in 1963-1964 Can also be used for evaluation and comparison purposes

Corpus http://weaver.nlplab.org/~brat/demo/latest/#/not-editable/conll-00-chunking/train.txt-doc-1 19

Corpus 20 Text Corpora Corpus of Contemporary American English The British National Corpus The International Corpus of English The Google N-gram Corpus ( https://books.google.com/ngrams) WBI repository (biomedical domain) ( http://corpora.informatik.hu-berlin.de/)

Word occurrence 21 A language consists of a set of V words (Vocabulary) A text is a sequence of the words from the vocabulary A word can occur several times in a text Word Token: each occurrence of words in text Word Type: each unique occurrence of words in the text

Word occurrence Example: 22 This is a sample text from a book that is read every day.

Word occurrence Example: 23 This is a sample text from a book that is read every day. # Word Tokens: 13 # Word Types: 11

Counting The Brown corpus 1,015,945 word tokens 47,218 word types Google N-Gram corpus 1,024,908,267,229 word tokens 13,588,391 word types Why so many word types? 24 Large English dictionaries have around 500k word types

Counting Corpora include numbers, mispellings, names, acronyms, etc. http://weaver.nlplab.org/~brat/demo/latest/#/not-editable/conll-00-chunking/train.txt-doc-1 25

Word frequency 26

Zipf's Law 27 The frequency of any word is inversely proportional to its rank in the frequency table Given a corpus of natural language utterances, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, Rank of a word times its frequency is approximately a constant Rank Freq c c 0.1 for English

Word frequency 28

Word frequency 29 Zipf s Law is not very accurate for very frequent and very infrequent words

Word frequency 30 Zipf s Law is not very accurate for very frequent and very infrequent words

Maximum Likelihood Estimation P(speech Computer can recognize) P (speech Computer can recognize )= 31 # (Computer can recognize speech) # (Computer can recognize ) Too many phrases Limited text for estimating probabilities Simplification assumption

Markov assumption P (S )= i 1 n P ( w i w 1, w 2,..., w i 1) P (S )= i 1 32 n P ( w i w i 1)

Markov assumption P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize Computer can) P(speech Computer can recognize) P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize can) P(speech recognize) P (speech recognize )= 33 # (recognize speech) # (recognize)

N-gram model 34 Unigram Bigram Trigram P (S )= i 1 N-gram P (S )=i 1 P (S )= i 1 P (S )= i 1 n n P ( wi) P ( w i w i 1) n P ( w i w i 1, w i 2) n P ( w i w 1, w 2,..., w i 1)

N-grams 35

Google N-gram corpus (https://books.google.com/ngrams) 36

WebTrigrams 37 (http://www.chrisharrison.net/index.php/visualizations/webtrigrams)

Maximum Likelihood Estimation 38 <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s> Vocabulary: V = {I,saw,the,boy,man,is,working,walked,in,street} walked boy working The boy is working street saw the man

Maximum Likelihood 39 <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s>

Maximum Likelihood <s> I saw the man </s> <s> 3 <s> 2 1 P ( S )=P ( I < s >) P ( saw I ) P (the saw) P (man the) # (< s > I ) # ( I saw) # ( saw the) # (the man) P ( S )= # (< s >) #( I ) #( saw) # (the) 40 2 1 1 1 P (S )= 3 2 1 3

Unkown words <s> I saw the woman </s> Closed vocabulary: test set can only contain words from this lexicon Open vocabulary: test set can contain unknown words Out of vocabulary (OOV) words: Choose a vocabulary Convert unknown (OOV) words to <UNK> word token Estimate probabilities for <UNK> Alternatively, 41 Replace the first occurrence of every word type in the training data by <UNK>

Outline 42 Motivation Estimation Evaluation Smoothing

Branching Factor Branching factor is the number of possible words that can be used in each position of a text Maximum branching factor for each language is V A good language model should be able to 43 minimize this number give a higher probability to the words that occur in real texts

Branching Factor John eats an 44 computer, book, apple, banana, umbrella, orange, desk

Branching Factor John eats an 45 computer, book, apple, banana, umbrella, orange, desk

Evaluation Dividing the corpus to two parts Building a language model from the training set Word frequencies, etc.. Estimating the probability of the test set Calculate the average branching factor of the test set training 46 test

Perplexity Goal: giving higher probability to frequent texts minimizing the perplexity of the frequent texts P (S )= P ( w 1, w 2,..., w n ) 1 n Perplexity (S )= P (w 1, w 2,..., w n ) = n Perplexity( S )= 47 n n i =1 P (w w i 1 P ( w1, w 2,..., w n ) 1 1, w 2,..., w i 1 )

Perplexity Maximum branching factor for each language is V Perplexity ( S )= n n i =1 P (w w i 1 1, w 2,..., w i 1 ) Example: predicting next characters instead of next words: V = 26, five next characters: 5 1 1 Perplexity (S )=(( ) ) 5 =26 26 48

Perplexity 49 Wall Street Journal (19,979 word vocabulary) Training set: 38 million word tokens Test set: 1.5 million words Perplexity: Unigram: 962 Bigram: 170 Trigram: 109

Outline 50 Motivation Estimation Evaluation Smoothing

Maximum Likelihood <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s> <s> I saw the man </s> P (S )= P ( I ) P ( saw I ) P (the saw ) P (man the) P (S )= # ( I ) #( I saw) #( saw the) # (the man) # (< s >) # ( I ) # ( saw) # (the) 2 1 1 1 P (S )= 3 2 1 3 51

Zero probability <s> I saw the man in the street </s> P ( S )=P ( I ) P ( saw I ) P (the saw ) P (man the) P (i n man) P (the i n) P ( street the) P ( S )= # ( I ) # ( I saw) # ( saw the) # (the man) # (man i n) # (i n the) # (the street ) # (< s >) # ( I ) # (saw ) #(the) # (man) # (i n) # (the) 2 1 1 1 0 1 1 P ( S )= 3 2 1 3 1 1 3 52

Zero probability 53 <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s> No man in in our corpus

Smoothing Giving a small probability to all as unseen n-grams Laplace smoothing 54 Add one to all counts (Add-one)

Smoothing Giving a small probability to all unseen n-grams Laplace smoothing Add one to all counts (Add-one) #( w i 1, w i ) P (w i w i 1 )= # (w i 1 ) 55 #( w i 1, w i )+1 P (w i w i 1 )= # (w i 1 )+V

Smoothing Giving a small probability to all unseen n-grams Interpolation and Back-off Smoothing Use a background probability P (w i w i 1 )= Back-off 56 P (w i w i 1 ) = #( w i 1, w i ) # (w i 1 ) #( w i 1, w i ) # (w i 1 ) if # (w i 1, w i )>0 P BG otherwise

Smoothing Giving a small probability to all as unseen n-grams Interpolation and Back-off Smoothing Use a background probability P (w i w i 1 )= Interpolation #( w i 1, w i ) # (w i 1 ) P (w i w i 1 )=λ1 Parameter tuning 57 # (w i 1, w i ) +λ 2 P BG # (w i 1 ) Background probability λ=1

Background probability Lower levels of n-gram can be used as background probability Trigram» Bigram Bigram» Unigram Unigram» Zerogram ( Back-off 1 ) V # (w i 1, w i ) # (w i 1 ) if # (w i 1, w i )> 0 α (w i ) P (w i ) otherwise P (w i w i 1 ) = P (w i ) = # (w i ) N α (w i ) 58 if # (w i )> 0 1 V otherwise

Background probability Lower levels of n-gram can be used as background probability Trigram» Bigram Bigram» Unigram Unigram» Zerogram Interpolation ( 1 ) V # (w i 1, w i ) P (w i w i 1 )=λ 1 +λ 2 P (w i ) # (w i 1 ) P(w i )=λ 1 P(w i w i 1 )=λ 1 59 # (w i ) 1 + λ 2 N V # (w i 1, w i ) # (w i ) 1 +λ 2 +λ 3 N V # (w i 1 )

Parameter tuning Held-out dataset (development set) 80% (training), 10% (dev-set), 10% (test) Minimize the perplexity of the held-out dataset training 60 dev test

Advanced smoothing #( w i 1, w i )+1 P (w i w i 1 )= # (w i 1 )+V #( w i 1, w i )+k P (w i w i 1 )= # (w i 1 )+kv 61 (add-k, add-δ smoothing)

Advanced smoothing Absolute discounting Good estimates for high counts, discount won't affect them much Lower counts are anyway not trustworthy P (w i w i 1 ) = P (w i w i 1 ) = # (w i 1, w i ) # (w i 1 ) if # (w i 1, w i )> 0 P BG otherwise # (w i 1, w i ) δ # (w i 1 ) α (w i ) P BG (w i ) 62 if # (w i 1, w i )> 0 otherwise

Advanced smoothing 63 Estimation based on the lower-order n-gram I cannot see without my reading Going unigram : Francisco, glasses,... Observations Francisco is more common than glasses But Francisco always follows San Francisco is not a novel continuation for a text

Advanced smoothing Solution Instead of P(w): How likely is w to appear in a text? Pcontinuation(w): How likely is w to appear as a novel continuation? Count the number of words types after which w appears P continuation (w ) w i 1 : # (w i 1, w i )>0 64

Advanced smoothing How many times does w appear as a novel continuation P continuation (w) w i 1 : # (w i 1, w i )>0 Normalized by the total number of bigram types w i 1 : # (w i 1, w i )>0 P continuation (w)= (w j 1, w j ): # (w j 1, w j )>0 Alternatively: normalized by the number of words preceding all words P continuation (w)= w i 1 : # (w i 1, w i )>0 (w 'i 1): # (w 'i 1, w'i )>0 w' 65

Advanced smoothing 66 Kneser-Ney discounting P (w i w i 1 )= max (# (w i 1, w i ) δ,0) +α P BG # (w i 1 ) P(w i w i 1 )= max (# (w i 1, w i ) δ,0) +α P continuation # (w i 1 )

Class-based n-grams Compute estimation for the bigram to Shanghai Training data: to London, to Beijing, to Denver Classes: CITY_NAME, AIRLINE, DAY_OF_WEEK, MONTH, etc. P (w i w i 1 ) P (c i c i 1 ) P (w i c i 1 ) 67

Further reading 68 Chapter 4 http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf