Natural Language Processing SoSe Words and Language Model

Similar documents
Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Machine Learning for natural language processing

Natural Language Processing. Statistical Inference: n-grams

DT2118 Speech and Speaker Recognition

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Michael Collins, Columbia University

Language Model. Introduction to N-grams

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Chapter 3: Basics of Language Modelling

N-gram Language Modeling

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Probabilistic Language Modeling

Language Processing with Perl and Prolog

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

CS 6120/CS4120: Natural Language Processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Natural Language Processing

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

N-gram Language Modeling Tutorial

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Statistical Natural Language Processing

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

CMPT-825 Natural Language Processing

Chapter 3: Basics of Language Modeling

Statistical Methods for NLP

SYNTHER A NEW M-GRAM POS TAGGER

The Noisy Channel Model and Markov Models

TnT Part of Speech Tagger

Language Models. Philipp Koehn. 11 September 2018

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Hidden Markov Models in Language Processing

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Kneser-Ney smoothing explained

Introduction to Probablistic Natural Language Processing

A fast and simple algorithm for training neural probabilistic language models

Text Mining. March 3, March 3, / 49

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Midterm sample questions

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Statistical methods in NLP, lecture 7 Tagging and parsing

Part-of-Speech Tagging

Natural Language Processing (CSE 490U): Language Models

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

CS 6120/CS4120: Natural Language Processing

Cross-Lingual Language Modeling for Automatic Speech Recogntion

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

On Using Selectional Restriction in Language Models for Speech Recognition

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Introduction to N-grams

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Exercise 1: Basics of probability calculus

arxiv:cmp-lg/ v1 9 Jun 1997

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

ANLP Lecture 6 N-gram models and smoothing

Sequences and Information

Lecture 12: Algorithms for HMMs

More Smoothing, Tuning, and Evaluation

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Prenominal Modifier Ordering via MSA. Alignment

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Fun with weighted FSTs

CSC321 Lecture 7 Neural language models

CSC321 Lecture 15: Recurrent Neural Networks

Topic Modeling: Beyond Bag-of-Words

From perceptrons to word embeddings. Simon Šuster University of Groningen

Ranked Retrieval (2)

Statistical Machine Translation

Statistical Methods for NLP

language modeling: n-gram models

Unsupervised Vocabulary Induction

HMM and Part of Speech Tagging. Adam Meyers New York University

Conditional Language Modeling. Chris Dyer

A New Smoothing Method for Lexicon-based Handwritten Text Keyword Spotting

Natural Language and Statistics

Transcription:

Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016

Outline 2 Words Language Model

Outline 3 Words Language Model

Tokenization Separation of words in a sentence Latest figures from the US government show the trade deficit with China reached an all-time high of $365.7bn ( 250.1bn) last year. By February this year it had already reached $57bn. Latest figures from the US government show the trade deficit with China reached an all time high of $ 365.7 bn ( 250.1 bn ) last year. By February this year it had already reached $ 57 bn. 4 (http://www.bbc.com/news/election-us-2016-36185012)

Tokenization Issues related to tokenization: Separators: punctuations 5 Exceptions: m.p.h, Ph.D Expansions: we're = we are Multi-words expressions: New York, doghouse

Segmentation = Tokenization 6 Word segmentation: separation of the morphemes but also tokenization for languages without 'space' character (http://www.basistech.com/better-multilingual-search/)

Sentence separation (splitting) Also usually based on punctuations (.?!) 7 Exceptions: Mr., 4.5

Approaches for Tokenization Based on rules or machine learning 8 Binary classifers that decides whether a certain punctuation is part of a word or not Based on regular expressions

Approaches for Segmentation Maximum matching approach Based on a dictionary Longest sequence of letters that forms a word Palmer (2000): thetabledownthere thetabledownthere thetabledownthere thetabledownthere 9

Outline 10 Words Language Model

Language model Finding the probability of a sentence or a sequence of words P(S) = P(w1, w2, w3,..., wn ) all of a sudden I notice three guys standing on the sidewalk... on guys all I of notice sidewalk three a sudden standing the... 11

Language model Finding the probability of a sentence or a sequence of words 12 P(S) = P(w1, w2, w3,..., wn ) all of a sudden I notice three guys standing on the sidewalk... on guys all I of notice sidewalk three a sudden standing the...

Motivation: Speech recognition 13 Computers can recognize speech. Computers can wreck a nice peach. Give peace a chance. Give peas a chance. Ambiguity in speech: Friday vs. fry day ice cream vs. I scream (http://worldsgreatestsmile.com/html/phonological_ambiguity.html)

Motivation: Handwriting recognition 14 (https://play.google.com/store/apps/details?id=com.metamoji.mazecen)

Motivation: Handwriting recognition 15 Take the money and run, Woody Allen: Abt naturally. vs. Act naturally. I have a gub. vs. I have a gun. (https://www.youtube.com/watch?v=-uhogkdbvqc)

Motivation: Machine Translation 16 The cat eats... Die Katze frisst... Die Katze isst... Chinese to English: He briefed to reporters on the chief contents of the statements He briefed reporters on the chief contents of the statements He briefed to reporters on the main contents of the statements He briefed reporters on the main contents of the statements

Motivation: Spell Checking I want to adver this project adverb (noun) advert (verb) They are leaving in about fifteen minuets to go to her house. The design an construction of the system will take more than a year. 17 minutes and

Language model Finding the probability of a sentence or a sequence of words Computers can recognize speech. 18 P(S) = P(w1, w2, w3,..., wn ) P(Computer, can, recognize, speech)

Conditional Probability P ( A B) P ( A B )= P( A) P ( A, B)=P ( A ) P ( B A ) P ( A, B,C, D)=P ( A) P (B A ) P (C A, B ) P ( D A, B,C ) 19

Conditional Probability P ( S )=P (w 1 ) P ( w 2 w 1 ) P ( w 3 w 1, w 2 )... P (w n w 1, w 2,, w 3,...,, w n ) P ( S )=i n P (w i w 1, w 2,..., w i 1) P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize Computer can) P(speech Computer can recognize) 20

Corpus 21 Probabilities are based on counting things A corpus is a computer-readable collection of text or speech Corpus of Contemporary American English The British National Corpus The International Corpus of English The Google N-gram Corpus ( https://books.google.com/ngrams) But also many small corpora for particular domains/tasks...

Word occurrence A language consists of a set of V words (Vocabulary) A word can occur several times in a text 22 Word Token: each occurrence of words in text Word Type: each unique occurrence of words in the text This is a sample text from a book that is read every day. # Word Tokens: 13 # Word Types: 11

Word occurrence Google N-Gram corpus 1,024,908,267,229 word tokens 13,588,391 word types Why so many word types? 23 Large English dictionaries have around 500k word types

Word frequency 24

Zipf's Law 25 The frequency of any word is inversely proportional to its rank in the frequency table Given a corpus of natural language utterances, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, Rank of a word times its frequency is approximately a constant Rank Freq c c 0.1 for English

Zipf's Law 26

Zipf's Law 27 Zipf s Law is not very accurate for very frequent and very infrequent words

Zipf's Law 28 But very precise for intermediate ranks

Back to Conditional Probability P (S )= P ( w 1 ) P (w 2 w 1 ) P ( w 3 w 1, w 2 )... P (w n w 1, w 2,, w 3,...,, w n ) P ( S )=i n P (w i w 1, w 2,..., w i 1) P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize Computer can) P(speech Computer can recognize) 29

Maximum Likelihood Estimation P(speech Computer can recognize) P (speech Computer can recognize )= 30 # (Computer can recognize speech) # (Computer can recognize ) Too many phrases Limited text for estimating probabilities Simplification assumption

Markov assumption P (S )= i 1 n P ( w i w 1, w 2,..., w i 1) P (S )= i 1 31 n P ( w i w i 1)

Markov assumption P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize Computer can) P(speech Computer can recognize) P(Computer,can,recognize,speech) = P(Computer) P(can Computer) P(recognize can) P(speech recognize) P (speech recognize )= 32 # (recognize speech) # (recognize)

N-gram model 33 Unigram: P (S )= i 1 Bigram: P (S )= i 1 Trigram: P (S )= i 1 N-gram: P (S )= i 1 n n n P ( wi) P ( w i w i 1) P ( w i w i 1, w i 2) n P ( w i w 1, w 2,..., w i 1)

N-gram model 34

Maximum Likelihood Estimation <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s> Vocabulary: 35 V = {I,saw,the,boy,man,is,working,walked,in,street}

Maximum Likelihood Estimation 36 <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s

Maximum Likelihood Estimation Estimation of maximum likelihood for a new sentence <s> I saw the man </s> P ( S )=P ( I < s >) P ( saw I ) P (the saw) P (man the) P ( S )= # (< s > I ) # ( I saw) # ( saw the) # (the man) # (< s>) #( I ) #( saw) # (the) 2 1 1 1 P (S )= 3 2 1 3 37

Unknown words <s> I saw the woman </s> Possible Solutions: Closed vocabulary: test set can only contain words from this lexicon Open vocabulary: test set can contain unknown words 38 Out of vocabulary (OOV) words: Choose a vocabulary Convert unknown (OOV) words to <UNK> word token Estimate probabilities for <UNK> Replace the first occurrence of every word type in the training data by <UNK>

Evaluation Divide the corpus to two parts: training and test Build a language model from the training set 39 Word frequencies, etc.. Estimate the probability of the test set Calculate the average branching factor of the test set

Branching factor The number of possible words that can be used in each position of a text Maximum branching factor for each language is V A good language model should be able to minimize this number 40 give a higher probability to the words that occur in real texts

Perplexity Goals: give higher probability to frequent texts minimize the perplexity of the frequent texts P (S )= P ( w 1, w 2,..., w n ) 1 n Perplexity (S )= P (w 1, w 2,..., w n ) = n Perplexity( S )= 41 n n i =1 1 P ( w1, w 2,..., w n ) 1 P (w i w 1, w 2,..., w i 1 )

Perplexity 42 Wall Street Journal (19,979 word vocabulary) Training set: 38 million word Test set: 1.5 million words Perplexity: Unigram: 962 Bigram: 170 Trigram: 109

Unknown n-grams Corpus: <s> I saw the boy </s> <s> the man is working </s> <s> I walked in the street </s> <s> I saw the man in the street </s> P( S )=P ( I ) P ( saw I ) P (the saw ) P (man the) P (i n man) P (the i n) P ( street the) P ( S )= # ( I ) # ( I saw) # ( saw the) # (the man) # (man i n) # (i n the) # (the street ) # (< s >) # ( I ) # (saw ) #(the) # (man) # (i n) # (the) 2 1 1 1 0 1 1 P ( S )= 3 2 1 3 1 1 3 43

Smoothing Laplace (Add-one) Small probability to all unseen n-grams Add one to all counts P (w i w i 1 )= 44 #( w i 1, w i ) # (w i 1 ) #( w i 1, w i )+1 P (w i w i 1 )= # (w i 1 )+V

Smoothing Back-off Use a background probability P (w i w i 1 ) = 45 #( w i 1, w i ) # (w i 1 ) if # (w i 1, w i )>0 P BG otherwise

Smoothing Interpolation Use a background probability # (w i 1, w i ) P (w i w i 1 )=λ1 +λ 2 P BG # (w i 1 ) 46 λ=1

Backgroung probability 47 Lower levels of n-gram can be used as background probability Trigram» Bigram Bigram» Unigram Unigram» Zerogram ( 1 ) V

Background probability Back-off # (w i 1, w i ) # (w i 1 ) if # (w i 1, w i )> 0 α (w i ) P (w i ) otherwise P (w i w i 1 ) = P(w i ) = # (w i ) N α (w i ) 48 if # (w i )> 0 1 V otherwise

Background probability Interpolation P (w i w i 1 )=λ 1 # (w i 1, w i ) +λ 2 P (w i ) # (w i 1 ) # (w i ) 1 P(w i )=λ 1 + λ 2 N V P(w i w i 1 )=λ 1 49 # (w i 1, w i ) # (w i ) 1 +λ 2 +λ 3 N V # (w i 1 )

Parameter Tuning Held-out dataset (development set) 50 80% (training), 10% (dev-set), 10% (test) Minimize the perplexity of the held-out dataset

Advanced Smoothing Add-k 51 P (w i w i 1 )= #( w i 1, w i )+1 # (w i 1 )+V P (w i w i 1 )= #( w i 1, w i )+k # (w i 1 )+kv (add-k, add-δ smoothing)

Advanced Smoothing Absolute discounting Good estimates for high counts discount won't affect them much Lower counts are not trustworthy anyway P (w i w i 1 ) = 52 # (w i 1, w i ) δ # (w i 1 ) if # (w i 1, w i )> 0 α (w i ) P BG (w i ) otherwise

Advanced Smoothing novel continuation 53 Estimation based on the lower-order n-gram I cannot see without my reading unigram : Francisco, glasses,... Observations: Francisco is more common than glasses But Francisco always follows San Francisco is not a novel continuation for a text

Advanced Smoothing novel continuation Solution Instead of P(w): How likely is w to appear in a text? Pcontinuation(w): How likely is w to appear as a novel continuation? Count the number of words types after which w appears P continuation (w ) w i 1 : # (w i 1, w i )>0 54

Class-based n-grams Estimation probability for classes: Based on name-entity recognition CITY_NAME, AIRLINE, DAY_OF_WEEK, MONTH, etc. Training data: to London, to Beijing, to Denver, etc. P (w i w i 1 ) P (c i c i 1 ) P (w i c i 1 ) 55

Summary Words Tokenization, Segmentation Language Model Word occurrence (word type and word token) Zipf's Law Maximum Likelihood Estimation 56 Markov assumption: N-Grams Evaluation: Perplexity Smoothing methods

Further reading Book Jurafski & Martin 57 Chapters 3 (3.9) and 4