{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Similar documents
Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Machine Learning for natural language processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Natural Language Processing SoSe Words and Language Model

CS 6120/CS4120: Natural Language Processing

Language Models. Philipp Koehn. 11 September 2018

Chapter 3: Basics of Language Modelling

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Model. Introduction to N-grams

Natural Language Processing

Natural Language Processing. Statistical Inference: n-grams

Probabilistic Language Modeling

Chapter 3: Basics of Language Modeling

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

N-gram Language Modeling

Introduction to N-grams

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Statistical Methods for NLP

CS 6120/CS4120: Natural Language Processing

The Noisy Channel Model and Markov Models

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

CMPT-825 Natural Language Processing

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

DT2118 Speech and Speaker Recognition

language modeling: n-gram models

Cross-Lingual Language Modeling for Automatic Speech Recogntion

N-gram Language Modeling Tutorial

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Kneser-Ney smoothing explained

Language Modeling. Michael Collins, Columbia University

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

ANLP Lecture 6 N-gram models and smoothing

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 12: Algorithms for HMMs

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

NEURAL LANGUAGE MODELS

Natural Language Processing (CSE 490U): Language Models

Statistical Natural Language Processing

Lecture 12: Algorithms for HMMs

8: Hidden Markov Models

SYNTHER A NEW M-GRAM POS TAGGER

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Microsoft Corporation.

TnT Part of Speech Tagger

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Fun with weighted FSTs

8: Hidden Markov Models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Language Processing with Perl and Prolog

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Sequences and Information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Natural Language Processing

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Machine Learning for natural language processing

Statistical Machine Translation

N-gram Language Model. Language Models. Outline. Language Model Evaluation. Given a text w = w 1...,w t,...,w w we can compute its probability by:

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center Spring Lectures 5-6: Language Models and Smoothing. required hard optional

Statistical Methods for NLP

arxiv:cmp-lg/ v1 9 Jun 1997

Graphical models for part of speech tagging

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Midterm sample questions

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Statistical methods in NLP, lecture 7 Tagging and parsing

Objective: Recognize halves within a circular clock face and tell time to the half hour. (60 minutes) (2 minutes) (5 minutes)

Advanced topics in language modeling: MaxEnt, features and marginal distritubion constraints

Number of solutions of a system

PROBABILISTIC PASSWORD MODELING: PREDICTING PASSWORDS WITH MACHINE LEARNING

CS4442/9542b ArtificialIntelligence II prof. Olga Veksler

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Transcription:

N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1

Word-prediction Applications Augmentative Communication Systems { Helping disabled communicate { Spelling too slow { Menus are limited Context sensitive spelling error correction Speech recognition language modelling For example: They are leaving in about fifteen minuets to go to her house. The study was conducted mainly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of [this problem.] He is trying to fine out. 2

Simple N-grams If vocabulary V and every word has an equal proability of occurring and w i has an equal probability of following another word w j What is P (w i )? { Answer: 1=V What is P (w j jw i )? { Answer: 1=V What is P (w i w j ) i.e. w i followed by w j? { Answer: P (w i )P (w j jw i ) { Answer: 1=V 2 But this is far too simple Use training corpus for counting But go further than P (w i ) { e.g. P (the) 0:07 3

P (word sequence) P (w 1 w 2 ::: w n;1 W n )=P (w n 1 ) P (w n 1 )= P (w 1 )P (w 2 jw 1 )P (w 3 jw 2 1) :::P(w n jw n;1 1 ) P (w n 1 )= Q n k=1 P (w kjw k;1 1 ) Unigram { Assume P (w n jw n;1 1 )=P (w n ) { So P (w n 1 )= Q n k=1 P (w k) Bigram { Assume P (w n jw n;1 1 )=P (w n jw n;1 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 ) Trigram { Assume P (w n jw n;1 1 )=P (w n jw n;1 n;2 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 k;2 ) N-gram { Assume P (w n jw n;1 1 )=P (w n jw n;1 n;n+1 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 k;n+1 ) 4

Bigram Example Berkeley Restaurant Project { Speech-based restaurant consultant { Limited domain Examples { I'm looking for Chinese food. { Is Cafe Venezia open for lunch? Bigram probabilities for the word eat: eat on.16 eat Thai.03 eat some.06 eat breakfast.03 eat lunch.06 eat in.02 eat dinner.05 eat Chinese.02 eat at.04 eat Mexican.02 eat a.04 eat tomorrow.01 eat Indian.04 eat dessert.007 eat today.03 eat British.001 5

P (I want to eat British food) <s> I.25 I want.32 want to.65 to eat.26 British food.60 <s> I d.06 I would.29 want a.05 to have.14 British restaurant.15 <s> Tell.04 I don t.08 want some.04 to spend.09 British cuisine.01 <s> I m.02 I have.04 want thai.01 to be.02 British lunch.01 < s > means \Start of sentence" P (Britishjeat)=0:001 P (I want to eat British food) = P (Ij < s >)P (wantji)p (tojwant)p (eatjto) P (Britishjeat)P (foodjbritish) = 0:25 0:32 0:65 0:26 0:001 0:60 = 0:0000081 6

Training Count Normalize { So that probabilities lie between 0 and 1 Bigrams: P (w n jw n;1 ) = C(w n;1 w n ) P w C(w n;1w) = C(w n;1w n ) C(w n;1 ) 7

Training Bigrams Berkeley RP V=1616 Bigram counts for seven words: I want to eat Chinese food lunch I 8 1087 0 13 0 0 0 want 3 0 786 0 6 8 6 to 3 0 10 860 3 0 12 eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 food 19 0 17 0 0 0 0 lunch 4 0 0 0 0 1 0 Unigram counts: I 3437 want 1215 to 3256 eat 938 Chinese 213 food 1506 lunch 459 8

Bigram Probabilities: I want to eat Chinese food lunch I.0023.32 0.0038 0 0 0 want.0025 0.65 0.0049.0066.0049 to.00092 0.0031.26.00092 0.0037 eat 0 0.0021 0.020.0021.055 Chinese.0094 0 0 0 0.56.0047 food.013 0.011 0 0 0 0 lunch.0087 0 0 0 0.0022 0 9

Training & Testing Choose corpus for training { Is it too specic to the task { Is it too general? Divide into training and testing sets Don't choose test set from training set Use test set to evaluate architectures Cross-validation is often used { Choose a portion of the corpus, say 9/10, for training { Leave the remainder (1/10) as testing data { Evaluate { Now choose a dierent training set and old testing set part of new training set. { Re-evaluate and repeat until... { Take averages as indicator of performance 10

Add-one Smoothing We don't want zero-probability N-grams For unigram probabilities: Add-one smoothing: C(w n ) P (w n ) = P C(w) w = C(w n) N P (w n ) = P C(w n)+1 w C (w) For bigram probabilities: = C(w n)+1 N + V P (w n jw n;1 ) = C(w n;1w n ) P w C(w n;1) Add-one smoothing: P (w n jw n;1 ) = C(w n;1w n )+1 C(w n;1 )+V 11

Smoothed Bigrams Berkeley RP V=1616 Smoothed bigram counts for seven words: I want to eat Chinese food lunch I 9 1088 1 14 1 1 1 want 4 1 787 1 7 9 7 to 4 1 11 861 4 1 13 eat 1 1 3 1 20 3 53 Chinese 3 1 1 1 1 121 2 food 20 1 18 1 1 1 1 lunch 5 1 1 1 1 2 1 Unigram counts: I 5053 want 2931 to 4872 eat 2554 Chinese 1829 food 3122 lunch 2075 12

Smoothed bigram Probabilities: I want to eat Chinese food lunch I.0018.22.00020.0028.00020.00020.00020 want.0014.00035.28.00035.0025.0032.0025 to.00082.00021.0023.18.00082.00021.0027 eat.00039.00039.0012.00039.0078.0012.021 Chinese.0016.00055.00055.00055.00055.066.0011 food.0064.00032.0058.00032.00032.00032.00032 lunch.0024.00048.00048.00048.00048.00096.00048 Unsmoothed bigram Probabilities: I want to eat Chinese food lunch I.0023.32 0.0038 0 0 0 want.0025 0.65 0.0049.0066.0049 to.00092 0.0031.26.00092 0.0037 eat 0 0.0021 0.020.0021.055 Chinese.0094 0 0 0 0.56.0047 food.013 0.011 0 0 0 0 lunch.0087 0 0 0 0.0022 0 13

Smoothed Counts Counts become adjusted by smoothing Some \weight" is given to zero-counts This comes from reducing the weight given to non-zero counts Adjusted count comes from adding one to count and muliplying by a normalisation factor, N : N+V c N i = (c i +1) N + V We dene the discount as: d i = c i c i 14

Smoothed bigram counts: I want to eat Chinese food lunch I 6 740.68 10.68.68.68 want 2.42 331.42 3 4 3 to 3.69 8 594 3.69 9 eat.37.37 1.37 7.4 1 20 Chinese.36.12.12.12.12 15.24 food 10.48 9.48.48.48.48 lunch 1.1.22.22.22.22.44.22 Unsmoothed bigram counts: I want to eat Chinese food lunch I 8 1087 0 13 0 0 0 want 3 0 786 0 6 8 6 to 3 0 10 860 3 0 12 eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 food 19 0 17 0 0 0 0 lunch 4 0 0 0 0 1 0 I 0.68 want 0.42 to 0.69 Discounts: eat 0.37 Chinese 0.12 food 0.48 lunch 0.22 15

Problem Too much/little weight can be given to zero-counts In general add-one smoothing is a poor smoothing method Witten-Bell Discounting is a relatively simple, but more \sensible" approach to smoothing. It assigns more appropriate weighting to zero-counts. Beyond scope of this module but details in Jurafsky & Martin. 16

Backo If we have no examples of a particular trigram w n;2 w n;1 w n to calculate P (w n jw n;2 w n;1 ) Then \back o" and simply use the bigram probability P (w n jw n;1 ) What if no examples of bigram w n;1 w n? Just use P (w n )! But remember: P i j P (w njw i w j )=1 So we mus use some discounting to adjust probabilities of lower order models when we backo to them. P (w n jw n;2 w n;1 )= ~P (w n jw n;2 w n;1 ), if C(w n;2 w n;1 w n ) > 0 else (w n;1 n;2 ) ~ P (wn jw n;1 ), if C(w n;1 w n ) > 0 else (w n;1 ) ~ P (w n ), otherwise 17

Deleted Interpolation Rather than backing o, use linear combination of trigram, bigram, and unigram. So P (w n jw n;2 w n;1 )= 1 P (w n jw n;2 w n;1 )+ 2 P (w n jw n;1 )+ 3 P (w n ) such that P i i =1 i can be learned automatically from corpus training data (using HMMs) i canbemadetovary according to particular trigram, so: So P (w n jw n;2 w n;1 )= 1 (w n;1 n;2 )P (w njw n;2 w n;1 ) + 2 w n;1 n;2 )P (w njw n;1 ) + 3 w n;1 n;2 )P (w n) 18