Introduction to N-grams

Similar documents
Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CS 6120/CS4120: Natural Language Processing

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

language modeling: n-gram models

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

CS 6120/CS4120: Natural Language Processing

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Language Model. Introduction to N-grams

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Natural Language Processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Machine Learning for natural language processing

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Words and Language Model

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Language Modeling. Michael Collins, Columbia University

Chapter 3: Basics of Language Modelling

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Probabilistic Language Modeling

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

N-gram Language Modeling

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CSEP 517 Natural Language Processing Autumn 2013

Language Models. CS5740: Natural Language Processing Spring Instructor: Yoav Artzi

Natural Language Processing. Statistical Inference: n-grams

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Language Models. Philipp Koehn. 11 September 2018

The Noisy Channel Model and Markov Models

We have a sitting situation

Statistical Methods for NLP

N-gram Language Modeling Tutorial

Chapter 3: Basics of Language Modeling

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

ANLP Lecture 6 N-gram models and smoothing

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

CSC321 Lecture 7 Neural language models

Introduction to Artificial Intelligence (AI)

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CS4442/9542b Artificial Intelligence II prof. Olga Veksler

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

DT2118 Speech and Speaker Recognition

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Lecture 3: ASR: HMMs, Forward, Viterbi

CMPT-825 Natural Language Processing

Statistical Natural Language Processing

Kneser-Ney smoothing explained

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Sequences and Information

From perceptrons to word embeddings. Simon Šuster University of Groningen

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Language Processing with Perl and Prolog

Natural Language Processing (CSE 490U): Language Models

Machine Learning for natural language processing

Conditional Language Modeling. Chris Dyer

Midterm sample questions

Value Constraints In ORM

Statistical Methods for NLP

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Fun with weighted FSTs

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Mid-term Reviews. Preprocessing, language models Sequence models, Syntactic Parsing

10/17/04. Today s Main Points

9/10/18. CS 6120/CS4120: Natural Language Processing. Outline. Probabilistic Language Models. Probabilistic Language Modeling

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

CS 188: Artificial Intelligence Spring Today

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

TnT Part of Speech Tagger

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 5 Neural models for NLP

Naïve Bayes, Maxent and Neural Models

Inference Methods In Propositional Logic

Logic & Logic Agents Chapter 7 (& Some background)

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Logic & Logic Agents Chapter 7 (& background)

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Ranked Retrieval (2)

Transcription:

Lecture Notes, Artificial Intelligence Course, University of Birzeit, Palestine Spring Semester, 2014 (Advanced/) Artificial Intelligence Probabilistic Language Modeling Introduction to N-grams Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu www.jarrar.info Jarrar 2014 1

Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2011/11/artificial-intelligence-fall-2011.html Jarrar 2014 2

Outline Probabilistic Language Models Chain Rule Markov Assumption N-gram Example Available language models Evaluate Probabilistic Language Models Keywords: Natural Language Processing, NLP, Language model, Probabilistic Language Models Chain Rule, Markov Assumption, unigram, bigram, N-gram, Curpus اللسانيات الحاسوبية, الغموض اللغوي, التحليل اللغوي إحصاي يا, تطبيقات لغوية, فرضية ماركوف, مدونة, المعالجة الا لية للغات الطبيعية Acknowledgement: This lecture is largely based on Dan Jurafsky s online course on NLP, which can be accessed at http://www.youtube.com/watch?v=s3kkluba3b0 Jarrar 2014 3

Why Probabilistic Language Models Goal: assign a probability to a sentence ( as used by native speakers ) Why? Machine Translation: P(high winds tonite) > P(large winds tonite) Spell Correction The office is about fifteen minuets from my house P(about fifteen minutes from) > P(about fifteen minuets from) Speech Recognition P(I saw a van) >> P(eyes awe of an) + Summarization, question-answering, etc., etc.!! Jarrar 2014 4

Probabilistic Language Modeling Goal: Given a corpus, compute the probability of a sentence W or sequence of words w 1 w 2 w 3 w 4 w 5 w n : P(W) = P(w 1,w 2,w 3,w 4,w 5 w n ) P(How to cook rice) = P(How, to, cook, rice) Related task: probability of an upcoming word. That is, given the sentence w 1,w 2,w 3,w 4, what is the probability that w 5 will be the next word: A model that computes: P(w 5 w 1,w 2,w 3,w 4 ) related to P(w 1,w 2,w 3,w 4,w 5 ) //P(rice how, to, cook) //P(how, to, cook, rice) P(W) or P(w n w 1,w 2 w n- 1 ) is called a language model. BePer: the grammar But language model or LM is standard è Intui6on: let s rely on the Chain Rule of Probability Jarrar 2014 5

Reminder: The Chain Rule Recall the definition of conditional probabilities: P(A B) = P(A,B) P(B) RewriRng P(A B) P(B) = P(A,B) or P(A,B) = P(A B) P(B) More variables: P(A,B,C,D) = P(A)P(B A) P(C A,B) P(D A,B,C) Example: P( its water is so transparent ) = P(its) P(water its) P(is its water) P(so its water is) P(transparent its water is so) The Chain Rule in General is: P(w 1,w 2,w 3,,w n ) = P(w 1 ) P(w 2 w 1 ) P(w 3 w 1,w 2 ) P(w n w 1,,w n-1 ) i P(w 1 w 2 w n ) = P(w i w 1 w 2 w i 1 ) Jarrar 2014 6

How to estimate these probabilities Given a large corpus of English (that represents the language), should we just divide all words and count all probabilires? P(the its water is so transparent that) = Count(its water is so transparent that the) Count(its water is so transparent that) No! Too many possible sentences! We ll never have enough data (the counts of all possible sentences) for esrmarng these. Jarrar 2014 7

Markov Assumption Instead we apply a simplifying assumpron: Andrei Markov (1856 1922), Russian mathematician Markov suggests: Instead of the counts of all possible sentences it is enough to only count the last few words In other words, approximate each component in the product (this is enough) Example: i P(w 1 w 2 w n ) P(w i w i k w i 1 ) P(w i w 1 w 2 w i 1 ) P(w i w i k w i 1 ) P(the its water is so transparent that) P(the that) Or maybe P(the its water is so transparent that) P(the transparent that) Jarrar 2014 8

Unigram Model -Simplest case of Markov Model Estimate the probability of whole sequence of words by the product of probability of individual words: P(w 1 w 2 w n ) P(w i ) i P(its water is so transparent that the) P(its) x P(water) x P(is) x P(so) x P(transparent) x P(that) x P(the) Example of some automatically generated sentences from a unigram model, (words are independent): fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!! thrift, did, eighty, said, hard, 'm, july, bullish!! that, or, limited, the! è This is not really a useful model Jarrar 2014 9

Bigram Model CondiRon on the previous word: Estimate the probability of a word given the entire prefix (from the begging to the pervious word) only by the pervious word. P(w i w 1 w 2 w i 1 ) P(w i w i 1 ) P(its water is so transparent that the) P(water its) x P(water is) x P(is so) x P(so transparent) x P(transparent that) x P(that the) Example of some automatically generated sentences from a bigram model, (words are independent):! texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen!! outside, new, car, parking, lot, of, the, agreement, reached!! this, would, be, a, record, November!! è The used conditioning (bigram) is still producing something is wrong! Jarrar 2014 10

N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language because language has long-distance dependencies: The computer which I had just put into the machine room on the fifth floor crashed. This means that we have to consider lots of long sentences. But in practice we can often get away with N-gram model. Jarrar 2014 11

Estimating Bigram Probabilities The Maximum Likelihood EsRmate P(w i w i 1 ) = count(w i 1,w i ) count(w i 1 ) Example: P(w i w i 1 ) = c(w i 1,w i ) c(w i 1 ) P(w i w i 1 ) = c(w i 1,w i ) c(w i 1 ) if we haveword w i-1, how many times it was followed by word w i Sample Corpus <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Jarrar 2014 12

Another example Given this larger corpus can you tell me about any good cantonese restaurants close by mid priced thai food is what i m looking for tell me about chez panisse can you give me a lisrng of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day Bigram Counts (Out of 9222 sentences) Many counts are Zero Jarrar 2014 13

Raw bigram probabilities Normalizing the previous table/counts with the following: Normalize by unigrams: Result: Jarrar 2014 14

Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I <s>) P(want I) P(english want) P(food english) P(</s> food) =.000031 Jarrar 2014 15

What kinds of knowledge? P(english want) =.0011 P(chinese want) =.0065 P(to want) =.66 P(eat to) =.28 P(food to) = 0 P(want spend) = 0 P (i <s>) =.25 è These numbers reflect how English is used in practice (our corpus). Jarrar 2014 16

Practical Issues In pracrce we don t keep these probabilires in the for probabilires, we keep them in the form of log probabilires; that is, We do everything in log space for two reasons: Avoid underflow (as we mulr many small numbers, yield arithmerc underflow) Adding is faster than mulrplying. p 1 p 2 p 3 p 4 = log p 1 + log p 2 + log p 3 + log p 4 Jarrar 2014 17

Available Resources There are many available Language models that you can try Google N-Gram Release, August 2006 Jarrar 2014 18

Google N-Gram Release serve as the incoming 92! serve as the incubator 99! serve as the independent 794! serve as the index 223! serve as the indication 72! serve as the indicator 120! serve as the indicators 45! serve as the indispensable 111! serve as the indispensible 40! serve as the individual 234! http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Jarrar 2014 19

Other Models Google Book N-grams Viewer http://ngrams.googlelabs.com/ SRILM - The SRI Language Modeling Toolkit hpp://www.speech.sri.com/projects/srilm/ Jarrar 2014 20

How to know a language is model is good? Does the language model prefer good sentences to bad ones? Assign higher probability to real or frequently observed sentences Than ungrammatical or rarely observed sentences? Train parameters of our model on a training set. Test the model s performance on data you haven t seen. A test set is an unseen dataset that is different from our training set, totally unused. An evaluation metric tells us how well our model does on the test set. è Two way to evaluate a language model v Extrinsic evaluation (in-vivo) v intrinsic evaluaron (perplexity) Jarrar 2014 21

Extrinsic evaluation of N-gram models Best evaluation for comparing models A and B Put each model in a task spelling corrector, speech recognizer, MT system Run the task, get an accuracy for A and for B How many misspelled words corrected properly How many words translated correctly Compare accuracy for A and B è Extrinsic evaluation is time-consuming; can take days or weeks Jarrar 2014 22

Intuition of Perplexity (intrinsic evaluation ) How well can we predict the next word? I always order pizza with cheese and The 33 rd President of the US was I saw a mushrooms 0.1 pepperoni 0.1 anchovies 0.01 A beper model of a text is one which assigns a higher probability to the word that actually occurs, gives, Gives the highest P(sentence).. fried rice 0.0001. and 1e-100 Perplexity is the probability of the test set, normalized by the number of words: Minimizing perplexity is the same as maximizing probability Jarrar 2014 23