The Noisy Channel Model and Markov Models

Similar documents
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Chapter 3: Basics of Language Modelling

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 3: ASR: HMMs, Forward, Viterbi

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Statistical Methods for NLP

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

TnT Part of Speech Tagger

A gentle introduction to Hidden Markov Models

Graphical models for part of speech tagging

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Statistical Methods for NLP

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Language Modeling. Michael Collins, Columbia University

From perceptrons to word embeddings. Simon Šuster University of Groningen

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Text Mining. March 3, March 3, / 49

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning for natural language processing

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Natural Language Processing (CSE 490U): Language Models

Sequences and Information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing. Statistical Inference: n-grams

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Intelligent Systems (AI-2)

Midterm sample questions

CSC321 Lecture 7 Neural language models

Natural Language Processing SoSe Words and Language Model

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Expectation Maximization (EM)

Hidden Markov Models in Language Processing

Deep Learning for NLP

Maxent Models and Discriminative Estimation

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Intelligent Systems (AI-2)

Natural Language Processing

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Processing/Speech, NLP and the Web

Chapter 3: Basics of Language Modeling

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

N-gram Language Modeling

SYNTHER A NEW M-GRAM POS TAGGER

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Language Processing with Perl and Prolog

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Conditional Language Modeling. Chris Dyer

Lecture 5 Neural models for NLP

10/17/04. Today s Main Points

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Prenominal Modifier Ordering via MSA. Alignment

Probabilistic Language Modeling

DT2118 Speech and Speaker Recognition

CS838-1 Advanced NLP: Hidden Markov Models

Fun with weighted FSTs

Empirical methods in NLP. Some history The underlying motivation The current state-of-the-art A few application examples

ANLP Lecture 6 N-gram models and smoothing

Naïve Bayes, Maxent and Neural Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Features of Statistical Parsers

Introduction to Artificial Intelligence (AI)

Sparse Models for Speech Recognition

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Language as a Stochastic Process

Structured Output Prediction: Generative Models

Variational Decoding for Statistical Machine Translation

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

COMP90051 Statistical Machine Learning

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Sequential Supervised Learning

Hidden Markov Models

Machine Learning for natural language processing

Lecture 13: Structured Prediction

Statistical Natural Language Processing

CS 343: Artificial Intelligence

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Probabilistic Graphical Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CS 343: Artificial Intelligence

Transcription:

1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014

2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle large item spaces X by decomposing each data item X into a collection of features F but the label spaces Y have to be small to avoid sparse data problems Where we re going from here: many important problems involve large label spaces Y Part-Of-Speech (POS) tagging, where Y is a vector of POS tags Syntactic parsing, where Y is a syntactic parse tree basic approach: decompose Y into parts or features G if each feature G j is independent, just learn a separate classifier for each G j but often there are important dependencies between the Gj in POS tagging, adjectives typically precede nouns more sophisticated models are required to capture these dependencies

3/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

4/24 Machine learning in computational linguistics Many problems involve predicting a complex label Y from data X in automatic speech recognition, X is acoustic waveform, Y is transcript in machine translation, X is source language text, Y is target language translation in spelling correction, X is a source text with spelling mistakes, and Y is a target text without spelling mistakes in automatic summarisation, X is a document, Y is a summary of that document Suppose we can estimate P(Y X ) somehow. Then we should compute: ŷ(x) = argmax P(Y =y X =x) y Y Problems we have to solve: Y is astronomical computing argmax may be intractable Y is astronomical how can we estimate P(Y X )?

5/24 The noisy channel model The noisy channel model uses Bayes rule to invert P(Y X ) P(Y X ) = P(X Y ) P(Y ) P(X ) We can ignore P(X ) if our goal is to compute Language model Y Observation X ŷ(x) = argmax P(X =x Y =y) P(Y =y) y Y because P(X =x) is a constant P(X Y ) is called the channel model or the distortion model P(Y ) is called the source model or the language model can often be learnt from cheap, readily available data

6/24 The noisy channel model in spelling correction, speech recognition and machine translation ŷ(x) = argmax y Y P(X =x Y =y) }{{} channel model P(Y =y) }{{} source model The channel models are task-specific for spelling correction, P(X Y ) maps words to their likely mis-spellings for speech recognition, P(X Y ) maps words or phonemes to acoustic waveforms for machine translation, P(X Y ) maps words or phrases to their translations The source model P(Y ), which is also called a language model, is the same P(Y ) is the probability of a sentence Y can be learned from readily available text corpora

7/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

8/24 Language models A language model calculates the probability of a sequence of words or phonemes In many NLP applications the source model P(Y ) in a noisy channel model is a language model ŷ(x) = argmax y Y P(X =x Y =y) }{{} channel model P(Y =y) }{{} source model In such applications, Y is a sequence of words or phonemes The language model is used to calculate the probability of possible sequences Y of words or phonemes The job of the language model is to distinguish likely sequences of words or phonemes from unlikely ones It can be learnt from cheaply-available text collections

9/24 The bigram assumption for language models A language model estimates the probability P(Y ) of a sentence Y = (Y 1,..., Y n ), where Y i is the ith word in the sentence Recall the relationship between joint and conditional probabilities: P(U, V ) = P(V ) P(U V ) We can use this to rewrite P(Y ) = P(Y 1,..., Y n ): P(Y 1,..., Y n ) = P(Y 1 ) P(Y 2,..., Y n Y 1 ) = P(Y 1 ) P(Y 2 Y 1 ) P(Y 3 Y 1, Y 2 )... P(Y n Y 1,..., Y n 1 ) Now make the bigram assumption: P(Y j Y 1,..., Y j 1 ) P(Y j Y j 1 ), i.e., word Y j only depends on Y j 1 Then P(Y ) simplifies to: P(Y 1,..., Y n ) = P(Y 1 ) P(Y 2 Y 1 )... P(Y n Y n 1 )

10/24 Homogeneity assumption in language models Using bigram assumption we simplified P(Y 1,..., Y n ) = P(Y 1 ) P(Y 2 Y 1 )... P(Y n Y n 1 ) n = P(Y 1 ) P(Y i Y i 1 ) Homogeneity assumption: conditional probabilities don t change with i, i.e., there is a matrix s such that i=2 P(Y i = y Y i 1 = y ) = s y,y Then the bigram model probability P(Y =y) of a sentence y = (y 1,..., y n ) is: P(Y =y) = P(Y 1 = y 1 ) s y2,y 1 s y3,y 2... s yn,yn 1 n = P(Y 1 = y 1 ) i=2 s yi,y i 1

11/24 Using end-markers to handle initial and final conditions We simplify the model by assuming the string y is padded with end-markers $ I.e., Y 0 =$ and Y n+1 = $ y = ($, y 1, y 2,..., y n, $) Then the bigram language model probability P(Y =y) is: P(Y =y) = s y1,$ s y2,y 1... s yn,y n 1 s $,yn = n+1 i=1 s yi,y i 1

12/24 Bigram language model example s = y i \y i 1 $ bow wow woof $ 0 0 0.1 0.2 bow 0.5 0 0.7 0.4 wow 0 1.0 0 0 woof 0.5 0 0.2 0.4 bow 0.5 1.0 0.4 0.7 $ 0.4 wow 0.2 0.5 0.2 woof P($,bow,wow,woof,woof,$) 0.4 = s bow,$ s wow,bow s woof,wow s woof,woof s $,woof = 0.5 1.0 0.2 0.4 0.2 = 0.008

13/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

14/24 Estimating bigram models from text We can estimate the bigram model s from a text corpus Collect a vector of unigram counts m and a matrix of bigram counts n m y = number of times y is followed by anything in corpus n y,y = number of times y follows y in corpus make sure you count the beginning and end of sentence markers! Maximum likelihood estimates: ŝ y,y = n y,y m y Add-1 smoothed estimates (a good idea!): ŝ y,y = n y,y + 1 m y + V where V is the vocabulary (set of words) of the corpus

15/24 Estimating a bigram model example corpus = $ bow wow woof woof $ $ woof bow wow bow wow woof $ $ bow wow $ V = {$, bow, wow, woof} m = n = ŝ = $ bow wow woof 3 4 4 4 y i \y i 1 $ bow wow woof $ 0 0 1 2 bow 2 0 1 1 wow 0 4 0 0 woof 1 0 2 1 y i \y i 1 $ bow wow woof $ 1/7 1/8 2/8 3/8 bow 3/7 1/8 2/8 2/8 wow 1/7 5/8 1/8 1/8 woof 2/7 1/8 3/8 2/8

16/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

17/24 Markov Models A first-order Markov chain is a sequence of random variables Y 1, Y 2, Y 3,..., where: P(Y i Y 1,..., Y i 1 ) = P(Y i Y i 1 ) I.e., Y i is independent of Y 1,..., Y i 2 given Y i, or the value Y i at time i only depends on the value Y i 1 at time i 1 The bigram language model is a first-order Markov chain Informally, the order of a Markov chain indicates how far back in the past the next state can depend on

18/24 Higher-order Markov chains and n-gram language models An n-gram language model uses adjacent n-word sequences to predict the probability of a sequence E.g., the trigrams in y = (the, rain, in, spain) are ($, $, the), ($, the, rain), (the, rain, in), (rain, in, spain), (in, spain, $), (spain, $, $) In an mth order Markov chain, Y i depends only on Y i m,..., Y i 1 So an m + 1-gram language model is an mth order Markov chain E.g., a bigram language model is a first-order Markov chain E.g., a trigram language model is a second-order Markov chain

19/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

20/24 n-gram language models Goal: estimate P(y), where y = (y 1,..., y m ) is a sequence of words n-gram models decompose P(y) into product of conditional distributions P(y) = P(y 1 )P(y 2 y 1 )P(y 3 y 1, y 2 )... P(y m y 1,..., y m 1 ) E.g., P(wreck a nice beach) = P(wreck) P(a wreck) P(nice wreck a) P(beach wreck a nice) n-gram assumption: no dependencies span more than n words, i.e., P(y i y 1,..., y i 1 ) P(y i y i n,..., y i 1 ) E.g., A bigram model is an n-gram model where n = 2: P(wreck a nice beach) P(wreck) P(a wreck) P(nice a) P(beach nice)

21/24 n-gram language models as Markov models and Bayes nets An n-gram language model is a Markov model that factorises the distribution over sentences into a product of conditional distributions: P(y) = m P(y i y i n,..., y i 1 ) i=1 pad y with end markers, i.e., y = ($, y 1, y 2,..., y m, $) Bigram language model as Bayes net: $ Y 1 Y 2 Y 3 Y 4 $ Trigram language model as Bayes net: $ $ Y 1 Y 2 Y 3 Y 4 $

22/24 The conditional word models in n-gram models An n-gram model factorises P(y) into a product of conditional models, each of the form: P(y n y 1,..., y n 1 ) The performance of an n-gram model depends greatly on exactly how these conditional models are defined huge amount of work on this Random forest and deep learning methods for estimating these conditional distributions currently produce state-of-the-art language models

23/24 Outline The noisy channel model Bigram language models Learning bigram models from text Markov Chains and higher-order n-grams n-gram language models as Bayes nets Summary

24/24 Summary In many NLP applications, the label space Y is astronomically large decompose it into components (e.g., elements, features, etc.) The noisy channel model uses Bayes rule to invert a sequence labelling problem enables us to use language models trained on large text collections n-gram language models are Markov models that predict the next word based on the preceding n 1 words