Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Similar documents
Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Lecture 12: Algorithms for HMMs

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Lecture 12: Algorithms for HMMs

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lecture 3: Big-O and Big-Θ

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

CS 361: Probability & Statistics

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Midterm sample questions

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Pair Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CSC321 Lecture 7 Neural language models

Hidden Markov Models: All the Glorious Gory Details

Log-Linear Models, MEMMs, and CRFs

Lecture 3: ASR: HMMs, Forward, Viterbi

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CSE446: Clustering and EM Spring 2017

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CSE 202 Dynamic Programming II

Probabilistic Context-free Grammars

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models. x 1 x 2 x 3 x K

Intelligent Systems (AI-2)

Introduction to Artificial Intelligence (AI)

order is number of previous outputs

Intelligent Systems (AI-2)

Statistical Methods for NLP

Hidden Markov Models and Gaussian Mixture Models

IBM Model 1 for Machine Translation

Multiple Sequence Alignment using Profile HMM

Design and Implementation of Speech Recognition Systems

Data-Intensive Computing with MapReduce

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

(Refer Slide Time: 0:21)

Review and Motivation

More Dynamic Programming

An Introduction to Bioinformatics Algorithms Hidden Markov Models

AN INTRODUCTION TO TOPIC MODELS

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

More Dynamic Programming

Unit 1: Sequence Models

Expectation Maximization (EM)

CS 361: Probability & Statistics

A brief introduction to Conditional Random Fields

1. Markov models. 1.1 Markov-chain

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models Part 2: Algorithms

CS 580: Algorithm Design and Analysis

Stephen Scott.

Machine Learning for OR & FE

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

More Smoothing, Tuning, and Evaluation

Hidden Markov Models

STA 4273H: Statistical Machine Learning

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Dimension Reduction. David M. Blei. April 23, 2012

Evolutionary Models. Evolutionary Models

Week 2: Defining Computation

The Gram-Schmidt Process

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Hidden Markov Models. x 1 x 2 x 3 x K

Markov Chains and Hidden Markov Models. = stochastic, generative models

x 1 + x 2 2 x 1 x 2 1 x 2 2 min 3x 1 + 2x 2

Theory of Computation Prof. Raghunath Tewari Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Hidden Markov Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS1800: Mathematical Induction. Professor Kevin Gold

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

1 Probabilities. 1.1 Basics 1 PROBABILITIES

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

11 The Max-Product Algorithm

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Lecture 13: Structured Prediction

The Noisy Channel Model and Markov Models

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Expectation Maximization Algorithm

CPSC 340: Machine Learning and Data Mining

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Lecture 22: Quantum computational complexity

A non-turing-recognizable language

Hidden Markov Models in Language Processing

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

STA 414/2104: Machine Learning

Natural Language Processing

Intelligent Systems (AI-2)

Transcription:

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture 6 2 February 2019

Recap: noisy channel model A general probabilistic framework, which helps us estimate something hidden (e.g., for spelling correction, the intended word) via two distributions: P (Y ): Language model. The distribution over the words the user intended to type. P (X Y ): Noise model. The distribution describing what user is likely to type, given what they meant to type. Given some particular word(s) x (say, no much effert), we want to recover the most probable y that was intended. Alex Lascarides FNLP Lecture 6 1

Recap: noisy channel model Mathematically, what we want is argmax y P (y x) = argmax y P (x y)p (y) Assume we have a way to compute P (x y) and P (y). Can we do the following? Consider all possible intended words y. For each y, compute P (x y)p (y). Return the y with highest P (x y)p (y) value. Alex Lascarides FNLP Lecture 6 2

Recap: noisy channel model Mathematically, what we want is argmax y P (y x) = argmax y P (x y)p (y) Assume we have a way to compute P (x y) and P (y). Can we do the following? Consider all possible intended words y. For each y, compute P (x y)p (y). Return the y with highest P (x y)p (y) value. No! Without constraints, there are an infinite # of possible ys. Alex Lascarides FNLP Lecture 6 3

Algorithm sketch A very basic spelling correction system. Assume: we have a large dictionary of real words; we only correct non-word word; and we only consider corrections that differ by a single character (insertion, deletion, or substitution) from the non-word. Then we can do the following to correct each non-word x: Generate a list of all words y that differ by 1 character from x. Compute P (x y)p (y) for each y and return the y with highest value. Alex Lascarides FNLP Lecture 6 4

A simple noise model Suppose we have a corpus of alignments between actual and corrected spellings. actual: n o - m u u c h e f f e r t intended: n o t m - u c h e f f o r t This example has one substitution (o e) one deletion (t -, where - is used to show the alignment, but nothing appears in the text) one insertion (- u) Alex Lascarides FNLP Lecture 6 5

A simple noise model Assume that typed character x i depends only on intended character y i (ignoring context). So, substitution o e is equally probable regardless of whether the word is effort, spoon, or whatever. Then we have P (x y) = n P (x i y i ) i=1 For example, P (no not) = P (n n)p (o o)p (- t) See Brill and Moore (2000) on course page for an example of a better model. Alex Lascarides FNLP Lecture 6 6

Estimating the probabilities Using our corpus of alignments, we can easily estimate P (x i y i ) for each character pair. Simply count how many times each character (including empty character for del/ins) was used in place of each other character. The table of these counts is called a confusion matrix. Then use MLE or smoothing to estimate probabilities. Alex Lascarides FNLP Lecture 6 7

Example confusion matrix y\x A B C D E F G H... A 168 1 0 2 5 5 1 3... B 0 136 1 0 3 2 0 4... C 1 6 111 5 11 6 36 5... D 1 17 4 157 6 11 0 5... E 2 10 0 1 98 27 1 5... F 1 0 0 1 9 73 0 6... G 1 3 32 1 5 3 127 3... H 2 0 0 0 3 3 0 4.............................. We saw G when the intended character was C 36 times. Alex Lascarides FNLP Lecture 6 8

Big picture again We now have a very simple spelling correction system, provided we have a corpus of aligned examples, and we can easily determine which real words are only one edit away from non-words. There are easy, fairly efficient, ways to do the latter (see http://norvig.com/spell-correct.html). But where do the alignments come from, and what if we want a more general algorithm that can compute edit distances between any two arbitrary words? Alex Lascarides FNLP Lecture 6 9

Alignments and edit distance These two problems reduce to one: find the optimal character alignment between two words (the one with the fewest character changes: the minimum edit distance or MED). Example: if all changes count equally, MED(stall, table) is 3: S T A L L T A L L deletion T A B L substitution T A B L E insertion Alex Lascarides FNLP Lecture 6 10

Alignments and edit distance These two problems reduce to one: find the optimal character alignment between two words (the one with the fewest character changes: the minimum edit distance or MED). Example: if all changes count equally, MED(stall, table) is 3: S T A L L T A L L deletion T A B L substitution T A B L E insertion Written as an alignment: S T A L L - d s i - T A B L E Alex Lascarides FNLP Lecture 6 11

More alignments There may be multiple best alignments. In this case, two: S T A L L - d s i - T A B L E S T A - L L d i s - T A B L E And lots of non-optimal alignments, such as: S T A - L - L s d i i d T - A B L E - S T A L - L - d d s s i i - - T A B L E Alex Lascarides FNLP Lecture 6 12

How to find an optimal alignment Brute force: Consider all possibilities, score each one, pick best. How many possibilities must we consider? First character could align to any of: - - - - - T A B L E - Next character can align anywhere to its right And so on... the number of alignments grows exponentially with the length of the sequences. Maybe not such a good method... Alex Lascarides FNLP Lecture 6 13

A better idea Instead we will use a dynamic programming algorithm. Other DP (or memoization) algorithms: Viterbi, CKY. Used to solve problems where brute force ends up recomputing the same information many times. Instead, we Compute the solution to each subproblem once, Store (memoize) the solution, and Build up solutions to larger computations by combining the pre-computed parts. Strings of length n and m require O(mn) time and O(mn) space. Alex Lascarides FNLP Lecture 6 14

Intuition Minimum distance D(stall, table) must be the minimum of: D(stall, tabl) + cost(ins) D(stal, table) + cost(del) D(stal, tabl) + cost(sub) Similarly for the smaller subproblems So proceed as follows: solve smallest subproblems first store solutions in a table (chart) use these to solve and store larger subproblems until we get the full solution Alex Lascarides FNLP Lecture 6 15

A note about costs Our first example had cost(ins) = cost(del) = cost(sub) = 1. But we can choose whatever costs we want. They can even depend on the particular characters involved. For example: choose cost(sub(c,c )) to be P (c c) from our spelling correction noise model! Then we end up computing the most probable way to change one word to the other. In the following example, we ll assume cost(ins) = cost(del)= 1 and cost(sub) = 2. Alex Lascarides FNLP Lecture 6 16

Chart: starting point T A B L E 0 S T A L L? Chart[i, j] stores two things: D(stall[0..i], table[0..j]): the MED of substrings of length i, j Backpointer(s): which sub-alignment(s) used to create this one. Deletion: Move down Cost =1 Insertion: Move right Cost=1 Substitution: Move down and right Cost=2 (or 0 if the same) Sum costs as we expand out from cell (0,0) to populate the entire matrix Alex Lascarides FNLP Lecture 6 17

Filling first cell S T A L L T A B L E 0 1 1 2 3 4 5 Moving down in chart: means we had a deletion (of S). That is, we ve aligned (S) with (-). Add cost of deletion (1) and backpointer. Alex Lascarides FNLP Lecture 6 18

Rest of first column S T A L L T A B L E 0 1 1 2 3 4 5 Each move down first column means another deletion. D(ST, -) = D(S, -) + cost(del) Alex Lascarides FNLP Lecture 6 19

Rest of first column S T A L L 0 1 2 3 4 5 T A B L E Each move down first column means another deletion. D(ST, -) = D(S, -) + cost(del) D(STA, -) = D(ST, -) + cost(del) etc Alex Lascarides FNLP Lecture 6 20

Start of second column: insertion S T A L L T A B L E 0 1 1 2 3 4 5 Moving right in chart (from [0,0]): means we had an insertion. That is, we ve aligned (-) with (T). Add cost of insertion (1) and backpointer. Alex Lascarides FNLP Lecture 6 21

Substitution T A B L E 0 1 S 1 2 T 2 A 3 L 4 L 5 Moving down and right: either a substitution or identity. Here, a substitution: we aligned (S) to (T), so cost is 2. For identity (align letter to itself), cost is 0. Alex Lascarides FNLP Lecture 6 22

Multiple paths 0 1 S 1 2 T 2 A 3 L 4 L 5 T A B L E However, we also need to consider other ways to get to this cell: Move down from [0,1]: deletion of S, total cost is D(-, T) + cost(del) = 2. Same cost, but add a new backpointer. Alex Lascarides FNLP Lecture 6 23

Multiple paths 0 1 S 1 2 T 2 A 3 L 4 L 5 T A B L E However, we also need to consider other ways to get to this cell: Move right from [1,0]: insertion of T, total cost is D(S, -) + cost(ins) = 2. Same cost, but add a new backpointer. Alex Lascarides FNLP Lecture 6 24

Single best path 0 1 S 1 2 T 2 1 A 3 L 4 L 5 T A B L E Now compute D(ST, T). Take the min of three possibilities: D(ST, -) + cost(ins) = 2 + 1 = 3. D(S, T) + cost(del) = 2 + 1 = 3. D(S, -) + cost(ident) = 1 + 0 = 1. Alex Lascarides FNLP Lecture 6 25

Final completed chart T A B L E 0 1 2 3 4 5 S 1 2 3 4 5 6 T 2 1 2 3 4 5 A 3 2 1 2 3 4 L 4 3 2 3 2 3 L 5 4 3 4 3 4 Exercises for you: How many different optimal alignments are there? Reconstruct all the optimal alignments. Redo the chart with all costs = 1 (Levenshtein distance) Alex Lascarides FNLP Lecture 6 26

Alignment and MED: uses? Computing distances and/or alignments between arbitrary strings can be used for Spelling correction (as here) Morphological analysis: which words are likely to be related? Other fields entirely: e.g., comparing DNA sequences in biology. Related algorithms are also used in speech recognition and timeseries data mining. Alex Lascarides FNLP Lecture 6 27

Getting rid of hand alignments Using MED algorithm, we can now produce the character alignments we need to estimate our error model, given only corrected words. Previously, we needed hand annotations like: actual: n o - m u u c h e f f e r t intended: n o t m - u c h e f f o r t Now, our annotation requires less effort: actual: no muuch effert intended: not much effort Alex Lascarides FNLP Lecture 6 28

Catch-22 But wait! In my example, we used costs of 1 and 2 to compute alignments. We actually want to compute our alignments using the costs from our noise model: the most probable alignment under that model. But until we have the alignments, we can t estimate the noise model... Alex Lascarides FNLP Lecture 6 29

General formulation This sort of problem actually happens a lot in NLP (and ML): We have some probabilistic model and want to estimate its parameters (here, the character rewrite probabilities: prob of each typed character given each intended character). The model also contains variables whose value is unknown (here: the correct character alignments). We would be able to estimate the parameters if we knew the values of the variables......and conversely, we would be able to infer the values of the variables if we knew the values of the parameters. Alex Lascarides FNLP Lecture 6 30

EM to the rescue Problems of this type can often be solved using a version of Expectation- Maximization (EM), a general algorithm schema: 1. Initialize parameters to arbitrary values (e.g., set all costs = 1). 2. Using these parameters, compute optimal values for variables (run MED to get alignments). 3. Now, using those alignments, recompute the parameters (just pretend the alignments are hand annotations; estimate parameters as from annotated corpus). 4. Repeat steps 2 and 3 until parameters stop changing. Alex Lascarides FNLP Lecture 6 31

EM vs. hard EM The algorithm on the previous slide is actually hard EM (meaning: soft/fuzzy decisions) no Step 2 of true EM does not choose optimal values for variables, instead computes expected values (we ll see this for HMMs). True EM is guaranteed to converge to a local optimum of the likelihood function. Hard EM also converges but not to anything nicely defined mathematically. However it s usually easier to compute and may work fine in practice. Alex Lascarides FNLP Lecture 6 32

Likelihood function Let s call the parameters of our model θ. So for our spelling error model, θ is the set of all character rewrite probabilities P (x i y i ). For any value of θ, we can compute the probability of our dataset P (data θ). This is the likelihood. If our data includes hand-annotated character alignments, then P (data θ) = n i=1 P (x i y i ) If the alignments a are latent, sum over possible alignments: P (data θ) = a n i=1 P (x i y i, a) Alex Lascarides FNLP Lecture 6 33

Likelihood function The likelihood P (data θ) is a function of θ, and can have multiple local optima. Schematically (but θ is really multidimensional): EM will converge to one of these; hard EM won t necessarily. Neither is guarateed to find the global optimum! Alex Lascarides FNLP Lecture 6 34

Summary Our simple spelling corrector illustrated several important concepts: Example of a noise model in a noisy channel model. Difference between model definition and algorithm to perform inference. Confusion matrix: used here to estimate parameters of noise model, but can also be used as a form of error analysis. Minimum edit distance algorithm as an example of dynamic programming. (Hard) EM as a way to bootstrap better parameter values when we don t have complete annotation. Alex Lascarides FNLP Lecture 6 35