Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Size: px

Start display at page:

Download "Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology"

Ashlyn Davidson
6 years ago
Views:

1 Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

2 Applications of spelling correction 2

3 Spelling Tasks Spelling Error Detection Spelling Error Correction: Autocorrect hte the Suggest a correction Suggestion lists 3

4 Types of spelling errors Non-word Errors graffe giraffe Real-word Errors Typographical errors three there Cognitive Errors (homophones) piece peace, too two your you re Real-word correction almost needs to be context sensitive 4

5 Spelling correction steps For each word w, generate candidate set: Find candidate words with similar pronunciations Find candidate words with similar spellings Choose best candidate By Weighted edit distance or Noisy Channel approach Context-sensitive so have to consider whether the surrounding words make sense Flying form Heathrow to LAX Flying from Heathrow to LAX 5

6 Candidate Testing: Damerau-Levenshtein edit distance Minimal edit distance between two strings, where edits are: Insertion Deletion Substitution Transposition of two adjacent letters 6

7 7

8 Noisy channel intuition 8

9 Noisy channel We see an observation x of a misspelled word Find the correct word w 9

10 Language Model Take a big supply of words with T tokens: p w = C(w) T C(w) = # occurrences of w Supply of words your document collection In other applications: you can take the supply to be typed queries (suitably filtered) when a static dictionary is inadequate 10

11 Unigram prior probability Counts from 404,253,213 words in Corpus of Contemporary English (COCA) 11

12 Channel model probability Error model probability, Edit probability Misspelled word x = x1, x2, x3,,xm Correct word w = w1, w2, w3,, wn P(x w) = probability of the edit (deletion/insertion/substitution/transposition) 12

13 Calculating p(x w) Still a research question. Can be estimated. Some simply ways. i.e., Confusion matrix A square table which represents how many times one letter was incorrectly used instead of another. Usually, there are four confusion matrix: deletion, insertion, substitution and transposition.

14 Computing error probability: Confusion matrix del[x,y]: count(xy typed as x) ins[x,y]: count(x typed as xy) sub[x,y]: count(y typed as x) trans[x,y]: count(xy typed as yx) Inser*on and dele*on condi*oned on previous character 14

15 Confusion matrix for subs*tu*on 15 The cell [o,e] in a substitution confusion matrix would give the count of times that e was substituted for o.

16 Channel model 16

17 Smoothing probabili*es: Add-1 smoothing A character alphabet 17

18 Channel model for acress 18

19 19

20 20

21 Noisy channel for real-word spell correc*on Given a sentence w1,w2,w3,,wn Generate a set of candidates for each word wi Candidate(w1) = {w1, w 1, w 1, w 1, } Candidate(w2) = {w2, w 2, w 2, w 2, } Candidate(wn) = {wn, w n, w n, w n, } Choose the sequence W that maximizes P(W) 21

22 Incorpora*ng context words: Context-sensi*ve spelling correc*on Determining whether actress or across is appropriate will require looking at the context of use A bigram language model condi*ons the probability of a word on (just) the previous word P(w 1 w n ) = P(w 1 )P(w 2 w 1 ) P(w n w n 1 ) 22

23 Incorpora*ng context words For unigram counts, P(w k ) is always non-zero if our dic*onary is derived from the document collec*on This won t be true of P(w k w k 1 ).We need to smooth add-1 smoothing on this condi*onal distribu*on Interpolate a unigram and a bigram: 23

24 Using a bigram language model 24

25 Using a bigram language model 25

26 Noisy channel for real-word spell correc*on 26

27 Noisy channel for real-word spell correc*on 27

28 Simplifica*on: One error per sentence 28

29 Where to get the probabili*es Language model Unigram Bigram Channel model Same as for non-word spelling correc*on Plus need probability for no error, P(w w) 29

30 Probability of no error What is the channel probability for a correctly typed word? P( the the ) If you have a big corpus, you can es*mate this percent correct But this value depends strongly on the applica*onbility of no error 30

31 Peter Norvig s thew example 31

32 Improvements to channel model Allow richer edits (Brill and Moore 2000) ent ant ph f le al Incorporate pronuncia*on into channel (Toutanova and Moore 2002) Incorporate device into channel Not all Android phones need have the same error model But spell correc*on may be done at the system level 32

Informa(on Retrieval

Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Christopher Manning and Pandu Nayak Spelling Correc*on The course thus far Index construc*on Index compression Efficient boolean