CS2731/ISSP2230. Natural Language Processing

Size: px

Start display at page:

Download "CS2731/ISSP2230. Natural Language Processing"

Pearl Hunter
6 years ago
Views:

S saw N I NP V saw VP NP NP PP NP nsubj I a dobj girl nmod telescope D N P D N a girl with a telescope CS2731/ISSP2230 det case det with Natural Language Processing Lecture 2 Regular Expressions,

1 S saw N I NP V saw VP NP NP PP NP nsubj I a dobj girl nmod telescope D N P D N a girl with a telescope CS2731/ISSP2230 det case det with Natural Language Processing Lecture 2 Regular Expressions, Text Normalization, and Minimum Edit Distance a Instructor Prof. Diane Litman litman@cs.pitt.edu TA Yuhuan Jiang yuhuan@cs.pitt.edu Based on the Slides of Dan Jurafsky & James H. Martin

2 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 2

3 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 3

4 Regular Expressions A regular expression specifies a regular language Think of a language as a set of strings We will learn in future lectures about what is a regular language are there other types of languages For now, we only need to know that there are some languages we cannot express with regular expressions. We ll come back to this later in this lecture 4

5 Regular Expressions A regular expression defines a language/a set of strings/a pattern of strings. Regex woodchucks [A-Z] Example Patterns Matched interesting links to woodchucks and lemurs Mary Ann stopped by Mona s! You ve left the burglar behind again! Useful for searching in text, when we have a pattern to search for. Example in UNIX: grep. 5

6 Basic Patterns Complicated regular expressions consist of basic regular expression patterns. We ll look at composition later. For now, we learn some basic patterns first. 6

7 Basic Patterns Simple characters Form: A sequence of simple characters. Effect: Matches any string containing the substring. Regex Example Patterns Matched woodchucks interesting links to woodchucks and lemurs a Mary Ann stopped by Mona s! You ve left the burglar behind again! 7

8 Basic Patterns Character-level Disjunction Form: A sequence of characters inside brackets. Effect: Matches any character from the bracket. Regex Example Patterns Matched [ww]oodchucks Woodchuck is the same as woodchuck. [abc] Mary Ann stopped by Mona s [ ] I can count from 1 to 5. This can get very boring, as [ABCDEFGHIJKLMNOPQRSTUVWXYZ] is a very tedious way to match any capital letter. 8

9 Basic Patterns Range Form: [x-y] Effect: Matches any character in the range x to y. Regex Example Patterns Matched [A-Z] we should call it Drenched Blossoms [a-z] YOUR USER NAME IS yuj24 [0-9] YOUR USER NAME IS yuj24 9

10 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 10

11 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character not in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 11

12 Operators Optional, Kleen *, Kleen + Regex woodchucks? colou?r Patten Description woodchuck or woodchucks color or colour a* Zero or more a s aa* One or more a s a+ Same effect as above [ab]* Zero or more a s or b s. E.g., aaaa, ababab, bbbb, abbababbbaab, [0-9]+ A sequence of numbers 12

13 Wildcard Matches any character Regex Patten Description. Any character beg.n begin, began, begun, <html>.+</html> Any string surrounded by the <html> tag 13

14 Anchors Caret ^: start of line Dollar $: end of line \b: word boundary The 3 Uses of Caret Start of a line Negate a bracketed pattern Just a caret (a word = any seq. of digits/underscores/letters) Regex Patten Description ^The Matches the word The only at the start of a line $ Matches a space at the end of a line. ^The dog\.$ Matches a line that contains only the phrase The dog \bthe\b Matches the but not other 14

15 Disjunction Operator Q: If we want to search for cat or dog, can we do this? [catdog] A: No. This would match any character that is c, a, t, d, o, or g. Solution: the disjunction operator Vertical bar a b matches a or b cat dog matches cat or dog 15

16 Example: Constructing a Regex Goal: find cases of the English article the the 16

17 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 17

18 Example: Constructing a Regex Goal: find cases of the English article the \b[tt]he\b 18

19 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 19

20 Example: Constructing a Regex Goal: find cases of the English article the [^a-za-z][tt]he[^a-za-z] 20

21 Example: Constructing a Regex Goal: find cases of the English article the (^ [^a-za-z])[tt]he[^a-za-z] 21

22 Example: Constructing a Regex Goal: find cases of the English article the Solution: (^ [^a-za-z])[tt]he[^a-za-z] 22

23 Error Types What did we just go through? We refined the regex by handling errors In fact, we handled two types of errors False positives: strings that we incorrectly matched E.g., other, or there False negatives: strings that we incorrectly missed E.g., The Precision and Recall By minimizing false positives, we increase precision By minimizing false negatives, we increase recall These two types of errors appear everywhere in NLP. 23

24 Quiz Write a regular expression for a n b n which is any string consisting of a s followed by the same number of b s. There is no regex for this one The language a n b n is not regular In future lectures, you will be taught about the Chomsky hierarchy of languages. 24

25 Source: Why This Matters? Let s try to parse an HTML document using regular expressions. An example HTML document: <div><div><div>...</div></div></div> This is basically <div> n </div> n which is not regular. Test.html <div> <div> <div>... </div> </div> </div> 25

26 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 26

27 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 27

28 Corpus Preprocessing Almost every NLP task needs to do some preprocessing on the corpus: Breaking a document into sentences Breaking the sentence into individual words Normalizing word formats 28

29 Corpus Preprocessing Document 29

30 Corpus Preprocessing Sentence Sentence Sentence Sentence Sentence Sentence Sentence Splitter 30

31 Corpus Preprocessing Word Word Word Word W W Word W Word W W Word W W W Word Word Word W Word W Sentence Splitter Word Segmenter 31

32 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer 32

33 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer system 33

34 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 34

35 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 35

36 Sentence Segmentation Simple! Just cut the document whenever you see a period/question mark/exclamation mark. No! Consider using the period (.) as sentence boundary: This is a Ph.D. student from the U.S. He does experiments comparing rules vs. probabilities. He enjoys 99.9% of his life. Would be more clever to classify every period as EndOfSentence NotEndOfSentence Possible models: Hand-crafted rules, regular expressions, machine-learning 36

37 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: 37

38 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: Features 38

39 Decision Tree Features Categorical features Those on the previous slide Case of the word with. {Upper, Lower, Cap, Num} Case of the word after. {Upper, Lower, Cap, Num} Numerical features Length of the word with. Prob(word with. occurs at the end of sentence) Prob(word with. occurs at the beginning of sentence) 39

40 Decision Tree Implementation A decision tree is just an if-then-else statement The interesting research is how to choose the features Setting up the structure is often too hard to do by hand in two ways: Hand-building is only possible for very simple features with limited domain size. Numerical features thresholds are hard to pick. Instead, we learn the structure of the decision tree by machine learning from a training corpus. Simplest: decision tree induction using information theory. Probabilistic classifiers: Logistic regression, SVM, NN, etc. 40

41 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 41

42 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 42

43 Q: What Is A Word? Is he s one word? If not, how do we break it? he s he s he is he has... Do we really want to split entity names (New York)? 43

44 Word Segmentation Specifications E.g., the Stanford Chinese word segmenter provides segmentation specification options Chinese Penn Treebank standard Peking University standard Provides difference choices: E.g., Whether to separate surnames from first names 江泽民江泽民 or no change Surname surnames from job titles FirstName 江主席江主席 or no change Surname JobTitle 44

45 Token vs. Types A word type is an element of the vocabulary Think of vocabulary as a set of distinct words A token is an instance of a type in actual text E.g., A Simple Corpus Here are some words. Here are some more words. How many tokens? 11 How many types? 6 {Here, are, some, words,., more} 45

46 Token vs. Types Compare the number of tokens with the number types in actual corpora Corpus Name #Tokens #Types Shakespeare 884,000 31,000 Brown corpus 1 million 38 thousand Switchboard telephone conversations 2.4 million 20 thousand COCA 440 million 2 million Google N-grams 1 trillion 13 million When we talk about the number of words in a language, we are generally referring to word types. 46

47 An Empirical Law The Heaps Law (or Herdan s Law) describes the growth of vocabulary size (# word types) with the number of tokens. V (n) =k n Reflects the type-token relation. K and β are parameters decided by data Affected by factors such as genre For English corpora K is usually between 10 and 100 β is roughly around 0.4 to 0.6 (~square root) 47

48 Issues in Tokenization of English Finland s Finland, Finlands, Finland s? what re what are, what re, what re? Heulett-Packard Heulett Packard? state-of-the-art state of the art? lowercase lower-case lowercase lower case? San Francisco should we split them? m.p.h. miles per hour? with respect to one preposition? 48

49 Tokenization in Other Languages French L ensemble is this one token or two? L? L? Le? German Noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter life insurance company employee German information retrieval system needs compound splitter 49

50 Tokenization in Other Languages Chinese There are no space between words. Needs segmentation. Different segmentation standard (e.g., different granularity) may yield different result in a downstream system that relies on word segmentation That is pretty much every Chinese NLP model 莎拉波娃现在居住在美国东南部的佛罗里达莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. 莎拉波娃现在居住在美国东南部的佛罗里达 50

51 Tokenization in Other Languages The problem is more complicated in Japanese. Because multiple scripts exist. People use them interchangeably at will. Alternative forms: Here [loc] tobacco [obj] smoke don t. 51

52 Tokenization in Other Languages Name tokenization can also be tricky If the person is from China: Surname FirstName If the person is from Japan: Surname FirstName 52

53 Word Segmentation This is what tokenization is called in languages such as Chinese, Japanese, Arabic, Take Chinese as an example: Chinese words are composed of characters. Characters are generally 1 syllable and 1 morpheme. A morpheme is the smallest meaning bearing unit. Average word is 2.4 characters long. Standard baseline for evaluating word segmentation algorithms: Maximum Matching 53

54 Algorithm: Maximum Matching Input: a lexicon L words of Chinese; a Chinses string s Output: tokenized string Algorithm: Step 1: Start a pointer p at the beginning of the string; Step 2: Find the longest word w in L that matches the string starting at p; Step 3: Move p over w in s; Step 4: Go to Step 2. 54

55 Source: D. Jurafsky Algorithm: Maximum Matching Doesn t generally work in English! Thecatinthehat thetabledownthere The cat in the hat theta bled own there But works well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. Modern probabilistic approaches are even better 55

56 Tools for Tokenization Read Ken Church s UNIX for Poets for UNIX commands. Try libraries such as NLTK or Stanford CoreNLP to learn how to do sentence splitting and tokenization in English. Try libraries such as NLTK or Stanford Word Segmenter for word segmentation in Chinese and Arabic. 56

57 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 57

58 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 58

59 Normalization Q: Why? A: For information retrieval purposes, the following two must have the same form: Indexed text (what we have in the system) Query terms (what the user types in a search box) E.g.: we want to match U.S.A. and USA. 59

60 Try to Differentiate Lemmatizing a word Removing all inflections Stemming a word Reduce to the stem 60

61 Case Folding (Lowercasing) Convert all letters to lower case. Since users tend to use lower case in search box. Possible exception: upper case in mid-sentence? E.g., General Monitors Fed vs. fed In some tasks, case is helpful Sentiment analysis (NO! vs. No.) Machine translation (US vs. us) Avoid lowercasing in these cases 61

62 Lemmatization Reduces inflections/variant forms to base form am, are, is, was, were be car, cars, car s, cars car The boy s cars are different colors. The boy car be different color. Sometimes lemmatization is unwanted In Spanish-to-English MT, it is important to differentiate: quiero (I want) and quieres (you want) But they have the same lemma querer 62

63 Stemming Some terminology A morpheme is the smallest meaningful unit in a language. The stem of a word is the core meaning-bearing unit (may be composed of more than one morphemes). Affixes are bits and pieces that adhere to stems. Often associated with grammatical functions. Stemming is crude chopping of affixes. E.g., automate, automates, automatic, automation are all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 63

64 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1a sses ss caresses caress ies i ponies poni ss ss caress caress s ø cats cat Step 2 (for long stems) ational ate relational relate izer ize digitizer digitize ator ate operator operate Step 1b (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster Step 3 (for longer stems) al ø revival reviv able ø adjustable adjust ate ø activate activ 64

65 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1b Strip ing only if there is a vowel in the v part (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster 65

66 Dealing with Complex Morphology Some language requires complex morpheme segmentation, such as Turkish: Uygarlastiramadiklarimizdanmissinizcasina Meaning: (behaving) as if you are among those whom we could not civilize Uygar = civilized + las = become + tir = cause + ama = not able + dik = past + lar = plural + imiz = [+1pl.] + dan = abl + mis = [past] + siniz = [+2pl.] + casina = as if 66

67 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 67

68 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 68

69 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Spelling correction: find a similar to what the user typed. User types: graffe What might be user s intention graf graft grail giraffe graffiti Which one is closest to graffe? 69

70 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead 70

71 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead Substitute Ins. Delete Ins. 71

72 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Named entity extraction and entity coreference resolution: IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy 72

73 Measuring Similarity In computational biology: Aligning two sequences of nucleobases AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC 73

74 Minimum Edit Distance The minimum edit distance between two strings is the minimum number of editing operations, e.g., deletion insertion substitution needed to transform one string into another. 74

75 Example: Edits I N T E N T I O N 75

76 Example: Edits N T E N T I O N delete 76

77 Example: Edits E T E N T I O N delete substitute 77

78 Example: Edits E X E N T I O N delete substitute substitute 78

79 Example: Edits E X E N U T I O N delete substitute substitute insert 79

80 Example: Edits E X E C U T I O N delete substitute substitute insert substitute 80

81 Sequence of Actions I N T E N T I O N N T E N T I O N E T E N T I O N E X E N T I O N E X E N U T I O N E X E C U T I O N delete substitute substitute insert substitute 81

82 Visualizing String Distances Alignment: a visualization of edits 82

83 Levenshtein Distance A typical cost choice of minimum edit distance, where deletion costs 1 insertion costs 1 substitution costs 2 Of course, substituting a letter with the same letter costs 0. In the example, the Levenshtein distance between intention and execution is 8. We can also disallow substitution Equivalent to allowing substitution with cost of 2 Since substitution = deletion + insertion 83

84 MDE as a Search Problem State: a word (seq. of chars) which we transform Actions: Insert any character at any position Delete any character Substitute any character with any character Costs: Insertion costs 1 Deletion costs 1 Substitution costs 2 if substituting with a different character; otherwise 0 Initial state: the source word Goal state: the target word 84

85 MDE as a Search Problem I N T E N T I O N del ins sub N T E N T I O N I N T E C N T I O N I N X E N T I O N 85

86 Minimum Edit Distance Definition Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. To obtain the MED between x and y, simply evaluate What are the subproblems? D(N, M) 86

87 Key: Find Subproblems Key idea: solve the current problem using subproblems. To solve tongue lingua If we already know how to do tongue lingu Then we would only need to do insertion of a If we already know how to do tongu lingua Then we would only need to do deletion of e If we already know how to do tongu lingu Then we would only need to do substitution of e with a 87

88 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua If we already know the MED of tongue lingu Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 88

89 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua To compute the complete problem: Use answers If we already know the MED of tongue lingu to these subproblems Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 89

90 Why This Works? Let p be an optimal path in the state space beginning from the state INTENTION and ending in the state EXECUTION. If an intermediate state EXENTION is in p, then the path p from INTENTION to EXENTION must also be optimal. Proof: if there were a shorter path p from INTENTION to EXENTION, we would use p to form an shorter overall path, making p not optimal. Contradiction! The optimal solution to the subproblem is nested inside the optimal solution to the current problem. I N T E N T I O N p p E X E N T I O N E X E C U T I O N 90

91 Recursive Relation Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. 8 D(i 1,j)+C delete >< D(i, j 1) + C D(i, j) =min insert Csubstitute, >: D(i 1,j 1) + 0, x i = y j x i 6= y j Base cases D(i, 0) = i C delete D(0,j)=j C insert 91

92 Say No to Recursion N T E N T I O N I N T E N T I O N del ins sub I N T E C N T I O N So we just implement the D function as the previous slide describes. No! Why was recursion not a problem in divide-andconquer algorithms then? Because in algorithms like Merge Sort, the subproblem is half the size of the current problem (reduces to the base case very quickly), while here the size of the subproblem is not substantially reduced. Very slow for MED! Why not memoize the recursive function? You pay the cost of maintaining call stack frames, while the method we introduce here simply fills a 2D table. I N X E N T I O N 92

93 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 93

94 Filling the DP Table Topological order over the sub-problems D(i 1, j 1) D(i 1, j) D(i 1, j+1) +delete +substitute min D(i, j 1) +insert D(i, j) D(i, j+1) D(i+1, j 1) D(i+1, j) D(i+1, j+1) 94

95 Filling the Table # I N T E N T I O N # E X E C U T I O N 95

96 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # I N T E N T I O N # E X E C U T I O N 96

97 Filling the Table D(i, 0) = i C delete # E X E C U T I O N # I N T E N T I O N 97

98 Filling the Table D(0,j)=j C insert # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 98

99 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 99

100 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 100

101 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 101

102 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 102

103 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 103

104 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 104

105 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 105

106 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 106

107 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 107

108 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 108

109 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 109

110 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 110

111 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 111

112 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 3 T 3 E 4 N 5 T 6 I 7 O 8 N 9 112

113 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T 3 E 4 N 5 T 6 I 7 O 8 N 9 113

114 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N

115 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N

116 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 116

117 Complexities Time: O(NM) Space: O(NM) Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 117

118 Obtaining Alignment (Backtracing) # F R O G # D O G 118

119 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D O G 119

120 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 0+1 D O G 1 O 1+1 G

121 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O 1 O 1+1 G F 1+1 G 3 121

122 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G

123 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G

124 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 2 O 1+1 G F 1+1 +R

125 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G

126 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G

127 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 3+1 O 2+1 O 1+1 O 2 +F R G 2+1 G 3 127

128 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 G 2+1 G 3 128

129 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 129

130 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 130

131 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 +G 3+1 G 2+1 G 3 131

132 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 132

133 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 133

134 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 2+1 G 3 +F R

135 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O

136 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O

137 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O

138 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O 5+1 +G

139 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

140 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

141 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

142 Time Complexity Bookkeeping O(NM) Backtracing costs O(N+M) 142

143 A Valid Distance Laws that a valid distance measure should respect 143

144 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 144

145 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Q&A Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 145

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Minimum Edit Distance. Defini'on of Minimum Edit Distance Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences