CS2731/ISSP2230. Natural Language Processing

Size: px
Start display at page:

Download "CS2731/ISSP2230. Natural Language Processing"

Transcription

1 S saw N I NP V saw VP NP NP PP NP nsubj I a dobj girl nmod telescope D N P D N a girl with a telescope CS2731/ISSP2230 det case det with Natural Language Processing Lecture 2 Regular Expressions, Text Normalization, and Minimum Edit Distance a Instructor Prof. Diane Litman litman@cs.pitt.edu TA Yuhuan Jiang yuhuan@cs.pitt.edu Based on the Slides of Dan Jurafsky & James H. Martin

2 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 2

3 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 3

4 Regular Expressions A regular expression specifies a regular language Think of a language as a set of strings We will learn in future lectures about what is a regular language are there other types of languages For now, we only need to know that there are some languages we cannot express with regular expressions. We ll come back to this later in this lecture 4

5 Regular Expressions A regular expression defines a language/a set of strings/a pattern of strings. Regex woodchucks [A-Z] Example Patterns Matched interesting links to woodchucks and lemurs Mary Ann stopped by Mona s! You ve left the burglar behind again! Useful for searching in text, when we have a pattern to search for. Example in UNIX: grep. 5

6 Basic Patterns Complicated regular expressions consist of basic regular expression patterns. We ll look at composition later. For now, we learn some basic patterns first. 6

7 Basic Patterns Simple characters Form: A sequence of simple characters. Effect: Matches any string containing the substring. Regex Example Patterns Matched woodchucks interesting links to woodchucks and lemurs a Mary Ann stopped by Mona s! You ve left the burglar behind again! 7

8 Basic Patterns Character-level Disjunction Form: A sequence of characters inside brackets. Effect: Matches any character from the bracket. Regex Example Patterns Matched [ww]oodchucks Woodchuck is the same as woodchuck. [abc] Mary Ann stopped by Mona s [ ] I can count from 1 to 5. This can get very boring, as [ABCDEFGHIJKLMNOPQRSTUVWXYZ] is a very tedious way to match any capital letter. 8

9 Basic Patterns Range Form: [x-y] Effect: Matches any character in the range x to y. Regex Example Patterns Matched [A-Z] we should call it Drenched Blossoms [a-z] YOUR USER NAME IS yuj24 [0-9] YOUR USER NAME IS yuj24 9

10 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 10

11 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character not in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 11

12 Operators Optional, Kleen *, Kleen + Regex woodchucks? colou?r Patten Description woodchuck or woodchucks color or colour a* Zero or more a s aa* One or more a s a+ Same effect as above [ab]* Zero or more a s or b s. E.g., aaaa, ababab, bbbb, abbababbbaab, [0-9]+ A sequence of numbers 12

13 Wildcard Matches any character Regex Patten Description. Any character beg.n begin, began, begun, <html>.+</html> Any string surrounded by the <html> tag 13

14 Anchors Caret ^: start of line Dollar $: end of line \b: word boundary The 3 Uses of Caret Start of a line Negate a bracketed pattern Just a caret (a word = any seq. of digits/underscores/letters) Regex Patten Description ^The Matches the word The only at the start of a line $ Matches a space at the end of a line. ^The dog\.$ Matches a line that contains only the phrase The dog \bthe\b Matches the but not other 14

15 Disjunction Operator Q: If we want to search for cat or dog, can we do this? [catdog] A: No. This would match any character that is c, a, t, d, o, or g. Solution: the disjunction operator Vertical bar a b matches a or b cat dog matches cat or dog 15

16 Example: Constructing a Regex Goal: find cases of the English article the the 16

17 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 17

18 Example: Constructing a Regex Goal: find cases of the English article the \b[tt]he\b 18

19 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 19

20 Example: Constructing a Regex Goal: find cases of the English article the [^a-za-z][tt]he[^a-za-z] 20

21 Example: Constructing a Regex Goal: find cases of the English article the (^ [^a-za-z])[tt]he[^a-za-z] 21

22 Example: Constructing a Regex Goal: find cases of the English article the Solution: (^ [^a-za-z])[tt]he[^a-za-z] 22

23 Error Types What did we just go through? We refined the regex by handling errors In fact, we handled two types of errors False positives: strings that we incorrectly matched E.g., other, or there False negatives: strings that we incorrectly missed E.g., The Precision and Recall By minimizing false positives, we increase precision By minimizing false negatives, we increase recall These two types of errors appear everywhere in NLP. 23

24 Quiz Write a regular expression for a n b n which is any string consisting of a s followed by the same number of b s. There is no regex for this one The language a n b n is not regular In future lectures, you will be taught about the Chomsky hierarchy of languages. 24

25 Source: Why This Matters? Let s try to parse an HTML document using regular expressions. An example HTML document: <div><div><div>...</div></div></div> This is basically <div> n </div> n which is not regular. Test.html <div> <div> <div>... </div> </div> </div> 25

26 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 26

27 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 27

28 Corpus Preprocessing Almost every NLP task needs to do some preprocessing on the corpus: Breaking a document into sentences Breaking the sentence into individual words Normalizing word formats 28

29 Corpus Preprocessing Document 29

30 Corpus Preprocessing Sentence Sentence Sentence Sentence Sentence Sentence Sentence Splitter 30

31 Corpus Preprocessing Word Word Word Word W W Word W Word W W Word W W W Word Word Word W Word W Sentence Splitter Word Segmenter 31

32 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer 32

33 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer system 33

34 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 34

35 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 35

36 Sentence Segmentation Simple! Just cut the document whenever you see a period/question mark/exclamation mark. No! Consider using the period (.) as sentence boundary: This is a Ph.D. student from the U.S. He does experiments comparing rules vs. probabilities. He enjoys 99.9% of his life. Would be more clever to classify every period as EndOfSentence NotEndOfSentence Possible models: Hand-crafted rules, regular expressions, machine-learning 36

37 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: 37

38 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: Features 38

39 Decision Tree Features Categorical features Those on the previous slide Case of the word with. {Upper, Lower, Cap, Num} Case of the word after. {Upper, Lower, Cap, Num} Numerical features Length of the word with. Prob(word with. occurs at the end of sentence) Prob(word with. occurs at the beginning of sentence) 39

40 Decision Tree Implementation A decision tree is just an if-then-else statement The interesting research is how to choose the features Setting up the structure is often too hard to do by hand in two ways: Hand-building is only possible for very simple features with limited domain size. Numerical features thresholds are hard to pick. Instead, we learn the structure of the decision tree by machine learning from a training corpus. Simplest: decision tree induction using information theory. Probabilistic classifiers: Logistic regression, SVM, NN, etc. 40

41 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 41

42 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 42

43 Q: What Is A Word? Is he s one word? If not, how do we break it? he s he s he is he has... Do we really want to split entity names (New York)? 43

44 Word Segmentation Specifications E.g., the Stanford Chinese word segmenter provides segmentation specification options Chinese Penn Treebank standard Peking University standard Provides difference choices: E.g., Whether to separate surnames from first names 江泽民江泽民 or no change Surname surnames from job titles FirstName 江主席江主席 or no change Surname JobTitle 44

45 Token vs. Types A word type is an element of the vocabulary Think of vocabulary as a set of distinct words A token is an instance of a type in actual text E.g., A Simple Corpus Here are some words. Here are some more words. How many tokens? 11 How many types? 6 {Here, are, some, words,., more} 45

46 Token vs. Types Compare the number of tokens with the number types in actual corpora Corpus Name #Tokens #Types Shakespeare 884,000 31,000 Brown corpus 1 million 38 thousand Switchboard telephone conversations 2.4 million 20 thousand COCA 440 million 2 million Google N-grams 1 trillion 13 million When we talk about the number of words in a language, we are generally referring to word types. 46

47 An Empirical Law The Heaps Law (or Herdan s Law) describes the growth of vocabulary size (# word types) with the number of tokens. V (n) =k n Reflects the type-token relation. K and β are parameters decided by data Affected by factors such as genre For English corpora K is usually between 10 and 100 β is roughly around 0.4 to 0.6 (~square root) 47

48 Issues in Tokenization of English Finland s Finland, Finlands, Finland s? what re what are, what re, what re? Heulett-Packard Heulett Packard? state-of-the-art state of the art? lowercase lower-case lowercase lower case? San Francisco should we split them? m.p.h. miles per hour? with respect to one preposition? 48

49 Tokenization in Other Languages French L ensemble is this one token or two? L? L? Le? German Noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter life insurance company employee German information retrieval system needs compound splitter 49

50 Tokenization in Other Languages Chinese There are no space between words. Needs segmentation. Different segmentation standard (e.g., different granularity) may yield different result in a downstream system that relies on word segmentation That is pretty much every Chinese NLP model 莎拉波娃现在居住在美国东南部的佛罗里达 莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. 莎拉波娃现在居住在美国东南部的佛罗里达 50

51 Tokenization in Other Languages The problem is more complicated in Japanese. Because multiple scripts exist. People use them interchangeably at will. Alternative forms: Here [loc] tobacco [obj] smoke don t. 51

52 Tokenization in Other Languages Name tokenization can also be tricky If the person is from China: Surname FirstName If the person is from Japan: Surname FirstName 52

53 Word Segmentation This is what tokenization is called in languages such as Chinese, Japanese, Arabic, Take Chinese as an example: Chinese words are composed of characters. Characters are generally 1 syllable and 1 morpheme. A morpheme is the smallest meaning bearing unit. Average word is 2.4 characters long. Standard baseline for evaluating word segmentation algorithms: Maximum Matching 53

54 Algorithm: Maximum Matching Input: a lexicon L words of Chinese; a Chinses string s Output: tokenized string Algorithm: Step 1: Start a pointer p at the beginning of the string; Step 2: Find the longest word w in L that matches the string starting at p; Step 3: Move p over w in s; Step 4: Go to Step 2. 54

55 Source: D. Jurafsky Algorithm: Maximum Matching Doesn t generally work in English! Thecatinthehat thetabledownthere The cat in the hat theta bled own there But works well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达 莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. Modern probabilistic approaches are even better 55

56 Tools for Tokenization Read Ken Church s UNIX for Poets for UNIX commands. Try libraries such as NLTK or Stanford CoreNLP to learn how to do sentence splitting and tokenization in English. Try libraries such as NLTK or Stanford Word Segmenter for word segmentation in Chinese and Arabic. 56

57 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 57

58 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 58

59 Normalization Q: Why? A: For information retrieval purposes, the following two must have the same form: Indexed text (what we have in the system) Query terms (what the user types in a search box) E.g.: we want to match U.S.A. and USA. 59

60 Try to Differentiate Lemmatizing a word Removing all inflections Stemming a word Reduce to the stem 60

61 Case Folding (Lowercasing) Convert all letters to lower case. Since users tend to use lower case in search box. Possible exception: upper case in mid-sentence? E.g., General Monitors Fed vs. fed In some tasks, case is helpful Sentiment analysis (NO! vs. No.) Machine translation (US vs. us) Avoid lowercasing in these cases 61

62 Lemmatization Reduces inflections/variant forms to base form am, are, is, was, were be car, cars, car s, cars car The boy s cars are different colors. The boy car be different color. Sometimes lemmatization is unwanted In Spanish-to-English MT, it is important to differentiate: quiero (I want) and quieres (you want) But they have the same lemma querer 62

63 Stemming Some terminology A morpheme is the smallest meaningful unit in a language. The stem of a word is the core meaning-bearing unit (may be composed of more than one morphemes). Affixes are bits and pieces that adhere to stems. Often associated with grammatical functions. Stemming is crude chopping of affixes. E.g., automate, automates, automatic, automation are all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 63

64 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1a sses ss caresses caress ies i ponies poni ss ss caress caress s ø cats cat Step 2 (for long stems) ational ate relational relate izer ize digitizer digitize ator ate operator operate Step 1b (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster Step 3 (for longer stems) al ø revival reviv able ø adjustable adjust ate ø activate activ 64

65 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1b Strip ing only if there is a vowel in the v part (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster 65

66 Dealing with Complex Morphology Some language requires complex morpheme segmentation, such as Turkish: Uygarlastiramadiklarimizdanmissinizcasina Meaning: (behaving) as if you are among those whom we could not civilize Uygar = civilized + las = become + tir = cause + ama = not able + dik = past + lar = plural + imiz = [+1pl.] + dan = abl + mis = [past] + siniz = [+2pl.] + casina = as if 66

67 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 67

68 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 68

69 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Spelling correction: find a similar to what the user typed. User types: graffe What might be user s intention graf graft grail giraffe graffiti Which one is closest to graffe? 69

70 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead 70

71 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead Substitute Ins. Delete Ins. 71

72 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Named entity extraction and entity coreference resolution: IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy 72

73 Measuring Similarity In computational biology: Aligning two sequences of nucleobases AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC 73

74 Minimum Edit Distance The minimum edit distance between two strings is the minimum number of editing operations, e.g., deletion insertion substitution needed to transform one string into another. 74

75 Example: Edits I N T E N T I O N 75

76 Example: Edits N T E N T I O N delete 76

77 Example: Edits E T E N T I O N delete substitute 77

78 Example: Edits E X E N T I O N delete substitute substitute 78

79 Example: Edits E X E N U T I O N delete substitute substitute insert 79

80 Example: Edits E X E C U T I O N delete substitute substitute insert substitute 80

81 Sequence of Actions I N T E N T I O N N T E N T I O N E T E N T I O N E X E N T I O N E X E N U T I O N E X E C U T I O N delete substitute substitute insert substitute 81

82 Visualizing String Distances Alignment: a visualization of edits 82

83 Levenshtein Distance A typical cost choice of minimum edit distance, where deletion costs 1 insertion costs 1 substitution costs 2 Of course, substituting a letter with the same letter costs 0. In the example, the Levenshtein distance between intention and execution is 8. We can also disallow substitution Equivalent to allowing substitution with cost of 2 Since substitution = deletion + insertion 83

84 MDE as a Search Problem State: a word (seq. of chars) which we transform Actions: Insert any character at any position Delete any character Substitute any character with any character Costs: Insertion costs 1 Deletion costs 1 Substitution costs 2 if substituting with a different character; otherwise 0 Initial state: the source word Goal state: the target word 84

85 MDE as a Search Problem I N T E N T I O N del ins sub N T E N T I O N I N T E C N T I O N I N X E N T I O N 85

86 Minimum Edit Distance Definition Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. To obtain the MED between x and y, simply evaluate What are the subproblems? D(N, M) 86

87 Key: Find Subproblems Key idea: solve the current problem using subproblems. To solve tongue lingua If we already know how to do tongue lingu Then we would only need to do insertion of a If we already know how to do tongu lingua Then we would only need to do deletion of e If we already know how to do tongu lingu Then we would only need to do substitution of e with a 87

88 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua If we already know the MED of tongue lingu Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 88

89 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua To compute the complete problem: Use answers If we already know the MED of tongue lingu to these subproblems Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 89

90 Why This Works? Let p be an optimal path in the state space beginning from the state INTENTION and ending in the state EXECUTION. If an intermediate state EXENTION is in p, then the path p from INTENTION to EXENTION must also be optimal. Proof: if there were a shorter path p from INTENTION to EXENTION, we would use p to form an shorter overall path, making p not optimal. Contradiction! The optimal solution to the subproblem is nested inside the optimal solution to the current problem. I N T E N T I O N p p E X E N T I O N E X E C U T I O N 90

91 Recursive Relation Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. 8 D(i 1,j)+C delete >< D(i, j 1) + C D(i, j) =min insert Csubstitute, >: D(i 1,j 1) + 0, x i = y j x i 6= y j Base cases D(i, 0) = i C delete D(0,j)=j C insert 91

92 Say No to Recursion N T E N T I O N I N T E N T I O N del ins sub I N T E C N T I O N So we just implement the D function as the previous slide describes. No! Why was recursion not a problem in divide-andconquer algorithms then? Because in algorithms like Merge Sort, the subproblem is half the size of the current problem (reduces to the base case very quickly), while here the size of the subproblem is not substantially reduced. Very slow for MED! Why not memoize the recursive function? You pay the cost of maintaining call stack frames, while the method we introduce here simply fills a 2D table. I N X E N T I O N 92

93 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 93

94 Filling the DP Table Topological order over the sub-problems D(i 1, j 1) D(i 1, j) D(i 1, j+1) +delete +substitute min D(i, j 1) +insert D(i, j) D(i, j+1) D(i+1, j 1) D(i+1, j) D(i+1, j+1) 94

95 Filling the Table # I N T E N T I O N # E X E C U T I O N 95

96 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # I N T E N T I O N # E X E C U T I O N 96

97 Filling the Table D(i, 0) = i C delete # E X E C U T I O N # I N T E N T I O N 97

98 Filling the Table D(0,j)=j C insert # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 98

99 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 99

100 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 100

101 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 101

102 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 102

103 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 103

104 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 104

105 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 105

106 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 106

107 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 107

108 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 108

109 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 109

110 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 110

111 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 111

112 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 3 T 3 E 4 N 5 T 6 I 7 O 8 N 9 112

113 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T 3 E 4 N 5 T 6 I 7 O 8 N 9 113

114 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N

115 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N

116 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 116

117 Complexities Time: O(NM) Space: O(NM) Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 117

118 Obtaining Alignment (Backtracing) # F R O G # D O G 118

119 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D O G 119

120 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 0+1 D O G 1 O 1+1 G

121 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O 1 O 1+1 G F 1+1 G 3 121

122 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G

123 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G

124 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 2 O 1+1 G F 1+1 +R

125 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G

126 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G

127 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 3+1 O 2+1 O 1+1 O 2 +F R G 2+1 G 3 127

128 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 G 2+1 G 3 128

129 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 129

130 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 130

131 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 +G 3+1 G 2+1 G 3 131

132 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 132

133 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 133

134 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 2+1 G 3 +F R

135 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O

136 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O

137 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O

138 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O 5+1 +G

139 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

140 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

141 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G

142 Time Complexity Bookkeeping O(NM) Backtracing costs O(N+M) 142

143 A Valid Distance Laws that a valid distance measure should respect 143

144 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 144

145 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Q&A Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 145

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Minimum Edit Distance. Defini'on of Minimum Edit Distance Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences

More information

Prenominal Modifier Ordering via MSA. Alignment

Prenominal Modifier Ordering via MSA. Alignment Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,

More information

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2018 Sharon Goldwater ANLP Lecture

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Lecture 1b: Text, terms, and bags of words

Lecture 1b: Text, terms, and bags of words Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection

More information

Regular expressions and automata

Regular expressions and automata Regular expressions and automata Introduction Finite State Automaton (FSA) Finite State Transducers (FST) Regular expressions(i) (Res) Standard notation for characterizing text sequences Specifying text

More information

Regular Expressions and Finite-State Automata. L545 Spring 2008

Regular Expressions and Finite-State Automata. L545 Spring 2008 Regular Expressions and Finite-State Automata L545 Spring 2008 Overview Finite-state technology is: Fast and efficient Useful for a variety of language tasks Three main topics we ll discuss: Regular Expressions

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

CISC4090: Theory of Computation

CISC4090: Theory of Computation CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter

More information

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture

More information

Optical Character Recognition of Jutakshars within Devanagari Script

Optical Character Recognition of Jutakshars within Devanagari Script Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Deduction by Daniel Bonevac. Chapter 3 Truth Trees

Deduction by Daniel Bonevac. Chapter 3 Truth Trees Deduction by Daniel Bonevac Chapter 3 Truth Trees Truth trees Truth trees provide an alternate decision procedure for assessing validity, logical equivalence, satisfiability and other logical properties

More information

CS460/626 : Natural Language

CS460/626 : Natural Language CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Latent Variable Models in NLP

Latent Variable Models in NLP Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 12.1 Introduction Today we re going to do a couple more examples of dynamic programming. While

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

UNIT II REGULAR LANGUAGES

UNIT II REGULAR LANGUAGES 1 UNIT II REGULAR LANGUAGES Introduction: A regular expression is a way of describing a regular language. The various operations are closure, union and concatenation. We can also find the equivalent regular

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Computational Models - Lecture 4

Computational Models - Lecture 4 Computational Models - Lecture 4 Regular languages: The Myhill-Nerode Theorem Context-free Grammars Chomsky Normal Form Pumping Lemma for context free languages Non context-free languages: Examples Push

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps CS 374: Algorithms & Models of Computation, Spring 207 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 8 March 28, 207 Chandra Chekuri (UIUC) CS374 Spring 207 / 56 Part I Shortest Paths

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps CS 374: Algorithms & Models of Computation, Fall 205 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 7 October 22, 205 Chandra & Manoj (UIUC) CS374 Fall 205 / 54 Part I Shortest Paths with

More information

Predicting New Search-Query Cluster Volume

Predicting New Search-Query Cluster Volume Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive

More information

Probability Review and Naïve Bayes

Probability Review and Naïve Bayes Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes

More information

Computational Models - Lecture 4 1

Computational Models - Lecture 4 1 Computational Models - Lecture 4 1 Handout Mode Iftach Haitner and Yishay Mansour. Tel Aviv University. April 3/8, 2013 1 Based on frames by Benny Chor, Tel Aviv University, modifying frames by Maurice

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 03: Edit distance and sequence alignment Slides adapted from Dr. Shaojie Zhang (University of Central Florida) KUMC visit How many of you would like to attend

More information

Do not turn this page until you have received the signal to start. Please fill out the identification section above. Good Luck!

Do not turn this page until you have received the signal to start. Please fill out the identification section above. Good Luck! CSC 373H5 F 2017 Midterm Duration 1 hour and 50 minutes Last Name: Lecture Section: 1 Student Number: First Name: Instructor: Vincent Maccio Do not turn this page until you have received the signal to

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

A Context-Free Grammar

A Context-Free Grammar Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily

More information

More Smoothing, Tuning, and Evaluation

More Smoothing, Tuning, and Evaluation More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w

More information

Processing/Speech, NLP and the Web

Processing/Speech, NLP and the Web CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

NaturalLanguageProcessing-Lecture08

NaturalLanguageProcessing-Lecture08 NaturalLanguageProcessing-Lecture08 Instructor (Christopher Manning):Hi. Welcome back to CS224n. Okay. So today, what I m gonna do is keep on going with the stuff that I started on last time, going through

More information

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling Department of Computer Science and Engineering Indian Institute of Technology, Kanpur CS 365 Artificial Intelligence Project Report Spatial Role Labeling Submitted by Satvik Gupta (12633) and Garvit Pahal

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09 Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require

More information

Part-of-Speech Tagging

Part-of-Speech Tagging Part-of-Speech Tagging Informatics 2A: Lecture 17 Adam Lopez School of Informatics University of Edinburgh 27 October 2016 1 / 46 Last class We discussed the POS tag lexicon When do words belong to the

More information

PL: Truth Trees. Handout Truth Trees: The Setup

PL: Truth Trees. Handout Truth Trees: The Setup Handout 4 PL: Truth Trees Truth tables provide a mechanical method for determining whether a proposition, set of propositions, or argument has a particular logical property. For example, we can show that

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages Do Homework 2. What Is a Language? Grammars, Languages, and Machines L Language Grammar Accepts Machine Strings: the Building Blocks of Languages An alphabet is a finite set of symbols: English alphabet:

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

What we have done so far

What we have done so far What we have done so far DFAs and regular languages NFAs and their equivalence to DFAs Regular expressions. Regular expressions capture exactly regular languages: Construct a NFA from a regular expression.

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Extracting Information from Text

Extracting Information from Text Extracting Information from Text Research Seminar Statistical Natural Language Processing Angela Bohn, Mathias Frey, November 25, 2010 Main goals Extract structured data from unstructured text Training

More information

Automata Theory and Formal Grammars: Lecture 1

Automata Theory and Formal Grammars: Lecture 1 Automata Theory and Formal Grammars: Lecture 1 Sets, Languages, Logic Automata Theory and Formal Grammars: Lecture 1 p.1/72 Sets, Languages, Logic Today Course Overview Administrivia Sets Theory (Review?)

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Advanced Natural Language Processing Syntactic Parsing

Advanced Natural Language Processing Syntactic Parsing Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Spatial Role Labeling CS365 Course Project

Spatial Role Labeling CS365 Course Project Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important

More information

Statistical Substring Reduction in Linear Time

Statistical Substring Reduction in Linear Time Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern

More information

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department 1 Reference Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Computer Sciences Department 3 ADVANCED TOPICS IN C O M P U T A B I L I T Y

More information

CS1800: Strong Induction. Professor Kevin Gold

CS1800: Strong Induction. Professor Kevin Gold CS1800: Strong Induction Professor Kevin Gold Mini-Primer/Refresher on Unrelated Topic: Limits This is meant to be a problem about reasoning about quantifiers, with a little practice of other skills, too

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

how to *do* computationally assisted research

how to *do* computationally assisted research how to *do* computationally assisted research digital literacy @ comwell Kristoffer L Nielbo knielbo@sdu.dk knielbo.github.io/ March 22, 2018 1/30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Person(object):

More information