CS2731/ISSP2230. Natural Language Processing
|
|
- Pearl Hunter
- 6 years ago
- Views:
Transcription
1 S saw N I NP V saw VP NP NP PP NP nsubj I a dobj girl nmod telescope D N P D N a girl with a telescope CS2731/ISSP2230 det case det with Natural Language Processing Lecture 2 Regular Expressions, Text Normalization, and Minimum Edit Distance a Instructor Prof. Diane Litman litman@cs.pitt.edu TA Yuhuan Jiang yuhuan@cs.pitt.edu Based on the Slides of Dan Jurafsky & James H. Martin
2 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 2
3 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 3
4 Regular Expressions A regular expression specifies a regular language Think of a language as a set of strings We will learn in future lectures about what is a regular language are there other types of languages For now, we only need to know that there are some languages we cannot express with regular expressions. We ll come back to this later in this lecture 4
5 Regular Expressions A regular expression defines a language/a set of strings/a pattern of strings. Regex woodchucks [A-Z] Example Patterns Matched interesting links to woodchucks and lemurs Mary Ann stopped by Mona s! You ve left the burglar behind again! Useful for searching in text, when we have a pattern to search for. Example in UNIX: grep. 5
6 Basic Patterns Complicated regular expressions consist of basic regular expression patterns. We ll look at composition later. For now, we learn some basic patterns first. 6
7 Basic Patterns Simple characters Form: A sequence of simple characters. Effect: Matches any string containing the substring. Regex Example Patterns Matched woodchucks interesting links to woodchucks and lemurs a Mary Ann stopped by Mona s! You ve left the burglar behind again! 7
8 Basic Patterns Character-level Disjunction Form: A sequence of characters inside brackets. Effect: Matches any character from the bracket. Regex Example Patterns Matched [ww]oodchucks Woodchuck is the same as woodchuck. [abc] Mary Ann stopped by Mona s [ ] I can count from 1 to 5. This can get very boring, as [ABCDEFGHIJKLMNOPQRSTUVWXYZ] is a very tedious way to match any capital letter. 8
9 Basic Patterns Range Form: [x-y] Effect: Matches any character in the range x to y. Regex Example Patterns Matched [A-Z] we should call it Drenched Blossoms [a-z] YOUR USER NAME IS yuj24 [0-9] YOUR USER NAME IS yuj24 9
10 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 10
11 Basic Patterns Negation Form: Add a caret ^ after the left bracket ([). Effect: Matches any character not in the range x to y. Regex [^A-Z] [^Ss] [^\.] [e^] [a^b] Patten Description Not an upper-case letter Neither S nor s Not a period Either e or a caret a^b 11
12 Operators Optional, Kleen *, Kleen + Regex woodchucks? colou?r Patten Description woodchuck or woodchucks color or colour a* Zero or more a s aa* One or more a s a+ Same effect as above [ab]* Zero or more a s or b s. E.g., aaaa, ababab, bbbb, abbababbbaab, [0-9]+ A sequence of numbers 12
13 Wildcard Matches any character Regex Patten Description. Any character beg.n begin, began, begun, <html>.+</html> Any string surrounded by the <html> tag 13
14 Anchors Caret ^: start of line Dollar $: end of line \b: word boundary The 3 Uses of Caret Start of a line Negate a bracketed pattern Just a caret (a word = any seq. of digits/underscores/letters) Regex Patten Description ^The Matches the word The only at the start of a line $ Matches a space at the end of a line. ^The dog\.$ Matches a line that contains only the phrase The dog \bthe\b Matches the but not other 14
15 Disjunction Operator Q: If we want to search for cat or dog, can we do this? [catdog] A: No. This would match any character that is c, a, t, d, o, or g. Solution: the disjunction operator Vertical bar a b matches a or b cat dog matches cat or dog 15
16 Example: Constructing a Regex Goal: find cases of the English article the the 16
17 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 17
18 Example: Constructing a Regex Goal: find cases of the English article the \b[tt]he\b 18
19 Example: Constructing a Regex Goal: find cases of the English article the [tt]he 19
20 Example: Constructing a Regex Goal: find cases of the English article the [^a-za-z][tt]he[^a-za-z] 20
21 Example: Constructing a Regex Goal: find cases of the English article the (^ [^a-za-z])[tt]he[^a-za-z] 21
22 Example: Constructing a Regex Goal: find cases of the English article the Solution: (^ [^a-za-z])[tt]he[^a-za-z] 22
23 Error Types What did we just go through? We refined the regex by handling errors In fact, we handled two types of errors False positives: strings that we incorrectly matched E.g., other, or there False negatives: strings that we incorrectly missed E.g., The Precision and Recall By minimizing false positives, we increase precision By minimizing false negatives, we increase recall These two types of errors appear everywhere in NLP. 23
24 Quiz Write a regular expression for a n b n which is any string consisting of a s followed by the same number of b s. There is no regex for this one The language a n b n is not regular In future lectures, you will be taught about the Chomsky hierarchy of languages. 24
25 Source: Why This Matters? Let s try to parse an HTML document using regular expressions. An example HTML document: <div><div><div>...</div></div></div> This is basically <div> n </div> n which is not regular. Test.html <div> <div> <div>... </div> </div> </div> 25
26 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 26
27 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 27
28 Corpus Preprocessing Almost every NLP task needs to do some preprocessing on the corpus: Breaking a document into sentences Breaking the sentence into individual words Normalizing word formats 28
29 Corpus Preprocessing Document 29
30 Corpus Preprocessing Sentence Sentence Sentence Sentence Sentence Sentence Sentence Splitter 30
31 Corpus Preprocessing Word Word Word Word W W Word W Word W W Word W W W Word Word Word W Word W Sentence Splitter Word Segmenter 31
32 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer 32
33 Corpus Preprocessing word word word word w w word w word w w word w w w word word word w word w Sentence Splitter Word Segmenter Lemmatizer system 33
34 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 34
35 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 35
36 Sentence Segmentation Simple! Just cut the document whenever you see a period/question mark/exclamation mark. No! Consider using the period (.) as sentence boundary: This is a Ph.D. student from the U.S. He does experiments comparing rules vs. probabilities. He enjoys 99.9% of his life. Would be more clever to classify every period as EndOfSentence NotEndOfSentence Possible models: Hand-crafted rules, regular expressions, machine-learning 36
37 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: 37
38 Classification by Decision Tree Input: a period; the sentence in which the period is. Output: Yes/No Algorithm: Features 38
39 Decision Tree Features Categorical features Those on the previous slide Case of the word with. {Upper, Lower, Cap, Num} Case of the word after. {Upper, Lower, Cap, Num} Numerical features Length of the word with. Prob(word with. occurs at the end of sentence) Prob(word with. occurs at the beginning of sentence) 39
40 Decision Tree Implementation A decision tree is just an if-then-else statement The interesting research is how to choose the features Setting up the structure is often too hard to do by hand in two ways: Hand-building is only possible for very simple features with limited domain size. Numerical features thresholds are hard to pick. Instead, we learn the structure of the decision tree by machine learning from a training corpus. Simplest: decision tree induction using information theory. Probabilistic classifiers: Logistic regression, SVM, NN, etc. 40
41 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 41
42 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 42
43 Q: What Is A Word? Is he s one word? If not, how do we break it? he s he s he is he has... Do we really want to split entity names (New York)? 43
44 Word Segmentation Specifications E.g., the Stanford Chinese word segmenter provides segmentation specification options Chinese Penn Treebank standard Peking University standard Provides difference choices: E.g., Whether to separate surnames from first names 江泽民江泽民 or no change Surname surnames from job titles FirstName 江主席江主席 or no change Surname JobTitle 44
45 Token vs. Types A word type is an element of the vocabulary Think of vocabulary as a set of distinct words A token is an instance of a type in actual text E.g., A Simple Corpus Here are some words. Here are some more words. How many tokens? 11 How many types? 6 {Here, are, some, words,., more} 45
46 Token vs. Types Compare the number of tokens with the number types in actual corpora Corpus Name #Tokens #Types Shakespeare 884,000 31,000 Brown corpus 1 million 38 thousand Switchboard telephone conversations 2.4 million 20 thousand COCA 440 million 2 million Google N-grams 1 trillion 13 million When we talk about the number of words in a language, we are generally referring to word types. 46
47 An Empirical Law The Heaps Law (or Herdan s Law) describes the growth of vocabulary size (# word types) with the number of tokens. V (n) =k n Reflects the type-token relation. K and β are parameters decided by data Affected by factors such as genre For English corpora K is usually between 10 and 100 β is roughly around 0.4 to 0.6 (~square root) 47
48 Issues in Tokenization of English Finland s Finland, Finlands, Finland s? what re what are, what re, what re? Heulett-Packard Heulett Packard? state-of-the-art state of the art? lowercase lower-case lowercase lower case? San Francisco should we split them? m.p.h. miles per hour? with respect to one preposition? 48
49 Tokenization in Other Languages French L ensemble is this one token or two? L? L? Le? German Noun compounds are not segmented Lebensversicherungsgesellschaftsangestellter life insurance company employee German information retrieval system needs compound splitter 49
50 Tokenization in Other Languages Chinese There are no space between words. Needs segmentation. Different segmentation standard (e.g., different granularity) may yield different result in a downstream system that relies on word segmentation That is pretty much every Chinese NLP model 莎拉波娃现在居住在美国东南部的佛罗里达 莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. 莎拉波娃现在居住在美国东南部的佛罗里达 50
51 Tokenization in Other Languages The problem is more complicated in Japanese. Because multiple scripts exist. People use them interchangeably at will. Alternative forms: Here [loc] tobacco [obj] smoke don t. 51
52 Tokenization in Other Languages Name tokenization can also be tricky If the person is from China: Surname FirstName If the person is from Japan: Surname FirstName 52
53 Word Segmentation This is what tokenization is called in languages such as Chinese, Japanese, Arabic, Take Chinese as an example: Chinese words are composed of characters. Characters are generally 1 syllable and 1 morpheme. A morpheme is the smallest meaning bearing unit. Average word is 2.4 characters long. Standard baseline for evaluating word segmentation algorithms: Maximum Matching 53
54 Algorithm: Maximum Matching Input: a lexicon L words of Chinese; a Chinses string s Output: tokenized string Algorithm: Step 1: Start a pointer p at the beginning of the string; Step 2: Find the longest word w in L that matches the string starting at p; Step 3: Move p over w in s; Step 4: Go to Step 2. 54
55 Source: D. Jurafsky Algorithm: Maximum Matching Doesn t generally work in English! Thecatinthehat thetabledownthere The cat in the hat theta bled own there But works well in Chinese 莎拉波娃现在居住在美国东南部的佛罗里达 莎拉波娃现在居住在美国东南部的佛罗里达 Sharapova now lives in US southeastern Florida. Modern probabilistic approaches are even better 55
56 Tools for Tokenization Read Ken Church s UNIX for Poets for UNIX commands. Try libraries such as NLTK or Stanford CoreNLP to learn how to do sentence splitting and tokenization in English. Try libraries such as NLTK or Stanford Word Segmenter for word segmentation in Chinese and Arabic. 56
57 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 57
58 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 58
59 Normalization Q: Why? A: For information retrieval purposes, the following two must have the same form: Indexed text (what we have in the system) Query terms (what the user types in a search box) E.g.: we want to match U.S.A. and USA. 59
60 Try to Differentiate Lemmatizing a word Removing all inflections Stemming a word Reduce to the stem 60
61 Case Folding (Lowercasing) Convert all letters to lower case. Since users tend to use lower case in search box. Possible exception: upper case in mid-sentence? E.g., General Monitors Fed vs. fed In some tasks, case is helpful Sentiment analysis (NO! vs. No.) Machine translation (US vs. us) Avoid lowercasing in these cases 61
62 Lemmatization Reduces inflections/variant forms to base form am, are, is, was, were be car, cars, car s, cars car The boy s cars are different colors. The boy car be different color. Sometimes lemmatization is unwanted In Spanish-to-English MT, it is important to differentiate: quiero (I want) and quieres (you want) But they have the same lemma querer 62
63 Stemming Some terminology A morpheme is the smallest meaningful unit in a language. The stem of a word is the core meaning-bearing unit (may be composed of more than one morphemes). Affixes are bits and pieces that adhere to stems. Often associated with grammatical functions. Stemming is crude chopping of affixes. E.g., automate, automates, automatic, automation are all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress 63
64 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1a sses ss caresses caress ies i ponies poni ss ss caress caress s ø cats cat Step 2 (for long stems) ational ate relational relate izer ize digitizer digitize ator ate operator operate Step 1b (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster Step 3 (for longer stems) al ø revival reviv able ø adjustable adjust ate ø activate activ 64
65 Porter s Stemmer The Porter s stemming algorithm is the most commonly used English stemmer. Step 1b Strip ing only if there is a vowel in the v part (*v*)ing ø walking walk sing sing (*v*)ed ø plastered plaster 65
66 Dealing with Complex Morphology Some language requires complex morpheme segmentation, such as Turkish: Uygarlastiramadiklarimizdanmissinizcasina Meaning: (behaving) as if you are among those whom we could not civilize Uygar = civilized + las = become + tir = cause + ama = not able + dik = past + lar = plural + imiz = [+1pl.] + dan = abl + mis = [past] + siniz = [+2pl.] + casina = as if 66
67 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 67
68 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 68
69 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Spelling correction: find a similar to what the user typed. User types: graffe What might be user s intention graf graft grail giraffe graffiti Which one is closest to graffe? 69
70 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead 70
71 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Evaluation of MT or ASR Reference Spokesman confirms senior government adviser was shot Hypothesis Spokesman said the senior adviser was shot dead Substitute Ins. Delete Ins. 71
72 Measuring Similarity Much of NLP is concerned with measuring how similar two strings are. Named entity extraction and entity coreference resolution: IBM Inc. announced today IBM profits Stanford President John Hennessy announced yesterday for Stanford University President John Hennessy 72
73 Measuring Similarity In computational biology: Aligning two sequences of nucleobases AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC 73
74 Minimum Edit Distance The minimum edit distance between two strings is the minimum number of editing operations, e.g., deletion insertion substitution needed to transform one string into another. 74
75 Example: Edits I N T E N T I O N 75
76 Example: Edits N T E N T I O N delete 76
77 Example: Edits E T E N T I O N delete substitute 77
78 Example: Edits E X E N T I O N delete substitute substitute 78
79 Example: Edits E X E N U T I O N delete substitute substitute insert 79
80 Example: Edits E X E C U T I O N delete substitute substitute insert substitute 80
81 Sequence of Actions I N T E N T I O N N T E N T I O N E T E N T I O N E X E N T I O N E X E N U T I O N E X E C U T I O N delete substitute substitute insert substitute 81
82 Visualizing String Distances Alignment: a visualization of edits 82
83 Levenshtein Distance A typical cost choice of minimum edit distance, where deletion costs 1 insertion costs 1 substitution costs 2 Of course, substituting a letter with the same letter costs 0. In the example, the Levenshtein distance between intention and execution is 8. We can also disallow substitution Equivalent to allowing substitution with cost of 2 Since substitution = deletion + insertion 83
84 MDE as a Search Problem State: a word (seq. of chars) which we transform Actions: Insert any character at any position Delete any character Substitute any character with any character Costs: Insertion costs 1 Deletion costs 1 Substitution costs 2 if substituting with a different character; otherwise 0 Initial state: the source word Goal state: the target word 84
85 MDE as a Search Problem I N T E N T I O N del ins sub N T E N T I O N I N T E C N T I O N I N X E N T I O N 85
86 Minimum Edit Distance Definition Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. To obtain the MED between x and y, simply evaluate What are the subproblems? D(N, M) 86
87 Key: Find Subproblems Key idea: solve the current problem using subproblems. To solve tongue lingua If we already know how to do tongue lingu Then we would only need to do insertion of a If we already know how to do tongu lingua Then we would only need to do deletion of e If we already know how to do tongu lingu Then we would only need to do substitution of e with a 87
88 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua If we already know the MED of tongue lingu Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 88
89 Key: Find Subproblems Key idea: solve the current problem using subproblems. To compute the MED of tongue lingua To compute the complete problem: Use answers If we already know the MED of tongue lingu to these subproblems Then we would only need to pay the cost of insertion of a If we already know the MED of tongu lingua Then we would only need to pay the cost of deletion of e If we already know the MED of tongu lingu Then we would only need to pay the cost of substitution of e with a 89
90 Why This Works? Let p be an optimal path in the state space beginning from the state INTENTION and ending in the state EXECUTION. If an intermediate state EXENTION is in p, then the path p from INTENTION to EXENTION must also be optimal. Proof: if there were a shorter path p from INTENTION to EXENTION, we would use p to form an shorter overall path, making p not optimal. Contradiction! The optimal solution to the subproblem is nested inside the optimal solution to the current problem. I N T E N T I O N p p E X E N T I O N E X E C U T I O N 90
91 Recursive Relation Given two strings, Source string x of length N, and Target string y of length M, define the function D(i, j) as the minimum edit distance between the prefixes x 1:i and y 1:j. 8 D(i 1,j)+C delete >< D(i, j 1) + C D(i, j) =min insert Csubstitute, >: D(i 1,j 1) + 0, x i = y j x i 6= y j Base cases D(i, 0) = i C delete D(0,j)=j C insert 91
92 Say No to Recursion N T E N T I O N I N T E N T I O N del ins sub I N T E C N T I O N So we just implement the D function as the previous slide describes. No! Why was recursion not a problem in divide-andconquer algorithms then? Because in algorithms like Merge Sort, the subproblem is half the size of the current problem (reduces to the base case very quickly), while here the size of the subproblem is not substantially reduced. Very slow for MED! Why not memoize the recursive function? You pay the cost of maintaining call stack frames, while the method we introduce here simply fills a 2D table. I N X E N T I O N 92
93 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 93
94 Filling the DP Table Topological order over the sub-problems D(i 1, j 1) D(i 1, j) D(i 1, j+1) +delete +substitute min D(i, j 1) +insert D(i, j) D(i, j+1) D(i+1, j 1) D(i+1, j) D(i+1, j+1) 94
95 Filling the Table # I N T E N T I O N # E X E C U T I O N 95
96 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # I N T E N T I O N # E X E C U T I O N 96
97 Filling the Table D(i, 0) = i C delete # E X E C U T I O N # I N T E N T I O N 97
98 Filling the Table D(0,j)=j C insert # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 98
99 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 99
100 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 100
101 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 101
102 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I 1 2 N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 102
103 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 103
104 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 104
105 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 105
106 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 106
107 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 107
108 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 108
109 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 109
110 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 110
111 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 T 3 E 4 N 5 T 6 I 7 O 8 N 9 111
112 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N 2 3 T 3 E 4 N 5 T 6 I 7 O 8 N 9 112
113 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T 3 E 4 N 5 T 6 I 7 O 8 N 9 113
114 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N
115 Filling the Table 8 >< D(i, j) =min >: D(i 1,j)+C delete D(i, j 1) + C insert Csubstitute, D(i 1,j 1) + 0, x i = y j x i 6= y j # E X E C U T I O N # I N T E N T I O N
116 Dynamic Programming Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 116
117 Complexities Time: O(NM) Space: O(NM) Let D be a N-by-M matrix Initialization: For i 1 to N D i,0 = i C delete D 0,j = j C insert For j 1 to M 8 D i 1,j + C delete >< D D i,j =min i,j 1 + C insert Csubstitute, x >: D i 1,j 1 + i 6= y j 0, x i = y j Return D N,M 117
118 Obtaining Alignment (Backtracing) # F R O G # D O G 118
119 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D O G 119
120 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 0+1 D O G 1 O 1+1 G
121 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O 1 O 1+1 G F 1+1 G 3 121
122 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G
123 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 1+1 D 0+1 D O G 1 +F O 1+1 G
124 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 2 O 1+1 G F 1+1 +R
125 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G
126 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 2+1 D 1+1 D 0+1 D O G 1 +F R O 1+1 G
127 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 3+1 O 2+1 O 1+1 O 2 +F R G 2+1 G 3 127
128 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 G 2+1 G 3 128
129 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 129
130 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G 2+1 G 3 130
131 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O F 2+1 +R 3+1 +O 4+1 +G 3+1 G 2+1 G 3 131
132 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 132
133 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 2+1 G 3 133
134 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 2+1 G 3 +F R
135 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O
136 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O
137 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O
138 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G F 3+1 +R 4+1 +O 5+1 +G
139 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G
140 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G
141 Obtaining Alignment (Backtracing) # F R O G # 0 +F R O G D 4+1 D 3+1 D 2+1 D 1+1 D 0+1 D 1 +F R O G O 5+1 O 4+1 O 3+1 O 2+1 O 1+1 O 2 +F R O G G 4+1 G 3+1 G 4+1 G 3+1 G 2+1 G 3 +F R O G
142 Time Complexity Bookkeeping O(NM) Backtracing costs O(N+M) 142
143 A Valid Distance Laws that a valid distance measure should respect 143
144 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 144
145 Topics Today Regular Expressions Basic Regular Expressions Composing Regular Expressions Q&A Text Processing Sentence Segmentation Word Tokenization Word Normalization and Stemming Minimum Edit Distance Levenshtein Distance Backtracing for Alignment 145
Minimum Edit Distance. Defini'on of Minimum Edit Distance
Minimum Edit Distance Defini'on of Minimum Edit Distance How similar are two strings? Spell correc'on The user typed graffe Which is closest? graf gra@ grail giraffe Computa'onal Biology Align two sequences
More informationPrenominal Modifier Ordering via MSA. Alignment
Introduction Prenominal Modifier Ordering via Multiple Sequence Alignment Aaron Dunlop Margaret Mitchell 2 Brian Roark Oregon Health & Science University Portland, OR 2 University of Aberdeen Aberdeen,
More informationAccelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance
Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance Sharon Goldwater (based on slides by Philipp Koehn) 20 September 2018 Sharon Goldwater ANLP Lecture
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More information10/17/04. Today s Main Points
Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points
More informationFROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationNatural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module
More informationLecture 1b: Text, terms, and bags of words
Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection
More informationRegular expressions and automata
Regular expressions and automata Introduction Finite State Automaton (FSA) Finite State Transducers (FST) Regular expressions(i) (Res) Standard notation for characterizing text sequences Specifying text
More informationRegular Expressions and Finite-State Automata. L545 Spring 2008
Regular Expressions and Finite-State Automata L545 Spring 2008 Overview Finite-state technology is: Fast and efficient Useful for a variety of language tasks Three main topics we ll discuss: Regular Expressions
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationCISC4090: Theory of Computation
CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationOptical Character Recognition of Jutakshars within Devanagari Script
Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationAlgorithms Exam TIN093 /DIT602
Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationDeduction by Daniel Bonevac. Chapter 3 Truth Trees
Deduction by Daniel Bonevac Chapter 3 Truth Trees Truth trees Truth trees provide an alternate decision procedure for assessing validity, logical equivalence, satisfiability and other logical properties
More informationCS460/626 : Natural Language
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 23, 24 Parsing Algorithms; Parsing in case of Ambiguity; Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th,
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationLatent Variable Models in NLP
Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More information/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17
601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 12.1 Introduction Today we re going to do a couple more examples of dynamic programming. While
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationNgram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer
+ Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward
More informationUNIT II REGULAR LANGUAGES
1 UNIT II REGULAR LANGUAGES Introduction: A regular expression is a way of describing a regular language. The various operations are closure, union and concatenation. We can also find the equivalent regular
More informationPenn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark
Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument
More informationLecture 15: Recurrent Neural Nets
Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationComputational Models - Lecture 4
Computational Models - Lecture 4 Regular languages: The Myhill-Nerode Theorem Context-free Grammars Chomsky Normal Form Pumping Lemma for context free languages Non context-free languages: Examples Push
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationCLRG Biocreative V
CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationDynamic Programming: Shortest Paths and DFA to Reg Exps
CS 374: Algorithms & Models of Computation, Spring 207 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 8 March 28, 207 Chandra Chekuri (UIUC) CS374 Spring 207 / 56 Part I Shortest Paths
More informationLecture 3: ASR: HMMs, Forward, Viterbi
Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The
More informationIBM Model 1 for Machine Translation
IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of
More informationLanguage Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky
Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationDynamic Programming: Shortest Paths and DFA to Reg Exps
CS 374: Algorithms & Models of Computation, Fall 205 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 7 October 22, 205 Chandra & Manoj (UIUC) CS374 Fall 205 / 54 Part I Shortest Paths with
More informationPredicting New Search-Query Cluster Volume
Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive
More informationProbability Review and Naïve Bayes
Probability Review and Naïve Bayes Instructor: Alan Ritter Some slides adapted from Dan Jurfasky and Brendan O connor What is Probability? The probability the coin will land heads is 0.5 Q: what does this
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationHidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.
, I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes
More informationComputational Models - Lecture 4 1
Computational Models - Lecture 4 1 Handout Mode Iftach Haitner and Yishay Mansour. Tel Aviv University. April 3/8, 2013 1 Based on frames by Benny Chor, Tel Aviv University, modifying frames by Maurice
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 03: Edit distance and sequence alignment Slides adapted from Dr. Shaojie Zhang (University of Central Florida) KUMC visit How many of you would like to attend
More informationDo not turn this page until you have received the signal to start. Please fill out the identification section above. Good Luck!
CSC 373H5 F 2017 Midterm Duration 1 hour and 50 minutes Last Name: Lecture Section: 1 Student Number: First Name: Instructor: Vincent Maccio Do not turn this page until you have received the signal to
More informationParsing with Context-Free Grammars
Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars
More informationA Context-Free Grammar
Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily
More informationMore Smoothing, Tuning, and Evaluation
More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w
More informationProcessing/Speech, NLP and the Web
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[
More informationRecurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves
Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text
More informationNaturalLanguageProcessing-Lecture08
NaturalLanguageProcessing-Lecture08 Instructor (Christopher Manning):Hi. Welcome back to CS224n. Okay. So today, what I m gonna do is keep on going with the stuff that I started on last time, going through
More informationDepartment of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling
Department of Computer Science and Engineering Indian Institute of Technology, Kanpur CS 365 Artificial Intelligence Project Report Spatial Role Labeling Submitted by Satvik Gupta (12633) and Garvit Pahal
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationCS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012
CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012 Parsing Problem Semantics Part of Speech Tagging NLP Trinity Morph Analysis
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationNatural Language Processing : Probabilistic Context Free Grammars. Updated 5/09
Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require
More informationPart-of-Speech Tagging
Part-of-Speech Tagging Informatics 2A: Lecture 17 Adam Lopez School of Informatics University of Edinburgh 27 October 2016 1 / 46 Last class We discussed the POS tag lexicon When do words belong to the
More informationPL: Truth Trees. Handout Truth Trees: The Setup
Handout 4 PL: Truth Trees Truth tables provide a mechanical method for determining whether a proposition, set of propositions, or argument has a particular logical property. For example, we can show that
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationWhat Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages
Do Homework 2. What Is a Language? Grammars, Languages, and Machines L Language Grammar Accepts Machine Strings: the Building Blocks of Languages An alphabet is a finite set of symbols: English alphabet:
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationParsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)
Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationWhat we have done so far
What we have done so far DFAs and regular languages NFAs and their equivalence to DFAs Regular expressions. Regular expressions capture exactly regular languages: Construct a NFA from a regular expression.
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationExtracting Information from Text
Extracting Information from Text Research Seminar Statistical Natural Language Processing Angela Bohn, Mathias Frey, November 25, 2010 Main goals Extract structured data from unstructured text Training
More informationAutomata Theory and Formal Grammars: Lecture 1
Automata Theory and Formal Grammars: Lecture 1 Sets, Languages, Logic Automata Theory and Formal Grammars: Lecture 1 p.1/72 Sets, Languages, Logic Today Course Overview Administrivia Sets Theory (Review?)
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationTopic 3: Neural Networks
CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationSpatial Role Labeling CS365 Course Project
Spatial Role Labeling CS365 Course Project Amit Kumar, akkumar@iitk.ac.in Chandra Sekhar, gchandra@iitk.ac.in Supervisor : Dr.Amitabha Mukerjee ABSTRACT In natural language processing one of the important
More informationStatistical Substring Reduction in Linear Time
Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern
More informationCISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata
CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences
More informationComputer Sciences Department
Computer Sciences Department 1 Reference Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Computer Sciences Department 3 ADVANCED TOPICS IN C O M P U T A B I L I T Y
More informationCS1800: Strong Induction. Professor Kevin Gold
CS1800: Strong Induction Professor Kevin Gold Mini-Primer/Refresher on Unrelated Topic: Limits This is meant to be a problem about reasoning about quantifiers, with a little practice of other skills, too
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationhow to *do* computationally assisted research
how to *do* computationally assisted research digital literacy @ comwell Kristoffer L Nielbo knielbo@sdu.dk knielbo.github.io/ March 22, 2018 1/30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Person(object):
More information