LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Similar documents
Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Midterm sample questions

Text Mining. March 3, March 3, / 49

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

LECTURER: BURCU CAN Spring

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 9: Hidden Markov Model

10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probabilistic Context-free Grammars

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Statistical Methods for NLP

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Sequences and Information

DT2118 Speech and Speaker Recognition

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

The Noisy Channel Model and Markov Models

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Hidden Markov Models

CISC4090: Theory of Computation

Turing machines and linear bounded automata

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Part-of-Speech Tagging

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Statistical NLP: Hidden Markov Models. Updated 12/15

Outline. CS21 Decidability and Tractability. Machine view of FA. Machine view of FA. Machine view of FA. Machine view of FA.

Statistical methods in NLP, lecture 7 Tagging and parsing

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Sequence Labeling: HMMs & Structured Perceptron

HMM and Part of Speech Tagging. Adam Meyers New York University

POS-Tagging. Fabian M. Suchanek

CS460/626 : Natural Language

Statistical Methods for NLP

The Pumping Lemma. for all n 0, u 1 v n u 2 L (i.e. u 1 u 2 L, u 1 vu 2 L [but we knew that anyway], u 1 vvu 2 L, u 1 vvvu 2 L, etc.

What we have done so far

Context-Free Grammars and Languages. Reading: Chapter 5

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Part-of-Speech Tagging

The Post Correspondence Problem

Introduction to Turing Machines. Reading: Chapters 8 & 9

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Undecidable Problems and Reducibility

Context Free Grammars

Turing machines and linear bounded automata

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Turing machines and linear bounded automata

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

CSC173 Workshop: 13 Sept. Notes

Maxent Models and Discriminative Estimation

Introduction to Computers & Programming

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Optimizing Finite Automata

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

More Smoothing, Tuning, and Evaluation

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Hidden Markov Models and Gaussian Mixture Models

Machine Learning for natural language processing

Pair Hidden Markov Models

Statistical Methods for NLP

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

ECS 120 Lesson 15 Turing Machines, Pt. 1

Parsing with Context-Free Grammars

MA/CSSE 474 Theory of Computation

Context Sensitive Grammar

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

1. Markov models. 1.1 Markov-chain

Lecture Notes on Inductive Definitions

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

A gentle introduction to Hidden Markov Models

Deterministic Finite Automata

Introduction to Formal Languages, Automata and Computability p.1/42

Lecture 13: Structured Prediction

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Probabilistic Linguistics

SCHEME FOR INTERNAL ASSESSMENT TEST 3

Probabilistic Context Free Grammars. Many slides from Michael Collins

CMP 309: Automata Theory, Computability and Formal Languages. Adapted from the work of Andrej Bogdanov

Seq2Seq Losses (CTC)

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Automata Theory (2A) Young Won Lim 5/31/18

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Theory of Computation (II) Yijia Chen Fudan University

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

Transcription:

LING 473: Day 10 START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars 1

Issues with Projects 1. *.sh files must have #!/bin/sh at the top (to run on Condor) 2. If run.sh is supposed to write to stdout, write to stdout Do *not* put the > output inside the.sh file. The instructions request that the script write to stdout. (My scripts are expecting output on stdout.) 3. run.sh (and other files) must be made executable. chmod u+x <filename> 4. Code must run on Patas This includes run.sh, as turned in. 2

5. Name files correctly Issues with Projects output, not output.txt, not out.put 3

Reminder Project 3 due tonight at 11:45pm PDT. (As always, slides adapted from Glenn Slayden s version of this class) 4

Simplistic tagger S = w $, w &, w ( t = argmax 01 P(t 4 w 4 ) This is surely the function we want to maximize, but it s not clear how to calculate the probabilities P(t w). Simplistic tagger: Why don t we use probabilities calculated from a corpus? as in Assignment 3 5

Simplistic tagger argmax 0 P(t the) = DT argmax 0 P(t cold) = JJ 6

How well does the simplistic tagger work? Such a POS tagger is not really usable Most probable POS tag: JJ Correct tag: NN Most probable POS tag: VBG Correct tag: JJ 7

Use Bayes Theorem Of course, you have this memorized P A B = P B A P(A) P(B) Remember, this was our objective function t = argmax 0 P(t 4 w 4 ) t = argmax 0 P w 4 t 4 P(t 4 ) P(w 4 ) This is one of the most important slides of this entire class 8

For each evaluated value of i, P(w 4 ) will be the same. We can cancel it. t = argmax 0 P w 4 t 4 P(t 4 ) P(w 4 ) t = argmax 0 P w 4 t 4 P t 4 The best sequence of tags is determined by the probability of each word given its tag and also the probability of that tag. 9

t = argmax 0 P w 4 t 4 P t 4 likelihood prior We compute the most probable tag sequence by multiplying the likelihood and the prior probability for each tag sequence and choosing the tag sequence for which this product is greatest. Unfortunately, this is still too hard to compute directly Jurafsky & Martin (paraphrase) p.140 10

We still need to make some assumptions. we ll come back to this part later t = argmax 0 P w 4 t 4 P t 4 Assumption 1: If we want to use corpus probabilities to estimate P w 4 t 4, we need to formally note that we re assuming it. 11

P B w 4 t 4 = C P w 4 t 4 t 4D& t 4DE 4 C P w 4 t 4 4 The only POS tag a word depends on is its own. Is this true? 12

Any progress? So wait: if we re assuming the only POS tag a word depends on is its own, how is this going to be better than the simplistic tagger from before, which assumed that the only word a POS tag depends on is its own? In other words, Why is P(w t) going to work better than P(t w)? Hint: Ω Hint: T W Answer: because there are a lot more distinct words than tags, conditioning on tags rather than words increases the resolution of the corpus measurements 13

example P(cold NN) =.00002 P(cold JJ) =.00040 P JJ cold =.97 P NN cold =.03 This value will drown out our calculation and we d never tag cold as a noun! 14

t = argmax 0 P w 4 t 4 P t 4 Assumption 2: The only tags that a tag t 4 depends on are the n previous tags, t 4D(D& t 4D&. For example, in a POS bigram model: P B t 4 = C P t 4 t 4D& t 4DE t 4DU 4 C P t 4 t 4D& 4 This is known as the bigram assumption: The only POS tag(s) a POS tag depends on are the ones immediately preceding it. 15

Putting it together t = argmax 0 P w 4 t 4 P t 4 P B w 4 t 4 C P w 4 t 4 4 P B (t 4 ) C P t 4 t 4D& 4 t = argmax 0 C P w 4 t 4 4 C P t 4 t 4D& 4 t = argmax 0 C P w 4 t 4 P t 4 t 4D& 4 16

Reminder: estimating P w 4 t 4 from a corpus Definition of conditional probability P A B = P(A, B) P(B) P A B = count(a, B) Ω count(b) Ω word likelihood P w 4 t 4 = count(w 4, t 4 ) count(t 4 ) 17

Reminder: estimating P t 4 t 4D& from a corpus Definition of conditional probability P A B = P(A, B) P(B) P A B = count(a, B) Ω count(b) Ω P t 4 t 4D& = count(t 4D&, t 4 ) count(t 4D& ) 18

POS tagging objective function t = argmax 0 C count(w 4, t 4 ) count(t 4 ) 4 count(t 4D&, t 4 ) count(t 4D& ) Best POS tag sequence How often does word w 4 occur with tag t 4 in the corpus? How often does t 4 follow t 4D& in the corpus? This might seem a little backwards (especially if you aren t familiar with Bayes theorem). We re trying to find the best tag sequence, but we re using P(w t), which seems to be predicting words. This compares: If we are expecting an adjective (based on the tag sequence), how likely is it that the adjective will be cold? versus If we are expecting a noun, how likely is it that the noun will be cold? 19

If we are expecting an adjective, how likely is it that the adjective will be cold? (high) WEIGHTED BY our chance of seeing the sequence DT JJ (medium) versus If we are expecting a noun, how likely is it that the noun will be cold? (medium) WEIGHTED BY our chance of seeing the sequence DT NN (very high) THE WINNER: NN 20

Multiplying probabilities We re multiplying a whole lot of probabilities together What do we know about probability values? 0 p 1 What happens when you multiply a lot of these together? This is an important consideration in computational linguistics. We need to worry about underflow. 21

Underflow When multiplying many probability terms together, we need to prevent underflow Due to limitations in the computer s internal representation of floating point numbers, the product quickly becomes zero We usually work with the logarithm of the probability values This is known as the log-prob = log &$ p 22

IEEE 754 floating point 32-bit single float 1.2 10 DU\ to 3.4 10 U\ 64-bit double ±1.8 10 U$\ 23

logarithms refresher definition: log` x = y: x = b e b f b e = b fge log xy = log x + log y log C x 4 4 = i log x 4 4 24

b f b e = bfde log x y = log x log y Write an expression for Bayes theorem as log-probabilites 25

Bayes theorem as log-prob P A B = P B A P A P B log P A B = log P B A + log P A log P(B) 26

Remember this? t = argmax 0 C count(w 4, t 4 ) count(t 4 ) 4 count(t 4D&, t 4 ) count(t 4D& ) t = argmax 0 i log count(w 4, t 4 ) count(t 4 ) 4 + log count(t 4D&, t 4 ) count(t 4D& ) Wait, how can we do this, there was no log in the original equation outside the! 27

argmax magic Doesn t matter. Since argmax doesn t care about the actual answer, but rather just the sequence that gives it, we can drop the overall log this is valid so long as log x is a monotonically increasing function argmax will find the same best tag sequence when looking at either probabilities or log-probs because both functions will peak at the same point t = argmax 0 i log P w 4 t 4 + log P t 4 t 4D& 4 28

Hidden Markov Model This is the foundation for the Hidden Markov Model (HMM) for POS tagging To proceed further and solve the argmax is still a challenge t = argmax 0 i log P w 4 t 4 4 Computing this naïvely is still O T ( + log P t 4 t 4D& 29

Dynamic programming The Viterbi algorithm is typically used to decode Hidden Markov Models You probably will implement it in Ling 570 It is a dynamic programming technique We maintain a trellis of partial computations This approach reduces the problem to O( T E n) time 30

POS Trigram model Recall the bigram assumption: P (t 4 ) C P t 4 t 4D& 4 We can improve the tagging accuracy by extending to a trigram (or larger) model P (t 4 ) C P t 4 t 4DE, t 4D& 4 Question: Why not increase 8, 10, 12-grams? 31

Data sparsity However, we might start having a problem if we try to get a value for P t 4 t 4DE, t 4D& by counting in the corpus count(t 4DE, t 4D&, t 4 ) count(t 4DE, t 4D& ) it was a butterfly in distress that she The count of this in our training set is likely to be zero 32

Unseens Our model will predict zero probability for something that we actually encounter This counts as a failure of the model (why?) This is a pervasive problem in corpus linguistics At runtime, how do you deal with observations that you never encountered during training (unseen data)? 33

Smoothing We don t want our model to have a discontinuity between something infrequent and something unseen Various techniques address this problem: add-one smoothing Good-Turing method Assume unseens have probability of the rarest observation Ideally, smoothing preserves the validity of your probability space 34

What is a grammar? Formal Grammars For a formal grammar (a mathematical definition): A system that generates and recognizes strings present in some formal language A formal language is a (possibly infinite) set of strings Examples of formal languages: Cricket language: {chirp, chirp-chirp} Sheep language: {ba, baa, baaa, baaaa, } 35

Formal Languages Some terms: Alphabet (Σ) A finite set of symbols symbol could be written, auditory, visual, etc. String A (possibly empty) ordered sequence of symbols from Σ Formal language (L) A set of strings defined over the alphabet Σ. Any language will always be a subset of Σ*. L Σ 36

Formal Languages One of the simplest languages is Σ, all possible strings pulled from the alphabet. It s often useful to constrain the language a bit more. We can use generative rules to constrain the language. These have the nice effect of generating constituents 37

Constituency Groups of words behave as a single unit or phrase. Sentence have parts, which have subparts, etc. e.g., Noun Phrase kermit the frog he August 22 the failure of the advisory committee to solve the problem 38

Constituent Phrases For constituents, we usually name them as phrases based on the word that heads the constituent 39

Evidence for constituency They appear in similar environments (before a verb) Kermit the frog comes on stage He comes to Massachusetts every summer August 22 is when your project 3 is due The failure of the advisory committee to solve the problem has resulted in lost revenue. But not each individual word in the constituent *The comes our... *is comes out... *for comes out... The constituent can be placed in a number of different locations On August 22 I d like to fly to Seattle. I d like to fly on August 22 to Seattle. I d like to fly to Seattle on August 22. But not split apart: *On August I d like to fly 22 to Seattle. *On I d like to fly August 22 to Seattle. 40

Substitution test Adjective: Noun: Verb: The {sad, intelligent, green, fat, } one is in the corner. The {cat, mouse, dog} ate the bug. Kim {loves, eats, makes, buys, moves} potato chips. 41

Traditional parts of speech Word categories Most of these form phrases (noun phrase, adverb phrase, etc) 42

Formal Grammars Here s how we ll constrain a language from all possible strings Σ : constituents can only be juxtaposed as permitted by R Each rule is a rule that generates a constituent, including the top constituent of sentence Σ a set of symbols (words, letters, ) R a set of rules Our formal grammar is G = Σ, R, 43

Formal Grammars Formal grammars can be viewed in two useful ways. Given a configuration of rules in grammar G, find the string(s) S that can be formed. This is called generation. Given a string S in a language L, find a rule configuration that generates S. This is called parsing. 44

Formal Grammars A grammar is fully defined by the tuple G = V, Σ, S, R V is a finite set of variables or preterminals (our constituents, incl POS) Σ is a finite set of symbols, called terminals (words) S is in V and is called the start symbol R is a finite set of productions, which are rules of the form α β where α and β are strings consisting of terminals and variables. 45

Types of formal grammars We can rank grammars by expressiveness Unrestricted, recursively enumerable (type 0) Context-sensitive (type 1) Context-free grammars (type 2) Regular grammars (type 3) The Chomsky hierarchy 46

Types of formal grammars Unrestricted, recursively enumerable (type 0) α β α and β are any string of terminals and non-terminals Context-sensitive (type 1) αxβ αγβ X is a nonterminal; α, β, γ are any string of terminals and non-terminals; γ may not be empty. Context-free grammars (type 2) X γ X is a nonterminal; γ are any string of terminals and non-terminals Regular grammars (type 3) X αy (i.e. a right regular grammar) X, Y are nonterminals; α is any string of terminals; Y may be absent. 47

Equivalence with automata Formal grammar classes are defined according to the type of automaton that can accept the language There are various types of automatons, and many correspond to certain types of formal grammars Grammar type Recursively Enumerable Context sensitive Context Free Regular Accepted by Turing machine non-deterministic linear bounded automaton (LBA) non-deterministic pushdown automaton (NDPA) Finite State Automaton 48

Regular languages Let s look at the most restricted case A regular language (over an alphabet Σ) is any language for which there exists a finite state machine (finite automaton) that recognizes (accepts) it This class is equivalent to regular expressions (but no capture groups) 49

Regular Expressions and Regular Languages There are simple finite automata corresponding to the simple regular expressions: λ a a*(a+b)c* a λ a a, b - the empty set λ the empty string a Σ c Each of these has an initial state and one accepting state. 50

Regular grammars Regular grammars can be generated by FSMs Equivalence with RegEx Definition of a right-regular grammar: Every rule in R is of the form (type 3) A ab or A a or S λ (to allow λ to be in the language) where A and B are variables (perhaps the same, but B can't be S) in V and a is any terminal symbol 51

Regular grammars Example: V = {S, A} Σ = {a, b, c} R = S as, S ba, A λ, A ca S = S RegEx: a*bc* Cannot express a ( b ( (type 3) 52

Parsing regular grammars (type 3) Parsing space: O 1 Parsing time: O n 53

Next Time Continue exploring Chomsky Hierarchy and Regular Languages Project 3 Review Project 5 Self Quiz 54