Part-of-Speech Tagging

Similar documents
Part-of-Speech Tagging

Fun with weighted FSTs

HMM and Part of Speech Tagging. Adam Meyers New York University

Statistical methods in NLP, lecture 7 Tagging and parsing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 9: Hidden Markov Model

Text Mining. March 3, March 3, / 49

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 12: Algorithms for HMMs

10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 12: Algorithms for HMMs

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CS838-1 Advanced NLP: Hidden Markov Models

TnT Part of Speech Tagger

Language Processing with Perl and Prolog

Sequence Labeling: HMMs & Structured Perceptron

Hidden Markov Models

Probabilistic Context-free Grammars

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Hidden Markov Models (HMMs)

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Midterm sample questions

Natural Language Processing SoSe Words and Language Model

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Dept. of Linguistics, Indiana University Fall 2009

Statistical Methods for NLP

The Noisy Channel Model and Markov Models

Maximum Entropy and Log-linear Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Sequential Data Modeling - The Structured Perceptron

SYNTHER A NEW M-GRAM POS TAGGER

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

LECTURER: BURCU CAN Spring

NLTK tagging. Basic tagging. Tagged corpora. POS tagging. NLTK tagging L435/L555. Dept. of Linguistics, Indiana University Fall / 18

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Parsing with Context-Free Grammars

Probabilistic Graphical Models

N-gram Language Modeling

Lecture 7: Sequence Labeling

Degree in Mathematics

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Lecture 3: ASR: HMMs, Forward, Viterbi

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Machine Learning for natural language processing

Natural Language Processing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Sequences and Information

Hidden Markov Models

Part-of-Speech Tagging + Neural Networks CS 287

1. Markov models. 1.1 Markov-chain

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Hidden Markov Models in Language Processing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 6: Part-of-speech tagging

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

LECTURER: BURCU CAN Spring

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

NLP Programming Tutorial 11 - The Structured Perceptron

CS388: Natural Language Processing Lecture 4: Sequence Models I

Natural Language Processing. Statistical Inference: n-grams

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models and Gaussian Mixture Models

POS-Tagging. Fabian M. Suchanek

CS460/626 : Natural Language

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Collapsed Variational Bayesian Inference for Hidden Markov Models

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Graphical models for part of speech tagging

Statistical Methods for NLP

Lecture 13: Structured Prediction

Advanced Natural Language Processing Syntactic Parsing

CS 188: Artificial Intelligence Fall 2011

A Context-Free Grammar

Maschinelle Sprachverarbeitung

Boosting Applied to Tagging and PP Attachment

A gentle introduction to Hidden Markov Models

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)


Hidden Markov Models

8: Hidden Markov Models

Applied Natural Language Processing

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Transcription:

Part-of-Speech Tagging Informatics 2A: Lecture 17 Adam Lopez School of Informatics University of Edinburgh 27 October 2016 1 / 46

Last class We discussed the POS tag lexicon When do words belong to the same class? Three criteria What tagset should we use? What are the sources of ambiguity for POS tagging? 2 / 46

1 Methods for tagging Unigram tagging Bigram tagging Tagging using Hidden Markov Models: Viterbi algorithm Reading: Jurafsky & Martin, chapters (5 and) 6. 3 / 46

Before we get started Question you should always ask yourself: How hard is this problem? 4 / 46

Before we get started Question you should always ask yourself: How hard is this problem? For POS tagging, this boils down to: How ambiguous are parts of speech, really? If most words have unambiguous POS, then we can probably write a simple program that solves POS tagging with just a lookup table. E.g. Whenever I see the word the, output DT. 4 / 46

Before we get started Question you should always ask yourself: How hard is this problem? For POS tagging, this boils down to: How ambiguous are parts of speech, really? If most words have unambiguous POS, then we can probably write a simple program that solves POS tagging with just a lookup table. E.g. Whenever I see the word the, output DT. This is an empirical question. To answer it, we need data. 4 / 46

Corpus annotation A corpus (plural corpora) is a computer-readable collection of natural language text (or speech) used as a source of information about the language: e.g. what words or constructions can occur in practice, and with what frequencies? To answer the question about POS ambiguity, we use corpora in which each word has been annotated with its POS tag, e.g. Our/PRP\$ enemies/nns are/vbp innovative/jj and/cc resourceful/jj,/, and/cc so/rb are/vb we/prp./. They/PRP never/rb stop/vb thinking/vbg about/in new/jj ways/nns to/to harm/vb our/prp\$ country/nn and/cc our/prp\$ people/nn, and/cc neither/dt do/vb we/prp./. 5 / 46

Empirical upper bounds: inter-annotator agreement Even for humans, tagging sometimes poses difficult decisions. E.g. Words in -ing: adjectives (JJ), or verbs in gerund form (VBG)? a boring/jj lecture a very boring lecture? a lecture that bores the falling/vbg leaves *the very falling leaves the leaves that fall a revolving/vbg? door *a very revolving door a door that revolves *the door seems revolving sparkling/jj? lemonade? very sparkling lemonade lemonade that sparkles the lemonade seems sparkling In view of such problems, we can t expect 100% accuracy from an automatic tagger. In the Penn Treebank, annotators disagree around 3.5% of the time. Put another way: if we assume that one annotator tags perfectly, and then measure the accuracy of another annotator by comparing with the first, they will only be right about 96.5% of the time. We can hardly expect a machine to do better! 6 / 46

Word types and tokens Need to distinguish word tokens (particular occurrences in a text) from word types (distinct vocabulary items). We ll count different inflected or derived forms (e.g. break, breaks, breaking) as distinct word types. A single word type (e.g. still) may appear with several POS. But most words have a clear most frequent POS. Question: How many tokens and types in the following? Ignore case and punctuation. Esau sawed wood. Esau Wood would saw wood. Oh, the wood Wood would saw! 1 14 tokens, 6 types 2 14 tokens, 7 types 3 14 tokens, 8 types 4 None of the above. 7 / 46

Extent of POS Ambiguity The Brown corpus (1,000,000 word tokens) has 39,440 different word types. 35340 have only 1 POS tag anywhere in corpus (89.6%) 4100 (10.4%) have 2 to 7 POS tags So why does just 10.4% POS-tag ambiguity by word type lead to difficulty? This is thanks to Zipfian distribution: many high-frequency words have more than one POS tag. In fact, more than 40% of the word tokens are ambiguous. He wants to/to go. He went to/in the store. He wants that/dt hat. It is obvious that/cs he wants a hat. He wants a hat that/wps fits. 8 / 46

Word Frequencies in Different Languages Ambiguity by part-of-speech tags: Language Type-ambiguous Token-ambiguous English 13.2% 56.2% Greek <1% 19.14% Japanese 7.6% 50.2% Czech <1% 14.5% Turkish 2.5% 35.2% 9 / 46

Word Frequency Properties of Words in Use Take any corpus of English like the Brown Corpus or Tom Sawyer and sort its words by how often they occur. word Freq. (f ) Rank (r) f r the 3332 1 3332 and 2972 2 5944 a 1775 3 5235 he 877 10 8770 but 410 20 8400 be 294 30 8820 there 222 40 8880 one 172 50 8600 about 158 60 9480 more 138 70 9660 never 124 80 9920 Oh 116 90 10440 10 / 46

Word Frequency Properties of Words in Use Take any corpus of English like the Brown Corpus or Tom Sawyer and sort its words by how often they occur. word Freq. (f ) Rank (r) f r two 104 100 10400 turned 51 200 10200 you ll 30 300 9000 name 21 400 8400 comes 16 500 8000 group 13 600 7800 lead 11 700 7700 friends 10 800 8000 begin 9 900 8100 family 8 1000 8000 brushed 4 2000 8000 sins 2 3000 6000 11 / 46

Zipf s law Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table (observation made by Harvard linguist George Kingsley Zipf). Zipf s law states that: f 1 r There is a constant k such that: f r = k. 12 / 46

Zipf s law for the Brown corpus 13 / 46

Zipf s law According to Zipf s law: There is a very small number of very common words. There is a small-medium number of middle frequency words. There is a very large number of words that are infrequent. (It s not fully understood why Zipf s law works so well for word frequencies.) In fact, many other kinds of data conform closely to a Zipfian distribution: Populations of cities. Sizes of earthquakes. Amazon sales rankings. 14 / 46

Some tagging strategies Now we have a sense that POS tagging is non-trivial. We ll look at several methods or strategies for automatic tagging. One simple strategy: just assign to each word its most common tag. (So still will always get tagged as an adverb never as a noun, verb or adjective.) Call this unigram tagging, since we only consider one token at a time. Very, very important: compute the unigram frequency on different data from which you test the tagger. Why? Surprisingly, even this crude approach typically gives around 90% accuracy. (State-of-the-art is 96 98%). Can we do better? We ll look briefly at bigram tagging, then at Hidden Markov Model tagging. 15 / 46

Bigram tagging Remember: linguistic tests often look at adjacent POS. (e.g. nouns are often preceded by determiners). Let s use this idea: we ll look at pairs of adjacent POS. For each word (e.g. still), tabulate the frequencies of each possible POS given the POS of the preceding word. Example (with made-up numbers): still DT MD JJ... NN 8 0 6 JJ 23 0 14 VB 1 12 2 RB 6 45 3 Given a new text, tag the words from left to right, assigning each word the most likely tag given the preceding one. Could also consider trigram (or more generally n-gram) tagging, etc. But the frequency matrices would quickly get very large, and also (for realistic corpora) too sparse to be really useful. 16 / 46

Problems with bigram tagging One incorrect tagging choice might have unintended effects: The still smoking remains of the campfire Intended: DT RB VBG NNS IN DT NN Bigram: DT JJ NN VBZ... No lookahead: choosing the most probable tag at one stage might lead to highly improbable choice later. The still was smashed Intended: DT NN VBD VBN Bigram: DT JJ VBD? We d prefer to find the overall most likely tagging sequence given the bigram frequencies. This is what the Hidden Markov Model (HMM) approach achieves. 17 / 46

Hidden Markov Models The idea is to model the process by which words were generated using a probabilistic process. Think of the output as visible to us, but the internal states of the process (which contain POS information) as hidden. For some outputs, there might be several possible ways of generating them i.e. several sequences of internal states. Our aim is to compute the sequence of hidden states with the highest probability. Specifically, our processes will be FSTs with probabilities. Simple, though not a very flattering model of human language users! 18 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a very rich history. 19 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh NNP p(nnp s ) p(edinburgh NNP) 20 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh NNP has VBZ p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) 21 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a NNP VBZ DT p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) p(dt VBZ) p(a DT ) 22 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a very NNP VBZ DT RB p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) p(dt VBZ) p(a DT ) p(rb DT ) p(very RB) 23 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a very rich NNP VBZ DT RB JJ p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) p(dt VBZ) p(a DT ) p(rb DT ) p(very RB) p(jj RB) p(rich JJ) 24 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a very rich history NNP VBZ DT RB JJ NN p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) p(dt VBZ) p(a DT ) p(rb DT ) p(very RB) p(jj RB) p(rich JJ) p(nn JJ) p(history NN) 25 / 46

Generating a Sequence Hidden Markov models can be thought of as devices that generate sequences with hidden states: Edinburgh has a very rich history NNP VBZ DT RB JJ NN p(nnp s ) p(edinburgh NNP) p(vbz NNP) p(has VBZ) p(dt VBZ) p(a DT ) p(rb DT ) p(very RB) p(jj RB) p(rich JJ) p(nn JJ) p(history NN) p(stop NN) 26 / 46

Definition of Hidden Markov Models For our purposes, a Hidden Markov Model (HMM) consists of: A set Q = {q 0, q 1,..., q T } of states, with q 0 the start state. (Our non-start states will correspond to parts-of-speech). A transition probability matrix A = (a ij 0 i T, 1 j T ), where a ij is the probability of jumping from q i to q j. For each i, we require T a ij = 1. For each non-start state q i and word type w, an emission probability b i (w) of outputting w upon entry into q i. (Ideally, for each i, we d have w b i(w) = 1.) We also suppose we re given an observed sequence w 1, w 2..., w n of word tokens generated by the HMM. j=1 27 / 46

Transition Probabilities 28 / 46

Emission Probabilities 29 / 46

Transition and Emission Probabilities VB TO NN PRP <s>.019.0043.041.67 VB.0038.035.047.0070 TO.83 0.00047 0 NN.0040.016.087.0045 PRP.23.00079.001.00014 I want to race VB 0.0093 0.00012 TO 0 0.99 0 NN 0.000054 0.00057 PRP.37 0 0 0 30 / 46

How Do we Search for Best Tag Sequence? We have defined an HMM, but how do we use it? We are given a word sequence and must find their corresponding tag sequence. It s easy to compute the probability of generating a word sequence w 1... w n via a specific tag sequence t 1... t n : let t 0 denote the start state, and compute T P(t i t i 1 ).P(w i t i ) (1) i=1 using the transition and emission probabilities. But how do we find the most likely tag sequence? 31 / 46

Question Given n word tokens and a tagset with T choices per token, how many tag sequences do we have to evaluate? 1 T tag sequences 2 n tag sequences 3 T n tag sequences 4 T n tag sequences 32 / 46

Question Given n word tokens and a tagset with T choices per token, how many tag sequences do we have to evaluate? 1 T tag sequences 2 n tag sequences 3 T n tag sequences 4 T n tag sequences Bad news: there are T n sequences. 32 / 46

Question Given n word tokens and a tagset with T choices per token, how many tag sequences do we have to evaluate? 1 T tag sequences 2 n tag sequences 3 T n tag sequences 4 T n tag sequences Bad news: there are T n sequences. Good news: We can do this efficiently using dynamic programming and the Viterbi algorithm. 32 / 46

The HMM trellis NN NN NN NN TO TO TO TO START VB VB VB VB PRP PRP PRP PRP I want to race 33 / 46

The Viterbi Algorithm Keep a chart of the form Table(POS, i) where POS ranges over the POS tags and i ranges over the indices in the sentence. For all T and i: and Table(T, i + 1) max T Table(T, i) p(t T ) p(w i+1 T ) Table(T, 0) p(t s ) Table(., n) will contain the probability of the most likely sequence. To get the actual sequence, we need backpointers. 34 / 46

The Viterbi algorithm Let s now tag the newspaper headline: deal talks fail Note that each token here could be a noun (N) or a verb (V). We ll use a toy HMM given as follows: to N to V from start.8.2 from N.4.6 from V.8.2 Transitions deal fail talks N.2.05.2 V.3.3.3 Emissions 35 / 46

The Viterbi matrix N deal talks fail V to N to V from start.8.2 from N.4.6 from V.8.2 Transitions deal fail talks N.2.05.2 V.3.3.3 Emissions Table(T, i + 1) max T Table(T, i) p(t T ) p(w i+1 T ) 36 / 46

The Viterbi matrix deal talks fail N.8x.2 =.16.16x.4x.2 =.0128.0288x.8x.05 =.001152 (since.16x.4 >.06x.8) (since.0128x.4 < 0.0288x.8) V.2x.3 =.06.16x.6x.3 =.0288.0128x.6x.3 =.002304 (since.16x.6 >.06x.2) (since.0128x.6 > 0.0288x.2) Looking at the highest probability entry in the final column and chasing the backpointers, we see that the tagging N N V wins. 37 / 46

The Viterbi Algorithm: second example q 4 NN 0 q 3 TO 0 q 2 VB 0 q 1 PRP 0 q o start 1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 38 / 46

The Viterbi Algorithm q 4 NN 0 q 3 TO 0 q 2 VB 0 q 1 PRP 0 q o start 1.0 <s> I want to race w 1 w 2 w 3 w 4 1 Create probability matrix, with one column for each observation (i.e., word token), and one row for each non-start state (i.e., POS tag). 2 We proceed by filling cells, column by column. 3 The entry in column i, row j will be the probability of the most probable route to state q j that emits w 1... w i. 39 / 46

The Viterbi Algorithm q 4 NN 0 1.0.041 0 q 3 TO 0 1.0.0043 0 q 2 VB 0 1.0.19 0 q 1 PRP 0 1.0.67.37 q o start 1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 v i 1 (k) is previous Viterbi path probability, a kj is transition probability, and b j (w i ) is emission probability. There s also an (implicit) backpointer from cell (i, j) to the relevant (i 1, k), where k maximizes v i 1 (k)a kj. 40 / 46

The Viterbi Algorithm q 4 NN 0 0.025.0012 0.000054 q 3 TO 0 0.025.00079 0 q 2 VB 0 0.025.23.0093 q 1 PRP 0.025.025.00014 0 q 0 start 1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 v i 1 (k) is previous Viterbi path probability, a kj is transition probability, and b j (w i ) is emission probability. There s also an (implicit) backpointer from cell (i, j) to the relevant (i 1, k), where k maximizes v i 1 (k)a kj. 41 / 46

The Viterbi Algorithm q 4 NN 0 0.000000002.000053.047 0 q 3 TO 0 0 0.000053.035.99 q 2 VB 0 0.00053.000053.0038 0 q 1 PRP 0.025 0.000053.0070 0 q 0 start 1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 v i 1 (k) is previous Viterbi path probability, a kj is transition probability, and b j (w i ) is emission probability. There s also an (implicit) backpointer from cell (i, j) to the relevant (i 1, k), where k maximizes v i 1 (k)a kj. 42 / 46

The Viterbi Algorithm q 4 NN 0 0.0000000020.0000018.00047.00057 q 3 TO 0 0 0.0000018.0000018 0 0 q 2 VB 0 0.00053 0.0000018.83.00012 q 1 PRP0.025 0 0.0000018 0 0 q 0 start1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 v i 1 (k) is previous Viterbi path probability, a kj is transition probability, and b j (w i ) is emission probability. There s also an (implicit) backpointer from cell (i, j) to the relevant (i 1, k), where k maximizes v i 1 (k)a kj. 43 / 46

The Viterbi Algorithm q 4 NN 0 0.000000002 0 4.8222e-13 q 3 TO 0 0 0.0000018 0 q 2 VB 0 0.00053 0 1.7928e-10 q 1 PRP 0.025 0 0 0 q 0 start 1.0 <s> I want to race w 1 w 2 w 3 w 4 For each state q j at time i, compute v i (j) = max n v i 1(k)a kj b j (w i ) k=1 v i 1 (k) is previous Viterbi path probability, a kj is transition probability, and b j (w i ) is emission probability. There s also an (implicit) backpointer from cell (i, j) to the relevant (i 1, k), where k maximizes v i 1 (k)a kj. 44 / 46

Connection between HMMs and finite state machines Hidden Markov models are finite state machines with probabilities added to them. If we think of finite state automaton as generating a string when randomly going through states (instead of scanning a string), then hidden Markov models are such FSMs where there is a specific probability for generating each symbol at each state, and a specific probability for transitioning from one state to another. As such, the Viterbi algorithm can be used to find the most likely sequence of states in a probabilistic FSM, given a specific input string. Question: where do the probabilities come from? 45 / 46

Example Demo http://nlp.stanford.edu:8080/parser/ Relies both on distributional and morphological criteria Uses a model similar to hidden Markov models 46 / 46