Dynamic Programming: Hidden Markov Models

Similar documents
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Statistical Methods for NLP

Lecture 3: ASR: HMMs, Forward, Viterbi

Machine Learning for natural language processing

Lecture 9: Hidden Markov Model

Sequence Labeling: HMMs & Structured Perceptron

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Dept. of Linguistics, Indiana University Fall 2009

Data-Intensive Computing with MapReduce

Lecture 11: Viterbi and Forward Algorithms

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Fun with weighted FSTs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Midterm sample questions

10/17/04. Today s Main Points

Log-Linear Models, MEMMs, and CRFs

Text Mining. March 3, March 3, / 49

Statistical Methods for NLP

Lecture 12: Algorithms for HMMs

Lecture 13: Structured Prediction

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 11: Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical NLP: Hidden Markov Models. Updated 12/15

8: Hidden Markov Models

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Hidden Markov Models

Hidden Markov Models. Terminology and Basic Algorithms

Lecture 12: Algorithms for HMMs

Automatic Speech Recognition (CS753)

COMP90051 Statistical Machine Learning

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Computational Genomics and Molecular Biology, Fall

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Statistical methods in NLP, lecture 7 Tagging and parsing

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

1. Markov models. 1.1 Markov-chain

POS-Tagging. Fabian M. Suchanek

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Terminology, Representation and Basic Problems

Lab 12: Structured Prediction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models Hamid R. Rabiee

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models,99,100! Markov, here I come!

Statistical Machine Learning from Data

Hidden Markov Models in Language Processing

Machine Learning for natural language processing

TnT Part of Speech Tagger

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Hidden Markov Models. x 1 x 2 x 3 x N

p(d θ ) l(θ ) 1.2 x x x

Probabilistic Context-free Grammars

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Intelligent Systems (AI-2)

Hidden Markov Models. Terminology and Basic Algorithms

CS 188: Artificial Intelligence Fall 2011

Hidden Markov Models and Gaussian Mixture Models

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Sequences and Information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Intelligent Systems (AI-2)

A gentle introduction to Hidden Markov Models

STA 414/2104: Machine Learning

8: Hidden Markov Models

Hidden Markov Models

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Introduction to Machine Learning CMU-10701

Dynamic Approaches: The Hidden Markov Model

Automatic Speech Recognition (CS753)

order is number of previous outputs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models: Maxing and Summing

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models. Implementing the forward-, backward- and Viterbi-algorithms

Graphical Models Seminar

Lecture 7: Sequence Labeling

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

The Noisy Channel Model and Markov Models

Statistical Processing of Natural Language

AN INTRODUCTION TO TOPIC MODELS

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Multiscale Systems Engineering Research Group

Conditional Random Field

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

Hidden Markov Models

Temporal Modeling and Basic Speech Recognition

STA 4273H: Statistical Machine Learning

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Hidden Markov models 1

Transcription:

University of Oslo : Department of Informatics Dynamic Programming: Hidden Markov Models Rebecca Dridan 16 October 2013 INF4820: Algorithms for AI and NLP Topics Recap n-grams Parts-of-speech Hidden Markov Models Today Dynamic programming Viterbi algorithm Forward algorithm Tagger evaluation

n-grams Previous context can help predict the next thing in a sequence Rather than use the whole previous context, the Markov assumption says that the whole history can be approximated by the last n 1 elements An n-gram language model predicts the n-th word, conditioned on the n 1 previous words Maximum Likelihood Estimation uses relative frequencies to approximate the conditional probabilities needed for an n-gram model Smoothing alters the probabilities to save some probability mass for unseen events Parts-of-speech Lexical categories defined by their distributional properties These properties can be inherent to the lemma, or a function of a word s use in a sentence Open-class categories: new words frequently added, usually content-bearing words Closed-class categories: more static, often function words Distinctions made vary with any particular tag set Correct tag is context depent POS tagging is a frequent pre-processing step for other NLP algorithms, but is also directly useful for text-to-speech systems, etc.

Hidden Markov Models For a bi-gram HMM, with O N 1 : P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S and P(o N+1 s N+1 ) = 1 The transition probabilities model the probabilities of moving from state to state. The emission probabilities model the probability that a state emits a particular observation. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations.

Ice cream and global warming Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s. A Hidden Markov Model! with: Hidden states: {H, C} (plus pseudo-states S and /S ) Observations: {1, 2, 3} Ice cream and global warming 0.8 S 0.3 0.6 H C 0.5 P(1 H)= P(2 H)=0.4 P(3 H)=0.4 /S P(1 C) = 0.5 P(2 C) = 0.4 P(3 C) = 0.1

Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations. Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i ) P(S, O) We want: P(S O) = P(O) Actually, we want the state sequence that maximises P(S O): S best = arg max S P(S, O) P(O) Since P(O) will be the same, we can drop the denominator.

Decoding Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = P( /S C) = S H C C /S 0.0003200 P(1 H) = P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500 Dynamic Programming For 2 states and an observation sequence of 3, this is OK, but... for N observations and L states, there are L N sequences we do the same calculations over and over again Enter dynamic programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi

Viterbi algorithm Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (s i = x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1. At each step, we record backpointers showing which previous state led to the maximum probability. An example of the Viterbi algorithmn S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 H H H P(C H)P(1 C) 0.5 P(H C)P(1 H) 0.3 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) P( /S C) v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H H H

Pseudocode for the Viterbi algorithm Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] Diversion: Complexity and O(N) Big O notation describes the complexity of an algorithm. it describes the worst-case order of growth in terms of the size of the input only the largest order term is represented constant factors are ignored determined by looking at loops in the code

Pseudocode for the Viterbi algorithm Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L N L L N O(L 2 N) Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations.

Computing likelihoods Task Given an observation sequence O, determine the likelihood P(O), according to the HMM. Compute the sum over all possible state sequences: P(O) = P(O, S) For example, the ice cream sequence 3 1 3: P(3 1 3) = P(3 1 3, cold cold cold) + P(3 1 3, cold cold hot) + S P(3 1 3, hot hot cold) +... O(L N N) The Forward algorithm Again, we use dynamic programming storing and reusing the results of partial computations in a trellis α. Each cell in the trellis stores the probability of being in state s x after seeing the first i observations: α i (x) = P(o 1... o i, s i = x) L = α i 1 (k) P(x k) P(o i x) k=1 Note, instead of the max in Viterbi.

An example of the Forward algorithmn S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 H H H P(C H)P(1 C) 0.5 P(H C)P(1 H) 0.3 P(C C)P(1 C) 0.5 0.5 v 2 (H) = (.32.12,.02.06) =.0396 P(C H)P(3 C) 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = (.0396.24,.037.12) =.013944 C C C P( /S H) P( /S C) v f ( /S ) = (.013944.2,.002642.2) =.0033172 /S v 1 (C) = 0.02 v 2 (C) = (.32.1,.02.25) =.037 v 3 (C) = (.0396.02,.037.05) =.002642 3 1 3 o 1 o 2 o 3 P(3 1 3) = 0.0033172 Pseudocode for the Forward algorithm Input: observations of length N, state set of length L Output: forward-probability create a probability matrix forward[n, L + 2] foreach state s from 1 to L do forward[1, s] trans( S, s) emit(o 1, s) foreach time step i from 2 to N do foreach state s from 1 to L do forward[i, s] L s =1 forward[i 1, s] trans(s, s) emit(o t, s) forward[n, L + 1] L return forward[n, L + 1] s=1 forward[n, s] trans(s, /S )

Tagger Evaluation To evaluate a part-of-speech tagger (or any classification system) we: train on a labelled training set test on a separate test set For a POS tagger, the standard evaluation metric is tag accuracy: number of correct tags Acc = number of words The other metric sometimes used is error rate: error rate = 1 Acc Summary Understand Why does dynamic programming save time, and what type of problems can it be used for? What is the complexity of the Viterbi algorithm? Coming Up Context free grammars Most likely trees