Statistical Processing of Natural Language

Similar documents
Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

1. Markov models. 1.1 Markov-chain

Hidden Markov Models

Statistical Methods for NLP

A gentle introduction to Hidden Markov Models

Hidden Markov Models

Hidden Markov Modelling

CS 7180: Behavioral Modeling and Decision- making in AI

Graphical models for part of speech tagging

Machine Learning for natural language processing

Hidden Markov Models and Gaussian Mixture Models

Parametric Models Part III: Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

COMP90051 Statistical Machine Learning

Lecture 11: Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models

Graphical Models Seminar

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

STA 414/2104: Machine Learning

Statistical Methods for NLP

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Multiscale Systems Engineering Research Group

Hidden Markov Models and Gaussian Mixture Models

ASR using Hidden Markov Model : A tutorial

Advanced Natural Language Processing Syntactic Parsing

order is number of previous outputs

Basic math for biology

STA 4273H: Statistical Machine Learning

Hidden Markov Models (HMMs)

Advanced Data Science

Hidden Markov Models NIKOLAY YAKOVETS

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

CS 7180: Behavioral Modeling and Decision- making in AI

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Lecture 13: Structured Prediction

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Conditional Random Field

Statistical Sequence Recognition and Training: An Introduction to HMMs

Introduction to Artificial Intelligence (AI)

Sequence Labeling: HMMs & Structured Perceptron

Sequence modelling. Marco Saerens (UCL) Slides references

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Hidden Markov Models

DRAFT! c January 7, 1999 Christopher Manning & Hinrich Schütze Markov Models

Intelligent Systems (AI-2)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Intelligent Systems (AI-2)

Hidden Markov Models

Hidden Markov Model and Speech Recognition

Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Dept. of Linguistics, Indiana University Fall 2009

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Introduction to Machine Learning CMU-10701

Linear Dynamical Systems (Kalman filter)

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

A brief introduction to Conditional Random Fields

Cheng Soon Ong & Christian Walder. Canberra February June 2018

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

p(d θ ) l(θ ) 1.2 x x x

Sequential Supervised Learning

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

CS838-1 Advanced NLP: Hidden Markov Models

A New OCR System Similar to ASR System

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Expectation Maximization (EM)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CS 188: Artificial Intelligence Fall 2011

Lecture 12: EM Algorithm

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Conditional Random Fields: An Introduction

Hidden Markov Models

Learning from Sequential and Time-Series Data

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical Machine Learning from Data

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

10/17/04. Today s Main Points

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Brief Introduction of Machine Learning Techniques for Content Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

EECS E6870: Lecture 4: Hidden Markov Models

Hidden Markov Models. Representing sequence data. Markov Models. A dice-y example 4/5/2017. CISC 5800 Professor Daniel Leeds Π A = 0.3, Π B = 0.

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Hidden Markov models

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Undirected Graphical Models

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Transcription:

Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Graphical and Generative models: Bayes rule independence assumptions. Able to generate data. Conditional models: No independence assumptions. Unable to generate data. Most algorithms of both kinds make assumptions about the nature of the data-generating process, predefining a fixed model structure and only acquiring from data the distributional information.

Usual Statistical in NLP and Generative models: Graphical: (Rabiner 1990), IO (Bengio 1996). Automata-learning algorithms: No assumptions about model structure. VLMM (Rissanen 1983), Suffix Trees (Galil & Giancarlo 1988), CSSR (Shalizi & Shalizi 2004). Non-graphical: Stochastic Grammars (Lary & Young 1990) Conditional models: Graphical: discriminative MM (Bottou 1991), MEMM (McCallum et al. 2000), CRF (Lafferty et al. 2001). Non-graphical: Maximum Entropy (Berger et al 1996).

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

[Visible] and X = (X 1,..., X T ) sequence of random variables taking values in S = {s 1,..., s N } Properties Limited Horizon: P(X t+1 = s k X 1,..., X t ) = P(X t+1 = s k X t ) Time Invariant (Stationary): P(X t+1 = s k X t ) = P(X 2 = s k X 1 ) Transition matrix: a ij = P(X t+1 = s j X t = s i ); Initial probabilities (or extra state s 0 ): π i = P(X 1 = s i ); N i=1 π i = 1 a ij 0, i, j; N j=1 a ij = 1, i

MM Example 0.6 1.0 h a 0.4 p and 1.0 0.4 0.3 0.6 0.3 e 1.0 t i 0.4 Sequence probability: P(X 1,.., X T ) = = P(X 1 )P(X 2 X 1 )P(X 3 X 1 X 2 )... P(X T X 1..X T 1 ) = P(X 1 )P(X 2 X 1 )P(X 3 X 2 )... P(X T X T 1 ) = π X1 T 1 t=1 a X tx t+1

() and States and Observations Emission Probability: b ik = P(O t = k X t = s i ) Used when underlying events probabilistically generate surface events: PoS tagging (hidden states: PoS tags, observations: words) ASR (hidden states: phonemes, observations: sound)... Trainable with unannotated data. Expectation Maximization (EM) algorithm. arc-emission vs state-emission

Example: PoS Tagging and Emission probabilities. the this cat kid eats runs fish fresh little big <FF> 1.0 Dt 0.6 0.4 N 0.6 0.1 0.3 V 0.7 0.3 Adj 0.3 0.3 0.4

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

and 1 Observation probability (decoding): Given a model µ = (A, B, π), how we do efficiently compute how likely is a certain observation? That is, P µ (O) 2 Classification: Given an observed sequence O and a model µ, how do we choose the state sequence (X 1,..., X T ) that best explains the observations? 3 Parameter estimation: Given an observed sequence O and a space of possible models, each with different parameters (A, B, π), how do we find the model that best explains the observed data?

and 1. Observation Probability 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 1. Observation probability and 1. Observation Probability Let O = (o 1,..., o T ) observation sequence. For any state sequence X = (X 1,..., X T ), we have: T P µ (O X ) = P µ (o t X t ) t=1 = b X1 o 1 b X2 o 2... b XT o T P µ (X ) = π X1 a X1 X 2 a X2 X 3... a XT 1 X T P µ (O) = P µ (O, X ) = P µ (O X )P µ (X ) X X = T π X1 b X1 o 1 a Xt 1 X t b Xtot X 1...X T t=2 Complexity: O(TN T ) Dynammic Programming: Trellis/lattice. O(TN 2 )

Trellis S 1 and 1. Observation Probability State S 2 S 3 S N 1 2 3 T+1 Fully connected where one can move from any state to any other at each step. A node {s i, t} of the trellis stores information about state sequences which include X t = i. Time t

Forward & Backward computation and 1. Observation Probability Forward procedure O(TN 2 ) We store α i (t) at each trellis node {s i, t}. α i (t) = P µ (o 1... o t, X t = i) Probability of emmiting o 1... o t and reach state s i at time t. 1 Inicialization: α i (1) = π i b io1 ; i = 1... N 2 Induction: t : 1 t < T N α j (t + 1) = α i (t)a ij b jot+1 ; 3 Total: P µ (O) = i=1 N α i (T ) i=1 j = 1... N

Forward computation and 1. Observation Probability Closeup of the computation of forward probabilities at one node. The forward probability α j (t + 1) is calculated by summing the product of the probabilities on each incoming arc with the forward probability of the originating node.

Forward & Backward computation and 1. Observation Probability Backward procedure O(TN 2 ) We store β i (t) at each trellis node {s i, t}. β i (t) = P µ (o t+1... o T X t = i) 1 Inicialization: β i (T ) = 1 i = 1... N 2 Induction: t : 1 t < T N β i (t) = a ij b jot+1 β j (t + 1) j=1 3 Total: P µ (O) = N π i b io1 β i (1) i=1 Probability of emmiting o t+1... o T given we are in state s i at time t. i = 1... N

Forward & Backward computation and Combination P µ (O, X t = i) = P µ (o 1... o t 1, X t = i, o t... o T ) = α i (t)β i (t) 1. Observation Probability P µ (O) = N α i (t)β i (t) i=1 t : 1 t T Forward and Backward procedures are particular cases of this equation when t = 1 and t = T respectively.

and 2. Best State Sequence 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 2. Best state sequence and 2. Best State Sequence Most likely path for a given observation O: P µ (X, O) argmax P µ (X O) = argmax X X P µ (O) = argmax P µ (X, O) X (since O is fixed) Compute the best sequence with the same recursive approach than in FB: Viterbi algorithm, O(TN 2 ). δ j (t) = max P µ (X 1... X t 1 s j, o 1... o t ) X 1...X t 1 Highest probability of any sequence reaching state s j at time t after emmitting o 1... o t ψ j (t) = last(argmax X 1...X t 1 P µ (X 1... X t 1 s j, o 1... o t )) Last state (X t 1) in highest probability sequence reaching state s j at time t after emmitting o 1... o t

Viterbi algorithm and 2. Best State Sequence 1 Initialization: j = 1... N δ j (1) = π j b jo1 ψ j (1) = 0 2 Induction: t : 1 t < T δ j (t + 1) = max δ i(t)a ij b jot+1 j = 1... N 1 i N ψ j (t + 1) = argmax δ i (t)a ij j = 1... N 1 i N 3 Termination: backwards path readout. ˆX T = argmax δ i (T ) 1 i N ˆX t = ψ ˆX t+1 (t + 1) P( ˆX ) = max 1 i N δ i(t )

and 3. Parameter Estimation 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

Question 3. Parameter Estimation and 3. Parameter Estimation Obtain model parameters (A, B, π) for the model µ that maximizes the probability of given observation O: (A, B, π) = argmax P µ (O) µ

Baum-Welch algorithm and 3. Parameter Estimation Baum-Welch algorithm (aka Forward-Backward): 1 Start with an initial model µ 0 (uniform, random, MLE...) 2 Compute observation probability (F&B computation) using current model µ. 3 Use obtained probabilities as data to reestimate the model, computing ˆµ 4 Let µ = ˆµ and repeat until no significant improvement. Iterative hill-climbing: Local maxima. Particular application of Expectation Maximization (EM) algorithm. EM Property: Pˆµ (O) P µ (O)

Definitions and 3. Parameter Estimation γ i (t) = P µ (X t = i O) = P µ(x t = i, O) α i (t)β i (t) = P µ (O) N k=1 α k(t)β k (t) Probability of being at state s i at time t given observation O. ϕ t (i, j) = P µ (X t = i, X t+1 = j O) = P µ(x t = i, X t+1 = j, O) P µ (O) = α i(t)a ij b jot+1 β j (t + 1) N k=1 α k(t)β k (t) probability of moving from state s i at time t to state s j at time t + 1, given observation sequence O. Note that γ i (t) = N j=1 ϕt(i, j) T 1 γ i (t) t=1 Expected number of transitions from state s i in O. T 1 ϕ t (i, j) t=1 Expected number of transitions from state s i to s j in O.

Arc probability and 3. Parameter Estimation Given an observation O, the model µ Probability ϕ t(i, j) of moving from state s i at time t to state s j at time t + 1 given observation O.

Reestimation and 3. Parameter Estimation Iterative reestimation ˆπ i = â ij = ˆb jk = Expected frequency in state s i at time (t = 1) = γ i(1) Expected number of transitions from s i to s j Expected number of = transitions from s i Expected number of emissions of k from s j Expected number = of visits to s j T 1 t=1 T 1 ϕ t (i, j) γ i (t) t=1 {t: 1 t T, o t=k} γ t (j) T γ t (j) t=1

and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4

and C. Manning & H. Schütze, Foundations of Statistical Natural Language Processing. The MIT Press. Cambridge, MA. May 1999. S.L. Lauritzen, Graphical. Oxford University Press, 1996 L.R. Rabiner, A tutorial on hidden models and selected applications in speech recognition. Proceedings of the IEEE, Vol. 77, num. 2, pg 257-286, 1989.