COMP90051 Statistical Machine Learning

Similar documents
Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Introduction to Machine Learning CMU-10701

STA 4273H: Statistical Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 11: Hidden Markov Models

STA 414/2104: Machine Learning

Hidden Markov Models

Linear Dynamical Systems

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Methods for NLP

Probabilistic Graphical Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Intelligent Systems (AI-2)

Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Lecture 13: Structured Prediction

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Graphical Models Seminar

Bayesian Machine Learning - Lecture 7

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov models

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

9 Forward-backward algorithm, sum-product on factor graphs

Hidden Markov Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Lecture 3: ASR: HMMs, Forward, Viterbi

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Sequence Labeling: HMMs & Structured Perceptron

Computational Genomics and Molecular Biology, Fall

Intelligent Systems (AI-2)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models in Language Processing

Multiscale Systems Engineering Research Group

Lecture 15. Probabilistic Models on Graph

CS 7180: Behavioral Modeling and Decision- making in AI

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Note Set 5: Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Data Mining in Bioinformatics HMM

CS 188: Artificial Intelligence Fall 2011

Intelligent Systems (AI-2)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

1. Markov models. 1.1 Markov-chain

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Statistical Processing of Natural Language

Bioinformatics 2 - Lecture 4

Statistical Methods for NLP

Data-Intensive Computing with MapReduce

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Hidden Markov Models

Probabilistic Graphical Models

CS 343: Artificial Intelligence

CS532, Winter 2010 Hidden Markov Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Conditional Random Field

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Machine Learning for OR & FE

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

A brief introduction to Conditional Random Fields

Template-Based Representations. Sargur Srihari

COMP90051 Statistical Machine Learning

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models Hamid R. Rabiee

A gentle introduction to Hidden Markov Models

Lecture 12: Algorithms for HMMs

Sequential Supervised Learning

Lecture 12: Algorithms for HMMs

Probabilistic Graphical Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Statistical Machine Learning from Data

Basic math for biology

Hidden Markov Models

Temporal Modeling and Basic Speech Recognition

Machine Learning for natural language processing

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CS 343: Artificial Intelligence

Linear Dynamical Systems (Kalman filter)

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning 4771

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models Part 2: Algorithms

8: Hidden Markov Models

Graphical models. Sunita Sarawagi IIT Bombay

Bayesian Networks BY: MOHAMAD ALSABBAGH

Hidden Markov Modelling

ASR using Hidden Markov Model : A tutorial

Transcription:

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing

Looking back Representation of joint distributions Conditional/marginal independence * Directed vs undirected Probabilistic inference * Computing other distributions from joint Statistical inference * Learn parameters from (missing) data Today: putting these all into practice 2

Hidden Markov Models Model of choice for sequential data. A form of clustering (or dimensionality reduction) for discrete time series. 3

The HMM (and Kalman Filter) Sequential observed outputs from hidden state * states take discrete values (i.e., clusters) * assumes discrete time steps 1, 2,, T The Kalman filter same with continuous Gaussian r.v. s * i.e., dimensionality reduction, but with temporal dynamic 4

HMM Applications NLP part of speech tagging: given words in sentence, infer hidden parts of speech I love Machine Learning à noun, verb, noun, noun Speech recognition: given waveform, determine phonemes Biological sequences: classification, search, alignment Computer vision: identify who s walking in video, tracking 5

Formulation Formulated as directed PGM * therefore joint expressed as P (o, q) =P (q 1 )P (o 1 q 1 ) TY P (q i q i 1 )P (o i q i ) i=2 * bold variables are shorthand for vector of T values Parameters (for homogenous HMM) 6

Independence Graph encodes independence btw RVs * conditional independence: o i o \i q i * state q i must encode all sequential context all other o s excluding i Markov blanket is local * for o i blanket is q i * for q i blanket is {o i, q i-1, q i+1 } 7

Fundamental HMM Tasks HMM Task Evaluation. Given an HMM μ and observation sequence o, determine likelihood Pr (o μ) Decoding. Given an HMM μ and observation sequence o, determine most probable hidden state sequence q Learning. Given an observation sequence o and set of states, learn parameters A, B, Π PGM Task Probabilistic inference MAP point estimate Statistical inference 8

Evaluation a.k.a. marginalisation Compute prob. of observations o by summing out q P (o µ) = X q P (o, q µ) = X X X... P (q 1 )P (o 1 q 1 )P (q 2 q 1 )P (o 2 q 2 )...P(q T q T 1 )P (o T q T ) q 1 q 2 q T Make this more efficient by moving the sums P (o µ) = X q 1 P (q 1 )P (o 1 q 1 ) X q 2 P (q 2 q 1 )P (o 2 q 2 )... X q T P (q T q T 1 )P (o T q T ) Deja vu? Maybe we could do var. elimation 9

Elimination = Backward Algorithm P (o µ) = X q 1 P (q 1 )P (o 1 q 1 ) X q 2 P (q 2 q 1 )P (o 2 q 2 )... X q T P (q T q T 1 )P (o T q T ) Eliminate q T Eliminate q 2 Eliminate q 1 m 2!1 (q 1 ) m T!T 1 (q T 1 ) P (o µ) = X q 1 P (q 1 )P (o 1 q 1 )m 2!1 (q 1 ) q 1 q 2 q T 10

Elimination = Forward Algorithm P (o µ) = X q T P (o T q T ) X q T 1 P (q T q T 1 )P (o T q T )... X q 1 P (q 2 q 1 )P (q 1 )P (o 1 q 1 ) Eliminate q 1 Eliminate q T-1 m T 1!T (q T ) m 1!2 (q 2 ) Eliminate q T P (o µ) = X q 1 P (o T q T )m T 1!T (q T ) q 1 q 2 q T 11

Forward-Backward Both algorithms are just variable elimination using different orderings * q T q 1 à backward algorithm * q 1 q T à forward algorithm * both have time complexity O(Tl 2 ) where l is the label set Can use either to compute P(o) * but even better, can use the m values to compute marginals (and pairwise marginals over q i, q i+1 ) P (q i o) = 1 P (o) m i 1!i(q i )P (o i q i )m i+1!i (q i ) forward backward 12

Statistical Inference (Learning) Learn parameters μ = (A, B, π), given observation sequence o Called Baum Welch algorithm which uses EM to approximate MLE, argmax μ P(o μ): 1. initialise μ 1, let i=1 2. compute expected marginal distributions P(q t o, μ i ) for all t; and P(q t-1,q t o, μ i ) for t=2..t 3. fit model μ i+1 based on expectations 4. repeat from step 2, with i=i+1 Expectations computed using forward-backward 13

Message Passing Sum-product algorithm for efficiently computing marginal distributions over trees. An extension of variable elimination algorithm. 14

Inference as message passing Each m can be considered as a message which summarises the effect of the rest of the graph on the current node marginal. * Inference = passing messages between all nodes m 3!2 q 1 q 2 q 3 q 4 m 1!2 15

Inference as message passing Messages vector valued, i.e., function of target label Messages defined recursively: left to right, or right to left m 3!2 q 1 q 2 q 3 q 4 m 1!2 m 4!3 16

Sum-product algorithm Application of message passing to more general graphs * applies to chains, trees and poly-trees (directed PGMs with >1 parent) * sum-product derives from: product = product of incoming messages sum = summing out the effect of RV(s) aka elimination Algorithm supports other operations (semi-rings) * e.g., max-product, swapping sum for max * Viterbi algorithm is the max-product variant of the forward algorithm for HMMs, solves the argmax q P(q o) 17

Application to Directed PGMS CTL FG CTL FG FA GRL FA GRL AS Directed PGM CTL FG AS Undirected moralised PGM FA GRL Factor graph AS 18

f 1 (CTL)=P (CTL) Factor graphs f 2 (CTL,GRL,FG)=P (GRL CTL,FG) f 1 f 2 CTL FG FA f 4 GRL f 3 f 5 AS FG = A bipartite graph, with factors (functions) and RVs Directed PGMs result in tree-structured FG 19

Factor graph for the HMM P (q 2 q 1 ) P (q 3 q 2 ) P (q 4 q 3 ) q 1 q 2 q 3 q 4 P (q 1 )P (o 1 q 1 ) P (o 2 q 2 ) P (o 3 q 3 ) P (o 4 q 4 ) Effect of observed nodes incorporated into unary factors 20

Two types of messages : * between factors and RVs; and between RVs and factors * summarise a complete sub-graph E.g., Sum-Product over Factor Graphs m f2!grl(grl) = X CTL X f 2 (GRL, CT L, F G)m CTL!f2 (CTL)m FG!f2 (FG) FG Structure inference as gather-and-distribute * gather messages from leaves of tree towards root * then propagate message back down from root to leaves 21

Undirected PGM analogue: CRFs Conditional Random Field: Same model applied to sequences * observed outputs are words, speech, amino acids etc * states are tags: part-of-speech, phone, alignment * shared inference algo., i.e., sum-product / max-product CRFs are discriminative, model P(q o) * versus HMMs which are generative, P(q,o) * undirected PGM more general and expressive o 4 o 3 o 1 o 2 q 4 q 3 q 1 q 2 22

Summary HMMs as example PGMs * formulation as PGM * independence assumptions * probabilistic inference using forward-backward * statistical inference using expectation maximisation Message passing: general inference method for U-PGMs * sum-product & max-product * factor graphs 23