Hidden Markov Models

Similar documents
Introduction to Hidden Markov Models

Artificial Intelligence Bayesian Networks

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

EM and Structure Learning

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

MARKOV CHAIN AND HIDDEN MARKOV MODEL

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

1 Convex Optimization

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Hidden Markov Models

Hidden Markov Model Cheat Sheet

Bayesian predictive Configural Frequency Analysis

Hidden Markov Models

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Gaussian process classification: a message-passing viewpoint

SDMML HT MSc Problem Sheet 4

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

6 Supplementary Materials

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Lecture Nov

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

A New Evolutionary Computation Based Approach for Learning Bayesian Network

Conjugacy and the Exponential Family

Hidden Markov Models. Hongxin Zhang State Key Lab of CAD&CG, ZJU

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Lecture 10 Support Vector Machines II

Cell Biology. Lecture 1: 10-Oct-12. Marco Grzegorczyk. (Gen-)Regulatory Network. Microarray Chips. (Gen-)Regulatory Network. (Gen-)Regulatory Network

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Target tracking example Filtering: Xt. (main interest) Smoothing: X1: t. (also given with SIS)

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Dynamic Programming. Lecture 13 (5/31/2017)

10-701/ Machine Learning, Fall 2005 Homework 3

Relevance Vector Machines Explained

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Lecture 3 Stat102, Spring 2007

3.1 ML and Empirical Distribution

Mixture o f of Gaussian Gaussian clustering Nov

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Homework Assignment 3 Due in class, Thursday October 15

Discriminative classifier: Logistic Regression. CS534-Machine Learning

The Basic Idea of EM

Chapter 11: Simple Linear Regression and Correlation

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Support Vector Machines

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example.

Why BP Works STAT 232B

Generative classification models

Grover s Algorithm + Quantum Zeno Effect + Vaidman

NUMERICAL DIFFERENTIATION

The Geometry of Logit and Probit

Support Vector Machines

Speech and Language Processing

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

18.1 Introduction and Recap

Kernel Methods and SVMs Extension

Randomness and Computation

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Maximum Likelihood Estimation (MLE)

Week 5: Neural Networks


Checking Pairwise Relationships. Lecture 19 Biostatistics 666

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Lecture 10: May 6, 2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Retrieval Models: Language models

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Linear Approximation with Regularization and Moving Least Squares

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

EEE 241: Linear Systems

Probability-Theoretic Junction Trees

Bayesian belief networks

Assortment Optimization under MNL

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Rockefeller College University at Albany

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Course 395: Machine Learning - Lectures

Online Classification: Perceptron and Winnow

Prediction of Driving Behavior through Probabilistic Inference

Markov decision processes

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Engineering Risk Benefit Analysis

Quantifying Uncertainty

1 Motivation and Introduction

Lecture Notes on Linear Regression

Transcription:

CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte condtonal ndependence relatonshps as X A X B X C We are lookng for factorzatons of the type: G d separaton I(G) BayesBall P F actorzaton I(P) P (X A X B X C ) = P (X A X C ).P (X B X C ) If for a gven dstrbuton P we can wrte I(G) I(P ) P then P s markov w.r.t. G. A fully connected graph has no condtonal ndependence. Also, If P s markov w.r.t. G, then we can wrte, P (x 1 n ) = P (x x pa() ) N.e. X X /P a() X P a() Markov Blanket: Markov blanket of a node s the only knowledge needed to predct the behavor of that node. Fgure 1: Markov Blanket P (x x mb() ) = P (x x ) 1

mb(5) = {3, 2, 6} Markov Blanket of a node ncludes ts parents, chldren and co-parents. mb() = P a() Chldren() Co parents() Ths relaton s mportant because once we know markov blanket of a node, we can gnore the rest of the nodes. In the above drected graph, P (x 5 x 5 ) = P (x 5, x 5 ) P (x 5 ) P (x 5 x 3 ).P (x 6 x 2 x 5 ) We are prmarly nterested n dong nference on the Bayesan Networks.e. calculatng posteror probablty P (x n x v ) for a gven set of varables v. 2 Exact Inference A partcular class of nference we wll be focusng on s called Sum-Product or Belef Propagaton(Chan or Trees). In partcular case of chan, Forwards-Backwards Algorthm. 2.1 Chan Structured Graph We wll see ths smplest case to see how the algorthm works. Ths s a frst order markov chan. The dstrbuton correspondng to ths markov chan s, m P (z 1 m ) = P (z 1 ) P (z z 1 ) z [] z szevector P (z z 1 ) matrx O(m 2 ) mb(j) = {j 1, j + 1} To descrbe a concrete example of ths, we wll look at admxture models. =2 2.2 Hdden Markov Model(Extenson of chan structure) 2

P (x 1 m, z 1 m ) = P (z 1 m ).P (x 1 m z 1 m ) m = P (z 1 ) P (z z 1 ) =2 Transton m P (x z ) =1 Emsson\Observaton In chan structure we have local dependence whch s not desred. Here we have long range dependence. Advantage of workng wth ths model s that we don t have to mantan m models. Admxture: Fgure 2: Admxture Model 2.3 Inference Problems on HMM 2.3.1 Flterng We are gettng data at one pont and we want to nfer hdden varables at that pont P (z t x 1 t ). Ths method s onlne. 2.3.2 Smoothng We have observed all the data and we want to nfer hdden varables P (z t x 1 m ). Ths s batch predcton task. 2.3.3 Predcton Gven data at some pont, how wll the next observaton look lke P (z t+1 x 1 t ). 2.3.4 MAP estmaton To calculate z 1 m = argmax z1 m P (z 1 m x 1 m ) 3

2.3.5 Margnal Lkelhood \Evdence To calculate P (x 1 t ) Let s look at one concrete example. P (z 1 ) = P (z 1, z 2, z 3 ) O( 3 ) = P (z 1 )P (z 2 z 1 ) P (z 3 z 2 ) z 2 z 3 O( 2 ) O( 2 ) Now we ntroduce observatons, Ths s a smoothng problem. P (z 1 x 1 3 ) = P (z 1, z 2, z 3 x 1 3 ) P (x 1 3 z 1 3 ).P (z 1 3 ) = P (x 1 3 ) P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 )P (z 3 z 2 )P (x 3 z 3 ) Ignorng denomnator as t s just for normalzaton = z 2 P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 ) z 3 P (z 3 z 2 )P (x 3 z 3 ) = z 2 P (z 1 )P (x 1 z 1 )P (z 2 z 1 )P (x 2 z 2 )M 3 2 (z 2 ) M 2 1 (z 1 )P (z 1 )P (x 1 z 1 ) Let s apply these des to HMM. 4

Forwards Backwards Algorthm: Goal s to compute smoothng probablty for every nstant t. γ t (j) = P (z t = j x 1 m ) P (x 1 m z t = j)p (z t = j) = P (x 1 t z t = j)p (x t+1 m z t = j)p (z t = j) = P (z t = j x 1 t ) P (x t+1 m z t = j) Forward Pass α t(j) Backward Pass β t(j) α t (j) P (z t = j x 1 t ) We want to wrte ths as functon of α t 1 whch wll allow us to use dynamc programmng. α t (j) = P (x t x 1 t 1, z t = j)p (z t = j x 1 t 1 ) P (x t x 1 t 1 ) P (x t z t = j) P (z t = j, z t 1 = x 1 t 1 ) P (z t = j z t 1 =, x 1 t 1 ) P (z t 1 = x 1 t 1 ) α t 1() α t (j) P (x t z t = j) P (z t = j z t 1 = )α t 1 () = ψ t (j) ψ t 1,t (, j) α t 1 () O( 2 ) M t 1 t(j) Intalzaton: α 1 (j) = P (z 1 = j x 1 ) P (x 1 z 1 = j) P (z 1 = j) ψ 1(j) ψ 0,1(j) β t 1 (j) = P (x t m z t 1 = j) β t (j) P (x t+1 m z t = j) = P (x t m, z t = k z t 1 = j) = P (x t m z t = k, z t 1 = j)p (z t = k z t 1 = j) = P (x t z t = k) P (x t+1 m z t = k) P (z t = k z t 1 = j) β t(k) = β t (k)ψ t (k)ψ t 1,t (j, k) O( 2 ) 5

Intalzaton: β m (j) = 1 Total tme complexty = O(M 2 ) γ t (j) α t (j)β t (j) Another problem that can be solved s MAP(Vterb Algorthm). Ths s smlar to forwards-backwards algorthm. Partal Computaton: z = argmax z 1 m P (z 1 m x 1 m ) δ t (j) = max {z 1 t 1} P (z 1 t 1, z t = j x 1 t ) = max {z 1 t 2,} P (z 1 t 2, z t 1 =, z t = j x 1 t 1, x t ) max P (x 1 t 1, x t z t = j, z t 1 =, z 1 t 2 )P (z t = j, z t 1 =, z 1 t 2 ) = max {P (x t z t = j)p (x 1 t 1 z t 1 =, z 1 t 2 )P (z t = j z t 1 = )P (z t 1 = z 1 t 2 )} = max {P (z t 1 =, z 1 t 2 x 1 t 1 ) P (x t z t = j)p (z t = j z t 1 = )} δ t 1 δ t () = max δ t 1 ()ψ t (j)ψ t 1,t (, j) O( 2 ) Frst Observaton: δ 1 (j) = P (z 1 = j x 1) Last Observaton: δ m () = max δ m () = max P (z 1 m x 1 m ) Learnng \Parameter Estmaton: Observed Data : {x (1),, x (n) } θ = (π, A, B) n ll(θ) = log P (x () θ) =1 ˆθ = argmax ll(θ) θ E-M Algorthm (Baum - Welch): E-step wll nvolve computng posteror probabltes P (z t x 1 m, θ (t) ) Posteror probabltes at two adjacent nstants n tme: P (z t 1, z t x 1 m, θ (t) ) whch we can calculate easly gven α and β. M- step: Soft classfcaton based on above probabltes. Ths s a non-convex problem. 6

One specal case of HMM (Useful for mputaton): Factoral HMM: Fgure 3: Factoral HMM z (t) j {0, 1} X = Z (1) + Z (2) Intally both the markov chans are ndependent but they become dependent after observng x. Exact nference fals n ths case. 3 Concluson Hdden Markov models are generatve models, n whch the jont dstrbuton of observatons and hdden states, or equvalently both the pror dstrbuton of hdden states (the transton probabltes) and condtonal dstrbuton of observatons gven states (the emsson probabltes), are modeled. HMMs are useful where the flexblty of decson process could be perfectly mplemented to acheve better performance. 7