Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Similar documents
STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Hidden Markov Models

Conjugacy and the Exponential Family

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Hidden Markov Models

EM and Structure Learning

Quantifying Uncertainty

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Maximum Likelihood Estimation

Introduction to Hidden Markov Models

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Generalized Linear Methods

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

10-701/ Machine Learning, Fall 2005 Homework 3

Gaussian Mixture Models

The Expectation-Maximization Algorithm

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture Notes on Linear Regression

Lecture 12: Discrete Laplacian

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Composite Hypotheses testing

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

PHYS 705: Classical Mechanics. Calculus of Variations II

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

The Basic Idea of EM

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

1 Convex Optimization

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Homework Assignment 3 Due in class, Thursday October 15

Applied Stochastic Processes

Linear Approximation with Regularization and Moving Least Squares

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Hidden Markov Model Cheat Sheet

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

The Prncpal Component Transform The Prncpal Component Transform s also called Karhunen-Loeve Transform (KLT, Hotellng Transform, oregenvector Transfor

EEE 241: Linear Systems

Finding Dense Subgraphs in G(n, 1/2)

Week 5: Neural Networks

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Time-Varying Systems and Computations Lecture 6

The Feynman path integral

Tracking with Kalman Filter

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

CALCULUS CLASSROOM CAPSULES

6 Supplementary Materials

4DVAR, according to the name, is a four-dimensional variational method.

Lecture 3. Ax x i a i. i i

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Markov chains. Definition of a CTMC: [2, page 381] is a continuous time, discrete value random process such that for an infinitesimal

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Thermodynamics and statistical mechanics in materials modelling II

1 Matrix representations of canonical matrices

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Poisson brackets and canonical transformations

Modelli Clamfim Equazione del Calore Lezione ottobre 2014

Appendix B. The Finite Difference Scheme

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

NUMERICAL DIFFERENTIATION

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Limited Dependent Variables

Lecture 10 Support Vector Machines II

Lecture 5 Decoding Binary BCH Codes

The Geometry of Logit and Probit

Probabilistic Classification: Bayes Classifiers. Lecture 6:

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

Gaussian process classification: a message-passing viewpoint

Expectation propagation

Snce h( q^; q) = hq ~ and h( p^ ; p) = hp, one can wrte ~ h hq hp = hq ~hp ~ (7) the uncertanty relaton for an arbtrary state. The states that mnmze t

Expectation Maximization Mixture Models HMMs

1 (1 + ( )) = 1 8 ( ) = (c) Carrying out the Taylor expansion, in this case, the series truncates at second order:

Continuous Time Markov Chain

3.1 ML and Empirical Distribution

18.1 Introduction and Recap

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

APPENDIX A Some Linear Algebra

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Lecture 17: Lee-Sidford Barrier

Singular Value Decomposition: Theory and Applications

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Probability-Theoretic Junction Trees

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

12. The Hamilton-Jacobi Equation Michael Fowler

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

ρ some λ THE INVERSE POWER METHOD (or INVERSE ITERATION) , for , or (more usually) to

Transcription:

CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models (HMMs) are approprate for modelng sequental data. Thus, HMMs have been appled to speech recognton, gene fndng, and other applcatons whch may nvolve, but are not restrcted to, parsng or segmentng. The formal structure of an HMM s shown below, where the representaton can be vewed as a chan of mxture models. The hdden states are denoted by q t and the observed values are denoted by y t where t s a specfc pont n tme. q 0 q 1 q T A A A 0 y 0 y 1 y T 2 HMM Parameter Estmaton Recall from last lecture the complete log lkelhood for the HMM: l c (θ) = =1 q0 log π +,j=1 qtq j t+1 log a j + log p(y t q t, η) (1) Where p(y t q t, η) s the probablty dstrbuton of each output node. We observe that the complete probablty dstrbuton s n the exponental famly where q0 s the suffcent statstc for π, and T q tq j t+1 s the suffcent statstc for a j (the suffcent statstc for η depends on the dstrbuton chosen for p(y t q t, η)). Note: In ths dscusson, we have left the dstrbuton on the output values arbtrary and thus gnore the log p(y t q t, y) term. Refer to chapter 12 of the text for an example where the outputs y t are multnomal varables. 1

2 Hdden Markov Models & The Multvarate Gaussan 2.1 E Step For the E step, we take the expected value of the complete log lkelhood, condtonng on the parameters at teraton p, θ (p). l c (θ q, y) y,θ (p) = q0 log π + = =1 q0 log π + =1,j=1 qtq j t+1 log a j +,j=1 log p(y t q t, η) (2) qtq j t+1 log a j + log p(y t q t, η) (3) Thus, we must compute q 0 y,θ (p) and qtq j t+1 y,θ (p). However, these are just margnal dstrbutons: q0 y,θ (p) = E(q0 y, θ (p) ) (4) = p(q0 = 1 y, θ (p) ) (5) T qtq j t+1 y,θ = E(qtq j (p) t+1 y, θ(p) ) (6) = p(qtq j t+1 y, θ(p) ) (7) We can compute these values va the SUM-PRODUCT algorthm. Whle the margnals can be computed va the SUM-PRODUCT algorthm, we wll reman consstent wth the HMM lterature and show how to calculate these values va the alpha-beta algorthm (also called forwardbackward). We frst show that the calculaton of the β s n the alpha-beta algorthm s dentcal to the SUM-PRODUCT algorthm. Consder the fragment of the graphcal model representaton of the HMM below: q t Q t+1 A y t Y t+1 The β s of the alpha-beta algorthm s gven by: β(q t ) = q t+1 p(y t+1 q t+1 )β(q t+1 )a qt,q t+1 (8) By the SUM-PRODUCT algorthm, the message sent from q t+1 to q t s gven by: m qt+1 (q t ) = q t+1 m qt+2 (q t+1 )p(y t+1 q t+1 )a qt,q t+1 (9)

Hdden Markov Models & The Multvarate Gaussan 3 We see that ths s exactly the same as the calculaton of the β s where β(q t ) = m qt+1 (q t ). Note that we drop the q t+1 notaton n the β s snce the chan structure of the HMM already mples that the message s sent by q t+1. We can also wrte β(q t ) as follows: β(q t ) p(y t+1,..., y T q t ) (10) whch s the probablty of emttng a partal sequence of outputs y t+1,..., y T gven that the system starts n state q t. In the alpha-beta algorthm, the α s are defned to be: α(q t ) p(y 0,..., y t, q t ) (11) whch s the probablty of emttng a partal sequence of outputs y 0,..., y t and endng up n state q t. 2.2 M Step We defne γt to be equal to qt, and ξ j t,t+1 to be equal to q tq j t+1. We can wrte the expected complete log lkelhood as: l c (θ) = =1 γt log π +,j=1 ξ j t,t+1 log a j + log p(y t q t, y) (12) In dervng the M step for the HMM, we use a lemma that s useful throughout ths class, and thus we present t here. Gven J(π) as follows: J(π) = a log π (13) we would lke to maxmze J(π) such that π = 1 and π > 0. The soluton s ˆπ = j a. To see that aj ths s true, we smply take the dervatve wth respect to π and set to zero. We use a Lagrange multpler to represent the constrant that the π s must sum to one. J(π) = a log π + λ(1 π ) (14) J = a λ (15) π π λ = a π (16) a λ ˆπ = = π (17) a j a j Usng ths lemma, we derve the followng equatons for the M step. (18) ˆπ (p+1) = ˆα (p+1) j = = γ 0 j γj 0 T ξ,j t,t+1 M k=1 T T t=1 ξ,k t,t+1 ξt t,t+1 T γ t (19) (20) (21)

4 Hdden Markov Models & The Multvarate Gaussan In the case of HMMs, these equatons are also known as the Baum-Welch updates. Note: In some cases, we would lke to calculate the confguraton of states on the HMM that has the hghest probablty gven observed values for y t. We can solve ths by usng the well-known Verterb algorthm, whch essentally s the MAX-PRODUCT algorthm. 2.3 Concrete Example To gve a concrete example, we compute the expected complete log lkelhood when the probablty dstrbuton on the output values s Gaussan. l c (θ) = + log p(y t q t, η) (22) = + log p(y t qt = 1, η) q t (23) = + qt log p(y t qt = 1, η) (24) = + = ( qt 1 ) 2σ 2 (y t µ ) 2 Note that ths s a weghted least squares problem. γ t ( ) 1 2σ 2 (y t µ ) 2 (25) (26) 2.4 Numercal Issues When mplementng an HMM on the computer, specal care has to be taken wth regard to numercal ssues. Specfcally, the forward and backward recursons nvolve repeated multplcatons of probabltes (.e. numbers less than one). These repeated multplcatons of small numbers generally lead to underflow. However, we can smply get around ths problem by normalzng after each recurson step. See chapter 12 for more detals on how to derve these normalzaton update equatons. 3 Factor Analyss Models and HMMs So far, we ve vewed HMMs as a chan of mxture models where the states q t are based on a dscrete latent varable. We can also consder factor analyss models whch are based on contnuous latent varables. The underlyng graphs of these models are dentcal. Roughly, factor analyss can be consdered a probablstc form of prncple component analyss (PCA).

Hdden Markov Models & The Multvarate Gaussan 5 X Y In the dynamcal generalzaton of factor analyss, called the Kalman Flter, the hdden states are represented by x t and the observed values as y t. To represent the transton between nodes, we allow the mean of the state at tme t + 1 be a lnear functon of the state at tme t plus Gaussan error ɛ t, wth mean 0. The ntal state, x 0, s endowed wth a Gaussan dstrbuton wth mean 0 and covarance Σ 0. The state space model s shown below. x 0 x 1 x T A A A 0 C C C y 0 y 1 y T Ths dynamcal generalzaton of factor analyss yelds tme seres analyss methods known as the Kalman flter and the Rauch-Tung-Strebel smoother. 4 Multvarate Gaussans We often express the multvarate Gaussan dstrbuton usng the parameters µ and Σ, where µ s a d 1 vector and Σ s a d d, symmetrc matrx. Usng these parameters, we have the followng form for the densty functon: { 1 p(x µ, Σ) = exp 1 } (2π) d/2 Σ 1/2 2 (x µ)t Σ (x µ) (27) where x s a vector n R d. Alternatvely, we can use a dfferent parameterzaton, the canoncal parameterzaton. We defne the canoncal parameters as follows: Λ = Σ (28) η = Σ µ (29)

6 Hdden Markov Models & The Multvarate Gaussan Note that these are nvertble and that we can calculate the moment parameters as follows: µ = Λ η (30) Σ = Λ (31) Usng the canoncal parameterzaton, we obtan the followng densty functon: { p(x η, Λ) = exp η T x 1 } 2 xt Λx + a(η, Λ) (32) We now ntroduce the trace trck. We defne the trace of a square matrx A to be the sum of the dagonal elements a of A: tr(a) a (33) An mportant property s ts nvarance to cyclcal permutatons of matrx products: tr(abc) = tr(cab) = tr(bca) (34) Usng the trace trck, we can have the followng: x T Λx = tr(x T Λx) = tr(λxx T ) (35) Thus, we can rewrte the densty functon as follows: { p(x η, Λ) = exp η T x 1 } 2 tr(λxxt ) + a(η, Λ) (36) We now see that suffcent statstc s (x, xx T ) and the canoncal parameter s (η, 1 2 Λ). We can partton the d 1 parameter vector x nto a p 1 sub-vector x 1 and q 1 sub-vector x 2, where n = p + q. The correspondng parttons of the µ and Σ parameters are: [ ] µ1 µ = (37) µ 2 [ ] Σ11 Σ Σ = 12 (38) Σ 21 Σ 22 In the next class, we wll talk about nverses of partton matrces. Gven a block matrx: ( ) E F M = G H (39) The Schur complement of the matrx M wth respect to H, denoted M\H, s defned to be E F H G. We wll also obtan an mportant result nvolvng the determnant of M: M = H M\H (40)