Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Similar documents
Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 4273H: Statistical Machine Learning

Hidden Markov Models Part 2: Algorithms

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

STA 414/2104: Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Brief Introduction of Machine Learning Techniques for Content Analysis

Introduction to Machine Learning CMU-10701

Linear Dynamical Systems (Kalman filter)

Dynamic Approaches: The Hidden Markov Model

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Machine Learning Summer School

Advanced Data Science

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Introduction to Artificial Intelligence (AI)

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Lecture 6: Graphical Models: Learning

Note Set 5: Hidden Markov Models

Bayesian Hidden Markov Models and Extensions

CPSC 540: Machine Learning

Intelligent Systems (AI-2)

Data Analyzing and Daily Activity Learning with Hidden Markov Model

Master 2 Informatique Probabilistic Learning and Data Analysis

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Statistical Processing of Natural Language

Machine Learning for OR & FE

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Intelligent Systems (AI-2)

13: Variational inference II

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Machine Learning

Graphical Models and Kernel Methods

Conditional Random Field

The Origin of Deep Learning. Lili Mou Jan, 2015

Math 350: An exploration of HMMs through doodles.

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov models

Statistical Methods for NLP

Statistical NLP: Hidden Markov Models. Updated 12/15

order is number of previous outputs

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

CSCE 471/871 Lecture 3: Markov Chains and

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Probabilistic Graphical Models

Basic math for biology

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Parametric Models Part III: Hidden Markov Models

Learning from Sequential and Time-Series Data

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Undirected Graphical Models

L23: hidden Markov models

Bayesian Networks Inference with Probabilistic Graphical Models

Pattern Recognition and Machine Learning

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models Hamid R. Rabiee

Lecture 11: Hidden Markov Models

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Machine Learning

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

STA 4273H: Statistical Machine Learning

Probabilistic Models for Sequence Labeling

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 13 : Variational Inference: Mean Field Approximation

Inference in Explicit Duration Hidden Markov Models

COMP90051 Statistical Machine Learning

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Hidden Markov Models and Gaussian Mixture Models

Multiscale Systems Engineering Research Group

Sequence modelling. Marco Saerens (UCL) Slides references

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

CS 7180: Behavioral Modeling and Decision- making in AI

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CS711008Z Algorithm Design and Analysis

Unsupervised Learning

LECTURE 15 Markov chain Monte Carlo

Expectation Maximization (EM)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Directed and Undirected Graphical Models

Tutorial on Probabilistic Programming with PyMC3

Chapter 05: Hidden Markov Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Time-Sensitive Dirichlet Process Mixture Models

The Ising model and Markov chain Monte Carlo

Hybrid HMM/MLP models for time series prediction

Household Energy Disaggregation based on Difference Hidden Markov Model

Lecture 13: Structured Prediction

Transcription:

Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang

Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check the dependences and independences Define local conditional probability distributions Mixture models Latent factor analysis

Mixture Model Cancer is a general disease category consisting of many different types of diseases. For example, breast cancer has four major subtypes: normal-like, basal, luminal A and luminal B, with distinct clinical outcomes. Each subtype has different gene expression patterns. We need to infer the subtypes based on the observed gene expressions. Variables: subtypes (s), gene expressions (e) Relation: subtypes define the patterns of expressions s e N 1) Circles represent variables. Shadow/white circles represent observed/unobserved variables. 2) Rectangle indicates the variables in the same logic layer. The number (N) at the right-bottom corner represents the number of samples. 3

Variables: subtypes (s), gene expressions (e) Relation: subtypes define the patterns of expressions Important: carefully check the independences in your model s e N π μ s e Σ N Consider the local CPDs Different subtypes have different occurrence probability (s) K P s k, 1 i k1 Each expression pattern (indicated by i) independently follows different Gaussian distributions given the cancer subtype (e s) e, p s k k i k 4

Latent Factor Model Cancer is usually caused by multiple driving processes (factors), such as sustainable proliferation, resistance to cell death, immune escape and promoted vascular growth, etc. These driving processes have combined effects on the gene expressions patterns. We want to infer the driving processes based on large-scale gene expression datasets. Variables: driving factors (z i ), gene expressions (e j ) Relations: z determines the distribution of e Local CPDs Driving process z i follows a standard Gaussian distribution p z 0,1, i 1,, K i Expression e j also follows a Gaussian distribution but its mean is determined by a linear model of z p e z z z i1 K 2 j 1,..., k 0 i i, j 5

Variables: independent driving processes (z i ), gene expressions (e j ) Relations: z determines the distribution of e Local CPDs Driving process z i follows a standard Gaussian distribution p z 0,1, i 1,, K e 2 i Expression e j also follows a Gaussian distribution but its mean is determined by a linear model of z e 1 α 1 τ 1 z 1 z K e L N α 2 τ 2 α L τ L p e z z z i1 K 2 j 1,..., k 0j ij i, j 6

Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check the dependences and independences Define local conditional probability distributions Questions A vector or scalar for a variable? how to deal with the variables with unknown independences? (Recall assignment #2)

Outlines Stochastic Process and Markov property Dynamic Bayesian Networks Hidden Markov Models (HMMs) State-Observation Models A Few Examples of HMM Representations Calculations for HMMs Generate Simulated Data Extended Models

Markov / Memoryless Property Add the time dimension to variables X X (t) Assumptions X 0 X 1 X 2 X 3 Time can be discretized into interesting points t 1,t 2,...t n ( ) n P X P( X ( t) X (1),... X ( t 1)) Markov assumption holds P( X ) t 1 n P( X ( t) X ( t t Distribution is stationary (time invariant or homogeneous) 1 1)) i, j : P( X ( i) X ( i 1)) P( X ( j) X ( j 1))

Example: Random Walk Define X (t) as the coordinate of particle at time point t. The particle can randomly move to neighbor variables (nodes) with fixed probability. Then, P(X (t+1) ) is only determined by X (t). http://www.theosysbio.bio.ic.ac.uk/research/ana lysing-in-vivo-cell-migration-data/

Dynamic Bayesian Networks for Time Invariant Process/System

Consider a smoke detection tracking application, where we have 3 rooms connected in a row. Each room has a true smoke level (X) and a measured smoke level (Y) by a smoke detector situated in the middle of the room. Which of the below are the proper DBN structure(s) for this problem? X X 1 1 X 1 X 1 Y 1 Y 1 X 2 X 2 X 2 X 2 Y 2 Y 2 X 3 X 3 X 3 X 3 Y 3 Y 3 X 1 X 1 X 1 X 1 Y 1 Y 1 X 2 X 2 X 2 X 2 Y 2 Y 2 X 3 X 3 X 3 X 3 Y 3 Y 3

The Parameterization of Dynamic BNs Usually assume time-invariant systems (parameters are constant over time) The parameters in each time slice The parameters between internal state variables The emission or observation parameters between state variables and observation variables The transition parameters between two consecutive time slices

Dice problem You stay in a casino for a long time. You find that the probability of dice is equal for each of its six faces (1-6). But you doubt that there must be some trick on the dice. So you record the series of the dice number of each play, and want to find the problem... Hidden Markov Models Probabilistic Graphic Models, Fall 2011 15

Dice problem Hidden Markov Models Probabilistic Graphic Models, Fall 2011 16

Hidden Markov Models S 0 S 1 S 2 S 3 State variable S (dice index 1~K) O 1 O 2 O 3 Observation variable O (dice number 1~6) Local structure #1 Hidden states (cannot be observed) are linked as Markov chain: P(S T S t<t ) = P(S T S T-1 ) Local structure #2 Observable variables are only determined by the hidden state: P(O T S t<=t ) = P(O T S T )

State-Observation Models If the hidden variables can be discretized into several states, HMM can be represented as State- Observation Models. S 0 S 1 S 2 O 1 O 2 S 3 O 3 Hidden states (cannot be observed) are linked as Markov chain: P(S T S t<t ) = P(S T S T-1 ) Observable variables are only determined by the hidden state: P(O T S t<=t ) = P(O T S T )

State-Observation Models State-State Transmission Matrix T (n,n) S t = TS t-1 State-Observation Emission Matrix E (m,n) O t = ES t

Example: Stock Market Index We can define three different states: UP/DOWN/UNCHANGE Hidden Markov Models Probabilistic Graphic Models, Fall 2011 20

Calculations in HMMs Problem 1: P X θ, given the model and the observation sequence, infer the probability of getting that observation sequence from the model Problem 2: argmax Y P(X, Y θ), given the model and the observation sequence, infer the hidden labels of the sequence Problem 3: argmax θ P(X θ), if parameters are un known, learn them from the observation sequence

Question #1: P X θ Inference

Formal Model Representation HMMs (discretized) have the below parameters and the observation sequences X T ti, j, i, j 1... N, E ei, j, i 1... N, j 1... K,, 1... i i N X x, t 1,..., T t Hidden Markov Models Probabilistic Graphic Models, Fall 2011 23

Compute P(X θ) i P x,...,, 1 x y i t t t Forward/Backward algorithm Initialization: Induction: i e i i x1 1, N t 1 i t j t j, i ei, x j1 Termination: t 1 P X N T i1 i Y 1 X 1 Y 2 X 2 Y 3 X 3

i P x,,, 1 x y i Compute P(X θ) t t T t Forward/Backward algorithm Initialization: Induction: Termination: T N i 1 i t e j t i, j j, xt 1 t1 j1 Y 1 Y 2 X 1 X 2 P X i e j N 0 j j, x1 1 j1 Y 3 X 3

Full Chain Local Transition (forward) Local Transition (Backward) Figures from Rabiner LR. Proceedings of the IEEE 1989, 77(2):257-286.

Question #2: argmax Y P(X, Y θ) Inference

Full Chain argmax Y P(X, Y θ) Figures from Rabiner LR. Proceedings of the IEEE 1989, 77(2):257-286. Infer the most possible states

Compute argmax Y P(X, Y θ) HMM inference (Viterbi algorithm) Known transition matrix and emission matrix Infer the maximum probability STATE series The probability at time 0 with observation x 0 For Y i e 1 : 1, i i i, x The probability at time 1 with observation x 1 e max e t e max t 2, i i, x y y, x y, i i, x 1, y y, i y 1,, N y 1,, N 2 1 1 1 1 2 1 1 1 1 i arg max y ty i 2 1, 1 1, y 1,, N 1 Hidden Markov Models Probabilistic Graphic Models, Fall 2011 29 1

Compute argmax Y P(X, Y θ) HMM inference (Viterbi algorithm) The probability at time t with observation x t t, i i, x t1, y y, i y 1,, N i arg max 1, t, For the last observation e t t1 t1 t1 t t yt1 yt1 i y 1,, N t1 max t e max t T, i i, x 1,, 1,, T y N y y i T T 1 T 1 T 1 i arg max 1, t, T T yt 1 yt 1 i y 1,, N T 1 y * T arg max y T 1,, N T, y T Hidden Markov Models Probabilistic Graphic Models, Fall 2011 30

Compute argmax Y P(X, Y θ) HMM inference (Viterbi algorithm) After y t is inferred, trace back to get other y i y * T arg max y T 1,, N We can infer other y i * * y i y arg max t T T Hidden Markov Models Probabilistic Graphic Models, Fall 2011 31 T, y i arg max 1, t, t t yt 1 yt 1 i y 1,, N t1 T 1 T T T 1, y 1 y 1, i y 1,, N T 1 arg max y i y t * * t1 t t y 1,, N t 1, yt 1 yt 1, i t1 T

Question #3: argmax θ P(X θ) Learning

Compute argmax θ P(θ X) Baum Welch algorithm Problem 3: if all parameters are unknown, we need to learn the model from observations Unknown variables The probability of y=i for time point t and y=j for the following time point t+1 i, j P y i, y j X, t t t1 Hidden Markov Models Probabilistic Graphic Models, Fall 2011 33

Baum Welch algorithm T Use forward and backward algorithm to calculate probability i, j P y i, y j X, t t t1 i j t1, t Y Y i1 j1 i t e t i, j j, x t1 i t, E, 0 0 0 0 t i, j t1 j, x j j e t1 0 i P x 1,, x, y i t t t 0 i P x 1,, x y i, t t T t Hidden Markov Models Probabilistic Graphic Models, Fall 2011 34

Compute argmax θ P(X θ) Baum Welch algorithm The probability of Y=i for time point t: t i Y t j1 i, j So expected times for state stayed in Y=i and the expected times for state transition i-j: T 1 t i, t i j t0 T 1 t0 Hidden Markov Models Probabilistic Graphic Models, Fall 2011 35

Compute argmax θ P(X θ) Baum Welch algorithm Re-estimate all parameters (maximize) t T 1 t t0 i, j T 1 t0 t i, j i t0 Repeat above steps until convergence e ix, T x x i T t t0 0 P X P X log log t i t Hidden Markov Models Probabilistic Graphic Models, Fall 2011 36

Gibbs Sampling for Learning You can read textbooks and supplementary materials. Understand the prior setting of unknown parameters. Tobias Ryden. EM verus Markov chain Monte Carlo for Estimation of Hidden Markov Models: A Computational Perspective. Bayesian Analysis 2008, 3(4):659-688. Zoubin Grahramani, Michael I Jordan. Factorial Hidden Markov Models. Machine Learning 1997, 29(2-3):245-273.

MCMC & Gibbs Sampling In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of random samples from multivariate probability distribution (i.e. from the joint probability distribution of two or more random variables), when direct sampling is difficult. Its theory and detailed methods will be introduced in the following sections.

Generate the hidden variables using current parameter Y t ~P Y X, θ t Step 1: Inference (Gibbs Sampling) Initialization: generate the labeling sequencing according to the prior probability Y y, 1 t T t Re-generate Y t according to the initial setting P y Y, X, P y y, y, x, t t t1 t1 t P y, x y, y, P y y, t1 t t t1 t t1 t t1 t t t1 t t1 t t1 t1 t t t t t1 y, y y, e y, y P y, x y, P y y, P x y, P y y, t e t i We get a score proportional to the probability. So we can randomly generate Y t based on these scores.

Step 2: Parameter Re-Estimation Simply update the parameter in transition and emission probability matrix t i,j = e k,j = T t=1 T t=1 I Y t =j,y t 1 =i T I Y t =j,x t =k T Note: I is an indicator function If the chain is not long enough, run multiple runs of step 1 and then do step 2

Generate simulated data using HMMs Find the initial probability for hidden states Markov chain Monte Carlo (MCMC) They are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. Typical use of MCMC sampling can only approximate the target distribution, as there is always some residual effect of the starting position. More sophisticated MCMC-based algorithms such as coupling from the past can produce exact samples. A good chain will have rapid mixing (the stationary distribution is reached quickly starting from an arbitrary position) described further under Markov chain mixing time. Hidden Markov Models Probabilistic Graphic Models, Fall 2011 41

Extended Models Factorial HMMs Hierarchical HMMs (GeneScan) Bayesian HMMs Two HMMs comparisons Max Entropy Markov Models Conditional Random Fields Hidden Markov Models Probabilistic Graphic Models, Fall 2011 42

Factorial HMMs Hierarchical HMMs What ever complex structure if you can solve it. Tree Structured HMMs

Max Entropy Markov Models In machine learning, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learned are connected in a Markov chain rather than being conditionally independent of each other. Y 0 Y 1 Y 2 Y 3 Y 0 Y 1 Y 2 Y 3 X 1 X 2 X 3 X 1 X 2 X 3 HMM MEMM Hidden Markov Models Probabilistic Graphic Models, Fall 2011 44

Max Entropy Markov Models Hidden Markov Models Probabilistic Graphic Models, Fall 2011 45

Supplementary Materials [1] Lawrence R Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77(2):257-286. [2] Zoubin Ghahramani. An introduction to Hidden Markov Models and Bayesian Networks. International Journal of Pattern Recognition and Artificial Intelligence 2001, 15(1):9-42.

The End of Chapter 4 HMM is a generative model HMM is powerful for modeling hidden mechanisms Hidden Markov Models Probabilistic Graphic Models, Fall 2011 47

Practice for HMMs Two different parts: Part I is recommended to be finished in three weeks); and, Part II can be finished later The full report should be submitted before December 5

Literature Survey Neural networks & deep learning Deep belief networks Convolutional neural networks Recurrent neural networks Residual neural networks Bayesian induction for concept learning Latent Dirichlet allocation & Dirichlet process Conditional random fields

Course Project 1 st choice: make your own proposal before Oct 17 (recommended!) Please send emails to Dongfang Wang Requirements Clearly describe your problems & aims The availability of data The source of your project (by your supervisor or an international competition or your own interest?) 50

Course Project 2 nd choice: object recognition of sundial ( 日晷 ). Do image segmentation and target object recognition. (Major model: MRFs) 3 rd choice: molecular subtyping of liver cancer. Identify consensus gene expression patterns from multiple datasets. (Major model: hierarchical models / LDA) You will be organized into groups, but every member needs to write his own project report