STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Similar documents
STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STA 414/2104: Machine Learning

Note Set 5: Hidden Markov Models

STA 4273H: Statistical Machine Learning

Graphical Models Seminar

Hidden Markov Models Part 2: Algorithms

13: Variational inference II

Variational Inference (11/04/13)

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Linear Dynamical Systems

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning Midterm, Tues April 8

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

13 : Variational Inference: Loopy Belief Propagation and Mean Field

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

COMP90051 Statistical Machine Learning

Data Preprocessing. Cluster Similarity

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Basic math for biology

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 7 Sequence analysis. Hidden Markov Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Statistical Pattern Recognition

Hidden Markov Models and Gaussian Mixture Models

Expectation Maximization (EM)

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Master 2 Informatique Probabilistic Learning and Data Analysis

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 15. Probabilistic Models on Graph

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Hidden Markov Models

Advanced Data Science

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Dynamic Approaches: The Hidden Markov Model

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Parametric Models Part III: Hidden Markov Models

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models

2 : Directed GMs: Bayesian Networks

Hidden Markov Models

Statistical Methods for NLP

Computational Genomics and Molecular Biology, Fall

Bayesian Machine Learning - Lecture 7

L23: hidden Markov models

Hidden Markov Models. x 1 x 2 x 3 x K

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Statistical Methods for NLP

Hidden Markov Models,99,100! Markov, here I come!

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Chapter 16. Structured Probabilistic Models for Deep Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic Graphical Models

CS532, Winter 2010 Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models

order is number of previous outputs

Data-Intensive Computing with MapReduce

CSCI-567: Machine Learning (Spring 2019)

O 3 O 4 O 5. q 3. q 4. Transition

6.867 Machine learning, lecture 23 (Jaakkola)

CS6220: DATA MINING TECHNIQUES

Multiscale Systems Engineering Research Group

Lecture 11: Hidden Markov Models

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

PROBABILISTIC REASONING OVER TIME

Probabilistic Graphical Models

Pair Hidden Markov Models

CS6220: DATA MINING TECHNIQUES

Hidden Markov models

Machine Learning Lecture 14

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

p L yi z n m x N n xi

Latent Variable Models and EM algorithm

Expectation-Maximization (EM) algorithm

Cours 7 12th November 2014

Statistical Sequence Recognition and Training: An Introduction to HMMs

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Lecture 3: Machine learning, classification, and generative models

Lecture 21: Spectral Learning for Graphical Models

Hidden Markov Models. x 1 x 2 x 3 x N

Directed Probabilistic Graphical Models CMSC 678 UMBC

Mobile Robot Localization

Hidden Markov Models (recap BNs)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Unsupervised Learning with Permuted Data

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

9 Forward-backward algorithm, sum-product on factor graphs

Statistical NLP: Hidden Markov Models. Updated 12/15

Clustering with k-means and Gaussian mixture distributions

Lecture 13: Structured Prediction

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Transcription:

STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced discrete HMMs as a way of modeling a sequence of datapoints x = (x 0, x 1,..., x T ) as emissions from dependent hidden states z = (z 0, z 1,..., z T ) with each z t {1,..., k}. More precisely, we developed the following generative model for HMMs: 1. Sample the first state: z 0 Mult(π, 1). 2. Sample each new state from a transition distribution determined by the prior state: z t z t 1 = j ind Mult(a j, 1). 3. Sample datapoints x t z t ind p(x t z t, η) from a state-dependent emission distribution. Recall that π R k is the initial state probability vector, each a j R k is the transition probability vector from state j to all other states, and η represents parameters for each conditional distribution given z t. Last time, we assumed that all model parameters were known θ = (π, A, η) and developed recursive schemes for efficient probabilistic inference in this model. Specifically, we developed α and β recursions to compute the single state conditional probabilities p(z t x) = p(x 0:t, z t )p(x t+1:t z t ) α(z t)β(z t ) for all timepoints t in O(k 2 T ) total time. While this permits us to make inferences about hidden states in isolation, it is not sufficient to make joint inferences about several hidden states. For example, since the hidden states are dependent conditioned on x, typically p(z t, z t 1 x) p(z t x)p(z t 1 x). In this lecture, we will learn to leverage our α-β recursions to compute these co-occurrence probabilities and other inferential quantities of interest. 5.1.2 Calculating co-occurrence probabilities Let us try to calculate the co-occurrence probability p(z t, z t+1 x) using Bayes rule and knowledge of the HMM conditional independence structure: p(z t, z t+1 x) = p(x z t, z t+1 )p(z t+1, z t ) = p(x z t, z t+1 )p(z t+1 z t )p(z t ) = p(x 0:t z t )p(x t+1 z t+1 )p(x t+2:t z t+1 )a zt,zt+1 p(z t ). 5-1

In the final equation, we have broken p(x z t, z t+1 ) into p(x 0:t z t ) (the probability of the past and present observations given the present state), p(x t+1 z t+1 ) (the probability of the next observation given the next state), and p(x t+2:t z t+1 ) (the probability of all future observations given the next state). We notice then that p(x 0:t z t )p(z t ) = p(x 0:t, z t ) = α(z t ) and p(x t+2:t z t+1 ) = β(z t+1 ). Therefore, we can compute the co-occurrence probabilities p(z t, z t+1 x) = α(z t)β(z t+1 )p(x t+1 z t+1 )a zt,z t+1 using only the results of our α-β recursion and the known transition and emission probabilities. This follows as, for any t, = z t α(z t )β(z t ). A derivation of the conditional distribution of any subsequence of z given all observations x proceeds similarly. 5.1.3 Soft inference for the full hidden sequence z Suppose that our object of inference was instead the entire sequence of hidden states. Interestingly, calculation of p(z x) is even more straightforward. Observe that, by Bayes rule, p(z x) = p(x z)p(z) = π(z 0)Π T 1 a zt,zt+1 Π t p(x t z t ). Since is a function of α and β, this expression depends only on known quantities. 5.1.4 Hard inference for z via the Viterbi algorithm Thus far, we have discussed soft inference, computing the conditional probabilities of hidden states, but often we are interested in hard assignments that associate each datapoint with a single state. In particular, as in the mixture modeling setting, we would like to find the hidden sequence that is most likely given the observations x. However, unlike in the mixture model setting where hidden states were independent, it is not the case that the most probable sequence of states argmax z p(z x) is equal to the sequence of most probable individual states argmax zt p(z t x). The distinction can be illustrated in the handwriting recognition setting: an ambiguous set of characters considered in isolation may seem to most likely correspond to the word the but in the context of a surrounding sentence, I put on a today, be more sensibly decoded as tie. Thus, we must account for the dependence. Since we know how compute p(z x), we could use brute force to consider all k T possible sequences z, but that quickly becomes intractable. Instead, we will compute max z p(z x) and its maximizer in an efficient recursive manner analogous to our α recursion. Note that since p(z x) = p(z,x) and does not depend on the specific sequence z, the maximizer of the conditional probabillity is the same as the maximizer of the joint probability p(z, x) (over z), and so we will focus on finding max z p(z, x) and its maximizer using a recursive procedure called the Viterbi algorithm. Significantly, this algorithm is identical to the forward pass α-recursion except that we will replace summation over states with maximization over states. 5-2

The Viterbi algorithm We begin by defining our targets for recursion v(z t ) = max z 0:t 1 p(z 0:t, x 0:t ) = max z 0:t 1 p(z 0:t 1, x 0:t, z t ). Note the similarity to α(z t ). Instead of summing over past states that may have led to present hidden state z t and observation x t, we are maximizing over them. Note moreover that max v(z T ) = max max p(z 0:T 1, x 0:T, z T ) = max p(z, x), z T z T z 0:T 1 z so if we can compute v(z T ) efficiently, we can also compute our target maximum probability. The Viterbi algorithm accomplishes this computation via a recursive forward pass through time: max z v(z 0 ) = p(x 0, z 0 ) = p(z 0 )p(x 0 z 0 ) (Base case) v(z t+1 ) = max z 0:t p(z 0:t, x 0:t+1, z t+1 ) = max (max p(z 0:t 1, x 0:t, x t )p(x t+1 z t+1 )p(z t+1 z t )) z t z 0:t 1 = max v(z t )a zt,zt+1 p(x t+1, z t+1 ) (Recursion for all t 0) z t p(z, x) = max z T v(z T ) (Final step). The equation for v(z 0 ) is just equal to the joint probability of the current state and current observation occurring together; there are no prior states over which to maximize. For the derivation of v(z t+1 ), we can break up the max over all past states into a max over the proximal state, v(z t ), and a max over more distant past states. These distant past states (z 0:t 1 ) are accounted for in v(z t ), which we calculated in the prior recursive step. Altogether, these establish the base case and the recursive step. As with the α-recursion, calculating v(z t ) for all t takes O(k 2 T ) work. To recover the maximizing sequence argmax z p(z, x) as well, it suffices to store pointers to maximizing states in each recursive step of the Viterbi algorithm. That is, the recursive computation of v zt+1 involves only a maximization over z t, so one can keep track of the maximizer m(z t+1 ) = zt. Once the Viterbi computation is complete, one can follow these pointers backwards from zt to obtain the entire sequence that maximizes p(z, x) and hence p(z x). 5.1.5 EM for Hidden Markov Models Our discussion of HMMs so far has assumed that the parameters θ = (π, A, η) are known, but, typically, we do not know the model parameters in advance. As usual (and as is most often done in practice), we will turn to the EM to learn model parameters that approximately maximize the likelihood of our observations. 1 Along the way, we will see that many of the inference routines we have derived will be useful for implementing the EM algorithm. 1 In the HMM setting, the EM algorithm is also called the Baum-Welch algorithm. 5-3

Note that we defined our HMMs in terms of generic emission distributions, and the exact M-step updates will depend on the emission distributions chosen. To provide a concrete example of a complete EM algorithm, we will adopt a Gaussian emission model, Complete log likelihood p(x t z t = j; η) = φ(x t ; µ j, Σ j ). As usual, we begin our EM derivation by writing down the complete log likelihood. Let θ = (π, A, η) denote all the parameters of the model. We have T 1 T l c (θ) = log(p(x, z; θ)) = log π z0 + log a ztzt+1 + log φ(x t ; µ zt Σ zt ) = E-Step I[z 0 = j] log π j + T 1 I[z t = j, z t+1 = l] log a jl + l=1 T I[z t = j] log φ(x i ; µ j, Σ j ). In the E-Step, we compute the expected complete log likelihood under the distribution z p(z x; θ). We can see from the form of the complete log likelihood that it is sufficient to compute two types of statistics: 1. γ t (j) = E[I[z t = j] x; θ] = p(z t = j x; θ) for each t {0,..., T } and j {1,..., k} 2. ξ t (j, l) = E[I[z t = j, z t+1 = l] x; θ] = p(z t = j, z t+1 = l x; θ) for each t {0,..., T } and (j, l) {1,..., k} 2. Fortunately, we have already learned how to compute these quantities using the α and β recursions. This completes the E-Step. M-Step According to our E-step computations, the expected complete log likelihood takes the form T 1 T E[l c (θ)] = γ 0 (j)π j + ξ t (j, l) log a jl + γ t (j) log φ(x i ; µ j, Σ j ). l=1 In the M-Step, we maximize the ECLL with respect to θ. As it turns out, there are closed form update formulas: π j = γ 0 (j) T 1 a jl = ξ t(j, l) T 1 γ t(j) The µ j and Σ j updates, as you might expect, are exactly the same as in the GMM case, with γ t (j) in place of τ ij. The formulas for π j and a jl are also intuitive. For π, γ 0 (j) is just the conditional probability that z 0 = j. For A, the numerator T 1 ξ t(j, l) is the (conditionally) expected number of transitions from state j to l, while the denominator T 1 γ t(j) is the (conditionally) expected number of times the system is in hidden state j, before time T. 5-4

5.1.6 Beyond HMMs HMMs are widely used and widely applicable, but they are still limited by a specific form of dependence structure. Fortunately, efficient generalizations of the α-β recursions (known as belief propagation or the sum-product algorithm) exist for carrying out probabilistic inference in models with more general tree-like dependence structures. For many dependence structures (e.g., those with cyclic dependencies), exact probabilistic inference is computationally intractable. However, approximate inference can be carried out efficiently by repeatedly applying related loopy belief propagation procedures. 5.2 Beyond Flat Clustering Thus far in the course we have considered only flat or structureless approaches to clustering. Each method has been designed to capture similarity within a cluster, but none have modeled the relationships that may exist amongst clusters. Moreover, these methods all have the often undesirable feature that clusters may change arbitrarily as the number of clusters k varies. That is, it is not necessarily the case that clusters produced when k = 4 are subsets of those produced when k = 3. Hierarchical clustering techniques address these issues by producing a nested series of clusterings that reflect the similarity between clusters. The most ubiquitous versions of hierarchical clustering are model-free and depend on a user-specified dissimilarity measure d between any two datapoints x i and x j. Common choices Euclidean distance x i x j 2 and negative Pearson correlation, but any dissimilarity measure is supported. The measure is used by the hierarchical clustering algorithm to induce dissimilarity measures between and within clusters. As with k-medoids, the standard hierarchical clustering algorithms require access only to the matrix of pairwise datapoint dissimilarities d ij = d(x i, x j ); the original data need not be retained. The standard output of a hierarchical clustering run is a dendrogram. A dendrogram is a binary tree in which each node represents a cluster. Each leaf (terminal) node is a cluster containing a single datapoint, each parent node represents the cluster formed by merging its two daughter clusters, and the root node represents a single cluster containing the entire dataset. One useful characteristic of a dendrogram is that the height of each node in the dendrogram represents the dissimilarity between its 2 daughter nodes. All terminal nodes lie at height 0 because they have no daughter nodes. This yields an interpretable visualization of the clustering sequence. Moreover, a user can slice a dendrogram at any intercluster dissimilarity level to obtain a single clustering with a number of clusters that varies between 1 (at the root node level) to n, the number of datapoints, (at the zero dissimilarity level). Remark. The dendrogram height interpretation is only appropriate when the hierarchical clustering algorithm maintains a certain monotonicity property ensuring that daughter dissimilarity associated with a parent node is never smaller than the daughter dissimilarity associated with a child node. However, this property is maintained by the most widely used hierarchical clustering approaches. There are two principal hierarchical clustering paradigms: agglomerative (bottom-up, starting from individual datapoints and merging similar clusters) and divisive (top-down, 5-5

starting with the entire dataset and gradually dividing into dissimilar clusters). In the next lecture, these two paradigms will be discussed in detail. 5-6