Collapsed Variational Bayesian Inference for Hidden Markov Models

Similar documents
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Study Notes on the Latent Dirichlet Allocation

Lecture 12: Algorithms for HMMs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 12: Algorithms for HMMs

Hidden Markov Models

Lecture 13 : Variational Inference: Mean Field Approximation

Probabilistic Time Series Classification

Type-Based MCMC. Michael I. Jordan UC Berkeley 1 In NLP, this is sometimes referred to as simply the collapsed

Sparse Stochastic Inference for Latent Dirichlet Allocation

Introduction to Bayesian inference

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models in Language Processing

Latent Dirichlet Bayesian Co-Clustering

Topic Models. Charles Elkan November 20, 2008

Expectation Maximization (EM)

Latent Dirichlet Allocation Introduction/Overview

Midterm sample questions

Modeling Environment

CS 6140: Machine Learning Spring 2017

CS838-1 Advanced NLP: Hidden Markov Models

Introduction to Machine Learning

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Evaluation Methods for Topic Models

Sequences and Information

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

16 : Approximate Inference: Markov Chain Monte Carlo

A brief introduction to Conditional Random Fields

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Latent Dirichlet Allocation (LDA)

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Topic Modelling and Latent Dirichlet Allocation

10/17/04. Today s Main Points

Statistical Pattern Recognition

The infinite HMM for unsupervised PoS tagging

STA 414/2104: Machine Learning

AN INTRODUCTION TO TOPIC MODELS

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Collapsed Variational Dirichlet Process Mixture Models

Machine Learning Summer School

Hierarchical Bayesian Nonparametric Models of Language and Text

Streaming Variational Bayes

STA 4273H: Statistical Machine Learning

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Statistical methods in NLP, lecture 7 Tagging and parsing

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Lecture 6: Graphical Models: Learning

Hierarchical Bayesian Nonparametric Models of Language and Text

Statistical Pattern Recognition

Sequence Labeling: HMMs & Structured Perceptron

Text Mining. March 3, March 3, / 49

TnT Part of Speech Tagger

Latent Dirichlet Allocation

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Inference Methods for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Generative Clustering, Topic Modeling, & Bayesian Inference

13: Variational inference II

Latent Dirichlet Allocation

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Bayesian Inference for Dirichlet-Multinomials

Unsupervised Learning

The Infinite Markov Model

Bayesian Inference and MCMC

Improving Topic Models with Latent Feature Word Representations

LECTURER: BURCU CAN Spring

Latent Dirichlet Alloca/on

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Expectation Maximization (EM)

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Statistical Methods for NLP

CS Lecture 18. Topic Models and LDA

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

A fast and simple algorithm for training neural probabilistic language models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Hidden Markov Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Online Bayesian Passive-Agressive Learning

Online EM for Unsupervised Models

Gaussian Mixture Model

Variational Decoding for Statistical Machine Translation

Auto-Encoding Variational Bayes

Bayesian Hidden Markov Models and Extensions

Lecture 9: Hidden Markov Model

Present and Future of Text Modeling

Hidden Markov Models

arxiv: v6 [stat.ml] 11 Apr 2017

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Transcription:

Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and Statistics (AISTATS) 2013 Presented by Yan Kaganovsky Duke University 1 / 20

Hidden Markov Models (HMM) Sequence of observations x = x 1, x 2,..., x T (e.g., index is time) Sequence of hidden states z = z 0, z 1, z 2,..., z T (z 0 =initial) x t {1,..., W }, z t {1,..., K} Parameters θ = {A, B}, with A= Transition matrix; B=Emission matrix A k,k =probability of transitioning into state k from state k B k,w =probability of emitting observation w from state k S1 S2 0 1 0 1 2 / 20

Example for HMM Considered in the Paper In part of speech (POS) tagging, the observations are the words in a sentence and index t is simply the location of the word in the sentence, e.g., the dog saw a cat with x 1 = the, x 2 = dog, x 3 = saw, x 4 = a, x 5 = cat. The hidden state of a word represents its function in the sentence: determiner (D), noun (N), verb (V), etc. In this example, z 1 = D, z 2 = N, z 3 = V, z 4 = D, z 5 = N. For unsupervised tagging, a dictionary is given which prescribes the possible tags for each word. 3 / 20

Motivation/Background Variational Bayes (MacKay, Tech. report, 1997) Independence between parameters and state variables is assumed, i.e., p(z, θ x) q(z x)q(θ x). This assumption leads to efficient iterative solutions but may be a poor approximation Can be competitive with GS in terms of performance for large data sets and faster 1 Gibbs sampling (GS) Draws samples from the true posterior How to assess convergence? what is the burn in? what is the distance between independent draws? Slow for large data sets For HMM, updates are strongly coupled which leads to poor mixing 1 A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers by Gao and Johnson (EMNLP 2008) 4 / 20

Motivation/Background Collapsed Gibbs sampling (CGS) Samples from the true posterior after integrating out θ Reduces number of computations compared to GS Converges faster than CG for some models (e.g., LDA 2 ) since marginalization induces dependencies which spread over many latent variables, so the dependency between any two variables should be smaller For HMM however, CGS is slower to converge than GS. Collapsing couples the state at any time to the states at both the following and previous times 2 Goldwater and Griffiths, AMCL, 2007 5 / 20

Motivation/Background Collapsed Variational Bayes Applies variational inference in the same space as CGS Faster convergence than CGS is expected Weaker approximations for the posterior than in VB Derived for LDA (Teh et al. 2007) but that work is not immediately applicable to HMM s 6 / 20

Overview Two new collapsed variational Bayes (CVB) algorithms are presented Algorithm I models exactly the dependence between parameters and latent variables but assumes mutual independencies between latent variables in the collapsed space Algorithm II relaxes the latter assumption into mutual independence between groups of latent variable but includes additional assumptions (weaker than in I) Comparison to collapsed Gibbs sampler, standard VB and the traditional celebrated Baum-Welch algorithm Performance demonstrated via Part of Speech (POS) Tagging 7 / 20

Generative Model for HMM T 1 p(x, z, θ α, β) = p(a α)p(b β) p(z t+1 z t, A)p(x t+1 z t+1, B) With t=0 (z t+1 z t = k, A) Cat(K, A k, ), t = 0, 1,..., T (x t z t = k, B) Cat(K, B k, ), t = 1, 2,..., T A α Dir(α), B β Dir(β) Assumptions: Given θ, a state depends only on the state one time step earlier and independent of t (1st order HMM or bi-gram) Given θ, output at each time step depends only on the state at the same time 8 / 20

Collapsed Gibbs Sampling Produces a hidden state sequence z sampled from p(z x, α, β) = p(x, z θ)p(θ α, β)dθ Dirichlet priors are conjugate to discrete distributions closed-form integration p(z t = k x, z t, α, β) C t k,w + β C t k, + W β C t z t 1,k + α Cz t t 1, + Kα C t k,z t+1 + α + δ(z t 1 = k = z t+1 ) C t k, + Kα + δ(z t 1 = k) C t k,z t+1 =number of occurrences for state k occuring before state z t+1 along all time steps except t C t k, =number of occurrences for state k excluding time t 9 / 20

Collapsed VB for I.I.D. Hidden Variables Formally, CVB models the dependencies between parameters and hidden variables in an exact fashion. p(z, θ x) = q(θ z, x)q(z x) The mean field method requires independent variables, and thus the induced weak dependencies among hidden variables are ignored T q(z x) q(z t x) Maximizing KL divergence q(θ z, x) is equal to the true posterior and the update for each q(z t ) in q(z) is q(z t x) exp(e q(z t )[log p(z, x)]) t=1 exp(e q(z t )[log p(z t x, z t ]) 10 / 20

Algorithm 1 The first CVB algorithm for HMMs follows the theory for i.i.d. models Assumes that the hidden variables are mutually independent in the collapsed space This is a strong assumption for HMM In standard VB the assumption is also strong (independence between parameters and hidden variables) It is not immediately clear which assumption is stronger 11 / 20

Algorithm 2 The strong independence assumption in Algorithm 1 has the potential to lead to inaccurate results Mean field method requires to partition the latent variables into disjoint and independent groups. In natural language processing, a common feature is many short sequences (i.e. sentences), where each sequence is drawn i.i.d. from the same set of parameters. 12 / 20

Algorithm 2 (x i, z i ) - ith sequence of observations and hidden states Denote the number of sequences to be I q(z, θ x) = q(θ z, x)q(z x) q(θ z, x) As with any i.i.d. model, I q(z i x) i=1 q(z i x) exp(e q(z i )[log p(z i x, z i )]) exp(e q(z i )[log p(x i, z i x i, z i )]) the expression is given in closed form in the paper 13 / 20

Algorithm 2 exact computation of p(x i, z i x i, z i ) includes expensive functions approximate by assuming that hidden variables within a sequence only exhibit first order Markov dependencies and output independence p(x i, z i x i, z i ) T 1 t=0 p(z i,t+1 z i,t, x i, z i )p(x i,t+1 z i,t+1, x i, z i ) This approximation ignores the contributions from other parts of the sequence to the global counts. Compared with contributions from all other sequences, we assume the impact of these local counts is small. 14 / 20

Results - Varying Corpus Size Validation of the above inference algorithms for HMMs on part of speech tagging. Unsupervised learning: given a raw corpus and a tag dictionary that defines legal parts of speech for each word, tag each token in the corpus The goal is to maximize accuracy against a reference tagged corpus. simple bi-gram taggers rather than state of the art tagger data set is the Wall Street Journal (WSJ) tree-bank Tag dictionary is constructed by collecting all the tags found for each word type in the entire corpus. Experiments for different corpus sizes, from 1K sentences to the entire treebank. 15 / 20

Results - Varying Corpus Size 16 / 20

Results - Varying Dictionary Knowledge In practice, it is not always possible to build a complete tag dictionary, especially for the infrequent words. Authors investigate the effects of reducing dictionary information. Randomly select 1K unlabeled sentences from the treebank for the training. A word is frequent if the word s tokens appear at least d times in the training corpus, otherwise it is infrequent. For frequent word types the standard tag dictionary is available; whereas for infrequent word types, all the tags are considered to be legal. 17 / 20

Results - Varying Dictionary Knowledge 18 / 20

Results - Test Perplexity Without a tag dictionary the tag types are interchangeable and we have a label identifiability issue. Tagging results cannot be evaluated directly against the reference tagged corpus. Randomly withhold 10 prec. of the sentences from the data for testing, and use the remaining 90 prec. for training. The algorithms are evaluated by their test perplexities (per token) on the withheld test set. perplexity(x test ) = 2 ( ) i log 2 p(x i ) i x i 19 / 20

Results - Test Perplexity 20 / 20