Fast Approximate MAP Inference for Bayesian Nonparametrics

Similar documents
Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Simple approximate MAP inference for Dirichlet processes mixtures

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Non-Parametric Bayes

Bayesian Nonparametric Models

Bayesian Methods for Machine Learning

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Probabilistic Time Series Classification

Bayesian Nonparametrics for Speech and Signal Processing

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Stat 5101 Lecture Notes

13: Variational inference II

Bayesian non parametric approaches: an introduction

Part IV: Monte Carlo and nonparametric Bayes

Lecture 4: Probabilistic Learning

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Clustering using Mixture Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Density Estimation. Seungjin Choi

Infinite latent feature models and the Indian Buffet Process

Expectation maximization

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Bayesian Models in Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Hidden Markov Models and Extensions

Learning ancestral genetic processes using nonparametric Bayesian models

A deterministic inference framework for discrete nonparametric latent variable models

Unsupervised Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Expectation propagation for infinite mixtures (Extended abstract) Thomas Minka and Zoubin Ghahramani December 17, 2003

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Nonparametrics: Models Based on the Dirichlet Process

A Process over all Stationary Covariance Kernels

Probabilistic Graphical Models

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Construction of Dependent Dirichlet Processes based on Poisson Processes

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning Overview

Answers and expectations

Latent Variable Models

Expectation Maximization

p L yi z n m x N n xi

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Algorithmisches Lernen/Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

A Brief Overview of Nonparametric Bayesian Models

Latent Variable View of EM. Sargur Srihari

Introduction to Machine Learning

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

Lecture 9. Time series prediction

Lecture 6: Gaussian Mixture Models (GMM)

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Bayesian Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

Priors for Random Count Matrices with Random or Fixed Row Sums

Image segmentation combining Markov Random Fields and Dirichlet Processes

STA 414/2104: Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Machine Learning 4771

Gaussian Mixture Models, Expectation Maximization

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Recent Advances in Bayesian Inference Techniques

Contents. Part I: Fundamentals of Bayesian Inference 1

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Gentle Introduction to Infinite Gaussian Mixture Modeling

K-Means and Gaussian Mixture Models

Scribe to lecture Tuesday March

Lecture 6: April 19, 2002

Hmms with variable dimension structures and extensions

Simple approximate MAP Inference for Dirichlet processes

Inference in Explicit Duration Hidden Markov Models

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Bayesian Nonparametrics: Dirichlet Process

Stochastic Variational Inference for the HDP-HMM

CS6220: DATA MINING TECHNIQUES

Gaussian Mixture Models

Gaussian Mixture Model

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Online Bayesian Transfer Learning for Sequential Data Modeling

arxiv: v1 [cs.lg] 16 Sep 2014

Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State

Transcription:

Fast Approximate MAP Inference for Bayesian Nonparametrics Y. Raykov A. Boukouvalas M.A. Little Department of Mathematics Aston University 10th Conference on Bayesian Nonparametrics, 2015

1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.

Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.

MAP-DP MAP problem Iterated Conditional Modes: arg min Z,µ,p N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k Compute the assignments: q i,k = log p k x i µ k s z i = arg min k2{1,...,k} q i,k Update cluster means µ 1,..., µ K taking the mode of the posterior; Update cluster weights from: p k = N k +a k 1 N for k = 1,..., K. 2 2 K-means: arg min Z,µ N Â i=1 Â k:z i =k kx i µ k k 2 2

MAP-DP 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model

MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model

MAP-DP DP mixtures MAP-DP Objective function: arg min Z,K N   i=1 k:z i =k log p x i qz i K + i  log G (N k ) k=1 P 0 k with prior term P 0 k = log a + log p q0 k. Keeping N k, compute: i and q i updated, for each observation q i,k = log N k, i log p x i q i k q i,k+1 = log a log p (x i q 0 ) z i = arg min q i,k k2{1,...,k + +1}

MAP-DP

MAP-DP Small Variance Asymptotics DP-means Objective function: arg min Z,µ,K N Â i=1 Â k:z i =k kx i µ k k 2 2 +lk Compute for each observation: z i = q i,k = kx i µ k k 2 2 q i,k+1 = l arg min q i,k k2{1,...,k+1}

MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

MAP-DP Figure : Association chart of ICM and SVA algorithms

Experiments and Results 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data

Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data

Experiments and Results Synthetic study Gibbs MAP-DP DP-means NMI 0.81(0.1) 0.82(0.1) 0.68(0.1) Iterations 1395(651) 10(3) 18(7) DK 3.6(3.0) 6.6(2.9) 0.0

Experiments and Results Case study Parkinson s Disease (PD) Data Organizing Center Database Aim of the study: Exploring PD sub-types using PD-DOC database. Data from 527 patients, 285 features with missing data Categorical, Poisson and Binomial Data

Experiments and Results Case study Results 3 main equally-sized clusters suggesting di erent PD sub-types Examples of features that separate clusters: Feature Cluster 1 Cluster 2 Cluster 3 Sleep Disturbance* 0.41 0.78 0.70 Right leg agility* 1.05 1.84 0.58 Risk of stroke 4% 15% 6% * Ratio of a ected to non-a ected patients.

Infinite Hidden Markov Model ihmm Each row in the transition matrix is a DP p (x t z t 1 ) = Â z t p zt,z t 1 p (x t z t )

Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI 0.86 0.62 0.58 Iterations 1270 13 12

Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI 0.86 0.62 0.58 Iterations 1270 13 12

Summary ICM breaks a lot of the Bayesian advantages of BNP models: Does not average over the uncertain variables Obtains only a point estimate of the joint posterior Underestimates the variance and fails to extract information from the tails of the true underlying distribution Nevertheless the suggested methods obtain statistically principled approximate solution of the MAP problem with little computational e ort involved. Results are easy to interpret and convergence to a local solution is guaranteed. The MAP schemes suggest a way to fit complex BNP models to at least moderately big problems Applying ICM on the non-degenerate likelihood function preserves some of the essential properties of the model

Appendix For Further Reading Relevant work Simple approximate MAP Inference for Dirichlet processes (Y. Raykov, A. Boukouvalas and M. A. Little) http://arxiv.org/abs/1411.0939 Fast search for Dirichlet process mixture models (H. Daume III, 2007) Scaling the Indian Bu et Process via Submodular Maximization (C. Reed and Z. Ghahramani, 2013) Fast Bayesian Inference in Dirichlet Process Mixtrure Models (L. Wang and D. B. Dunson, 2011) Revisiting k-means: New Algorithms via Bayesian Nonparametrics (B. Kulis and M. I. Jordan, 2012) MAD-Bayes: MAP-based Asymptotic Derivations from Bayes (T. Broderick, B. Kulis and M. I. Jordan, 2013)