MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Similar documents
Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

A Brief Overview of Nonparametric Bayesian Models

Non-Parametric Bayes

Infinite latent feature models and the Indian Buffet Process

Bayesian non parametric approaches: an introduction

Infinite Latent Feature Models and the Indian Buffet Process

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Methods for Machine Learning

Fast Approximate MAP Inference for Bayesian Nonparametrics

Bayesian nonparametric latent feature models

Part IV: Monte Carlo and nonparametric Bayes

Bayesian nonparametric models for bipartite graphs

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Bayesian Nonparametrics

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Bayesian Nonparametric Models

Approximate Decentralized Bayesian Inference

Lecture 3a: Dirichlet processes

Bayesian nonparametric latent feature models

Truncation error of a superposed gamma process in a decreasing order representation

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Probabilistic Graphical Models

Nonparametric Factor Analysis with Beta Process Priors

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Bayesian Mixtures of Bernoulli Distributions

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Priors for Random Count Matrices with Random or Fixed Row Sums

Bayesian Nonparametric Models on Decomposable Graphs

Streaming Variational Bayes

Bayesian Machine Learning

Advanced Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Hidden Markov Models and Extensions

Gentle Introduction to Infinite Gaussian Mixture Modeling

An Alternative Prior Process for Nonparametric Bayesian Clustering

Hierarchical Models, Nested Models and Completely Random Measures

Bayesian Nonparametrics: Dirichlet Process

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Probabilistic Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The Indian Buffet Process: An Introduction and Review

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Image segmentation combining Markov Random Fields and Dirichlet Processes

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Distance dependent Chinese restaurant processes

Feature Allocations, Probability Functions, and Paintboxes

Collapsed Variational Bayesian Inference for Hidden Markov Models

Clustering using Mixture Models

Generative Clustering, Topic Modeling, & Bayesian Inference

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models

Bayesian Nonparametric Regression for Diabetes Deaths

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Gaussian Mixture Models

Bayesian Learning (II)

Collapsed Variational Dirichlet Process Mixture Models

Dirichlet Enhanced Latent Semantic Analysis

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Unsupervised Learning

Interpretable Latent Variable Models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Introduction to Machine Learning

Online Bayesian Passive-Agressive Learning

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Coresets for Bayesian Logistic Regression

Model Averaging (Bayesian Learning)

Sparse Bayesian Nonparametric Regression

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

Expectation Propagation Algorithm

Bayesian Nonparametrics: Models Based on the Dirichlet Process

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Understanding Covariance Estimates in Expectation Propagation

Bayesian nonparametric models for bipartite graphs

Infinite Hierarchical Hidden Markov Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Dirichlet Processes and other non-parametric Bayesian models

Tree-Based Inference for Dirichlet Process Mixtures

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

CPSC 540: Machine Learning

Bayesian non-parametric model to longitudinally predict churn

Probabilistic Time Series Classification

Latent Dirichlet Alloca/on

Lecture 13 : Variational Inference: Mean Field Approximation

CS6220: DATA MINING TECHNIQUES

Nonparametric Probabilistic Modelling

Dirichlet Process. Yee Whye Teh, University College London

Latent Variable Models and EM Algorithm

arxiv: v2 [stat.ml] 10 Sep 2012

Bayesian estimation of the discrepancy with misspecified parametric models

STA 4273H: Statistical Machine Learning

Intelligent Systems:

arxiv: v2 [stat.ml] 4 Aug 2011

A Stick-Breaking Construction of the Beta Process

Spatial Normalized Gamma Process

The Kernel Beta Process

Transcription:

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Tamara Broderick Brian Kulis Michael I. Jordan

Cat Clusters Mouse clusters Dog 1

Cat Clusters Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Picture 6 Picture 7 2

Cat Features Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Picture 6 Picture 7 3

Cat Features Dog Mouse Lizard Sheep Picture 1 Picture 2 Picture 3 Picture 4 Picture 5 Many other possible latent structures in data Picture 6 Picture 7 3

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters 4

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

How do we learn latent structure? K-means! Fast! Can parallelize! Straightforward! Only works for K clusters Nonparametric Bayes! Modular (general latent structure)! Flexible (K can grow as data grows)! Coherent treatment of uncertainty But...! E.g., Silicon Valley: can have petabytes of data 4! Practitioners turn to what runs

MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

MAD-Bayes Perspectives! Bayesian nonparametrics assists the optimizationbased inference community " New, modular, flexible, nonparametric objectives & regularizers " Alternative perspective: fast initialization Inspiration! Consider a finite Gaussian mixture model! The steps of the EM algorithm limit to the steps of the K-means algorithm as the Gaussian variance is taken to 0 5

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective 6

K-means K-means clustering problem 7

K-means K-means clustering problem minimize (sum of square distances from data points to cluster centers) 7

K-means K-means clustering problem minimize (sum of square distances from data points to cluster centers) 7

K-means K-means clustering problem minimize N n=1 x n center n 2 8

K-means K-means clustering problem minimize N n=1 x n center n 2 8

K-means K-means clustering problem minimize N n=1 x n center n 2 8

K-means K-means objective minimize N n=1 x n center n 2 8

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

Lloyd s algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to a cluster 2. Update cluster means 9

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

Bayesian model 10

Bayesian model 10

Bayesian model 10

Bayesian model 11

Bayesian model Nonparametric! number of parameters can grow with the number of data points 11

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

MAD-Bayes The MAD-Bayes idea! Start with nonparametric Bayes model! Take a similar limit to get a K-means-like objective

MAD-Bayes 12

MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

MAD-Bayes! Maximum a Posteriori (MAP) is an optimization problem argmax parameters P(parameters data)! We take a limit of the objective (posterior) and get one like K-means " Small-variance asymptotics 12

MAD-Bayes Bayesian posterior K-means-like objectives 13

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means 13

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters 13

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers 13

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers...... 13

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers...... Beta process Features 13

Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 Point 3 Point 4 Point 5 Point 6 Point 7 14

Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 A Point 3 Point 4 Point 5 Point 6 Point 7 14

Features Z Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Point 1 Point 2 A Point 3 Point 4 Point 5 Point 6 Point 7 14

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes Bayesian posterior P(Z, A X) 1 (2 2 ) ND/2 exp 1 2 K + exp N n=1 n 2 tr((x ZA) (X ZA)) K + H h=1 K h! k=1 1 (2 2 ) K+ D/2 exp 1 tr(a A) 2 2 (S N,k 1)!(N S N,k )!. N! 15

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

MAD-Bayes BP-means objective argmin K +,Z,Atr[(X ZA) (X ZA)] + K + 2. BP-means algorithm Iterate until no changes: 1. For n = 1,..., N! Assign point n to features! Create a new feature if it lowers the objective 2. Update feature means A (Z Z) 1 Z X 16

MAD-Bayes Griffiths & Ghahramani (2006) computer vision problem tabletop data Bayesian posterior Gibbs sampler BP-means algorithm 8.5 * 10 3 sec 0.36 sec Still faster by order of magnitude if restart 1000 times 17

MAD-Bayes Parallelism and optimistic concurrency control DP-means alg. BP-means alg. # data points 134M 8M time per iteration 5.5 min 4.3 min 18

MAD-Bayes Bayesian posterior K-means-like objectives Mixture of K Gaussians K-means Dirichlet process mixture Unbounded number of clusters Hierarchical Dirichlet process Multiple data sets share cluster centers...... Beta process Features 19

MAD-Bayes conclusions 20

MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

MAD-Bayes conclusions! We provide new optimization objectives and regularizers " In fact, general means of obtaining more " Straightforward, fast algorithms! Can the promise of nonparametric Bayes be fulfilled? 20

References T. Broderick, B. Kulis, and M. I. Jordan. MAD- Bayes: MAP-based asymptotic derivations from Bayes. In International Conference on Machine Learning, 2013. X. Pan, J. E. Gonzales, S. Jegelka, T. Broderick, and M. I. Jordan. Optimistic concurrency control for distributed unsupervised learning. In Neural Information Processing Systems, 2013. T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In Neural Information Processing Systems, 2013. 21

Further References T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In Neural Information Processing Systems, 2006. N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18(3):1259 1294, 1990. J. F. C. Kingman. The representation of partition structures. Journal of the London Mathematical Society, 2(2):374, 1978. B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via Bayesian nonparametrics. In International Conference on Machine Learning, 2012. J. Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2):145 158, 1995. R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In International Conference on Artificial Intelligence and Statistics, 2007.