Fast Approximate MAP Inference for Bayesian Nonparametrics

Size: px

Start display at page:

Download "Fast Approximate MAP Inference for Bayesian Nonparametrics"

Edwin Johnston
5 years ago
Views:

1 Fast Approximate MAP Inference for Bayesian Nonparametrics Y. Raykov A. Boukouvalas M.A. Little Department of Mathematics Aston University 10th Conference on Bayesian Nonparametrics, 2015

2 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

3 Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

4 Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

5 Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

6 Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

7 Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state

8 Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.

9 Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.

11 MAP-DP MAP problem Iterated Conditional Modes: arg min Z,µ,p N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k Compute the assignments: q i,k = log p k x i µ k s z i = arg min k2{1,...,k} q i,k Update cluster means µ 1,..., µ K taking the mode of the posterior; Update cluster weights from: p k = N k +a k 1 N for k = 1,..., K. 2 2 K-means: arg min Z,µ N Â i=1 Â k:z i =k kx i µ k k 2 2

12 MAP-DP 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

13 MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model

14 MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model

15 MAP-DP DP mixtures MAP-DP Objective function: arg min Z,K N Â Â i=1 k:z i =k log p x i qz i K + i Â log G (N k ) k=1 P 0 k with prior term P 0 k = log a + log p q0 k. Keeping N k, compute: i and q i updated, for each observation q i,k = log N k, i log p x i q i k q i,k+1 = log a log p (x i q 0 ) z i = arg min q i,k k2{1,...,k + +1}

16 MAP-DP

17 MAP-DP Small Variance Asymptotics DP-means Objective function: arg min Z,µ,K N Â i=1 Â k:z i =k kx i µ k k 2 2 +lk Compute for each observation: z i = q i,k = kx i µ k k 2 2 q i,k+1 = l arg min q i,k k2{1,...,k+1}

18 MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

19 MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

20 MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data

21 MAP-DP Figure : Association chart of ICM and SVA algorithms

22 Experiments and Results 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)

23 Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data

24 Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data

25 Experiments and Results Synthetic study Gibbs MAP-DP DP-means NMI 0.81(0.1) 0.82(0.1) 0.68(0.1) Iterations 1395(651) 10(3) 18(7) DK 3.6(3.0) 6.6(2.9) 0.0

26 Experiments and Results Case study Parkinson s Disease (PD) Data Organizing Center Database Aim of the study: Exploring PD sub-types using PD-DOC database. Data from 527 patients, 285 features with missing data Categorical, Poisson and Binomial Data

27 Experiments and Results Case study Results 3 main equally-sized clusters suggesting di erent PD sub-types Examples of features that separate clusters: Feature Cluster 1 Cluster 2 Cluster 3 Sleep Disturbance* Right leg agility* Risk of stroke 4% 15% 6% * Ratio of a ected to non-a ected patients.

28 Infinite Hidden Markov Model ihmm Each row in the transition matrix is a DP p (x t z t 1 ) = Â z t p zt,z t 1 p (x t z t )

29 Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI Iterations

30 Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI Iterations

31 Summary ICM breaks a lot of the Bayesian advantages of BNP models: Does not average over the uncertain variables Obtains only a point estimate of the joint posterior Underestimates the variance and fails to extract information from the tails of the true underlying distribution Nevertheless the suggested methods obtain statistically principled approximate solution of the MAP problem with little computational e ort involved. Results are easy to interpret and convergence to a local solution is guaranteed. The MAP schemes suggest a way to fit complex BNP models to at least moderately big problems Applying ICM on the non-degenerate likelihood function preserves some of the essential properties of the model

32 Appendix For Further Reading Relevant work Simple approximate MAP Inference for Dirichlet processes (Y. Raykov, A. Boukouvalas and M. A. Little) Fast search for Dirichlet process mixture models (H. Daume III, 2007) Scaling the Indian Bu et Process via Submodular Maximization (C. Reed and Z. Ghahramani, 2013) Fast Bayesian Inference in Dirichlet Process Mixtrure Models (L. Wang and D. B. Dunson, 2011) Revisiting k-means: New Algorithms via Bayesian Nonparametrics (B. Kulis and M. I. Jordan, 2012) MAD-Bayes: MAP-based Asymptotic Derivations from Bayes (T. Broderick, B. Kulis and M. I. Jordan, 2013)

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs Yining Wang Jun Zhu Tsinghua University July, 2014 Y. Wang and J. Zhu (Tsinghua University) Max-Margin DP-means July, 2014 1 / 25 Outline