Fast Approximate MAP Inference for Bayesian Nonparametrics Y. Raykov A. Boukouvalas M.A. Little Department of Mathematics Aston University 10th Conference on Bayesian Nonparametrics, 2015
1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)
Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state
Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state
Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state
Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state
Iterated Conditional Modes For PGMs is a deterministic algorithm that maximizes the conditional distribution of each random variable while holding the rest fixed Finds an approximation of the MAP solution for the joint distribution over all the random variables in the PGM Cheap alternative to sampling approach The exact equivalent to the simulated annealing at 0-temperature state
Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.
Notation and model Spherical model: µ k N (µ 0, s 0 I) p Dir (a 1,..., a K ) z 1,..., z N Categorical (p) x 1,..., x N N µ zi si Figure : Bayesian mixture model Bayesian Spherical GMM Negative log-likelihood: log p (X, Z...) = N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k with prior and constant terms P 0 k = log a k + log p µ 0 k + C.
MAP-DP MAP problem Iterated Conditional Modes: arg min Z,µ,p N Â Â i=1 k:z i =k x i s µ k 2 2 log p k P 0 k Compute the assignments: q i,k = log p k x i µ k s z i = arg min k2{1,...,k} q i,k Update cluster means µ 1,..., µ K taking the mode of the posterior; Update cluster weights from: p k = N k +a k 1 N for k = 1,..., K. 2 2 K-means: arg min Z,µ N Â i=1 Â k:z i =k kx i µ k k 2 2
MAP-DP 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)
MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model
MAP-DP Underlying model Fully collapsed DP mixture model Underlying model: z 1,..., z N CRP (a, N) x i F q i z i for all i = 1,..., N Figure : Collapsed DP mixture model
MAP-DP DP mixtures MAP-DP Objective function: arg min Z,K N   i=1 k:z i =k log p x i qz i K + i  log G (N k ) k=1 P 0 k with prior term P 0 k = log a + log p q0 k. Keeping N k, compute: i and q i updated, for each observation q i,k = log N k, i log p x i q i k q i,k+1 = log a log p (x i q 0 ) z i = arg min q i,k k2{1,...,k + +1}
MAP-DP
MAP-DP Small Variance Asymptotics DP-means Objective function: arg min Z,µ,K N Â i=1 Â k:z i =k kx i µ k k 2 2 +lk Compute for each observation: z i = q i,k = kx i µ k k 2 2 q i,k+1 = l arg min q i,k k2{1,...,k+1}
MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data
MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data
MAP-DP Comparison of MAP-DP and DP-means Similarities Both provide approximately optimal clustering Both are fast and scalable Both are non-parametric Advantages of MAP-DP over DP-means Retains the reinforcement e ect No degeneracy in the likelihood Rigorous way of choosing the concentration parameter a Prior keeps influence on the objective function Principled way to handle non-spherical and missing data
MAP-DP Figure : Association chart of ICM and SVA algorithms
Experiments and Results 1 Iterated Conditional Modes 2 Dirichlet Process Mixtures MAP-DP Experiments and Results 3 Infinite Hidden Markov Model(HDP-iHMM)
Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data
Experiments and Results Synthetic Study CRP mixture data Sample cluster indicators: z 1,..., z N CRP (a, N) Sample K + cluster parameters: {µ k, S k } NW(q 0 ) For each k, samplen k observations: x i N(µ k, S K ) Figure : Synthetically-generated CRP mixture data
Experiments and Results Synthetic study Gibbs MAP-DP DP-means NMI 0.81(0.1) 0.82(0.1) 0.68(0.1) Iterations 1395(651) 10(3) 18(7) DK 3.6(3.0) 6.6(2.9) 0.0
Experiments and Results Case study Parkinson s Disease (PD) Data Organizing Center Database Aim of the study: Exploring PD sub-types using PD-DOC database. Data from 527 patients, 285 features with missing data Categorical, Poisson and Binomial Data
Experiments and Results Case study Results 3 main equally-sized clusters suggesting di erent PD sub-types Examples of features that separate clusters: Feature Cluster 1 Cluster 2 Cluster 3 Sleep Disturbance* 0.41 0.78 0.70 Right leg agility* 1.05 1.84 0.58 Risk of stroke 4% 15% 6% * Ratio of a ected to non-a ected patients.
Infinite Hidden Markov Model ihmm Each row in the transition matrix is a DP p (x t z t 1 ) = Â z t p zt,z t 1 p (x t z t )
Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI 0.86 0.62 0.58 Iterations 1270 13 12
Synthetic study HMM with spherical Gaussian emissions Sample 4000 data points from a HMM with spherical emissions: N (µ 1, si 3 ),..., N (µ 5, si 3 ) with 0.96 probability of self-transition and 0.01 probability for each of the remaining transitions Gibbs MAP-iHMM SVA-iHMM NMI 0.86 0.62 0.58 Iterations 1270 13 12
Summary ICM breaks a lot of the Bayesian advantages of BNP models: Does not average over the uncertain variables Obtains only a point estimate of the joint posterior Underestimates the variance and fails to extract information from the tails of the true underlying distribution Nevertheless the suggested methods obtain statistically principled approximate solution of the MAP problem with little computational e ort involved. Results are easy to interpret and convergence to a local solution is guaranteed. The MAP schemes suggest a way to fit complex BNP models to at least moderately big problems Applying ICM on the non-degenerate likelihood function preserves some of the essential properties of the model
Appendix For Further Reading Relevant work Simple approximate MAP Inference for Dirichlet processes (Y. Raykov, A. Boukouvalas and M. A. Little) http://arxiv.org/abs/1411.0939 Fast search for Dirichlet process mixture models (H. Daume III, 2007) Scaling the Indian Bu et Process via Submodular Maximization (C. Reed and Z. Ghahramani, 2013) Fast Bayesian Inference in Dirichlet Process Mixtrure Models (L. Wang and D. B. Dunson, 2011) Revisiting k-means: New Algorithms via Bayesian Nonparametrics (B. Kulis and M. I. Jordan, 2012) MAD-Bayes: MAP-based Asymptotic Derivations from Bayes (T. Broderick, B. Kulis and M. I. Jordan, 2013)