Deterministic annealing variant of variational Bayes method

Similar documents
Stochastic Complexity of Variational Bayesian Hidden Markov Models

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Unsupervised Learning

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Markov Chains and Hidden Markov Models

Learning Gaussian Process Models from Uncertain Data

Approximate Inference Part 1 of 2

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Approximate Inference Part 1 of 2

Phase Diagram Study on Variational Bayes Learning of Bernoulli Mixture

Variational Scoring of Graphical Model Structures

Basic math for biology

Hidden Markov Models Part 2: Algorithms

Dynamic Approaches: The Hidden Markov Model

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Algorithmisches Lernen/Machine Learning

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Learning in Undirected Graphical Models

Pattern Recognition and Machine Learning

Graphical Models for Collaborative Filtering

Machine Learning Summer School

Expectation maximization

Bayesian Hidden Markov Models and Extensions

Recent Advances in Bayesian Inference Techniques

Variational Bayesian Hidden Markov Models

Lecture 6: Graphical Models: Learning

p(d θ ) l(θ ) 1.2 x x x

Statistical Learning Theory of Variational Bayes

Variational Learning : From exponential families to multilinear systems

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Variational Methods in Bayesian Deconvolution

Learning Tetris. 1 Tetris. February 3, 2009

Expectation Maximization (EM)

Extracting State Transition Dynamics from Multiple Spike Trains with Correlated Poisson HMM

The Expectation Maximization or EM algorithm

Week 3: The EM algorithm

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Variational Principal Components

A Bayesian Local Linear Wavelet Neural Network

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

G8325: Variational Bayes

Statistical Approaches to Learning and Discovery

Introduction to Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

STA 4273H: Statistical Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Hidden Markov models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

CPSC 540: Machine Learning

CSE 473: Artificial Intelligence Autumn Topics

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Lecture 6: April 19, 2002

Expectation propagation for signal detection in flat-fading channels

Statistical Data Mining and Machine Learning Hilary Term 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

STA 414/2104: Machine Learning

A brief introduction to Conditional Random Fields

Hidden Markov Models

Expectation Maximization (EM)

Forecasting Wind Ramps

Computational statistics

Fast Approximate MAP Inference for Bayesian Nonparametrics

PMR Learning as Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

An introduction to Variational calculus in Machine Learning

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Planning by Probabilistic Inference

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Fast Inference and Learning with Sparse Belief Propagation

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Intelligent Systems Discriminative Learning, Neural Networks

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 8: Graphical models for Text

Clustering with k-means and Gaussian mixture distributions

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Algorithms for Variational Learning of Mixture of Gaussians

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Does the Wake-sleep Algorithm Produce Good Density Estimators?

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Conditional Random Field

Hidden Markov Modelling

13: Variational inference II

Bayesian Learning in Undirected Graphical Models

Hidden Markov Bayesian Principal Component Analysis

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Brief Introduction of Machine Learning Techniques for Content Analysis

Transcription:

Journal of Physics: Conference Series Deterministic annealing variant of variational Bayes method To cite this article: K Katahira et al 28 J. Phys.: Conf. Ser. 95 1215 View the article online for updates and enhancements. Related content - Estimating parameter of Rayleigh distribution by using Maximum Likelihood method and Bayes method itri Ardianti and Sutarman - The diagnose of oil palm disease using Naive Bayes Method based on Expert System Technology Marlince Nababan, Yonata Laia, Delima Sitanggang et al. - Approximate method of variational Bayesian matrix factorization/completion with sparse prior Ryota Kawasumi and Koujin Takeda Recent citations - Jesus Villalba et al - Jesus Villalba and Eduardo Lleida - Real Time Ligand-Induced Motion Mappings of AChBP and nachr Using X- ray Single Molecule Tracking Hiroshi Sekiguchi et al This content was downloaded from IP address 148.251.232.83 on 11/11/218 at 13:18

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 Deterministic Annealing Variant of Variational Bayes Method Kentaro Katahira 1,2, Kazuho Watanabe 1 and Masato Okada 1,2 1 Graduate School of rontier Sciences, The University of Tokyo, Kashiwa, Japan 2 RIKEN Brain Science Institute, Saitama, Japan E-mail: katahira@mns.k.u-tokyo.ac.jp Abstract. The Variational Bayes (VB) method is widely used as an approximation of the Bayesian method. Because the VB method is a gradient algorithm, it is often trapped by poor local optimal solutions. We introduce deterministic annealing to the VB method to overcome such a local optimal problem. A temperature parameter is introduced to the free energy for controlling the annealing process deterministically. Applying the method to a mixture of Gaussian models and hidden Markov models, we show that it can obtain the global optimum of the free energy and discover optimal model structure. 1. Introduction The Variational Bayes (VB) method has been widely used as an approximation of the Bayseian method for statistical models that have hidden variables [1, 2]. The VB method approximates true Bayesian posterior distributions with a factored distribution using an iterative algorithm that is similar to the expectation maximization (EM) algorithm. In many applications, it has performed better generalization compared to the maximum likelihood estimation [3]. Because VB is a gradient algorithm, it suffers from a serious local optimal problem in practice; that is, the algorithm is often trapped by poor local solutions near the initial parameter value. To overcome this problem, Ueda and Ghahramani [4] applied a split and merge procedure to VB. However, their approach is limited to mixture models; it can not be applied to other models, e.g., hidden Markov models (HMM). In this paper, we introduce a new deterministic annealing (DA) method to VB to overcome the local optimal problem. We call the method Deterministic Annealing VB (DA-VB). This method is general and can be easily applied to a wide class of statistical models. Such a deterministic annealing approach has been used for EM algorithms in the maximum likelihood context [5], and successfully applied to practical models [6]. Ueda and Nakano [5] reformulated maximizing the log-likelihood function as minimizing the free energy, defined as an effective cost function that depends on a temperature parameter. The objective function of VB is originally of the free energy form. Therefore, we can naturally use a statistical mechanics analogy to add a temperature parameter to it. We applied DA-VB methods to a simple mixture of Gaussian models and hidden Markov models (HMM). In both models, DA-VB enables us to obtain a better estimation compared with conventional VB. Unlike the maximum likelihood estimation, VB can automatically eliminates redundant components in the model, so it has the potential to discovering appropriate c 28 Ltd 1

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 model structure [1, 2]. However, the local optimal problem deteriorates this advantage. We demonstrate the ability of DA-VB to eliminate redundant components and find optimal model structure. The paper is organized as follows. In section 2, we introduce the DA-VB method. In section 3, we apply DA-VB to a simple mixture of Gaussian models as an illustrative example. In section 4, we apply DA-VB to hidden Markov models and show the experimental results. A conclusion follow in section 5. 2. The DA-VB method We consider learning models p(x,y θ) where x is an observable variable, y is a hidden variable, and θ is a set of parameters. Let X n = {x 1,...,x n } be a set of observable data, Y n = {y 1,...,y n } be a set of unobservable data, and ϕ(θ) be the prior distribution on the parameter. By introducing a distribution q(y n,θ), the marginal log likelihood log p(x n ) can be lower bounded by appealing to Jensen s inequality as log p(x n ) = log Y n = log Y n dθ p(x n,y n,θ) (1) dθ q(y n,θ) p(xn,y n,θ) q(y n,θ) log p(x n,y n,θ) q(y n,θ) + H q(y n,θ) (3) (q), (4) where H p (x) is the entropy of the distribution p(x), and p(x) denotes the expectation over p(x). We define the energy of a global configuration (Y n,θ) to be the average of log p(x n,y n,θ) over the distribution q(y n,θ). Then, according to the basic equations of statistical physics, = E TS ( is free energy, T is temperature, and S is the entropy), (q) corresponds to the negative free energy at temperature T = 1. Using this statistical mechanics analogy, we introduce the inverse temperature parameter β = 1/T to this free energy for annealing: (2) β (q) log p(x n,y n,θ) q(y n,θ) + 1 β H q(y n,θ). (5) Using the factorized form for q(y n,θ), q(y n,θ) = Q(Y n )r(θ), (6) the free energy with inverse temperature parameter β becomes β (Q,r) = log p(x n,y n,θ) Q(Y n )r(θ) + 1 β H Q(Y n ) + 1 β H r(θ). (7) The VB method iteratively maximizes (q) with respect to Q(Y n ) and r(θ) until it converges to a local maximum. The maximization of (q) with respect to Q(Y n ) is called the VB-E step and the maximization of (q) with respect to r(θ) is called the VB-M step. The DA-VB method uses the free energy function with the temperature parameter β (Q,r) as an objective function, instead of (q). By adding an annealing loop to the original VB procedure, the following DA-VB method can be derived. 2

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 [DA-VB method] 1. Set β β init ( < β init < 1), t. 2. Initialize the hyperparameter of r () (θ). 3. Perform the following VB-EM steps until convergence: VB-E step: VB-M step: Q (t+1) (Y n ) = 1 C Q exp β log p(x n,y n θ) r (t) (θ) (8) r (t+1) (θ) = 1 C r ϕ(θ) β exp β log p(x n,y n θ) Q (t) (Y n ) (9) Set t t + 1. 4. βnew β const 5. If βnew < 1, repeat from step 3; otherwise stop. Here, C Q and C r are normalization constants, Q (t) and r (t) denote current estimates of posteriors after the tth iteration. When β = 1, these VB-EM steps are the same as those of the original VB method. The VB-E step and VB-M step in the above algorithm are derived as follows. By introducing the Lagrange multipliers λ and λ, the functional derivative of β with respect to Q(Y n ) is log p(x n,y n θ) r(θ) 1 β (log Q(Y n ) + 1) + λ, (1) for fixed r(θ), and the functional derivative of β with respect to r(θ) is log p(x n,y n θ) Q(Y n ) 1 β (log r(θ) + 1) + λ, (11) for fixed Q(Y n ). Equating these derivatives to zero gives equation (8) and equation (9). Algorithmically, DA-VB is similar to the deterministic annealing variant of the EM (DAEM) algorithm [5]. The key difference is that in the DAEM algorithm, the temperature parameter β modifies the weight of the entropy of the hidden variable posterior on the objective function, while in the DA-VB method, β changes the weight of the entropy of the parameter posterior r(θ) as well as the hidden variable posterior Q(Y n ). Beal described an annealing procedure for the VB of conjugate-exponential models [2], although its effects have not been examined yet. His annealing procedure corresponds to the special case of our DA-VB method, that updates the second β alone in the VB-M step while the first β in the VB-M step and β in the VB-E step are fixed. As we will show later, Beal s annealing method is ineffective at training hidden Markov Models in our experiment. 3. Illustrative example To demonstrate the effects of β on the free energy β, we consider a simple one-dimensional, two-component mixture of Gaussian models [5]: p(x θ) = 2 k=1 α k exp { 1 } 2π 2 (x µ k) 2, (12) 3

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 The complete likelihood of the model is given by p(x n,y n θ) = [ 2 n { αk exp 1 } ] yk i k=1 i=1 2π 2 (x i µ k ) 2, (13) where yi k is not observed and is representing the component from which the datum x i is generated. If the datum x i is generated from the k-th component, then yi k = 1, if otherwise, yi k =. The prior over the means, is normal distribution with mean and variance σ 2 : ϕ(µ) = 2 N(µ k,σ 2 ). (14) k=1 We fix the mixing coefficients α k to α 1 =.3 and α 2 =.7. Then, the posteriors over the parameters to be estimated are only r( ) and r( ). Because r( ) and r( ) are parameterized by their own means (, ), we can visualize the free energy surface β defined on the (, )- plane. The VB-M step consists of updating the posterior according to where r(µ k ) = N ( µ k µ k, σ 2 ) β(1 + σ 2 N, (15) k ) µ k = ȳ k i = σ 2 Nk 1 + σ 2 N k x k, Nk = exp{βγ i k } 2j=1 exp{βγ i j }, n ȳi k, x k = 1 Nk n ȳi k x i, i=1 i=1 γ i k = log α k log 2π 1 2 (x i µ k ) 2 σ 2 2β(1 + σ 2 N k ). ive hundred samples were generated from this model with (, ) = (4, 4). Therefore, the global optimal posterior means are (, ) = (4, 4). igure 1 shows the learning process of the conventional VB method, that is, the special case of DA-VB with β init = 1.. In the trial with the initial value (, ) = (,6) close to the local maximum, the algorithm converged to the local maximum, whereas it converged to the global maximum with the initial value (, ) = (6,). igure 2 shows learning process of the DA-VB method. We set the hyperparameter for the prior as σ 2 = 1. Note that σ 2 of the prior is independent of the variance of Gaussians of the model (in our case, that is 1). The choice of the prior σ 2 affects the estimation, e.g., for small σ 2, the possibility that µ k exists far from zero is small. However, as can be seen in the equation for µ k, if the data size n is large enough and σ 2 is not too close to zero, the effect of the prior σ 2 is negligible. The starting point for each β is set with the convergence point of the previous annealing step. We set β init =.6 and βnew = β current 1.1. When β is small (β =.6), β has only one maximum. Hence, for arbitrary initial parameter values, the VB-EM steps can converge to the maximum (figure 2(a)). As β increases, two maxima appear, corresponding to the global optimal solution (, ) = (4, 4) and a local optimal solution (, ) = ( 4,4) (figure 2(b)). The form of the free energy surface β gradually changes as β increases. The maxima are always larger at the point near the global optimal solution than near the local optimal solution. Therefore, the DA-VB method can reach the global optimal solution without being trapped by the local optimal solution. 4

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 local maximum = -1225 global maximum = -152 1 2 3 4 5 1 5 5 1 initial point (, ) = (,6) 1 1 initial point (, ) = (6,) igure 1. Learning process of the conventional VB method with two different initial values (, ) = (,6) and (6,). The algorithm starting from the initial point near the global maximum converges to the global maximum, while the one starting from the initial point near a local maximum converges to the local maximum. 4. DA-VB method for HMM Here, we derive the DA-VB method for hidden Markov models, by expanding the conventional VB method for HMM (VB HMM) [2, 7]. Suppose a discrete sequence of L-valued symbols X T = {x 1,...,x T }, where T is the length of the time series, was observed. We assume x t was produced by a K-valued discrete hidden state y t and the sequence of hidden states Y T = {y 1,...,y T } was generated by a first-order Markov process. We represent the observed data by L-dimensional binary vectors x t,m such that if the output symbol at time t is m, then x t,m = 1; otherwise. The hidden states y t are also represented by a K-dimensional binary vector y t,k such that if a hidden state at time t is k, then y t,k = 1; otherwise. The HMM has the following parameters: initial hidden state prior: π = {π i } : π i = p(y 1 = i) (K 1) state transition matrix: A = {a ij } : a ij = p(y t = j y t 1 = i) (K K) symbol emission matrix: C = {c im } : c im = p(x t = m y t = i) (K L) where all parameters are non-negative and obey the normalization constraints: Ki=1 π i = 1, Kj=1 a ij = 1, and L m=1 c im = 1. The VB-E step in the DA-VB computes the expectations required for the VB-M steps, y t,k Q(Yt) = y t 1,i y t,j Q(Yt) = p(y t,k = 1 X T,) β Kk=1 p(y t,k = 1 X T ) β, (16) p(y t 1,i = 1,y t,j = 1 X T ) β Ki=1 Kj=1 p(y t 1,i = 1,y t,j = 1 X T ) β. (17) These quantities are calculated by using well-known forward-backward algorithm. 5

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 convergence point starting point convergence point Initial point 3 2 1 1 β 1 β 2 2 3 3 4 5 1 5 5 1 1 1 4 5 1 5 5 1 1 1 (a) β =.5 (b) β =.86 starting point convergence point starting point convergence point 1 1 2 2 β β 3 3 4 1 4 1 5 1 5 5 1 µ 1 (c) β =.124 1 5 1 5 5 1 (d) β =.149 1 starting point convergence point final solution 1 1 2 2 β β 3 3 4 1 4 1 5 1 5 5 1 1 5 1 5 5 1 1 (e) β =.31 (f) β = 1. igure 2. Learning process of the DA-VB method with an initial value (, ) = (,6). The method successively finds the global optimal solution. (a)-(f) show free energy surfaces corresponding to particular β values. The parameter prior over π, the rows of A, and the rows of C are Dirichlet distributions: ϕ(π) = Dir({π 1,...,π K } u (π) ), (18) K ϕ(a) = Dir({a j1,...,a jk } u (A) ), (19) j=1 6

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 where ϕ(c) = K Dir({c j1,...,c jl } u (C) ), (2) j=1 Dir({a 1,...,a K } u) = Γ( K j=1 u j ) Kj=1 Γ(u j ) Then, VB-M step in the DA-VB is given by where K j=1 a u j 1 j. (21) r(π) = Dir({π 1,...,π K } {w1 π,...,wk}), π (22) K r(a) = Dir({a i1,...,a ik } {wi1 A,...,wA ik }), (23) r(c) = w π j w A ij w C ij i=1 K i=1 Dir({c i1,...,c il } {w C i1,...,w C il}), (24) ( = β u (π) ) j + y 1,j Q(Yt) 1 + 1, (25) ( ) T = β u (A) j + y t 1,i y t,j Q(Yt) 1 + 1, (26) = β ( u (C) j + t=2 T t=1 x t,j y t,i Q(Yt) 1 ) + 1. (27) 5. Experiments To demonstrate the ability of the DA-VB HMM to find global optimal solutions, we performed an experiment on synthetic data. We generated the sample data set as follows: By using standard regular expression notation, the three type of sequences, (abc), (acb), (a b ) are concatenated by switching to other sequences with probability.1 and repeating with probability.8. The true model structure for these sequences is shown in figure 3(c). The training data consisted of 5 sequences of length 5 symbols. They are synthetic data modified from those used in [2]. We chose an HMM with K = 15 hidden states to allow for some redundancy. VB HMMs can learn by pruning redundant hidden states [2]. In other words, the posterior means for the transition probabilities to the redundant hidden states converge to automatically. or the DA-VB method, we set β init =.6, βnew β current 1.2. The prior hyperparameters were set as u (π) j = u (A) j = 1/K,u (C) j = 1/L, j. igure 3(a-c) shows the histograms of free energies after convergence across 5 random initializations. The conventional VB method reaches a maximum free energy of 113 in less than 1% of the trials. On the other hand, the DA-VB method reaches a maximum free energy of about 118, which was never reached with the conventional VB method, in about 62% of the trials. igures 3(e) and (f) show typical model structures obtained by DA-VB and conventional VB, respectively. We can see that the typical model structure obtained most frequently by DA-VB corresponds to the true model structure, whereas the model structure obtained most frequently by conventional VB is too complex. The DA-VB eliminated the redundant hidden states appropriately, and used seven hidden states that is optimal number, while conventional VB used more than seven hidden states in most trials. Beal s annealing method [2] corresponds to the case that replaces equations (25)-(27) with wj π = u(π) j +β y 1,j Q(Yt), wij A = u (A) j + β T t=2 y t 1,i y t,j Q(Yt), wij C = u (C) j + β T t=1 x t,j y t,i Q(Yt). As we can see in figure 3(c), his method was ineffective in our experiment. 7

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 2 VB 15 requency 1.1 5 3 25 2 15 1 (a) conventional VB.8 a /.5 b /.5.1 1. c / 1. a / 1..8.1.1 a / 1..9 1..1.8 b / 1. b / 1..1 (d) true model structure 1. c / 1. 4 DA VB.14 requency requency 3 2 1 3 25 2 15 1 β 2 15 1 5 (b) DA-VB Beal s annealing.76 a /.48 b /.52.1 a / 1. 1..11 1. a / 1. 1. c /.99.82 b / 1..14 c /.99.81 1..12 b / 1..9.8 a / 1..88 (e) DA-VB ( = 118).98 c /.89 a /.99.82.17 c / 1..82 b /.99 b / 1..91 b / 1. (f) conventional VB ( = 122).7.46.18.14.99.63.35 1..98 a /.9 b /.9 b /.97 a / 1. c /.99.53.96.78 a /.39 b /.51 c /.1 3 25 2 15 1 β (c) Beal s annealing method igure 3. (a-c): Histograms of converged free energies across random initializations for (a) VB method, (b) DA-VB method with β init =.6, and (c) VB with Beal s annealing method. (d) true HMM structure of synthetic data. The HMM structure obtained by (e) DA-VB ( β = 122) and (f) conventional VB ( = 118). The transition probabilities and emission probabilities are obtained from the posterior means: a ij r(a), c im r(c). Unused hidden states and the initial hidden state prior π i r(π) are omitted for the sake of viewability. 6. Conclusion We developed a novel deterministic annealing method for VB by introducing an inverse temperature parameter β based on the statistical mechanics analogy. By applying the method to simple mixture of Gaussian models and hidden Markov models, we showed that the proposed method improved the ability of the VB method to find optimal solutions. Our method can be easily applied to other statistical models, with minor modification to the conventional VB method, i.e., by adding the parameter β. We believe that our method will be a powerful tool for finding statistical structures underlying real world data. Acknowledgments This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas No. 1827 and No. 18793, a Grant-in-Aid for Scientific Research (C) No. 16593, and the Director s fund of BSI, RIKEN. 8

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 References [1] Attias H 1999 Proc. 15th Conf. on Uncertainty in Artificial Intelligence 21 [2] Beal M J 23 Ph.D thesis, University College London [3] Watanabe S, Minami Y, Nakamura A and Ueda N 22 Advances in Neural Information Processing Systems vol 15 (Cambridge, MA: MIT Press) 1261 [4] Ueda N and Ghahramani Z 22 Neural Netw. 15 1223 [5] Ueda N and Nakano R 1998 Neural Netw. 12 271 [6] Ghahramani Z and Hinton G E 2 Neural Comput. 12 963 [7] MacKay D 1997 Unpublished manuscript 9