Deterministic annealing variant of variational Bayes method

Journal of Physics: Conference Series Deterministic annealing variant of variational Bayes method To cite this article: K Katahira et al 28 J. Phys.: Conf. Ser. 95 1215 View the article online for updates and enhancements. Related content - Estimating parameter of Rayleigh distribution by using Maximum Likelihood method and Bayes method itri Ardianti and Sutarman - The diagnose of oil palm disease using Naive Bayes Method based on Expert System Technology Marlince Nababan, Yonata Laia, Delima Sitanggang et al. - Approximate method of variational Bayesian matrix factorization/completion with sparse prior Ryota Kawasumi and Koujin Takeda Recent citations - Jesus Villalba et al - Jesus Villalba and Eduardo Lleida - Real Time Ligand-Induced Motion Mappings of AChBP and nachr Using X- ray Single Molecule Tracking Hiroshi Sekiguchi et al This content was downloaded from IP address 148.251.232.83 on 11/11/218 at 13:18

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 Deterministic Annealing Variant of Variational Bayes Method Kentaro Katahira 1,2, Kazuho Watanabe 1 and Masato Okada 1,2 1 Graduate School of rontier Sciences, The University of Tokyo, Kashiwa, Japan 2 RIKEN Brain Science Institute, Saitama, Japan E-mail: katahira@mns.k.u-tokyo.ac.jp Abstract. The Variational Bayes (VB) method is widely used as an approximation of the Bayesian method. Because the VB method is a gradient algorithm, it is often trapped by poor local optimal solutions. We introduce deterministic annealing to the VB method to overcome such a local optimal problem. A temperature parameter is introduced to the free energy for controlling the annealing process deterministically. Applying the method to a mixture of Gaussian models and hidden Markov models, we show that it can obtain the global optimum of the free energy and discover optimal model structure. 1. Introduction The Variational Bayes (VB) method has been widely used as an approximation of the Bayseian method for statistical models that have hidden variables [1, 2]. The VB method approximates true Bayesian posterior distributions with a factored distribution using an iterative algorithm that is similar to the expectation maximization (EM) algorithm. In many applications, it has performed better generalization compared to the maximum likelihood estimation [3]. Because VB is a gradient algorithm, it suffers from a serious local optimal problem in practice; that is, the algorithm is often trapped by poor local solutions near the initial parameter value. To overcome this problem, Ueda and Ghahramani [4] applied a split and merge procedure to VB. However, their approach is limited to mixture models; it can not be applied to other models, e.g., hidden Markov models (HMM). In this paper, we introduce a new deterministic annealing (DA) method to VB to overcome the local optimal problem. We call the method Deterministic Annealing VB (DA-VB). This method is general and can be easily applied to a wide class of statistical models. Such a deterministic annealing approach has been used for EM algorithms in the maximum likelihood context [5], and successfully applied to practical models [6]. Ueda and Nakano [5] reformulated maximizing the log-likelihood function as minimizing the free energy, defined as an effective cost function that depends on a temperature parameter. The objective function of VB is originally of the free energy form. Therefore, we can naturally use a statistical mechanics analogy to add a temperature parameter to it. We applied DA-VB methods to a simple mixture of Gaussian models and hidden Markov models (HMM). In both models, DA-VB enables us to obtain a better estimation compared with conventional VB. Unlike the maximum likelihood estimation, VB can automatically eliminates redundant components in the model, so it has the potential to discovering appropriate c 28 Ltd 1

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 model structure [1, 2]. However, the local optimal problem deteriorates this advantage. We demonstrate the ability of DA-VB to eliminate redundant components and find optimal model structure. The paper is organized as follows. In section 2, we introduce the DA-VB method. In section 3, we apply DA-VB to a simple mixture of Gaussian models as an illustrative example. In section 4, we apply DA-VB to hidden Markov models and show the experimental results. A conclusion follow in section 5. 2. The DA-VB method We consider learning models p(x,y θ) where x is an observable variable, y is a hidden variable, and θ is a set of parameters. Let X n = {x 1,...,x n } be a set of observable data, Y n = {y 1,...,y n } be a set of unobservable data, and ϕ(θ) be the prior distribution on the parameter. By introducing a distribution q(y n,θ), the marginal log likelihood log p(x n ) can be lower bounded by appealing to Jensen s inequality as log p(x n ) = log Y n = log Y n dθ p(x n,y n,θ) (1) dθ q(y n,θ) p(xn,y n,θ) q(y n,θ) log p(x n,y n,θ) q(y n,θ) + H q(y n,θ) (3) (q), (4) where H p (x) is the entropy of the distribution p(x), and p(x) denotes the expectation over p(x). We define the energy of a global configuration (Y n,θ) to be the average of log p(x n,y n,θ) over the distribution q(y n,θ). Then, according to the basic equations of statistical physics, = E TS ( is free energy, T is temperature, and S is the entropy), (q) corresponds to the negative free energy at temperature T = 1. Using this statistical mechanics analogy, we introduce the inverse temperature parameter β = 1/T to this free energy for annealing: (2) β (q) log p(x n,y n,θ) q(y n,θ) + 1 β H q(y n,θ). (5) Using the factorized form for q(y n,θ), q(y n,θ) = Q(Y n )r(θ), (6) the free energy with inverse temperature parameter β becomes β (Q,r) = log p(x n,y n,θ) Q(Y n )r(θ) + 1 β H Q(Y n ) + 1 β H r(θ). (7) The VB method iteratively maximizes (q) with respect to Q(Y n ) and r(θ) until it converges to a local maximum. The maximization of (q) with respect to Q(Y n ) is called the VB-E step and the maximization of (q) with respect to r(θ) is called the VB-M step. The DA-VB method uses the free energy function with the temperature parameter β (Q,r) as an objective function, instead of (q). By adding an annealing loop to the original VB procedure, the following DA-VB method can be derived. 2

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 [DA-VB method] 1. Set β β init ( < β init < 1), t. 2. Initialize the hyperparameter of r () (θ). 3. Perform the following VB-EM steps until convergence: VB-E step: VB-M step: Q (t+1) (Y n ) = 1 C Q exp β log p(x n,y n θ) r (t) (θ) (8) r (t+1) (θ) = 1 C r ϕ(θ) β exp β log p(x n,y n θ) Q (t) (Y n ) (9) Set t t + 1. 4. βnew β const 5. If βnew < 1, repeat from step 3; otherwise stop. Here, C Q and C r are normalization constants, Q (t) and r (t) denote current estimates of posteriors after the tth iteration. When β = 1, these VB-EM steps are the same as those of the original VB method. The VB-E step and VB-M step in the above algorithm are derived as follows. By introducing the Lagrange multipliers λ and λ, the functional derivative of β with respect to Q(Y n ) is log p(x n,y n θ) r(θ) 1 β (log Q(Y n ) + 1) + λ, (1) for fixed r(θ), and the functional derivative of β with respect to r(θ) is log p(x n,y n θ) Q(Y n ) 1 β (log r(θ) + 1) + λ, (11) for fixed Q(Y n ). Equating these derivatives to zero gives equation (8) and equation (9). Algorithmically, DA-VB is similar to the deterministic annealing variant of the EM (DAEM) algorithm [5]. The key difference is that in the DAEM algorithm, the temperature parameter β modifies the weight of the entropy of the hidden variable posterior on the objective function, while in the DA-VB method, β changes the weight of the entropy of the parameter posterior r(θ) as well as the hidden variable posterior Q(Y n ). Beal described an annealing procedure for the VB of conjugate-exponential models [2], although its effects have not been examined yet. His annealing procedure corresponds to the special case of our DA-VB method, that updates the second β alone in the VB-M step while the first β in the VB-M step and β in the VB-E step are fixed. As we will show later, Beal s annealing method is ineffective at training hidden Markov Models in our experiment. 3. Illustrative example To demonstrate the effects of β on the free energy β, we consider a simple one-dimensional, two-component mixture of Gaussian models [5]: p(x θ) = 2 k=1 α k exp { 1 } 2π 2 (x µ k) 2, (12) 3

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 The complete likelihood of the model is given by p(x n,y n θ) = [ 2 n { αk exp 1 } ] yk i k=1 i=1 2π 2 (x i µ k ) 2, (13) where yi k is not observed and is representing the component from which the datum x i is generated. If the datum x i is generated from the k-th component, then yi k = 1, if otherwise, yi k =. The prior over the means, is normal distribution with mean and variance σ 2 : ϕ(µ) = 2 N(µ k,σ 2 ). (14) k=1 We fix the mixing coefficients α k to α 1 =.3 and α 2 =.7. Then, the posteriors over the parameters to be estimated are only r( ) and r( ). Because r( ) and r( ) are parameterized by their own means (, ), we can visualize the free energy surface β defined on the (, )- plane. The VB-M step consists of updating the posterior according to where r(µ k ) = N ( µ k µ k, σ 2 ) β(1 + σ 2 N, (15) k ) µ k = ȳ k i = σ 2 Nk 1 + σ 2 N k x k, Nk = exp{βγ i k } 2j=1 exp{βγ i j }, n ȳi k, x k = 1 Nk n ȳi k x i, i=1 i=1 γ i k = log α k log 2π 1 2 (x i µ k ) 2 σ 2 2β(1 + σ 2 N k ). ive hundred samples were generated from this model with (, ) = (4, 4). Therefore, the global optimal posterior means are (, ) = (4, 4). igure 1 shows the learning process of the conventional VB method, that is, the special case of DA-VB with β init = 1.. In the trial with the initial value (, ) = (,6) close to the local maximum, the algorithm converged to the local maximum, whereas it converged to the global maximum with the initial value (, ) = (6,). igure 2 shows learning process of the DA-VB method. We set the hyperparameter for the prior as σ 2 = 1. Note that σ 2 of the prior is independent of the variance of Gaussians of the model (in our case, that is 1). The choice of the prior σ 2 affects the estimation, e.g., for small σ 2, the possibility that µ k exists far from zero is small. However, as can be seen in the equation for µ k, if the data size n is large enough and σ 2 is not too close to zero, the effect of the prior σ 2 is negligible. The starting point for each β is set with the convergence point of the previous annealing step. We set β init =.6 and βnew = β current 1.1. When β is small (β =.6), β has only one maximum. Hence, for arbitrary initial parameter values, the VB-EM steps can converge to the maximum (figure 2(a)). As β increases, two maxima appear, corresponding to the global optimal solution (, ) = (4, 4) and a local optimal solution (, ) = ( 4,4) (figure 2(b)). The form of the free energy surface β gradually changes as β increases. The maxima are always larger at the point near the global optimal solution than near the local optimal solution. Therefore, the DA-VB method can reach the global optimal solution without being trapped by the local optimal solution. 4

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 local maximum = -1225 global maximum = -152 1 2 3 4 5 1 5 5 1 initial point (, ) = (,6) 1 1 initial point (, ) = (6,) igure 1. Learning process of the conventional VB method with two different initial values (, ) = (,6) and (6,). The algorithm starting from the initial point near the global maximum converges to the global maximum, while the one starting from the initial point near a local maximum converges to the local maximum. 4. DA-VB method for HMM Here, we derive the DA-VB method for hidden Markov models, by expanding the conventional VB method for HMM (VB HMM) [2, 7]. Suppose a discrete sequence of L-valued symbols X T = {x 1,...,x T }, where T is the length of the time series, was observed. We assume x t was produced by a K-valued discrete hidden state y t and the sequence of hidden states Y T = {y 1,...,y T } was generated by a first-order Markov process. We represent the observed data by L-dimensional binary vectors x t,m such that if the output symbol at time t is m, then x t,m = 1; otherwise. The hidden states y t are also represented by a K-dimensional binary vector y t,k such that if a hidden state at time t is k, then y t,k = 1; otherwise. The HMM has the following parameters: initial hidden state prior: π = {π i } : π i = p(y 1 = i) (K 1) state transition matrix: A = {a ij } : a ij = p(y t = j y t 1 = i) (K K) symbol emission matrix: C = {c im } : c im = p(x t = m y t = i) (K L) where all parameters are non-negative and obey the normalization constraints: Ki=1 π i = 1, Kj=1 a ij = 1, and L m=1 c im = 1. The VB-E step in the DA-VB computes the expectations required for the VB-M steps, y t,k Q(Yt) = y t 1,i y t,j Q(Yt) = p(y t,k = 1 X T,) β Kk=1 p(y t,k = 1 X T ) β, (16) p(y t 1,i = 1,y t,j = 1 X T ) β Ki=1 Kj=1 p(y t 1,i = 1,y t,j = 1 X T ) β. (17) These quantities are calculated by using well-known forward-backward algorithm. 5

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 convergence point starting point convergence point Initial point 3 2 1 1 β 1 β 2 2 3 3 4 5 1 5 5 1 1 1 4 5 1 5 5 1 1 1 (a) β =.5 (b) β =.86 starting point convergence point starting point convergence point 1 1 2 2 β β 3 3 4 1 4 1 5 1 5 5 1 µ 1 (c) β =.124 1 5 1 5 5 1 (d) β =.149 1 starting point convergence point final solution 1 1 2 2 β β 3 3 4 1 4 1 5 1 5 5 1 1 5 1 5 5 1 1 (e) β =.31 (f) β = 1. igure 2. Learning process of the DA-VB method with an initial value (, ) = (,6). The method successively finds the global optimal solution. (a)-(f) show free energy surfaces corresponding to particular β values. The parameter prior over π, the rows of A, and the rows of C are Dirichlet distributions: ϕ(π) = Dir({π 1,...,π K } u (π) ), (18) K ϕ(a) = Dir({a j1,...,a jk } u (A) ), (19) j=1 6

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 where ϕ(c) = K Dir({c j1,...,c jl } u (C) ), (2) j=1 Dir({a 1,...,a K } u) = Γ( K j=1 u j ) Kj=1 Γ(u j ) Then, VB-M step in the DA-VB is given by where K j=1 a u j 1 j. (21) r(π) = Dir({π 1,...,π K } {w1 π,...,wk}), π (22) K r(a) = Dir({a i1,...,a ik } {wi1 A,...,wA ik }), (23) r(c) = w π j w A ij w C ij i=1 K i=1 Dir({c i1,...,c il } {w C i1,...,w C il}), (24) ( = β u (π) ) j + y 1,j Q(Yt) 1 + 1, (25) ( ) T = β u (A) j + y t 1,i y t,j Q(Yt) 1 + 1, (26) = β ( u (C) j + t=2 T t=1 x t,j y t,i Q(Yt) 1 ) + 1. (27) 5. Experiments To demonstrate the ability of the DA-VB HMM to find global optimal solutions, we performed an experiment on synthetic data. We generated the sample data set as follows: By using standard regular expression notation, the three type of sequences, (abc), (acb), (a b ) are concatenated by switching to other sequences with probability.1 and repeating with probability.8. The true model structure for these sequences is shown in figure 3(c). The training data consisted of 5 sequences of length 5 symbols. They are synthetic data modified from those used in [2]. We chose an HMM with K = 15 hidden states to allow for some redundancy. VB HMMs can learn by pruning redundant hidden states [2]. In other words, the posterior means for the transition probabilities to the redundant hidden states converge to automatically. or the DA-VB method, we set β init =.6, βnew β current 1.2. The prior hyperparameters were set as u (π) j = u (A) j = 1/K,u (C) j = 1/L, j. igure 3(a-c) shows the histograms of free energies after convergence across 5 random initializations. The conventional VB method reaches a maximum free energy of 113 in less than 1% of the trials. On the other hand, the DA-VB method reaches a maximum free energy of about 118, which was never reached with the conventional VB method, in about 62% of the trials. igures 3(e) and (f) show typical model structures obtained by DA-VB and conventional VB, respectively. We can see that the typical model structure obtained most frequently by DA-VB corresponds to the true model structure, whereas the model structure obtained most frequently by conventional VB is too complex. The DA-VB eliminated the redundant hidden states appropriately, and used seven hidden states that is optimal number, while conventional VB used more than seven hidden states in most trials. Beal s annealing method [2] corresponds to the case that replaces equations (25)-(27) with wj π = u(π) j +β y 1,j Q(Yt), wij A = u (A) j + β T t=2 y t 1,i y t,j Q(Yt), wij C = u (C) j + β T t=1 x t,j y t,i Q(Yt). As we can see in figure 3(c), his method was ineffective in our experiment. 7

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 2 VB 15 requency 1.1 5 3 25 2 15 1 (a) conventional VB.8 a /.5 b /.5.1 1. c / 1. a / 1..8.1.1 a / 1..9 1..1.8 b / 1. b / 1..1 (d) true model structure 1. c / 1. 4 DA VB.14 requency requency 3 2 1 3 25 2 15 1 β 2 15 1 5 (b) DA-VB Beal s annealing.76 a /.48 b /.52.1 a / 1. 1..11 1. a / 1. 1. c /.99.82 b / 1..14 c /.99.81 1..12 b / 1..9.8 a / 1..88 (e) DA-VB ( = 118).98 c /.89 a /.99.82.17 c / 1..82 b /.99 b / 1..91 b / 1. (f) conventional VB ( = 122).7.46.18.14.99.63.35 1..98 a /.9 b /.9 b /.97 a / 1. c /.99.53.96.78 a /.39 b /.51 c /.1 3 25 2 15 1 β (c) Beal s annealing method igure 3. (a-c): Histograms of converged free energies across random initializations for (a) VB method, (b) DA-VB method with β init =.6, and (c) VB with Beal s annealing method. (d) true HMM structure of synthetic data. The HMM structure obtained by (e) DA-VB ( β = 122) and (f) conventional VB ( = 118). The transition probabilities and emission probabilities are obtained from the posterior means: a ij r(a), c im r(c). Unused hidden states and the initial hidden state prior π i r(π) are omitted for the sake of viewability. 6. Conclusion We developed a novel deterministic annealing method for VB by introducing an inverse temperature parameter β based on the statistical mechanics analogy. By applying the method to simple mixture of Gaussian models and hidden Markov models, we showed that the proposed method improved the ability of the VB method to find optimal solutions. Our method can be easily applied to other statistical models, with minor modification to the conventional VB method, i.e., by adding the parameter β. We believe that our method will be a powerful tool for finding statistical structures underlying real world data. Acknowledgments This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas No. 1827 and No. 18793, a Grant-in-Aid for Scientific Research (C) No. 16593, and the Director s fund of BSI, RIKEN. 8

International Workshop on Statistical-Mechanical Informatics 27 (IW-SMI 27) Journal of Physics: Conference Series 95 (28) 1215 doi:1.188/1742-6596/95/1/1215 References [1] Attias H 1999 Proc. 15th Conf. on Uncertainty in Artificial Intelligence 21 [2] Beal M J 23 Ph.D thesis, University College London [3] Watanabe S, Minami Y, Nakamura A and Ueda N 22 Advances in Neural Information Processing Systems vol 15 (Cambridge, MA: MIT Press) 1261 [4] Ueda N and Ghahramani Z 22 Neural Netw. 15 1223 [5] Ueda N and Nakano R 1998 Neural Netw. 12 271 [6] Ghahramani Z and Hinton G E 2 Neural Comput. 12 963 [7] MacKay D 1997 Unpublished manuscript 9