Factorized Asymptotic Bayesian Inference for Mixture Modeling

Size: px

Start display at page:

Download "Factorized Asymptotic Bayesian Inference for Mixture Modeling"

Archibald Francis
6 years ago
Views:

1 Fatorized Asymptoti Bayesian Inferene for Mixture Modeling Ryohei Fujimaki NEC Laboratories Ameria Department of Media Analytis Satoshi Morinaga NEC Corporation Information and Media Proessing Researh Laboratories Abstrat This paper proposes a novel Bayesian approximation inferene method for mixture modeling. Our key idea is to fatorize marginal log-likelihood using a variational distribution over latent variables. An asymptoti approximation, a fatorized information riterion (FIC), is obtained by applying the Laplae method to eah of the fatorized omponents. In order to evaluate FIC, we propose fatorized asymptoti Bayesian inferene (FAB), whih maximizes an asymptotially-onsistent lower bound of FIC. FIC and FAB have several desirable properties: 1) asymptoti onsisteny with the marginal log-likelihood, 2) automati omponent seletion on the basis of an intrinsi shrinkage mehanism, and 3) parameter identifiability in mixture modeling. Experimental results show that FAB outperforms state-of-the-art VB methods. 1 Introdution Model seletion is one of the most diffiult and important hallenges in mahine learning problems, in whih we optimize a model representation as well as model parameters. Bayesian learning provides a natural and sophistiated way to address the issue [1, 3]. A entral task in Bayesian model seletion is evaluation of marginal log-likelihood. Sine exat evaluation is often omputationally and analytially intratable, a number of approximation algorithms have been studied. One well-known preursor is Bayes information rite- Appearing in Proeedings of the 15 th International Conferene on Artifiial Intelligene and Statistis (AISTATS) 2012, La Palma, Canary Islands. Volume XX of JMLR: W&CP XX. Copyright 2012 by the authors. 400 rion (BIC) [18], an asymptoti seond-order approximation using the Laplae method. Sine BIC assumes regularity onditions that ensure the asymptoti normality of the maximum likelihood (ML) estimator, it loses its theoretial justifiation for non-regular models, inluding mixture models, hidden Markov models, multi-layer neural networks, et. Further, it is known that the ML estimation does not have a unique solution for non-identifiable models in whih the mapping between parameters and funtions is not one-toone [22, 23] (suh equivalent models are said to be in an equivalent lass). For suh non-singular and nonidentifiable models, the generalization error of the ML estimator signifiantly degrades [24]. Among reent advaned Bayesian methods, we fous on variational Bayesian (VB) inferene. A VB method maximizes a omputationally-tratable lower bound of the marginal log-likelihood by applying variational distributions on latent variables and parameters. Previous studies have investigated VB methods for a variety of models [2, 6, 5, 12, 15, 16, 20] and have demonstrated their pratiality. Further, theoretial studies [17, 21] have shown that VB methods an resolve the non-regularity and non-identifiability issues. One disadvantage, however, is that latent variables and model parameters are partitioned into sub-groups that are independent of eah other in variational distributions. While this independene makes omputation tratable, the lower bound an be loose sine their dependeny is essential in the true distribution. This paper proposes a novel Bayesian approximation inferene method for a family of mixture models whih is one of the most signifiant example of non-regular and non-identifiable model families. Our ideas and ontributions are summarized as follows: Fatorized Information Criterion We derive a new approximation of marginal log-likelihood and refer to it as a fatorized information riterion (FIC). A key observation is that, using a variational distribution over latent variables, the marginal log-likelihood an

2 Fatorized Asymptoti Bayesian Inferene for Mixture Modeling be rewritten as a fatorized representation in whih the Laplae approximation is appliable to eah of the fatorized omponents. FIC is unique in the sense that it takes into aount dependenies among latent variables and parameters, and FIC is asymptotially onsistent with the marginal log-likelihood. Fatorized Asymptoti Bayesian Inferene We propose a new approximation inferene algorithm and refer to it as a fatorized asymptoti Bayesian inferene (FAB). Sine FIC is defined on the basis of both observed and latent variables, the latter of whih are not available, FAB maximizes a lower bound of FIC through an iterative optimization similar to the EM algorithm [8] and VB methods [2]. This FAB optimization is proved to monotonially inrease the lower bound of FIC. It is worth noting that FAB provides a natural way to ontrol model omplexity not only in terms of the number of omponents but also in terms of the types of individual omponents (e.g., degrees of omponents in a polynomial urve mixture (PCM).) FIC and FAB offer the following desirable properties: Gaussian mixture models (GMM), PCM, autoregressive mixture models) satisfy A1. A model M of p(x θ) is speified by C and models S of p (X φ ), i.e., M as M = (C, S 1,..., S C ). Although the representations of θ and φ depend on M and S, respetively, we omit them for notational simpliity. Let us denote a latent variable of X as Z = (Z 1,..., Z C ). Z is a omponent assignment vetor, and Z = 1 if X is generated from the -th omponent, and Z = 0 otherwise. The marginal distribution on Z and the onditional distribution on X given Z an be desribed as p(z α) = C =1 αz and p(x Z, φ ) = C =1 (p (X φ )) Z. The observed data and their latent variables are denoted as x N = x 1,..., x N and z N = z 1,..., z N, respetively, where z n = (z n1,..., z nc ) and z N = (z 1,..., z N ). We make another assumption that log P (X, Z θ) < holds (A2). Condition A2 is disussed in Setion Fatorized Information Criterion for Mixture Models 1. The FIC lower bound is asymptotially onsistent with FIC, and therefore with the marginal loglikelihood. 2. FAB automatially selets omponents on the basis of its intrinsi shrinkage mehanism different from prior-based regularization. FAB therefore mitigates overfitting even if it ignores the prior effet in the asymptoti sense, or even if we apply a non-informative prior [13]. 3. FAB addresses the non-identifiability issue. In an equivalent lass, FAB automatially selets the model whih maximizes the entropy of distributions for latent variables. 2 Preliminaries This paper onsiders mixture models p(x θ) = C =1 α p (X φ ) for a D-dimensional random variable X. C and α = (α 1,..., α C ) are the number of omponents and the mixture ratio. θ is θ = (α, φ 1,..., φ C ). We allow different omponents p (X φ ) to be different in their model representations from one another 1. We assume that p (X φ ) satisfies the regularity onditions 2 while p(x θ) is non-regular (A1). Assumption A1 is less stringent than the regularity onditions on p(x θ), and many models (e.g., 1 So-alled heterogeneous mixture models [11]. In the example of PCM, the degree of p 1(X φ 1 ) will be two while that of p 2(X φ 2 ) will one. 2 The Fisher information matrix of p (X φ ) is nonsingular around the maximum likelihood estimator. 401 A Bayesian selets the model whih maximizes the model posterior p(m x N ) p(m)p(x N M). With uniform model prior p(m), we are partiularly interested in p(x N M), whih is referred to as marginal likelihood or Bayesian evidene. VB methods for latent variable models [2, 5, 6, 21] onsider a lower bound of the marginal log-likelihood to be: log p(x N M) q(z N )q θ (θ) log p(xn θ)p(θ M) q(z N dθ, )q θ (θ) z N (1) where q and q θ are variational distributions on z N and θ, respetively. On q and q θ, z N and θ are assumed to be independent of eah other in order to make the lower bound omputationally and analytially tratable. This ignores, however, signifiant dependeny between z N and θ, and basially the equality does not hold. In ontrast to this, we onsider the lower bound on q(z N ) to be: log p(x N M) ( p(x q(z N N, z N M) ) ) log q(z N ) z N (2) VLB(q, x N, M), (3) Lemma 1 [2] guarantees that max q {VLB(q, x N, M)} is exatly onsistent with log p(x N M). Lemma 1 The inequality (3) holds for an arbitrary distribution q on z N, and the equality is satisfied by q(z N ) = p(z N M, x N ).

3 Ryohei Fujimaki, Satoshi Morinaga Sine this paper handles mixture models, we further assume mutual independene of z N, i.e. q(z N ) = C =1 q(zn ) and q(z N ) = N q(z n) zn. The lower bound (2) is the same with that the ollapse variational Bayesian (CVB) method [15, 20] onsiders. A key differene is that FAB employs an asymptoti seond approximation of (2), whih produes several desirable properties of FAB whih we have desribed in Setion 1, while CVB methods employ seond oder approximations of variational parameters in eah iterative update step. Note that the numerator of (3) has the form of the parameter integration of p(x N, z N M), as: C p(x N, z N M) = p(z N α) p (x N z N, φ )p(θ M)dθ. =1 (4) A key idea in FIC is to apply the Laplae method to the individual fatorized distributions p(z N α) and p (x N z N, φ ) as follows 3 : [ FZ, (α ᾱ)] log p(x N,z N θ) log p(x N, z N θ) N 2 C N z [ n F, (φ 2 φ )], (5) =1 Then, by applying a prior of log p(θ M) = O(1), p(x N, z N M) an be asymptotially approximated as: (2π) Dα/2 p(x N, z N M) p(x N, z N θ) N Dα/2 F (9) Z 1/2 C =1 (2π) D/2 ( N z n) D/2 F 1/2 Here, D α D(α) = C 1 and D D(φ ), in whih D( ) is the dimensionality of. On the basis of A1 and Lemma 2, both log F 1/2 and log F Z 1/2 are O(1). By substituting (9) into (3) and ignoring asymptotially small terms, we obtain an asymptoti approximation of log p(x N M) = max q {VLB(q, x N, M)} (i.e., FIC) as follows: F IC(x N,M) = max q J (q, θ, x N ) = ( q(z N ) z N C =1 D { J (q, θ, } x N ) 2 log( N (10) log p(x N, z N θ) D α 2 log N )} z n ) log q(z N ) The following theorem justifies FIC as an approximation of the marginal log-likelihood. where [A, a] represents the quadrati form a T Aa for a matrix A and a vetor a. Here we denote the ML estimator of log p(x N, z N θ) as θ = (ᾱ, φ 1,..., φ C ). We disuss appliation of the Laplae method around the maximum a priori (MAP) estimator in Setion 4.5. F Z and F are fatorized sample approximations of Fisher information matries defined as follows: F Z = 1 2 log p(z N α) α=ᾱ N α α T, (6) 1 F 2 log p (x N z N, φ = ) N z n φ φ T. φ = φ The following lemma is satisfied for F Z and F. Lemma 2 With N, F Z and F respetively onverge to the Fisher information matries desribed as: 2 log p(z α) F Z = α α T p(z α)dz, (7) α=ᾱ 2 log p (X φ F = ) φ φ T p (X φ )dx. (8) φ = φ The proof follows the law of large numbers by taking into aount that the effetive number of samples for the -th omponent is N z n. 3 The Laplae approximation is not appliable to p(x N θ) beause of its non-regularity. 402 Theorem 3 F IC(x N, M) is asymptotially onsistent with log p(x N M) under A1 and log p(θ M) = O(1). Sketh of Proof 3 Sine both p (x N z N, φ ) and p(z N α) satisfy the regularity ondition (A1 and A2), their Laplae approximations appearing in the left side of (9) have asymptoti onsisteny (for the same reason as with BIC [7, 18, 19]). Therefore, their produt (9) asymptotially agrees with p(x N, z N M). By applying Lemma 1, this theorem is proved. Despite our ignoring the prior effet, FIC still has the regularization term D z N q(zn ) log( N z n)/2. This term has several interesting properties. First, a dependeny between z N and θ, whih most of VB methods ignore, expliitly appears. Like BIC, it is not the parameter or funtion representation of p(x φ ) but only its dimensionality D that plays an important role in the bridge between the latent variables and the parameters. Seond, the omplexity of the -th omponent is adjusted by the value of z N q(zn ) log( N z n). Therefore, omponents of different sizes are automatially regularized in their individual appropriate levels. Third, roughly speaking, with respet to the terms, it is preferable for the entropy of q(z N ) to be large. This makes the parameter estimation in FAB identifiable. More details are disussed in Setions 4.4 and 4.5.

4 Fatorized Asymptoti Bayesian Inferene for Mixture Modeling 4 Fatorized Asymptoti Bayesian Inferene Algorithm 4.1 Lower bound of FIC Sine θ is not available in pratie, we annot evaluate FIC itself. Instead, FAB maximizes an asymptotiallyonsistent lower bound for FIC. We firstly derive the lower bound as follows: Lemma 4 Let us define L(a, b) log b + (a b)/b. F IC(x N, M) is lower bounded as follows: F IC(x N, M) G(q, q, θ, x N ) (11) ( q(z N ) log p(x N, z N θ) D α 2 log N z N C N D 2 L( N z n, =1 ) q(z n )) log q(z N ), with arbitrary hoies of q, θ and a distribution q on z N ( N q(z n) > 0). Sketh of Proof 4 Sine θ is the ML estimator of log p(x N, z N θ), log p(x N, z N θ) log p(x N, z N θ) is satisfied. Further, on the basis of the onavity of the logarithm funtion, log( N z n) L( N z n, N q(z n)) is satisfied. By substituting these two inequalities into (10), we obtain (11). In addition to the replaement of θ with the unavailable estimator θ, a omputationally intratable log( N z n) is linearized around N q(z n) with a new parameter (distribution) q. Now our problem is to solve the following maximization problem (note that q, θ and q are funtions of M): M, q, θ, q = arg max M,q,θ, q G(q, q, θ, xn ). (12) The following lemma gives us a guide for optimizing the newly introdued distribution q. We omit the proof beause of spae limitations. Lemma 5 If we fix q and θ, then q = q maximizes G(q, q, θ, x N ). With a finite number of model andidates M, we an use a two-stage algorithm in whih we first solve the maximization of (12) for all andidates in M, and then selet the best model. When we optimize only the number of omponents C (e.g., as in a GMM), we need to solve the inner maximization C max times (the maximum searh number of omponents.) However, if we intend to optimize S 1,..., S C (e.g., as in PCM or an autoregressive mixture), we must avoid a ombinatorial salability issue. FAB provides a natural way to do this beause the objetive funtion is separated into independent parts in terms of S 1,..., S C Iterative Optimization Algorithm Let us first fix C, and onsider the following optimization problem: S, q, θ, q = arg max S,q,θ, q G(q, q, θ, xn ), (13) where S = (S 1,..., S C ). As the inner maximization, FAB solves (13) on the basis of iterations of two substeps (V-step and M-step). Let the supersription (t) represent the t-th iteration. V step The V-step optimizes the variational distribution q as follows: { } q (t) = arg max G(q, q = q (t 1), θ (t 1), x N ). (14) q More speifially, q (t) is obtained as follows: q (t) (z n ) α (t 1) p(x n φ (t 1) ) exp( D 2α (t 1) N ), (15) where we use N q(t 1) (z n ) = α (t 1) N. The regularization terms whih we disussed in Setion 3 appear here as exponentiated update terms, i.e., D exp( ). Roughly speaking, (15) indiates that 2α (t 1) N 2α (t 1) N smaller and more omplex omponents are likely to beome smaller through the iterations. The V-step is different from the E-step of the EM algorithm in D the term exp( ). This makes an essential differene between FAB inferene and the ML estimation that employs the EM algorithm (Setions 4.4 and 4.5). M step The M-step optimizes the models of individual omponents S and the parameter θ as follows: S (t), θ (t) = arg max S,θ G(q(t), q = q (t), θ, x N ). (16) G(q, q, θ, x N ) has a signifiant advantage in its avoiding of the ombinatorial salability issue on S seletion. Sine G(q, q, θ, x N ) has no ross term between omponents (if we fix q), we an separately optimize S and θ as follows: α (t) = N q (t) (z n )/N, (17) S (t), φ (t) = arg max H (q (t), q (t), φ, x N ) (18) S,φ H (q (t), q (t), φ,x N ) = N q (t) (z n ) log p(x n φ ) D N 2 L( N q (t) (z n ), q (t) (z n )).

5 Ryohei Fujimaki, Satoshi Morinaga H (q (t), q (t), φ, x N ) is a part of G(q (t), q (t), φ, x N ), whih is related to the -th omponent. With a finite set of omponent andidates, in (18), we first optimize φ for eah element of a fixed S and then selet the optimal one by omparing them. Note that, if we use a single omponent type (e.g., GMM), our M-step eventually beomes the same as that in the EM algorithm. 4.3 Convergene and Consisteny After the t-th iteration, FIC is lower bounded as: F IC(x N, M) F IC (t) LB (xn, M) G(q (t), q (t), θ (t), x N ). (19) We have the following theorem whih guarantees that FAB will monotonially inrease F IC (t) LB (xn, M) over the iterative optimization and also onverge to the loal minima under ertain regularity onditions. Theorem 6 For the iteration of the V-step and the M-step, the following inequality is satisfied: F IC (t) LB (xn, M) F IC (t 1) LB (xn, M). (20) Sketh of Proof 6 The theorem is proved as follows: G(q (t), q (t), θ (t), x N ) G(q (t), q (t), θ (t 1), x N ) G(q (t), q (t 1), θ (t 1), x N ) G(q (t 1), q (t 1), θ (t 1), x N ). The first and the third inequalities arise from (16) and (14), respetively. The seond inequality arises from Lemma 5. We then use the following stopping riterion for the FAB iteration steps: F IC (t) LB (xn, M) F IC (t 1) LB (xn, M) ε, (21) where ε is an optimization tolerane parameter and is set to 1e-6 in Setion 5. In addition to the monotoni property of FAB, we present the theorem below, whih asymptotitheoretially supports FAB. Let us denote M (t) = (C, S (t) ) and let the supersriptions T and represent the number of steps at onvergene and the true model/parameters, respetively. Theorem 7 F IC (T ) LB (xn, M (T ) ) is asymptotially onsistent with F IC(x N, M (T ) ). Sketh of Proof 7 1) In (15), exp( D ) onverges to one, and S (T 1) = S (T ) holds without 2α (T ) N loss of generality. Therefore, the FAB algorithm is asymptotially redued to the EM algorithm. This means θ (T ) onverges to the ML estimator ˆθ of 404 log p(x N θ). Then, z N q(t ) (z N )(log p(x N, z N θ) log p(x N, z N θ (T ) )) /N 0 holds. 2) On the basis of the law of large numbers, N z n/n onverges to α. Then, z N q(t ) (z N ) C =1 (log( N z n) log( N q(t ) (z n ))) /N 0 holds. The substitution of 1) and 2) into (11) proves the theorem. The following theorem arises from Theorems 3 and 7 and guarantees that FAB will be asymptotially apable of evaluating the marginal log-likelihood itself: Theorem 8 F IC (T ) LB (xn, M (T ) ) is asymptotially onsistent with log p(x N M (T ) ). 4.4 Shrinkage Mehanism As has been noted in Setions 3 and 4.2, the term exp(d /2α (t) N) in (15), whih arises from D z N q(zn ) log( N z n)/2 in (10), plays a signifiant role in the ontrol of model omplexity. Fig.1 illustrates the shrinkage effets of FAB in the ase of of a GMM. For D = 10 and N = 50, for example, a omponent of α < 0.2 is severely regularized, and suh omponents are automatially shrunk (see the left graph.) With inreasing N, the shrinkage effet dereases, and small omponents manage to survive in the FAB optimization proess. The right graph shows the shrinkage effet in terms of the dimensionality. In a low dimensional spae (e.g., D = 1, 3), small omponents are not stritly regularized. For D = 20 and N = 100, however, a omponent of α < 0.4 may be shrunk, and only one omponent might survive beause of the onstraint C =1 α. Let us onsider the gradient of exp( D 2α N ) whih is derived as exp( D 2α )/ α N = D D α 2 N exp( 2α ) 0. N This indiates that the iteration of FAB aelerates the shrinkages in order of O(α 2 ). More intuitively speaking, the smaller the omponent is, the more likely it is to be shrunk. The above observations explain the overfitting mitigation mehanism in FAB. The shrinkage effet of FAB is different from the priorbased regularization of the MAP estimation. In fat, it appears despite our asymptotially ignoring of the prior effet, and even if a non-informative prior or a flat prior is applied. In terms of pratial implementation, α does not get to exatly zero beause (15) takes an exponentiated update. Therefore, in our FAB proedure, we apply a thresholding-based shrinkage of small omponents, after the V-step, as follows: { N q (t) 0 if (z n ) = q(t) (z n ) < δ q (t) (z n )/Q (t) (22) otherwise

6 Fatorized Asymptoti Bayesian Inferene for Mixture Modeling Figure 1: Shrinkage effets of FAB in GMM (D = D + D(D +1)/2, where D and D are, respetively, the data dimensionality and the parameter dimensionality.) The horizontal and vertial axes represent α and exp( D /(α N)), respetively. The effet is evident when the number of data is not suffiient for the parameter dimensionality. Algorithm 1 F AB two : Two-stage FAB input : x N, C max, S, ε output : C, S, θ, q (z N ), F ICLB 1: F ICLB = 2: for C = 1,..., C max do 3: Calulate F IC (T ) LB, C, S(T ), θ (T ), and q (T ) (z N ) by F AB shrink (x N, C max = C, S, ε, δ = 0). 4: end for 5: Choose C, S, θ and q (z N ) by (12). δ and Q (t) are a threshold value and a normalization onstant for C =1 q(t) (z n ) = 1, respetively. We used the threshold value δ = 0.01N in Setion 5. With the shrinkage operation, FAB does not require the twostage optimization in (12). By starting from a suffiient number of omponents C max, FAB iteratively and simultaneously optimizes all of M = (C, S), θ, and q. Sine FAB annot revive shrunk omponents, it is a greedy algorithm, like the least angle regression method [9] whih does not revive shrunk features. The two-stage algorithm and the shrinkage algorithm are shown here as Algorithms 1 and 2. S is a set of omponent andidates (e.g., S = {Gaussian} for GMM. S = {0,..., K max } for PCM where K max is the maximum degree of urves.) 4.5 Identifiability A well-known diffiulty in ML estimation for mixture models is non-identiability, as we have noted in Setion 1. Let us onsider a simple mixture model, p(x a, b, ) = ap(x b) + (1 a)p(x ). All models p(x a, b, = b) with arbitrary a are equivalent (i.e., in an equivalent lass.) Therefore, the ML estimator is not unique. Theoretial studies have shown that Bayesian estimators and VB estimators avoid the above issue and signifiantly outperform the ML estimator in terms of generalization error [17, 21, 24]. 405 Algorithm 2 F AB shrink : Shrinkage FAB input : x N, C max, S, ε, δ output : F IC (T ) LB, C, S(T ), θ (T ), q (T ) (z N ) 1: t = 0, F IC (0) LB = 2: randomly initialize q (0) (z N ). 3: while onvergene do 4: Calulate S (t) and θ (t) by (17) and (18). 5: Chek onvergene by (21). 6: Calulate q (t+1) by (15). 7: Shrinkage omponents by (22). 8: t = t + 1 9: end while 10: T = t In FAB, at the stationary (onvergene) point of (15), the following equality is satisfied as N q (z n ) = αn = N α p(x n φ ) exp( D 2α N )/Z n, where Z n = C =1 α p(x n φ ) exp( D 2α N ) is the normalization onstant. For all pairs of α, α > 0, we have the following ondition: N p(x n φ ) exp( D 2α N ) = N p(x n φ ) exp( D 2α N ). Then, the neessary ondition of φ = φ is α = α. Therefore, FAB hooses the model, in an equivalent lass, whih maximizes the entropy of q and addresses the non-identifiability issue. Unfortunately, FIC and FAB do not resolve another non-identifiability issue, i.e., divergene of the riterion. Without the assumption A2, (10) and (11) an diverge to infinity (e.g., a GMM with a zero-variane omponent 4.) We an address this issue by applying the Laplae approximation around the MAP estimator θ MAP of p(x N, z N θ)p(θ) rather than around the ML estimator θ. In suh a ase, a flat onjugate prior might be used in order to avoid the divergene. While suh priors do not provide regularization effets, FAB has an overfitting mitigation mehanism, as has been previously disussed. Beause of spae limitations, we omit here a detailed disussion of FIC and FAB with prior distributions. 5 Experiments 5.1 Polynomial Curve Mixture We first evaluated F AB shrink for PCM to demonstrate how its model seletion works. We used the settings N = 300 and C max = K max = 10 in this evaluation 5. Let Y be a random dependent variable of X. 4 Pratially speaking, with the shrinkage (22), FAB does not have the divergene issue beause small omponents are automatially removed. 5 Larger values of C max and K max did not make a large differene in results, but they were hard to visualize.

7 Ryohei Fujimaki, Satoshi Morinaga Figure 2: Model seletion proedure for F AB shrink in appliation to PCM. The true urves are : 1) Y = 5 + ε, 2) Y = X ε, 3) Y = X + ε, and 4) Y = 0.5X 3 + 2X + ε where ε N (0, 1). The top shows the hange in F IC (t) LB over iteration t. The bottom three graphs are the estimated urves at t = 1, 6, and 44. PCM has the mixed omponent p (Y X, φ ) = N (Y, X(S )β, σ 2 ), where X(S ) = (1, X,..., X S ). The true model has four urves as shown in Fig. 2.We used the setting α = (0.3, 0.2, 0.3, 0.2) for the urves 1) - 4). x N was uniformly sampled from [ 5, 5]. Fig. 2 illustrates F IC (t) LB (top) and intermediate estimation results (bottom). From the top figure, we are able to onfirm that FAB monotonially inreases F IC (t) LB, exept for the points (dashed irles) of the shrinkage operation. In the initial state t = 1 (bottom left), q (0) (z N ) is randomly initialized, and therefore the degrees of all ten urves are zero (S = 0). Over the iteration, F AB shrink simultaneously searhes C, S, and θ. For example, at t = 15 (bottom enter), we have one urve of S = 9, two urves of S = 2, and six urves of S = 0. Here two urves have already been shrunk. Finally, F AB shrink selets the model onsistent with the true model at t = 44 (bottom left). These results demonstrate the powerful simultaneous model seletion proedure of F AB shrink. 5.2 Gaussian Mixture Model We applied F AB two and F AB shrink to GMM in appliation to artifiial datasets and UCI datasets [10], and ompared them with the variational Bayesian EM algorithm for GMM (VBEM) that is desribed in Setion 10.2 of [3] and the variational Bayesian Dirihlet Proess GMM (VBDP) [5]. The former and the latter use a Dirihlet prior and a Dirihlet proess prior for α, respetively. We used the implementations by M. E. Khan 6 for VBEM and by K. Kurihara 7 for VBDP. We used the default hyperparameter values in the software. In the following experiments, C max was set to C max = 20 for all methods Artifiial Data We generated the true model with D = 15, C = 5 and φ = (µ, Σ ) (mean and ovariane), where the supersription denotes the true model. The parameter values were randomly sampled as α [0.4, 0.6] (before normalization), and µ [ 5, 5]. Σ were generated as a Σ, where a [0.5, 1.5] is a sale parameter and Σ is generated using the gallery funtion in Matlab. The results are averages of ten runs. The suffiient number of data is important measure to evaluate asymptoti methods, and this is usually measured in terms of the empirial onvergenes of their riteria. Table 1 shows how the values of F IC (T ) LB /N onverge over N. Roughly speaking, N = was suffiient for onvergene (depending on data dimensionality D.) Our results indiate that FAB an be applied in atual pratie to reent data analysis senarios with large N values. We next ompared the four methods on the basis of three evaluation metris: 1) the Kullbak-Leibler divergene (KL) between the true distribution and the estimated distribution (estimation error), 2) C C (model seletion error), and 3) CPU time. For N 500, VBDP performed better than or omparably with F AB two and F AB shrink w.r.t. KL and C C (left and middle graphs of Fig. 3). With inreasing N, those of the F ABs signifiantly derease while that of VBDP does not. This might be beause the lower bound in (1) generally does not reah the marginal log-likelihood as has been noted. VBEM was signifiantly inferior for both metris. Another observation is that VBDP and VBEM were likely to re-

Fatorized Asymptoti Bayesian Inferene for Mixture Modeling Table 1: Convergene of F ICLB over data size N. The standard deviations are in parenthesis. N 50 100 250 500 1000 2000 3000 F AB two -10.

8 Fatorized Asymptoti Bayesian Inferene for Mixture Modeling Table 1: Convergene of F ICLB over data size N. The standard deviations are in parenthesis. N F AB two (3.51) (2.51) (1.19) (1.05) (1.10) (0.50) (0.39) F AB shrink (2.61) (1.56) (1.24) (1.02) (0.85) (0.74) (0.39) Table 2: Preditive likelihood per data. The standard deviations are in parenthesis. The maximum number of training samples was limited to 2000 (parenthesis in the N olumn), and the rest of them were used for test. The best results and those not signifiantly worse than them are highlighted in boldfae (one-side t-test with 95% onfidene.) data N D F AB two F AB shrink VBEM VBDP iris (0.76) (0.68) (0.64) (0.57) yeast (1.81) (3.38) 3.52 (3.80) (0.72) loud (2.81) 9.09 (2.62) 5.47 (2.62) 2.67 (2.28) wine quality 6497 (2000) (0.22) (0.17) (0.17) (0.06) olor moments (2000) (0.20) (0.21) (0.21) (0.11) oo texture (2000) (0.22) (0.46) (0.13) 6.78 (0.28) spoken arabi digits (2000) (0.11) (0.09) (0.09) (0.19) Figure 3: Comparisons of four methods in terms of KL (left), C C (middle), and 3) CPU time (right). spetively under-fit and to over-fit their models, while we did not find suh biases in F ABs. While KLs of F ABs are similar, F AB shrink performed better in terms of C C beause small omponents survived in F AB two. Further, both F ABs had a signifiant advantage over VBDP and VBEM in CPU time (right graph). In partiular, while the CPU time of VBDP grew explosively over N, those of F ABs remained in a pratial range. Sine F AB shrink does not use a loop to optimize C, it was signifiantly faster than F AB two UCI data For the UCI datasets summarized in Table 2, we do not know their true distributions and therefore employed preditive log-likelihood as an evaluation metri. For large sale datasets, we randomly seleted two thousands data for training and used the rest of the data for predition beause of the salability issue in VBDP. As shown in Table 2, F ABs gave better performane then VBDP and VBEM with a suffiient number of data (the sale O(10 3 ) agrees with the results in the previous setion). The trend of VBDP (under-fit) and VBEM (over-fit) was the same with the previous setion. Also, while the preditive log-likelihood of F AB two and F AB shrink are ompetitive, F AB shrink 407 obtained more ompat models (smaller C values) than F AB two. Unfortunately, we have no spae to show the results here. 6 Summary and Future Work We have proposed approximation of marginal loglikelihood (FIC) and an inferene method (FAB) for Bayesian model seletion for mixture models, as an alternative to VB inferene. We have given their justifiations (asymptoti onsisteny, onvergene, et) and analyzed FAB mehanisms in terms of overfitting mitigation (shrinkage) and identifiability. Experimental results have shown that FAB outperforms state-ofthe-art VB methods for a pratial number of data in terms of both model seletion performane and omputational effiieny. Our next step is extensions of FAB to more general non-regular and non-identifiable models whih ontain ontinuous latent variables (e.g., fator analyzer models [12] and matrix fatorization models [17]), timedependent latent variables (e.g., hidden Markov models [16]), hierarhial latent variables [14, 4], et. We believe that our idea, fatorized representation of the marginal log-likelihood in whih the asymptoti approximation works, is widely appliable to them.

9 Ryohei Fujimaki, Satoshi Morinaga Referenes [1] T. Ando. Bayesian Model Seletion and Statistial Modeling. Chapman and Hall/CRC, [2] M. Beal. Variational Algorithms for Approximate Bayesian Inferene. PhD thesis, Gatsby Computational Neurosiene Unit, University College London, [3] C. M. Bishop. Pattern Reognition and Mahine Learning. Springer-Verlag, [4] C. M. Bishop and M. E. Tipping. A hierarhial latent variable model for data visualization. IEEE Transations on Pattern Analysis and Mahine Intelligene, 20(3): , [5] D. M. Blei and M. I. Jordan. Variational inferene for Dirihlet proess mixtures. Bayesian Analysis, 1(1): , [6] A. Corduneanu and C. M. Bishop. Variational Bayesian model seletion for mixture distributions. In Proeedings of Artiffiial Intelligene and Statistis, pages 27 34, [7] I. Csiszar and P. C. Shields. The onsisteny of the BIC Markov order estimator. Annals of Statistis, 28(6): , [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from imomplete data via the EM algorithm. Journal of the Royal Statistial Soiety, B39(1):1 38, [9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistis, 32(2): , [10] A. Frank and A. Asunion. UCI mahine learning repository [11] R. Fujimaki, Y. Sogawa, and S. Morinaga. Online heterogeneous mixture modeling with marginal and opula seletion. In Proeedings of ACM SIGKDD Conferene on Knowledge Disovery and Data Mining, pages , [14] M. I. Jordan and R. A. Jaobs. Hierarhial mixtures of experts and the EM algorithm. Neural Computation, 6: , [15] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational dirihlet proess mixture models. In Proeedings of Twentieth International Joint Conferene on Artifiial Intelligene, pages , [16] D. J. MaKay. Ensemble learning for hidden Markov models. Tehnial report, University of Cambridge, [17] S. Nakajima and M. Sugiyama. Theoretial analysis of Bayesian matrix fatorization. Journal of Mahine Learning Researh, 12: , [18] G. Shwarz. Estimating the dimension of a model. The Annals of Statistis, 6(2): , [19] M. Stone. Comments on model seletion riteria of Akaike and Shwarz. Journal of the Royal Statistial Soiety: Series B, 41(2): , [20] Y. W. Teh, D. Newman, and M. Welling. A ollapsed variational bayesian inferene algorithm for latent dirihlet alloation. In Advanes in Neural Information Proessing Systems, pages , [21] K. Watanabe and S. Watanabe. Stohasti omplexities of Gaussian mixtures in variational Bayesian approximation. Journal of Mahine Learning Researh, 7: , [22] S. Watanabe. Algebrai analysis for nonidentiable learning mahines. Neural Computation, 13: , [23] S. Watanabe. Algebrai geometry and statistial learning. UK: Cambridge University Press, [24] K. Yamazaki and S. Watanabe. Mixture models and upper bounds of stohasti omplexity. Neural Networks, 16: , [12] Z. Ghahramani and M. Beal. Variational inferene for Bayesian mixtures of fator analysers. In Advanes in Neural Information Proessing Systems 12, pages MIT Press, [13] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proeedings of the Royal Soiety of London. Series A, Mathematial and Physial Sienes, 186: ,

Complexity of Regularization RBF Networks

Complexity of Regularization RBF Networks Mark A Kon Department of Mathematis and Statistis Boston University Boston, MA 02215 mkon@buedu Leszek Plaskota Institute of Applied Mathematis University of Warsaw