On Hyper-Parameter Estimation in Empirical Bayes: A Revisit of the MacKay Algorithm

Size: px

Start display at page:

Download "On Hyper-Parameter Estimation in Empirical Bayes: A Revisit of the MacKay Algorithm"

Mitchell Garrett
5 years ago
Views:

1 On Hyper-Paraeter Estiation in Epirical Bayes: A Revisit of the MacKay Algorith Chune Li 1, Yongyi Mao, Richong Zhang 1 and Jinpeng Huai 1 1 School of Coputer Science and Engineering, Beihang University, Beijing, P. R. China School of Electrical Engineering and Coputer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada Abstract An iterative procedure introduced in MacKay s evidence fraework is often used for estiating the hyper-paraeter in epirical Bayes. Despite its effectiveness, the procedure has stayed priarily as a heuristic to date. This paper forally investigates the atheatical nature of this procedure and justifies it as a well-principled algorith fraework. This fraework, which we call the MacKay algorith, is shown to be closely related to the EM algorith under certain Gaussian assuption. 1 INTRODUCTION As a bridge between full Bayesian odels and copletely frequentist odels, the epirical Bayesian ethod (also known as epirical Bayes in short, or type-ii axiu likelihood) has been applied to any learning, inference or prediction applications (see. e.g. [Schäfer and Strier, 005, Efron, 01, Heskes, 000, Yang et al., 004, Frost and Savarino, 1986, DuMouchel and Pregibon, 001]). The generic setup of epirical Bayes consists the observed data D, the odel paraeter z that paraetrizes the data likelihood function p(d z), and the prior distribution p(z) of the odel paraeter. In the paraetric version of epirical Bayes (Figure 1), the prior distribution is paraeterized by certain hyper-paraeter, naely, as p(z ), and the philosophy of epirical Bayes is to estiate the hyper-paraeter fro the observed data D. As epirical Bayes treats the odel paraeter z as a latent rando variable, the estiation of the hyper- This work is supported partly by China 973 progra (No. 014CB340305), by the National Natural Science Foundation of China (No , ), and by the Beijing Advanced Innovation Center for Big Data and Brain Coputing. Richong Zhang is the corresponding author of this work (eail: zhangrc@act.buaa.edu.cn) z D Figure 1: The generic odel of epirical Bayesian ethod paraeter naturally fits in the fraework of the EM algorith [Depster et al., 1977, Carlin and Louis, 1997], and the EM-based solutions have been developed to solve this proble in various application doains (see, e.g., [Inoue and Tanaka, 001, Clyde and George, 000]). Aong other approaches to this proble, a technique introduced by MacKay is also widely adopted in practice[bishop, 1999, Tipping, 001, Tipping et al., 003, Wipf and Nagarajan, 008, Tan and Févotte, 009]. In his evidence fraework [MacKay, 199b, MacKay and Neal, 1994, MacKay, 1995], MacKay considers a hierarchical Bayesian odel siilar to that in Figure 1 but with one distinction: an additional hyper-prior p( H), which depends on the choice H of odel, is placed on the hyper-paraeter. In this setting, the evidence fraework addresses three levels of inference probles: 1) given the hyper-paraeter, inferring z, ) given the odel H, inferring, and 3) evaluating odel H. MacKay shows [MacKay, 1995] that the three levels of inference ay be cobined for prediction and for autoatic shrinkage of paraeter spaces (naely, Autoatic Relevance Deterination, or ARD) for neural network regression odels. The second-level inference in the evidence fraework is closely related to epirical Bayes. In particular, when a flat hyper-prior p( H) is placed on, the objective of the second-level inference coincides with the objective of epirical Bayes. For the second-level inference, MacKay introduces a procedure that alternates between inferring z given (first-level inference) and inferring given z. This procedure, although well appreciated in soe classical papers (e.g., [Bishop, 1999]) and highly cited in ARD related literature (e.g., [Bishop, 1999, Tipping, 001, Tan and Févotte, 009]), is called the MacKay algorith in this paper.

2 Since its birth, the MacKay algorith has been applied to various epirical Bayes odels and its perforance is often copared with the EM algorith. For exaple, the MacKay algorith is applied to the Bayesian PCA odel [Bishop, 1999] and a non-negative atrix factorization odel [Tan and Févotte, 009] for autoated shrinkage of the latent-space diensions. In [Tipping, 001], the MacKay algorith is applied to SVM regression odels for prooting sparsity, and it is shown to converge faster than the EM algorith. Despite its effectiveness, the atheatical principle and optiization objective of the MacKay algorith are however not well characterized in the literature to date. In MacKay s original exposition [MacKay and Neal, 1994, MacKay, 1995], the second-level inference task is clearly stated, but the justification of the iterative procedure (i.e., the MacKay algorith) is ainly heuristic. In addition, since the MacKay s algorith is often ipleented with a particular update procedure, known as the fixedpoint iteration [Soloon, 015, Hyvärinen, 1999], or the MacKay update, the boundary between the fraework of the MacKay algorith and MacKay s fixed-point update rule is often blurred in the literature. This akes the MacKay algorith often understood in a narrow sense as this specific fixed-point update rule, rather than as an algorith fraework. In this paper, under a generic forulation of the epirical Bayes odel (Figure 1), we re-forulate the MacKay algorith as a coordinate-ascent procedure for solving a welldefined optiization proble. This optiization proble shares soe siilarity with the optiization proble underlying the EM algorith: its objective function is a lower bound of the true objective function defining the optiization objective of epirical Bayes. Also siilar to the EM algorith, this lower bound is not far fro the true objective function and one of the two update steps in the coordinate-ascent procedure guarantees to ake the lower bound eet the true objective function. This understanding justifies the MacKay algorith (whether or not ipleented with the MacKay update) as a well-principled algorith fraework, juxtaposed on equal footing with the EM fraework. Under a specific linear regression odel, it has been observed that the MacKay update and the EM algorith are closely reated [Murphy, 01]. It is then curious to investigate the relationship between the two algoriths in a ore general setting. To that end, we show that as long as the the posterior distribution p(z D, ) is a Gaussian distribution, the objective function for the MacKay algorith is siply a restriction of the EM objective function where two of the three variables are restricted on a curve. Under this Gaussian condition, we show that the MacKay optiization proble is a relaxation of the original optiization proble in epirical Bayes, and that the EM optiization proble is a relaxation of the MacKay optiization proble. In addition, the three probles attain their optiu at the sae configuration of the hyper-paraeter. These understandings then help to explain why the MacKay algorith converge faster than the EM algorith. The objective of this paper is to rigorously forulate the MacKay algorith and to investigate its connection to the EM algorith. We have ade an effort to be pedagogical in our presentation. In particular, we use a linear regression odel and the Bayesian PCA odel as running exaples throughout the paper. SETUP The generic odel for epirical Bayes is given in Figure 1, where D is the observed data, z is the odel paraeter, and is the hyper-paraeter. We note that both z and can be a scalar, a vector, a atrix or of an arbitrary for. The odel is specified by the likelihood function p(d z) and the prior distribution p(z ). The objective of epirical Bayes is then to estiate the hyper-paraeter fro the data D. Let l() := log p(d ) = log p(d z)p(z )dz (1) be the log-arginal likelihood or the log-evidence [MacKay, 199b] of the hyper-paraeter. Then the estiation of can be naturally forulated as solving the following optiization proble. Opt-I Find that axiizes l(). We now use the exaples of linear regression and Bayesian PCA[Bishop, 1999] to illustrate this. Throughout the paper, we will use N (x; µ, Λ) to denote the Gaussian density function with variable x, ean µ and covariance atrix Λ, and we will use I d to denote the d d identity atrix, Tr( ) to denote the trace operator, Det( ) to denote the deterinant operator, to denote L nor, and E q [ ] to denote expectation under distribution q. Linear Regression Exaple 1 Let D := {(x (i), y (i) ) : i = 1,,..., n} be the observed data, where each x (i) is a vector in R d, and each y (i) is a scalar in R. The dependency of y (i) on x (i) is odelled as y (i) = z T x (i) + ϵ (i). Here ϵ (i) is a zero-ean Gaussian noise with variance σ, and z is the odel paraeter, which is odelled as a d- diensional spherical Gaussian variable with zero ean and variance 1/. For siplicity, we assue that the paraeter σ is known and the objective of epirical Bayes

3 is to estiate the hyper-paraeter. Then the objective function in Opt-I is: l()= log N (z;0, 1 I d) n N (y (i) ; z T x (i),σ )dz Bayesian PCA Exaple 1 Following [Bishop, 1999], let D := {t (i) : i = 1,,..., n} be the observed data, where each t (i) is a vector in R. Each observed vector t (i) depends on a latent variable x (i) R d via t (i) = zx (i) + ϵ (i). Here x (i) is a zero-ean Gaussian variable with covariance I d, ϵ (i) is a spherical Gaussian noise with zero ean and known variance σ, and paraeter z R d (d < ) is odelled as a rando atrix whose kth colun z k is drawn fro a spherical Gaussian distribution with zero ean and variance 1/ k. Let := { k : k = 1,,..., d}. Then is the hyper-paraeter on z and Opt-I has the following objective function: ( d 1 l() = log N (z k ; 0, I ) k n ) N (x (i) ; 0, I d )N (t (i) ; zx (i),σ I )dx (i) dz The optiization proble Opt-I can soeties be solved easily, for instance, in the above linear regression setting. In practice, however, this proble is usually difficult and requires special algorithic techniques. The above Bayesian PCA setting is one such exaple. The EM approach to Opt-I is well-known. In the reainder of this paper, we develop the MacKay algorith for this proble. To copare and relate to EM, we also present the EM algorith in parallel. The above linear regression and Bayesian PCA settings will be carried along our developent as illustrative exaples. 3 THE TWO ALGORITHMS In this section, we will show that both the EM algorith and the MacKay algorith can be forulated as optiizing a lower bound of the objective function l() via coordinate ascent. While this is well known for the EM algorith, it has been quite obscure for the MacKay algorith. 3.1 THE EM ALGORITHM The Expectation-Maxiization (EM) algorith [Depster et al., 1977] is a classical ethod for axiizing the log-likelihood function or log-posterior density function in which certain latent variables have been integrated over. When applied to the optiization proble Opt-I, the EM algorith iplicitly constructs a lower bound F EM of the objective function l(). [ F EM (q, ) := E q log p(d z)p(z ) ] q(z) = q(z) log p(d z)p(z ) dz q(z) where q( ) is an arbitrary probability distribution on the space of z. By the Jensen s Inequality[Jensen, 1906], the follow result is well-known in the literature of the EM algorith [Depster et al., 1977]. Lea 1. F EM (q, ) l(), where the equality is achieved if and only if q(z) is the posterior distribution p(z D, ). Instead of optiizing the original objective function l(), we now define an alternative optiization proble. OptEM Find and a distribution q that axiize F EM (q, ). The EM algorith is then the coordinate ascent solver for OptEM. More precisely, the update rule at the t th iteration of the coordinate ascent is given below. EM Algorith E-Step: M-Step: q (t) := arg ax q F EM (q, (t) ) = p(z D, (t) ) (t+1) := arg ax F EM (q (t), ) = arg ax E q (t) [log p(z )] We note that by Lea 1, at the end of E-Step, () F EM (q (t), (t) ) = l( (t) ). (3) This gives rise to the following lea [Depster et al., 1977]. Lea. The iteration of the EM algorith continuously increases the log-evidence function l() and therefore is guaranteed to converge. Linear Regression Exaple For the linear regression odel, let [x (1), x (),..., x (n) ] be denoted by a atrix X R d n and [y (1), y (),..., y (n) ] T denoted by a vector

4 Y R n. The objective function () is n F EM (q, ) = 1 σ y (i) x T (i) E q[z] [ ( 1 n ) ] σ E q z T x (i) x T (i) + σ I d z E q [log q(z)] + n + d log π + n log σ + d log 1 n σ y(i) It is easy to verify that the posterior distribution of z is also Gaussian and the E-Step update becoes ( ) q (t) (z) = N z; µ LR ( (t) ), K LR ( (t) ) (4) where and µ LR () := 1 σ ( 1 σ XXT + I d ) 1 XY, (5) K LR () := ( ) 1 1 σ XXT + I d. (6) For the M-Step update, noting that F EM (q (t), ) = µ LR ( (t) ) ) (K Tr LR ( (t) ) + d log + const, it is possible to express the axiizing for this function directly in ters of (t) as: (t+1) d = µ LR ( (t) ) +Tr ( K LR ( (t) ) ) (7) That is, the updates in E-Step and M-Step can be cobined into the single update equation (7). Bayesian PCA Exaple For the BPCA odel, the objective function () is F EM (q, ) = n E [ q log Det(zz T + σ I ) ] n E q [ t T (i) (zzt + σ I ) 1 t (i) ] k E q [ z k ] E q [log q(z)] log dn + d k log π. However, the E-Step update of q (t) can not be expressed in explicit fors and one usually relies on various approxiation techniques. For exaple, a sapling approach [Neal, 1993] ay be used for this purpose. Later in this paper, we will discuss the approach that approxiates the posterior as a Gaussian. 3. MACKAY ALGORITHM In [MacKay, 199a, MacKay, 199c, MacKay, 199b, MacKay, 1995], MacKay presented the influential evidence fraework that addresses inference as three levels. A heuristic iterative procedure is introduced for the second-level inference, naely, for inferring the hyperparaeter. Here, we re-forulate the this procedure as a well-principled algorith fraework, and call it the MacKay algorith. To begin, note that the objective function l() can be expressed as l() = log p(d, z ) log p(z D, ), for any z (with p(z D, ) non-zero). Define F MacKay (z,):=log p(d, z ) ax log p(z D,). (9) z The following lea is easy to verify. Lea 3. F MacKay (z, ) l(), where the equality is achieved if and only if z = arg ax z log p(d, z ). We now introduce another optiization proble. OptMacKay Find and z that axiize F MacKay (z, ). The MacKay algorith is then defined as the following coordinate ascent procedure for optiizing F MacKay. MacKay Algorith z-step: -Step: z (t) := arg ax F MacKay (z, (t) ) z = arg ax log p(d, z (t) ) z (t+1) := arg ax F MacKay(z (t), ). At the end of z-step, by Lea 3, The M-Step update is then (t+1) k = E q (t) (8) ]. [ z k F MacKay (z (t), (t) ) = l( (t) ). (10) This property clearly parallels Equation (3) of the EM algorith. That is, although the MacKay algorith axiizes

5 the lower bound F MacKay of the true objective function l(), the lower bound F MacKay is in fact not far below l() and at the end of each z-step update, the lower-bound eets l(). Then by the coordinate-ascent nature of the MacKay algorith, we have the following lea, parallel to Lea of the EM algorith. Lea 4. The iteration of the MacKay algorith continuously increases the log-evidence function l() and therefore is guaranteed to converge. In the MacKay algorith, it is worth noting that the update in the -Step is usually perfored with a fixed-point iteration procedure [Soloon, 015, MacKay, 199a, Bishop, 1999, Murphy, 01], which we describe next for self-containedness. Fixed-Point Iteration Suppose that the equation F MacKay (z, )/ = 0 can be reduced to the for = h(, z). The fixed-point iteration approach for the -Step update in the MacKay algorith is the following update rule. (t+1) = h( (t), z (t) ). Linear Regression Exaple 3 In the linear regression odel, the lower bound F MacKay is F MacKay (z, ) = 1 n σ y (i) x T (i) z 1 n σ ( 1 n ) σ zt x (i) x T (i) + σ I d z + n + d log π + n log σ + d log + 1 log Det (πk LR()) The z-step turns out to be Note F MacKay (z,) z (t) = µ LR ( (t) ). When setting F MacKay(z,) y (i) = d 1 z 1 Tr (K LR()). = 0, we obtain = d Tr (K LR()) z This gives rise to the fixed-point iteration of the -Step: (t+1) = d (t) Tr ( K LR ( (t) ) ) z (t). (11) Bayesian PCA Exaple 3 objective function F MacKay is For the BPCA odel, the F MacKay (z, ) = n log Det ( zz T + σ I ) n t T (i) (zzt + σ I ) 1 t (i) k z k ax z p(z D, ) log dn + d k log π The z-step and -Step updates then becoe z (t) = arg ax z (t+1) = arg ax p(z D, (t) ) [ 1 ax z p(z D, ) + k z (t) k ] log k Since in general there does not exist closed-for solution for the z-step update, the two update equations can not be further expressed. In practice, a Gaussian approxiation is applied to the posterior function p(z D, (t) ) in order to derive these update equations (see next section). It is perhaps worth noting that the z-step update of the MacKay algorith resebles the E-step update of an approxiate version of the EM algorith, known as Hard EM (or Viterbi-EM in the context of Hidden Markov Models)[Allahverdyan and Galstyan, 011]. However, the M-step of Hard EM/Viterbi-EM is different fro the - step of the MacKay algorith, due to the fact the OptEM and OptMacKay have different objective functions. It is not clear whether there is a ore direct connection between Hard EM and the MacKay algorith bypassing the generic EM algorith, although we suspect that the answer is no. 4 GAUSSIAN APPROXIMATION As seen above, in both the EM algorith and the MacKay algorith, it is desirable to copute the posterior distribution of odel paraeter, naely, p(z D, ). In the case of EM, this is for updating q in the E-Step and in the case of MacKay, this is for updating z in the z-step. For soe odels, such Bayesian PCA, it is difficult to carry out explicit coputation of the posterior. A coonly used technique is to approxiate the posterior as a Gaussian density function, naely, p(z D, ) N (z; µ, K), for soe µ, K. (1)

6 Clearly, the ean vector µ and the covariance atrix K of the Gaussian density depend on the hyper-paraeter and will be denoted by µ() and K() respectively. When p(z D, ) is a continuous function of z, a coon technique for obtain such an approxiation (1) is the following [MacKay, 1995]. Let ẑ axiizes p(z D, ). By Taylor-expanding log p(z D, ) at z = ẑ, up to the second-order ters, it is easy to see that log p(z D, ) log p (ẑ D, )+ 1 (z ẑ)t H() (z ẑ) where H() denotes the Hessian atrix of function log p(z D, ) at z = ẑ. This gives the custoary approxiation of p(z D, ) as [MacKay, 1995] p(z D, ) N ( z; ẑ, H() 1) Then µ() and K() in the Gaussian approxiation (1) can be taken as µ() = ẑ, K() = H() 1. (13) We note that in (1), the approxiation is soeties accurate, naely, that the strict equality is satisfied. In such cases, the Gaussian approxiation as stated in (1) and (13) in fact holds precisely. Linear Regression Exaple 4 As seen in (4), (5) and (6), the posterior distribution p(z D, ) is indeed a Gaussian density function. That is, the Gaussian approxiation (1) holds with equality, where µ() = µ LR () and K() = K LR () (defined in (5) and (6) respectively). Bayesian PCA Exaple 4 Let Z be the vector representation of atrix z, naely, Z is a length-d vector obtained by stacking coluns of the atrix z. That is, Z := (z T 1, z T,..., z T d )T. Let Ẑ denote the axiizing configuration for function p(z D, ), and siilarly let H() denote the Hessian of log p(z D, ) at Z = Ẑ. The Gaussian approxiation (1) then becoes where p(z D,) N (Z; µ BPCA (),K BPCA ()) (14) µ BPCA () := Ẑ, K BPCA() := H() 1. We note that in this case, (1) is only an approxiation. In addition, since Ẑ and H() 1 are difficult to copute analytically, nuerical solutions are usually sought. In the reainder of this section, we assue that (1) holds with equality and further investigate the optiization probles in the EM and MacKay algoriths. 4.1 EM Recall that with the EM algorith, the objective function in the optiization proble is F EM in (). Since the optiizing distribution q for any given is the posterior p(z D, ), this, under the Gaussian assuption (1) of the posterior, allows us to restrict q to the for N (z; u, S) paraetrized by ean vector u and covariance S. The the objective function F EM (q, ) can then be re-expressed as F EM (u, S, ). That is, [ F EM (u, S, ) = E N (z;u,s) log p(d z)p(z ) ] N (z; u, S) = E N (z;u,s) [log p(d z)p(z )] (15) + J log π + 1 log Det(S) + J where J is the length of the vector z. The following lea is established by noting the following two-way factorization of p(d, z ). p(d, z ) = p(d z)p(z ) = p(d )p(z D, ) (16) Lea 5. When the Gaussian approxiation (1) holds with equality, the function F EM can be re-expressed as F EM (u,s, ) = log N (u; µ(), K()) 1 Tr(K 1 ()S) +log p(d ) + J log π + 1 log Det(S) + J To derive the update rule for the EM algorith under the Gaussian approxiation (1), we prove the following results. Lea 6. For any given S and, arg ax F EM(u, S, ) = µ(). u Proof: Based on Lea 5, we express F EM (u, S, ) further. F EM (u,s, ) = J log π 1 log Det(K()) 1 (u µ())t K 1 ()(u µ()) 1 Tr ( K 1 ()S ) +log p(d ) + J log π + 1 log Det(S) + J. F EM u = 1 ( K 1 () + (K 1 ()) T ) (u µ()) By setting FEM u to zero, we prove the result. Lea 7. For any u and, arg ax F EM(u, S, ) = K(). S

7 Proof: By the expression of F EM (u, S, ) in Lea 5, F EM S = 1 S Tr(K 1 ()S) + 1 log Det(S). S By Tr(AB) = i j A ijb ji, we have Tr(K 1 ()S) = (K 1 ()) ij S ji ; i j Tr(K 1 ()S) = (K 1 ()) ji S ij S Tr(K 1 ()S) = (K 1 ()) T. On the other hand, since Det(S) S ij have Then S ij log Det(S) = (S 1 ) ji S log Det(S) = (S 1 ) T. = Det(S)(S 1 ) ji, we F EM S = 1 (K 1 ()) T + 1 (S 1 ) T. The lea is then proved by setting this derivative to zero. As a corollary of Lea 6, Lea 7 and (15), the update rule of the EM algorith can be established. Lea 8. When the Gaussian approxiation (1) holds with equality, the EM algorith becoes the following EM-Gauss Procedure. EM-Gauss Procedure: E-Step: M-Step: u (t) := µ( (t) ) S (t) := K( (t) ) (t+1) := arg ax E N (z;u (t),s (t) ) [log p(z )] Linear Regression Exaple 5 In the linear regression odel, the Gaussian assuption (1) holds true. The E- Step then reduces to (4), which can be integrated into the M-Step. The EM update can then be expressed as a single update equation (7), the sae as that in Linear Regression Exaple-. Bayesian PCA Exaple 5 Note that µ BPCA () is a vector of length d and K BPCA () is an d d atrix. Let µ BPCA,k () denote the coponent of µ BPCA () corresponding to z k coponent of Ẑ, and K BPCA,k() denote the sub-atrix of K BPCA () that serves as the covariance atrix of z k. With the Gassian approxiation (14) holds with equality, the update (8) of becoes (t+1) k = 4. MACKAY µ BPCA,k ( (t) ) +Tr ( K BPCA,k ( (t) ) ). Lea 9. When the Gaussian approxiation (1) holds with equality, the function F MacKay in (9) becoes F MacKay (z, ) = log p(d, z ) + 1 log Det(πK()). Proof: This lea follows fro ax z log p(z D, ) = log Det(πK()). 1 Lea 10. When the Gaussian approxiation (1) holds with equality, for any, arg ax log p(d, z ) = µ(). z The proof of this lea follows the sae line as that of Lea 6. The MacKay algorith under the Gaussian approxiation (1) can then be established fro Lea 10 and Lea 9. Lea 11. When the Gaussian approxiation (1) holds with equality, the MacKay algorith becoes the following MacKay-Gauss Procedure. MacKay-Gauss Procedure: z-step: -Step: (t+1) :=arg ax z (t) := µ( (t) ) [log p(z (t) ) + 1 log Det(K()) ] Linear Regression Exaple 6 Since the Gaussian assuption (1) holds true in linear regression, the updates of the MacKay algorith are those in Linear Regression Exaple 3. Bayesian PCA Exaple 6 Assuing that the Gassian approxiation (1) holds with equality, the z-step and - Step updates becoe Z (t) = µ BPCA ( (t) ) [ (t+1) = arg ax 1 z (t) k + 1 log Det (K BPCA())+ k ] log k

8 When setting F MacKay(z,) k = 0, we can obtain k = h( k, z) = ktr (K BPCA,k ()) z k. This gives rise to the fixed-point iteration of the -Step (t+1) k = (t) k Tr ( K BPCA,k ( (t) ) ) z (t) k 4.3 THE RELATIONSHIP BETWEEN MAKCAY AND EM As is shown earlier, the MacKay and EM algoriths correspond to solving two different optiization probles. However, we will show next that when the Gaussian approxiation (1) holds exactly, the two algoriths are closely related. First note that F EM is a trivariate function whereas F MacKay is a bivariate function. The theore below suggests that if the Gaussian approxiation of the posterior distribution p(z D, ) is exact, then by setting its covariance variable S of F EM to the covariance atrix of the posterior, the function F EM reduces F MacKay. Theore 1. When the Gaussian approxiation (1) holds with equality, F EM (u, K(), ) = F MacKay (u, ), where K() is defined in (13). Proof: Suppose that the Gaussian approxiation (1) holds with equality. Invoking (16), we have F MacKay (z, ) = log p(d, z ) + 1 log Det(πK()) = log N (z; µ(), K()) + log p(d ) But by Lea 5, we have + 1 log Det(πK()) F EM (z,k(), ) = log N (z; µ(), K()) 1 Tr ( K 1 ()K() ) + log p(d ) + J log π + 1 log Det(K()) + J = log N (z; µ(), K()) + log p(d ) + 1 log Det(πK()) =F MacKay (z, ) This proves the theore. Theore 1 suggests that the function F MacKay is a restriction of function F EM. Denote by C the set of all (S, ) configurations with S = K(). That is, C is the curve on the (S, ) plane specified by S = K(). Under this notation, F MacKay is the function F EM with variables (S, ) restricted on the curve C. Theore. When the Gaussian approxiation (1) holds with equality, F MacKay (u, ) = ax S F EM(u, S, ), (17) l() = ax u F MacKay(u, ). (18) Proof: Denote S := arg ax S F EM (u, S, ) = K(). Thus ax F EM(u, S, )=F EM (u, S, )=F EM (u, K(), ). S Then the equation (17) holds by Theore 1, On the other hand, by Lea 3, F MacKay (z, ) l() and the equality can be achieved. We thus obtain the equation (18).. The following result follows iediately fro Theore. Corollary 1. The optiizing configurations for Opt-I, OptEM and OptMacKay are identical in. Theore and Corollary 1 essentially suggest that Opt- MacKay is a relaxation of Opt-I, that OptEM is a relaxation of OptMacKay, and that such successive relaxations do not alter the solution of the original proble Opt-I. Lea 1. The EM-Gauss Procedure is identical to the 3-way coordinate ascent on F EM, naely, iterating over the following three steps. u (t) : = arg ax u F EM(u, S (t 1), (t) ) S (t) : = arg ax S F EM(u (t), S, (t) ) (t+1) : = arg ax F EM(u (t), S (t), ) Proof: This follows fro the fact that in the EM-Gauss Procedure, the update of u is independent of S and the update of S is independent of u. Since OptEM is a relaxation of OptMacKay, it has higher degrees of freedo during optiization. This extra degree of freedo is fully explored in the three-way coordinate descent of EM-Gauss, aking its convergence slower than that of MacKay-Gauss. This slower convergence of EM- Gauss can also be understood fro another perspective, in which MacKay-Gauss and EM-Gauss are both considered as optiizing the function F EM. Lea 13. The MacKay-Gauss Procedure is equivalent to the following two-way coordinate-ascent on F EM. u (t) : = arg ax F EM(u, S (t 1), (t 1) ) u ( S (t), (t)) : = arg ax F EM(u (t), S, ). (S,) C Following directly fro the Lea 11 and Theore 1, this lea suggests that MacKay-Gauss can be viewed as optiizing the sae objective function F EM as EM-Gauss,

9 but taking a particular coordinate-ascent path, naely, alternating between axiization over u and axiization over (S, ) along the curve C. This allows MacKay-Gauss to take a shorter-cut than EM-Gauss. u() EM MacKay 0.6 l() F EM F MacKay (a) (b) Figure : The convergence of both the EM algorith and the MacKay algorith fro the initial configuration = 1 to the final configuration 0.1. (a) The (, u)- trajectories of the EM and MacKay algoriths, where an arbitrary coponent (in this case, the second coponent u()) of vector u is taken as a representative for u. (b) the function F EM evaluated along the EM trajectory and the function F MacKay evaluated along the MacKay trajectory, and the log-evidence function l() plotted using its closedfor expression. 4.4 Experients Experients are perfored to study the dynaics of the MacKay algorith and the EM algorith. We generate a siulated dataset D for a linear regression odel according to the setup in Linear Regression Exaple 1 with n = 300, d = 00, σ = 10, = 0.1 where each each x (i) is drawn uniforly at rando fro the open interval (0, 1). Both the EM algorith (in Linear Regression Exaple-) and the MacKay algorith (in Linear Regression Exaple- 3) are used to estiate fro D. For both algoriths, is initialized to 1 and σ is treated as known. The optiization trajectories of the two algoriths in Figure (a) show that the MacKay algorith converges faster towards the fixed point (top right corner) than the EM algorith. In Figure (b), we see that both the MacKay algorith and the EM algorith increase their respective objective functions along their optiization paths, but MacKay achieves higher value of the log-evidence function l() than EM at every iteration step. 5 Concluding Rearks In his influential evidence fraework, MacKay presented practical Bayesian ethods for inference at the paraeter level, at the hyper-paraeter level and at the odel level. For inference at the hyper-paraeter level, MacKay introduced a heuristic procedure that iterates between estiating the paraeter for a given hyper-paraeter setting and estiating the hyper-paraeter for the previous paraeter setting. Although this procedure is widely adopted in epirical Bayesian ethods, its atheatical principle had not been carefully explored prior to this work. In this paper, we forulate this procedure as a well-principled algorithic fraework, and call it the MacKay algorith. We show that the MacKay algorith, like the EM algorith, can be understood as a coordinate-ascent solution to optiizing a lower bound of the objective function in epirical Bayes. Although this lower bound is different fro the lower bound that is optiized by the EM algorith, we show that as long as the posterior distribution of the paraeter is a Gaussian density function, the two algoriths are closely related. In particular, the EM optiization proble, the MacKay optiization proble, and the original epirical Bayes optiization proble all have the sae solution. In addition, the MacKay proble is a relaxation of the original proble, and the EM proble is a relaxation of the MacKay proble. This understanding provides and intuitive explanation as to why the MacKay algorith converges faster than the EM algorith. We believe that the close relationship between the MacKay algorith and the EM algorith revealed in this paper strongly depends on the Gaussian condition. Although this paper does not show the necessity of this condition, we believe that, in case of non-gaussian posterior or non- Gaussian odels, this relationship will break down, and the MacKay algorith will diverge fro the EM algorith towards its own optiization objective and along its own optiization path. This akes the MacKay algorith a fraework in its own right. We hope that this paper inspire ore applications of the MacKay algorith to ore general odels, a territory appearing copletely unexplored.

10 References [Allahverdyan and Galstyan, 011] Allahverdyan, A. and Galstyan, A. (011). Coparative analysis of viterbi training and axiu likelihood estiation for hs. In Advances in Neural Inforation Processing Systes, pages [Bishop, 1999] Bishop, C. M. (1999). Bayesian PCA. Advances in neural inforation processing systes, pages [Carlin and Louis, 1997] Carlin, B. P. and Louis, T. A. (1997). Bayes and epirical Bayes ethods for data analysis. Statistics and Coputing, 7(): [Clyde and George, 000] Clyde, M. and George, E. I. (000). Flexible epirical Bayes estiation for wavelets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 6(4): [Depster et al., 1977] Depster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maxiu likelihood fro incoplete data via the e algorith. Journal of the royal statistical society. Series B (ethodological), pages [DuMouchel and Pregibon, 001] DuMouchel, W. and Pregibon, D. (001). Epirical Bayes screening for ulti-ite associations. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data ining, pages ACM. [Efron, 01] Efron, B. (01). Large-scale inference: epirical Bayes ethods for estiation, testing, and prediction, volue 1. Cabridge University Press. [Frost and Savarino, 1986] Frost, P. A. and Savarino, J. E. (1986). An epirical Bayes approach to efficient portfolio selection. Journal of Financial and Quantitative Analysis, 1(03): [Heskes, 000] Heskes, T. (000). Epirical bayes for learning to learn. In ICML, pages [Hyvärinen, 1999] Hyvärinen, A. (1999). Fast and robust fixed-point algoriths for independent coponent analysis. Neural Networks, IEEE Transactions on, 10(3): [Inoue and Tanaka, 001] Inoue, J.-i. and Tanaka, K. (001). Dynaics of the axiu arginal likelihood hyperparaeter estiation in iage restoration: Gradient descent versus expectation and axiization algorith. Phys. Rev. E, 65: [Jensen, 1906] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les ingalits entre les valeurs oyennes. Acta Matheatica, 30(1): [MacKay, 199a] MacKay, D. J. (199a). Bayesian interpolation. Neural coputation, 4(3): [MacKay, 199b] MacKay, D. J. (199b). The evidence fraework applied to classification networks. Neural coputation, 4(5): [MacKay, 199c] MacKay, D. J. (199c). A practical Bayesian fraework for backpropagation networks. Neural coputation, 4(3): [MacKay, 1995] MacKay, D. J. (1995). Probable networks and plausible predictions: a review of practical bayesian ethods for supervised neural networks. Network: Coputation in Neural Systes, 6(3): [MacKay and Neal, 1994] MacKay, D. J. and Neal, R. M. (1994). Autoatic relevance deterination for neural networks. In Technical Report in preparation. Cabridge University. [Murphy, 01] Murphy, K. P. (01). Machine learning: a probabilistic perspective. MIT press. [Neal, 1993] Neal, R. (1993). Probabilistic inference using arkov chain onte carlo ethods. Technical Report CRG-TR -93-1, Dept. of Coputer Science, University of Toronto. [Schäfer and Strier, 005] Schäfer, J. and Strier, K. (005). An epirical Bayes approach to inferring large-scale gene association networks. Bioinforatics, 1(6): [Soloon, 015] Soloon, J. (015). Nuerical Algoriths: Methods for Coputer Vision, Machine Learning, and Graphics. CRC Press. [Tan and Févotte, 009] Tan, V. Y. and Févotte, C. (009). Autoatic relevance deterination in nonnegative atrix factorization. In SPARS 09-Signal Processing with Adaptive Sparse Structured Representations. [Tipping, 001] Tipping, M. E. (001). Sparse Bayesian learning and the relevance vector achine. The journal of achine learning research, 1: [Tipping et al., 003] Tipping, M. E., Faul, A. C., et al. (003). Fast arginal likelihood axiisation for sparse bayesian odels. In AISTATS. [Wipf and Nagarajan, 008] Wipf, D. P. and Nagarajan, S. S. (008). A new view of autoatic relevance deterination. In Advances in neural inforation processing systes, pages [Yang et al., 004] Yang, D., Zakharkin, S. O., Page, G. P., Brand, J. P., Edwards, J. W., Bartolucci, A. A., and Allison, D. B. (004). Applications of Bayesian statistical ethods in icroarray data analysis. Aerican Journal of Pharacogenoics, 4(1):53 6.

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points