Type II variational methods in Bayesian estimation

Size: px

Start display at page:

Download "Type II variational methods in Bayesian estimation"

Camron Gaines
5 years ago
Views:

1 Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093 Abstract We consider type-ii variational methods to be estimation methods that use an EM-type algorithm to maimize the variational free energy lower bound on the likelihood, and in so doing employ a set of variational parameters or hyperparameters assuming a variational Gaussian approimate posterior. In particular we focus on conve bounding algorithms and conditionally Normal N/I) hyperprior algorithms. We contrast these type-ii methods with ensemble learning or variational Bayes algorithms, which assume that the approimating posterior is separable in the latent variables and hyperparameters in order to update the separate posteriors by functional optimization. We consider MAP and ML estimation in non-gaussian linear and kernel nonlinear models, and we derive general algorithms for each case, showing how eisting variational algorithms fit into this framework. We compare the methods in a simple Monte Carlo eperiment. 1 INTRODUCTION Variational methods have become increasingly popular over the past decade in Bayesian approaches to Machine Learning, particularly for estimation in graphical models and belief networks [14, 7, 16, 15, 17,, 4, 3], sparse Bayesian learning [9, 6, 11], ICA [18, 19, 9], and learning overcomplete representations [1]. Variational Bayesian methods are primarily used to lower bound the model evidence py m), and to approimate the posterior distribution over hidden or latent variables p y, m), where given m), py) = py )p) d, p y) = py )p) py) The following decomposition of the log likelihood is generally employed in the derivation of a lower bound, pz, y) log py) = qz y) log qz y) dz + D qz y) pz y) ) The first term on the right hand side is called the negative) free energy [7, 1]. From the non-negativity of the Kullback-Leibler divergence, the negative free energy is a lower bound on the log likelihood. 1 Variational methods generally employ variational parameters, or hyperparameters,, in addition to non- Gaussian) latent variables,, in the approimating posterior, qz y) = q, y) So far we have described features that are characteristic of variational Bayesian methods in general. We can distinguish two main branches in variational methods according to how the approimating posterior q, y) is handled. In the ensemble learning approach, including Variational Bayes [4, ], the posterior approimating distribution is taken to be factorial, q, y) = q y) q y) This allows the conditional maimization of the lower bound with respect to q) and q) separately. In contrast, type-ii methods, including hyperprior algorithms [0, 9, 11] and conve bounding algorithms [7, 16, 17], use an eplicit variational Gaussian approimation to the posterior, which has the general form, q, y) = py, ) q ) q) = N y; µ), Σ) N ; 0, Λ ) q) 1) ) 1 In maimizing the negative free energy, we are not guaranteed to find a local maimum of the true likelihood, but we are guaranteed that the true likelihood of our estimate is greater than whatever we find for the optimal value of the negative free energy. For eample, if the negative free energy at an estimate is greater than some threshold value, then we can be sure that the true likelihood of the estimate is also greater than that threshold.

2 where Λ = diag). A further distinction can be made within type-ii methods between conve bounding methods, and hyperprior or conditionally Normal N/I) methods. In the following subsections we present abstract derivations of the descent properties of ensemble learning, conve bounding, and hyperprior methods. 1.1 ENSEMBLE LEARNING The general ensemble algorithm is derived in [] and [4, 3] using relatively simple) calculus of variations to optimize the negative free energy functional. The following apparently new) derivation avoids the variational calculus, and relies only on simple properties of the KL divergence. For fied q), maimizing the negative free energy with respect to q), we have, ma q) p, y) q) q) log d d q) q) = ma q) log p, y) log q) d q) log p,y) K e = min q) log d q) q) = min q) D q) K e log p,y) ) where denotes epectation with respect to q), and K is a normalizing constant. The minimum of the KL divergence in the last epression is attained when q) ep log p, y) ). An identical derivation yields the optimal q) ep log p, y) ) when q) is fied. Thus if we can perform the epectations, we can monotonically increase the negative free energy by alternately updating q) and q). The ensemble method makes no prior assumptions on the form of the approimating posterior q, y) beyond separability, i.e. conditional independence of and given y. Intuitively, however, we would epect the latent variables and the variational parameters to be intimately related given the observation y, as there is typically an assumed Markov chain dependence between the variational parameters, the latent variables, and the observation, y. This anomaly may be mitigated by other considerations [3]. Of course, tractability considerations are made in the a priori determination of the joint density p, y), which in effect determines the forms of q) and q). 1. TYPE-II MAXIMUM LIKELIHOOD The conve bounding and hyperprior methods use the variational Gaussian approimation 1). Conve bounding methods use the conveity properties of certain densities, namely strongly super-gaussian densities, to formulate a variational algorithm for maimizing the negative free energy. Hyperprior algorithms, on the other hand, employ a class of densities that are conditionally Normal i.e. Gaussian scale mitures [1], or N/I densities [10]), eploiting this representation to derive an EM algorithm Conve bounding Conve bounding algorithms employ the following variational form of the prior on, p) = sup q; ) = sup N ; 0, Λ ) ϕ) where Λ = diag). Note that q; ) is not necessarily a normalized density in. The family of approimating posteriors, parameterized by, is defined by, q y; ) = py ) N ; 0, Λ ) py) with Λ = diag ). The approimating posterior will be Gaussian in in the standard Bayesian linear model with Gaussian errors. The negative free energy is bounded using, q y ; p, y) ) log q y; ) d = q y; ) log py )p) q y; ) d = [ q y; ) sup log py )q; ) q y; ) sup q y; py )q; ) ) log q y; ) = sup = sup ] d d 3) q y; q y; )py) ) log q y; ) ϕ) d log py) ϕ) D q y; ) ) q y; ) This last epression is maimized by alternately finding the supremum over, say, which will generally be possible in closed form, and then maimizing over the family of approimating distributions q y; ) by letting = to minimize the KL divergence. Note that the inequality 3), which will generally be strict, implies an approimation to the negative free energy, and thus a further approimation to the true log likelihood. The distance at the optimum from the true log likelihood is thus in fact greater than the D q y) p y ) imposed by the free energy approimation.

3 1.. Hyperpriors and the Evidence framework The evidence methods [0, 9] and hyperprior algorithms actually approimate the evidence py m) rather than bounding it as in the ensemble and conve bounding methods. However, given their similarity to the conve bounding methods, as demonstrated in the rest of the paper, it is natural to discuss them together. The hyperprior methods use the following representation when possible) of the non-gaussian prior p), p) = N ; 0, Λ)p) d where Λ = diag). We then have for the model evidence, py) = py )p) d [ ] = py )p ) d p) d = N y; 0, Σ y )p)d N y; 0, Σ y MAP ) ) and we approimate the posterior over the latent variables by, p y) py ) p MAP ) py MAP ) An EM-type algorithm can be used to find MAP as we show later. 3 We will be concerned in this paper primarily with type-ii methods. Our main focus will be on the theoretical basis for the conve bounding and hyperprior algorithms, formulating them in the most abstract terms possible, and eamining the relationship between them. Section discusses the conve bounding methods in greater detail, and Section 3 treats the hyperprior algorithms. In Section 4 we discuss the relationship between these methods, deriving conditions under which they may be applied, and the relationship between them when both can be applied. Finally, Section 5 compares the methods in some simple Monte Carlo eperiments. CONVEX BOUNDING The conve bounding methods we discuss use the concept of concavity in [15, 1], or square- 3 The EM algorithm is usually used for Maimum Likelihood estimation of a non-random parameter, but the same derivation of convergence holds for random parameters as well. concavity []. We define the class of strongly supergaussian densities to be those densities p) such that p) = ep f)) = ep g )), with g concave..1 STRONG SUPER-GAUSSIANITY The most widely used criterion for super-gaussianity, particularly in ICA, is positive kurtosis. However, a definition equivalent to ours is used in [5, pp ], and cited in [13] as the proper definition of supergaussianity, though we have not found it used outside of [5]. The definition given in [5] defines p) = ep f)) to be super-gaussian sub-gaussian) if f )/ is increasing decreasing) on 0, ). This condition is equivalent to f) = g ) with g concave, i.e. g decreasing. The square-concavity requirement may seem stringent at first, but all of the symmetric parameterized densities that we are aware of are either strongly super-gaussian or strongly sub-gaussian. Eamples of strongly super-gaussian densities include leaving out normalizing constants): i) Generalized Gaussian, ep γ p ), p, ii) Logistic, d d 1 + ep )) 1, iii) Student s t, and iv) symmetric α-stable densities, with characteristic function ep ω α ), 0 < α. The main property of these densities that we shall use is the inequality, fy) f) f ) y ) 4) which is the differential criterion for the concavity of g. This inequality allows a simple proof of descent for a class of IRLS algorithms for MAP estimation []. In particular, consider the linear model, y = A+ν, with independent strongly super-gaussian priors. From, ˆ MAP = arg ma p y) = arg min log p) log py ) arg min f) + d y A) ) = arg min f) + de) s.t. A + e = y,e we see that the original unconstrained problem can be rephrased as as a constrained optimization problem. Thus without loss of generality we can consider linearly constrained optimization of a single function f. Suppose for simplicity that f is separable more general formulations in R n are possible []) with each f i

4 concave in. Using 4), we have for arbitrary z, n fz) f) = f i z i ) f i i ) Thus, taking, 1 n f i i ) i z i ) i 1 zt Π) z 1 T Π) 5) new = arg min T Π old ) s.t. A = y 6) guarantees that f new ) < f old ), and we have a iterative reweighted least-squares IRLS) algorithm with provable descent. The diagonal weight matri Π) is given by, [ Π) ]i,i = f i i ) i 7) Given the MAP estimate of, it is possible to approimate the posterior p y) using the Laplace estimate taking the Hessian of the prior to be the covariance matri of a Gaussian with mean MAP ). Problems with this method include: i) the MAP estimate may not be representative of the location of probability mass [0], ii) even if the mass is located at the mode, the curvature at the mode the Hessian) may not be representative of the actual mass distribution, and iii) many super-gaussian priors have infinite curvature at the mode, making second-order algorithms unstable. Alternatively we can use a variational evidence framework, which employs variational or hyperparameters to form the posterior approimation.. Conve bounding EM Suppose the prior p) is Normal with zero mean and diagonal inverse variance Λ 1 diag), and consider the ordinary Maimum Likelihood estimate of the non-random) prior component inverse variances = [ 1... n ] T, ˆ ML = arg ma = arg ma = arg ma py; ) py ) p; ) d N y; A, Σ) N ; 0, Λ) d We can easily define an EM algorithm for estimating, treating the as hidden or latent variables. The complete log likelihood is then, log py, ; ) 1 T A T Σ 1 A y T Σ 1 A 1 log Λ + 1 T Λ 8) In the EM algorithm we take the epectation of the complete log likelihood with respect to the posterior distribution using the current parameters k), where k denotes the iteration. For the posterior, we have, p y; ) = where we define, py ) p; ) py) = N ; µ, Σ ) µ = Λ 1 A T AΛ 1 A T + Σ) 1 y 9) Σ = A T Σ 1 A + Λ) 1 10) Taking the epectation of the terms in the complete log likelihood 8) that involve the parameter, we have, E y; k) [ n ] i log i = n i E y; k)[ i ] log i Then, minimizing with respect to, we get, [ k+1) ] 1 i = E y; k)[ i ] = [µ ] i + [Σ ] i,i 11) Thus the EM algorithm consists of alternately updating the variance parameters according to 11), and updating the mean and covariance of the posterior according to 9) and 10) with Λ = diag k+1) ). We can generalize this algorithm for strongly supergaussian component priors by employing a variational form of the prior, and using an EM-type algorithm to estimate the variational parameters. For a strongly super-gaussian prior p) = ep f)), we have f) = g ) with g concave and increasing on 0, ). By definition of the concave conjugate of g, we have, and thus, where, f) = g ) = ep f)) = sup = sup inf ) g ep ) )) ep g N ; 0, 1 ) ϕ) 1) ϕ) = )) π/ ep g Thus we have for the complete likelihood, py, ) = = sup py ) p) sup py ) N ; 0, Λ 1 ) Ly, ; ) n ϕ i )

5 Hence, we can lower bound the log likelihood, log py) E Q log py, ) + HQ) E Q log Ly, ; ) + HQ) The first inequality is the ensemble inequality and holds for all distributions Q y). The second inequality is a consequence of the strong super-gaussianity of p). The algorithm performs an EM-type iteration to maimize this lower bound with respect to the parameter vector, or equivalently minimize an upper bound on the free energy, log Ly, ; ) 1 y A Σ Λ = 1 y A Σ n log Λ log ϕ i ) n [ i i g i )] 13) Taking the epected value with respect to the approimate posterior Q k N ; µ k), Σ k) ) and minimizing with respect to, we have, i = g 1 E Qk [ i ] ) = g E Qk [ i ] ) = f σ i ) σ i where σi matri as, = E Qk [ i ]. Thus we can write the weight [Λ] i,i = f σ i ) σ i 14) The form of Λ is clearly similar to that of Π in 7), which is noted in [1] for the case of the Laplacian prior. As in 11), we have, σ i = [µ ] i + [Σ ] i,i 15) The update 7) can be seen as taking σ i = [µ ] i, neglecting the variance term [Σ ] i,i. Following [1], we can generate an approimate Maimum Likelihood estimate of A by minimizing 13) for A along with { k }. This leads to the standard EM update for A, A ML = min A N = = N y k A k Σ 1 ) N y k µ T k E k T k ) 1 N ) N ) 1 y k µ T k µ k µ T k + Σ k where µ k and Σ k are given by 9) and 10), and Λ is given by 14). The algorithm in [1] is a special case of the one given here. It is straightforward to etend this method to include estimation of the noise covariance, or miture coefficients in where the sources are variational Gaussian mitures, as in [9], though they take the ensemble approach. 3 HYPERPRIORS Suppose the prior p) can be written p) = p )p) d = N ; 0, 1 )p) d 16) so that we can define a random variable such that given, is conditionally Gaussian with variance a function of. 3.1 N/I MAP First, consider MAP estimation of where is taken as the hidden data. ˆ = arg ma As before, the free energy is, 0 log p y) log p, y) = log py ) log p ) log p) The only term related to the maimization of is i 1 i i, so we need to find E[ ]. Following [10], we have, p ) = d d N ; 0, 1 )p) d = N ; 0, 1 )p) d = p, ) d = p) p ) d Thus, we see that, E [ ] = p ) d = p ) p) = f ) 17) Interestingly, this is equivalent to 7), which was derived from a conveity inequality, while 17) follows from the N/I representation. 3. N/I EM Now consider the reverse case, MAP estimation of the vector where is taken as hidden, ˆ = arg ma log p y)

6 We can define an EM-type algorithm to find a locally optimal. Using the fact that, we have, log p y) = E y log log p k+1) y) log p k) y) = p, y) p, y) E y log p k+1), y) E y log p k), y) + D p k), y) ) p k+1), y) Thus we can treat log p, y) as a complete loglikelihood, analogous to that in the standard EM algorithm, and maimize E y log p, y) with respect to, which may be possible in closed form. The posterior p, y) is given by, p, y) = py, ) p ) py ) = N ; µ, Σ ) where again, µ = Λ 1 A T AΛ 1 A T + Σ) 1 y, Σ = A T Σ 1 A+Λ) 1, and Λ = diag). For the complete log-likelihood term, we have, log p, y) log py ) log p ) log p) = 1 y A Σ Λ 1 log Λ log p) = 1 y A Σ + i 1 i 1 log i log p i ) i The EM-type algorithm proceeds by alternately updating the mean and covariance of p, y), and setting k+1) i to satisfy, 1 i + p i ) p i ) = E k),y[ i ] 18) If the variance parameter is an arbitrary invertible function v) rather than, then k+1) i must satisfy, 1 i + p i ) p i ) + v i ) v i ) = E k),y[ i ] 19) Alternative algorithms can be derived through reparameterization of the prior p). As an eample, the Laplacian density can be written as conditionally Normal with standard deviation distributed Rayleigh. 1 ep ) = N ; 0, σ )pσ) dσ 0 [ 1 = ep 1 )] [ πσ σ σ ep 1 )] σ dσ 0 Deriving formula 19) in terms of σ =, we have, σ i σ 3 i For the Rayleigh distribution, p σ i ) pσ i ) = E σ k),y[ i ] 0) p σ i ) pσ i ) = 1 σ i σ i Substituting this in 0), we get, σ 4 i = i = E σ k),y[ i ] This differs from the case of assuming Gaussian components and estimating the variance parameter 11) in that the, the square of the variance, is set to the posterior variance, whereas in 11) is set to the posterior variance. The Laplacian can also be written as conditionally Gaussian with random variance distributed eponentially, which gives a different optimal given, so the algorithms are dependent on the form of the conditional Gaussianization. In the case of Maimum Likelihood estimation of A in the linear model y = A + ν given N independent observations of y. We can etend the variational algorithms for estimating components by simply updating A according to the usual EM update for the Gaussian linear model. 4 THEORETICAL COMPARISON In this section we present criteria that can be used to determine when the conve bounding and hyperprior methods are applicable, and eamine their relationship to each other. We have seen that the conve bounding and hyperprior algorithms have some interesting similarities, and we pursue this further here. 4.1 CRITERIA FOR APPLICATION As mentioned in, the conve bounding methods are based on the notion of square-concavity, or concavity in, where f is square-concave if f) = g ) with g concave and increasing over the positive orthant. If p) ep f) with f square concave, then we say that p) is strongly super-gaussian. Criteria for strong super-gaussianity are equivalent to criteria for the concavity of g. Epressed in terms of f) = log p), we have p) strongly super-gaussian if f )/ is decreasing on the positive orthant, or if, f ) f ) 1

7 The evidence and hyperprior methods are applicable when p) is a scale miture of Gaussian, or equivalently if the random variable X is distributed according to p), and X = N/I with N a Normal random variable, and I a non-negative random variable independent of N, then p) is a Gaussian scale miture, i.e. N/I density. According to the well-known theorem of Schoenberg [8], a density is a Gaussian scale miture if and only if is positive definite on Hilbert space. Using Bernstein s theorem on completely monotonic functions, a function p) is then a Gaussian scale miture, i.e., p) p ) = e / p) d is completely monotonic, that is, if, 1) n p n) ) 0 4. RELATIONSHIP BETWEEN THE METHODS The relationship between the conve bounding and hyperprior criteria can be seen from a theorem of Bochner [7, Thm 4.1.5]: Theorem. If g) > 0, then e ug) is completely monotonic for every u > 0, if and only if g ) is completely monotonic. In particular, g ) is decreasing, so that all Gaussian scale mitures are seen to be strongly super-gaussian. Thus the class of strongly super-gaussian densities includes the class of Gaussian scale mitures, though the reverse is not true. That is, an arbitrary strongly super-gaussian density cannot be epressed as a Gaussian scale miture. Finally, we note that the algorithms derived for strongly super-gaussian densities can also be applied to strongly sub-gaussian densities by solving the dual problem, as the Fenchel-Legendre conjugate of strongly super-gaussian density is strongly subgaussian [, 3]. 5 EXPERIMENTS In the particular application of subset selection, or sparse estimation, algorithms can be compared based on the percentage of the time they find known sparse generating solutions. We present the results of a simple eperiment that demonstrates the superiority of the hyperprior evidence, ARD) methods in subset selection with overcomplete dictionaries, i.e. solving an underdetermined linear system with known sparse solution. More details can be found in [6, 5]. We generated a random N M Φ dictionary whose entries were each drawn from a standardized Gaussian distribution. The columns were then normalized to unit l -norm. Sparse weight vectors w 0 were generated with divw 0 ) = D 0, where div) is the diversity, or number of non-zero elements in. We randomly selected nonzero entries with random amplitudes. The vector of target values is then computed as y = A 0. 1) Each algorithm is then presented with y and A and attempts to find 0. Under this construction i.e., no noise and randomly generated dictionaries and random weight amplitudes), all local minima almost surely have a suboptimal diversity of div) = N, so 0 is maimally sparse. Thus, we can be certain that when an algorithm finds 0, it has found the maimally sparse solution. Initially, we chose D 0 = 7, N = 0 and allowed M to vary from 0 to 80, i.e., an overcompleteness ratio ranging from 1.0 to 4.0. These results are shown in Figure 1. We compare the FOCUSS algorithms of [4, 6] and the basis pursuit methods of [8], which represent the MAP estimation of components, with the evidence-based method of [9]. Error Rate FOCUSS p=0.001) FOCUSS p=0.9) Basis Pursuit p=1.0) SBL Overcompleteness Ratio M/N) Figure 1: Error rate uncovering generative weights w 0 using N = 0, D 0 = 7, and M ranging from 0 to 80. When the basis set, or dictionary, is complete, or in the case of sparse estimation for Kernel methods having square design matri) [9, 11], we have found that the conve bounding and evidence methods are very similar in their performance, which is to be epected in the low-noise case. References [1] D. F. Andrews and C. L. Mallows. Scale mitures of normal distributions. J. Roy. Statist. Soc. Ser. B, 36:99 10, 1974.

8 [] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 1. MIT Press, 000. [3] M. J. Beal. Variational Algorithms for Approimate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 003. [4] M. J. Beal and Z. Ghahrarmani. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7, pages University of Oford Press, 00. [5] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approimations. Springer- Verlag, [6] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann, 000. [7] S. Bochner. Harmonic analysis and the theory of probability. University of California Press, Berkeley and Los Angeles, [8] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computation, 01):33 61, [9] R. A. Choudrey and S. J. Roberts. Variational miture of Bayesian independent component analysers. Neural Computation, 151):13 5, 00. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Iteratively reweighted least squares for linear regression when errors are Normal/Independent distributed. In P. R. Krishnaiah, editor, Multivariate Analysis V, pages North Holland Publishing Company, [11] M. Figueiredo. Adaptive sparseness using Jeffreys prior. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 00. MIT Press. [1] M. Girolami. A variational method for learning sparse and overcomplete representations. Neural Computation, 13:517 53, 001. [13] S. Haykin. Neural Networks: a comprehensive foundation. Prentice Hall, [14] G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sith annual conference on Computational learning theory, pages ACM Press, [17] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Publishers, [18] H. Lappalainen. Ensemble learning for independent component analysis. In Proceedings of the First International Workshop on Independent Component Analysis, [19] N. D. Lawrence and C. M. Bishop. Variational Bayesian independent component analysis. Technical report, 000. [0] D. J. C. Mackay. Comparison of approimate methods for handling hyperparameters. Neural Computation, 115): , [1] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages Kluwer, [] J. A. Palmer and K. Kreutz-Delgado. A globally convergent algorithm for maimum likelihood estimation with non-gaussian priors. In Proceedings of the 36th Asilomar Conference on Signals and Systems. IEEE, 00. [3] J. A. Palmer and K. Kreutz-Delgado. A general framework for component estimation. In Proceedings of the 4th International Symposium on Independent Component Analysis, 003. [4] B. D. Rao and I. F. Gorodnitsky. Sparse signal reconstruction from limited data using FOCUSS: a reweighted minimum norm algorithm. IEEE Trans. Signal Processing, 45: , [5] B. D. Rao and K. Kreutz-Delgado. An affine scaling methodology for best basis selection. IEEE Trans. Signal Processing, 47:187 00, [6] B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado. Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 513), 003. [7] L. K. Saul, T. S. Jaakkola, and M. I. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61 76, [8] I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, 394): , [9] M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:11 44, 001. [15] T. S. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. PhD thesis, Massachusetts Institute of Technology, [16] T. S. Jaakkola and M. I. Jordan. A variational approach to Bayesian logistic regression models and their etensions. In Proceedings of the 1997 Conference on Artificial Intelligence and Statistics, 1997.

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection