Type II variational methods in Bayesian estimation

Size: px
Start display at page:

Download "Type II variational methods in Bayesian estimation"

Transcription

1 Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093 Abstract We consider type-ii variational methods to be estimation methods that use an EM-type algorithm to maimize the variational free energy lower bound on the likelihood, and in so doing employ a set of variational parameters or hyperparameters assuming a variational Gaussian approimate posterior. In particular we focus on conve bounding algorithms and conditionally Normal N/I) hyperprior algorithms. We contrast these type-ii methods with ensemble learning or variational Bayes algorithms, which assume that the approimating posterior is separable in the latent variables and hyperparameters in order to update the separate posteriors by functional optimization. We consider MAP and ML estimation in non-gaussian linear and kernel nonlinear models, and we derive general algorithms for each case, showing how eisting variational algorithms fit into this framework. We compare the methods in a simple Monte Carlo eperiment. 1 INTRODUCTION Variational methods have become increasingly popular over the past decade in Bayesian approaches to Machine Learning, particularly for estimation in graphical models and belief networks [14, 7, 16, 15, 17,, 4, 3], sparse Bayesian learning [9, 6, 11], ICA [18, 19, 9], and learning overcomplete representations [1]. Variational Bayesian methods are primarily used to lower bound the model evidence py m), and to approimate the posterior distribution over hidden or latent variables p y, m), where given m), py) = py )p) d, p y) = py )p) py) The following decomposition of the log likelihood is generally employed in the derivation of a lower bound, pz, y) log py) = qz y) log qz y) dz + D qz y) pz y) ) The first term on the right hand side is called the negative) free energy [7, 1]. From the non-negativity of the Kullback-Leibler divergence, the negative free energy is a lower bound on the log likelihood. 1 Variational methods generally employ variational parameters, or hyperparameters,, in addition to non- Gaussian) latent variables,, in the approimating posterior, qz y) = q, y) So far we have described features that are characteristic of variational Bayesian methods in general. We can distinguish two main branches in variational methods according to how the approimating posterior q, y) is handled. In the ensemble learning approach, including Variational Bayes [4, ], the posterior approimating distribution is taken to be factorial, q, y) = q y) q y) This allows the conditional maimization of the lower bound with respect to q) and q) separately. In contrast, type-ii methods, including hyperprior algorithms [0, 9, 11] and conve bounding algorithms [7, 16, 17], use an eplicit variational Gaussian approimation to the posterior, which has the general form, q, y) = py, ) q ) q) = N y; µ), Σ) N ; 0, Λ ) q) 1) ) 1 In maimizing the negative free energy, we are not guaranteed to find a local maimum of the true likelihood, but we are guaranteed that the true likelihood of our estimate is greater than whatever we find for the optimal value of the negative free energy. For eample, if the negative free energy at an estimate is greater than some threshold value, then we can be sure that the true likelihood of the estimate is also greater than that threshold.

2 where Λ = diag). A further distinction can be made within type-ii methods between conve bounding methods, and hyperprior or conditionally Normal N/I) methods. In the following subsections we present abstract derivations of the descent properties of ensemble learning, conve bounding, and hyperprior methods. 1.1 ENSEMBLE LEARNING The general ensemble algorithm is derived in [] and [4, 3] using relatively simple) calculus of variations to optimize the negative free energy functional. The following apparently new) derivation avoids the variational calculus, and relies only on simple properties of the KL divergence. For fied q), maimizing the negative free energy with respect to q), we have, ma q) p, y) q) q) log d d q) q) = ma q) log p, y) log q) d q) log p,y) K e = min q) log d q) q) = min q) D q) K e log p,y) ) where denotes epectation with respect to q), and K is a normalizing constant. The minimum of the KL divergence in the last epression is attained when q) ep log p, y) ). An identical derivation yields the optimal q) ep log p, y) ) when q) is fied. Thus if we can perform the epectations, we can monotonically increase the negative free energy by alternately updating q) and q). The ensemble method makes no prior assumptions on the form of the approimating posterior q, y) beyond separability, i.e. conditional independence of and given y. Intuitively, however, we would epect the latent variables and the variational parameters to be intimately related given the observation y, as there is typically an assumed Markov chain dependence between the variational parameters, the latent variables, and the observation, y. This anomaly may be mitigated by other considerations [3]. Of course, tractability considerations are made in the a priori determination of the joint density p, y), which in effect determines the forms of q) and q). 1. TYPE-II MAXIMUM LIKELIHOOD The conve bounding and hyperprior methods use the variational Gaussian approimation 1). Conve bounding methods use the conveity properties of certain densities, namely strongly super-gaussian densities, to formulate a variational algorithm for maimizing the negative free energy. Hyperprior algorithms, on the other hand, employ a class of densities that are conditionally Normal i.e. Gaussian scale mitures [1], or N/I densities [10]), eploiting this representation to derive an EM algorithm Conve bounding Conve bounding algorithms employ the following variational form of the prior on, p) = sup q; ) = sup N ; 0, Λ ) ϕ) where Λ = diag). Note that q; ) is not necessarily a normalized density in. The family of approimating posteriors, parameterized by, is defined by, q y; ) = py ) N ; 0, Λ ) py) with Λ = diag ). The approimating posterior will be Gaussian in in the standard Bayesian linear model with Gaussian errors. The negative free energy is bounded using, q y ; p, y) ) log q y; ) d = q y; ) log py )p) q y; ) d = [ q y; ) sup log py )q; ) q y; ) sup q y; py )q; ) ) log q y; ) = sup = sup ] d d 3) q y; q y; )py) ) log q y; ) ϕ) d log py) ϕ) D q y; ) ) q y; ) This last epression is maimized by alternately finding the supremum over, say, which will generally be possible in closed form, and then maimizing over the family of approimating distributions q y; ) by letting = to minimize the KL divergence. Note that the inequality 3), which will generally be strict, implies an approimation to the negative free energy, and thus a further approimation to the true log likelihood. The distance at the optimum from the true log likelihood is thus in fact greater than the D q y) p y ) imposed by the free energy approimation.

3 1.. Hyperpriors and the Evidence framework The evidence methods [0, 9] and hyperprior algorithms actually approimate the evidence py m) rather than bounding it as in the ensemble and conve bounding methods. However, given their similarity to the conve bounding methods, as demonstrated in the rest of the paper, it is natural to discuss them together. The hyperprior methods use the following representation when possible) of the non-gaussian prior p), p) = N ; 0, Λ)p) d where Λ = diag). We then have for the model evidence, py) = py )p) d [ ] = py )p ) d p) d = N y; 0, Σ y )p)d N y; 0, Σ y MAP ) ) and we approimate the posterior over the latent variables by, p y) py ) p MAP ) py MAP ) An EM-type algorithm can be used to find MAP as we show later. 3 We will be concerned in this paper primarily with type-ii methods. Our main focus will be on the theoretical basis for the conve bounding and hyperprior algorithms, formulating them in the most abstract terms possible, and eamining the relationship between them. Section discusses the conve bounding methods in greater detail, and Section 3 treats the hyperprior algorithms. In Section 4 we discuss the relationship between these methods, deriving conditions under which they may be applied, and the relationship between them when both can be applied. Finally, Section 5 compares the methods in some simple Monte Carlo eperiments. CONVEX BOUNDING The conve bounding methods we discuss use the concept of concavity in [15, 1], or square- 3 The EM algorithm is usually used for Maimum Likelihood estimation of a non-random parameter, but the same derivation of convergence holds for random parameters as well. concavity []. We define the class of strongly supergaussian densities to be those densities p) such that p) = ep f)) = ep g )), with g concave..1 STRONG SUPER-GAUSSIANITY The most widely used criterion for super-gaussianity, particularly in ICA, is positive kurtosis. However, a definition equivalent to ours is used in [5, pp ], and cited in [13] as the proper definition of supergaussianity, though we have not found it used outside of [5]. The definition given in [5] defines p) = ep f)) to be super-gaussian sub-gaussian) if f )/ is increasing decreasing) on 0, ). This condition is equivalent to f) = g ) with g concave, i.e. g decreasing. The square-concavity requirement may seem stringent at first, but all of the symmetric parameterized densities that we are aware of are either strongly super-gaussian or strongly sub-gaussian. Eamples of strongly super-gaussian densities include leaving out normalizing constants): i) Generalized Gaussian, ep γ p ), p, ii) Logistic, d d 1 + ep )) 1, iii) Student s t, and iv) symmetric α-stable densities, with characteristic function ep ω α ), 0 < α. The main property of these densities that we shall use is the inequality, fy) f) f ) y ) 4) which is the differential criterion for the concavity of g. This inequality allows a simple proof of descent for a class of IRLS algorithms for MAP estimation []. In particular, consider the linear model, y = A+ν, with independent strongly super-gaussian priors. From, ˆ MAP = arg ma p y) = arg min log p) log py ) arg min f) + d y A) ) = arg min f) + de) s.t. A + e = y,e we see that the original unconstrained problem can be rephrased as as a constrained optimization problem. Thus without loss of generality we can consider linearly constrained optimization of a single function f. Suppose for simplicity that f is separable more general formulations in R n are possible []) with each f i

4 concave in. Using 4), we have for arbitrary z, n fz) f) = f i z i ) f i i ) Thus, taking, 1 n f i i ) i z i ) i 1 zt Π) z 1 T Π) 5) new = arg min T Π old ) s.t. A = y 6) guarantees that f new ) < f old ), and we have a iterative reweighted least-squares IRLS) algorithm with provable descent. The diagonal weight matri Π) is given by, [ Π) ]i,i = f i i ) i 7) Given the MAP estimate of, it is possible to approimate the posterior p y) using the Laplace estimate taking the Hessian of the prior to be the covariance matri of a Gaussian with mean MAP ). Problems with this method include: i) the MAP estimate may not be representative of the location of probability mass [0], ii) even if the mass is located at the mode, the curvature at the mode the Hessian) may not be representative of the actual mass distribution, and iii) many super-gaussian priors have infinite curvature at the mode, making second-order algorithms unstable. Alternatively we can use a variational evidence framework, which employs variational or hyperparameters to form the posterior approimation.. Conve bounding EM Suppose the prior p) is Normal with zero mean and diagonal inverse variance Λ 1 diag), and consider the ordinary Maimum Likelihood estimate of the non-random) prior component inverse variances = [ 1... n ] T, ˆ ML = arg ma = arg ma = arg ma py; ) py ) p; ) d N y; A, Σ) N ; 0, Λ) d We can easily define an EM algorithm for estimating, treating the as hidden or latent variables. The complete log likelihood is then, log py, ; ) 1 T A T Σ 1 A y T Σ 1 A 1 log Λ + 1 T Λ 8) In the EM algorithm we take the epectation of the complete log likelihood with respect to the posterior distribution using the current parameters k), where k denotes the iteration. For the posterior, we have, p y; ) = where we define, py ) p; ) py) = N ; µ, Σ ) µ = Λ 1 A T AΛ 1 A T + Σ) 1 y 9) Σ = A T Σ 1 A + Λ) 1 10) Taking the epectation of the terms in the complete log likelihood 8) that involve the parameter, we have, E y; k) [ n ] i log i = n i E y; k)[ i ] log i Then, minimizing with respect to, we get, [ k+1) ] 1 i = E y; k)[ i ] = [µ ] i + [Σ ] i,i 11) Thus the EM algorithm consists of alternately updating the variance parameters according to 11), and updating the mean and covariance of the posterior according to 9) and 10) with Λ = diag k+1) ). We can generalize this algorithm for strongly supergaussian component priors by employing a variational form of the prior, and using an EM-type algorithm to estimate the variational parameters. For a strongly super-gaussian prior p) = ep f)), we have f) = g ) with g concave and increasing on 0, ). By definition of the concave conjugate of g, we have, and thus, where, f) = g ) = ep f)) = sup = sup inf ) g ep ) )) ep g N ; 0, 1 ) ϕ) 1) ϕ) = )) π/ ep g Thus we have for the complete likelihood, py, ) = = sup py ) p) sup py ) N ; 0, Λ 1 ) Ly, ; ) n ϕ i )

5 Hence, we can lower bound the log likelihood, log py) E Q log py, ) + HQ) E Q log Ly, ; ) + HQ) The first inequality is the ensemble inequality and holds for all distributions Q y). The second inequality is a consequence of the strong super-gaussianity of p). The algorithm performs an EM-type iteration to maimize this lower bound with respect to the parameter vector, or equivalently minimize an upper bound on the free energy, log Ly, ; ) 1 y A Σ Λ = 1 y A Σ n log Λ log ϕ i ) n [ i i g i )] 13) Taking the epected value with respect to the approimate posterior Q k N ; µ k), Σ k) ) and minimizing with respect to, we have, i = g 1 E Qk [ i ] ) = g E Qk [ i ] ) = f σ i ) σ i where σi matri as, = E Qk [ i ]. Thus we can write the weight [Λ] i,i = f σ i ) σ i 14) The form of Λ is clearly similar to that of Π in 7), which is noted in [1] for the case of the Laplacian prior. As in 11), we have, σ i = [µ ] i + [Σ ] i,i 15) The update 7) can be seen as taking σ i = [µ ] i, neglecting the variance term [Σ ] i,i. Following [1], we can generate an approimate Maimum Likelihood estimate of A by minimizing 13) for A along with { k }. This leads to the standard EM update for A, A ML = min A N = = N y k A k Σ 1 ) N y k µ T k E k T k ) 1 N ) N ) 1 y k µ T k µ k µ T k + Σ k where µ k and Σ k are given by 9) and 10), and Λ is given by 14). The algorithm in [1] is a special case of the one given here. It is straightforward to etend this method to include estimation of the noise covariance, or miture coefficients in where the sources are variational Gaussian mitures, as in [9], though they take the ensemble approach. 3 HYPERPRIORS Suppose the prior p) can be written p) = p )p) d = N ; 0, 1 )p) d 16) so that we can define a random variable such that given, is conditionally Gaussian with variance a function of. 3.1 N/I MAP First, consider MAP estimation of where is taken as the hidden data. ˆ = arg ma As before, the free energy is, 0 log p y) log p, y) = log py ) log p ) log p) The only term related to the maimization of is i 1 i i, so we need to find E[ ]. Following [10], we have, p ) = d d N ; 0, 1 )p) d = N ; 0, 1 )p) d = p, ) d = p) p ) d Thus, we see that, E [ ] = p ) d = p ) p) = f ) 17) Interestingly, this is equivalent to 7), which was derived from a conveity inequality, while 17) follows from the N/I representation. 3. N/I EM Now consider the reverse case, MAP estimation of the vector where is taken as hidden, ˆ = arg ma log p y)

6 We can define an EM-type algorithm to find a locally optimal. Using the fact that, we have, log p y) = E y log log p k+1) y) log p k) y) = p, y) p, y) E y log p k+1), y) E y log p k), y) + D p k), y) ) p k+1), y) Thus we can treat log p, y) as a complete loglikelihood, analogous to that in the standard EM algorithm, and maimize E y log p, y) with respect to, which may be possible in closed form. The posterior p, y) is given by, p, y) = py, ) p ) py ) = N ; µ, Σ ) where again, µ = Λ 1 A T AΛ 1 A T + Σ) 1 y, Σ = A T Σ 1 A+Λ) 1, and Λ = diag). For the complete log-likelihood term, we have, log p, y) log py ) log p ) log p) = 1 y A Σ Λ 1 log Λ log p) = 1 y A Σ + i 1 i 1 log i log p i ) i The EM-type algorithm proceeds by alternately updating the mean and covariance of p, y), and setting k+1) i to satisfy, 1 i + p i ) p i ) = E k),y[ i ] 18) If the variance parameter is an arbitrary invertible function v) rather than, then k+1) i must satisfy, 1 i + p i ) p i ) + v i ) v i ) = E k),y[ i ] 19) Alternative algorithms can be derived through reparameterization of the prior p). As an eample, the Laplacian density can be written as conditionally Normal with standard deviation distributed Rayleigh. 1 ep ) = N ; 0, σ )pσ) dσ 0 [ 1 = ep 1 )] [ πσ σ σ ep 1 )] σ dσ 0 Deriving formula 19) in terms of σ =, we have, σ i σ 3 i For the Rayleigh distribution, p σ i ) pσ i ) = E σ k),y[ i ] 0) p σ i ) pσ i ) = 1 σ i σ i Substituting this in 0), we get, σ 4 i = i = E σ k),y[ i ] This differs from the case of assuming Gaussian components and estimating the variance parameter 11) in that the, the square of the variance, is set to the posterior variance, whereas in 11) is set to the posterior variance. The Laplacian can also be written as conditionally Gaussian with random variance distributed eponentially, which gives a different optimal given, so the algorithms are dependent on the form of the conditional Gaussianization. In the case of Maimum Likelihood estimation of A in the linear model y = A + ν given N independent observations of y. We can etend the variational algorithms for estimating components by simply updating A according to the usual EM update for the Gaussian linear model. 4 THEORETICAL COMPARISON In this section we present criteria that can be used to determine when the conve bounding and hyperprior methods are applicable, and eamine their relationship to each other. We have seen that the conve bounding and hyperprior algorithms have some interesting similarities, and we pursue this further here. 4.1 CRITERIA FOR APPLICATION As mentioned in, the conve bounding methods are based on the notion of square-concavity, or concavity in, where f is square-concave if f) = g ) with g concave and increasing over the positive orthant. If p) ep f) with f square concave, then we say that p) is strongly super-gaussian. Criteria for strong super-gaussianity are equivalent to criteria for the concavity of g. Epressed in terms of f) = log p), we have p) strongly super-gaussian if f )/ is decreasing on the positive orthant, or if, f ) f ) 1

7 The evidence and hyperprior methods are applicable when p) is a scale miture of Gaussian, or equivalently if the random variable X is distributed according to p), and X = N/I with N a Normal random variable, and I a non-negative random variable independent of N, then p) is a Gaussian scale miture, i.e. N/I density. According to the well-known theorem of Schoenberg [8], a density is a Gaussian scale miture if and only if is positive definite on Hilbert space. Using Bernstein s theorem on completely monotonic functions, a function p) is then a Gaussian scale miture, i.e., p) p ) = e / p) d is completely monotonic, that is, if, 1) n p n) ) 0 4. RELATIONSHIP BETWEEN THE METHODS The relationship between the conve bounding and hyperprior criteria can be seen from a theorem of Bochner [7, Thm 4.1.5]: Theorem. If g) > 0, then e ug) is completely monotonic for every u > 0, if and only if g ) is completely monotonic. In particular, g ) is decreasing, so that all Gaussian scale mitures are seen to be strongly super-gaussian. Thus the class of strongly super-gaussian densities includes the class of Gaussian scale mitures, though the reverse is not true. That is, an arbitrary strongly super-gaussian density cannot be epressed as a Gaussian scale miture. Finally, we note that the algorithms derived for strongly super-gaussian densities can also be applied to strongly sub-gaussian densities by solving the dual problem, as the Fenchel-Legendre conjugate of strongly super-gaussian density is strongly subgaussian [, 3]. 5 EXPERIMENTS In the particular application of subset selection, or sparse estimation, algorithms can be compared based on the percentage of the time they find known sparse generating solutions. We present the results of a simple eperiment that demonstrates the superiority of the hyperprior evidence, ARD) methods in subset selection with overcomplete dictionaries, i.e. solving an underdetermined linear system with known sparse solution. More details can be found in [6, 5]. We generated a random N M Φ dictionary whose entries were each drawn from a standardized Gaussian distribution. The columns were then normalized to unit l -norm. Sparse weight vectors w 0 were generated with divw 0 ) = D 0, where div) is the diversity, or number of non-zero elements in. We randomly selected nonzero entries with random amplitudes. The vector of target values is then computed as y = A 0. 1) Each algorithm is then presented with y and A and attempts to find 0. Under this construction i.e., no noise and randomly generated dictionaries and random weight amplitudes), all local minima almost surely have a suboptimal diversity of div) = N, so 0 is maimally sparse. Thus, we can be certain that when an algorithm finds 0, it has found the maimally sparse solution. Initially, we chose D 0 = 7, N = 0 and allowed M to vary from 0 to 80, i.e., an overcompleteness ratio ranging from 1.0 to 4.0. These results are shown in Figure 1. We compare the FOCUSS algorithms of [4, 6] and the basis pursuit methods of [8], which represent the MAP estimation of components, with the evidence-based method of [9]. Error Rate FOCUSS p=0.001) FOCUSS p=0.9) Basis Pursuit p=1.0) SBL Overcompleteness Ratio M/N) Figure 1: Error rate uncovering generative weights w 0 using N = 0, D 0 = 7, and M ranging from 0 to 80. When the basis set, or dictionary, is complete, or in the case of sparse estimation for Kernel methods having square design matri) [9, 11], we have found that the conve bounding and evidence methods are very similar in their performance, which is to be epected in the low-noise case. References [1] D. F. Andrews and C. L. Mallows. Scale mitures of normal distributions. J. Roy. Statist. Soc. Ser. B, 36:99 10, 1974.

8 [] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 1. MIT Press, 000. [3] M. J. Beal. Variational Algorithms for Approimate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 003. [4] M. J. Beal and Z. Ghahrarmani. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7, pages University of Oford Press, 00. [5] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approimations. Springer- Verlag, [6] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier and M. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann, 000. [7] S. Bochner. Harmonic analysis and the theory of probability. University of California Press, Berkeley and Los Angeles, [8] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computation, 01):33 61, [9] R. A. Choudrey and S. J. Roberts. Variational miture of Bayesian independent component analysers. Neural Computation, 151):13 5, 00. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Iteratively reweighted least squares for linear regression when errors are Normal/Independent distributed. In P. R. Krishnaiah, editor, Multivariate Analysis V, pages North Holland Publishing Company, [11] M. Figueiredo. Adaptive sparseness using Jeffreys prior. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 00. MIT Press. [1] M. Girolami. A variational method for learning sparse and overcomplete representations. Neural Computation, 13:517 53, 001. [13] S. Haykin. Neural Networks: a comprehensive foundation. Prentice Hall, [14] G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sith annual conference on Computational learning theory, pages ACM Press, [17] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Publishers, [18] H. Lappalainen. Ensemble learning for independent component analysis. In Proceedings of the First International Workshop on Independent Component Analysis, [19] N. D. Lawrence and C. M. Bishop. Variational Bayesian independent component analysis. Technical report, 000. [0] D. J. C. Mackay. Comparison of approimate methods for handling hyperparameters. Neural Computation, 115): , [1] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages Kluwer, [] J. A. Palmer and K. Kreutz-Delgado. A globally convergent algorithm for maimum likelihood estimation with non-gaussian priors. In Proceedings of the 36th Asilomar Conference on Signals and Systems. IEEE, 00. [3] J. A. Palmer and K. Kreutz-Delgado. A general framework for component estimation. In Proceedings of the 4th International Symposium on Independent Component Analysis, 003. [4] B. D. Rao and I. F. Gorodnitsky. Sparse signal reconstruction from limited data using FOCUSS: a reweighted minimum norm algorithm. IEEE Trans. Signal Processing, 45: , [5] B. D. Rao and K. Kreutz-Delgado. An affine scaling methodology for best basis selection. IEEE Trans. Signal Processing, 47:187 00, [6] B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado. Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 513), 003. [7] L. K. Saul, T. S. Jaakkola, and M. I. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61 76, [8] I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, 394): , [9] M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. Journal of Machine Learning Research, 1:11 44, 001. [15] T. S. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. PhD thesis, Massachusetts Institute of Technology, [16] T. S. Jaakkola and M. I. Jordan. A variational approach to Bayesian logistic regression models and their etensions. In Proceedings of the 1997 Conference on Artificial Intelligence and Statistics, 1997.

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Generalized Kernel Classification and Regression

Generalized Kernel Classification and Regression Generalized Kernel Classification and Regression Jason Palmer and Kenneth Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA japalmer@ucsd.edu,kreutz@ece.ucsd.edu

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Graphical models: parameter learning

Graphical models: parameter learning Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

Perspectives on Sparse Bayesian Learning

Perspectives on Sparse Bayesian Learning Perspectives on Sparse Bayesian Learning David Wipf, Jason Palmer, and Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego, CA 909 dwipf,japalmer@ucsd.edu,

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

Numerical Methods. Rafał Zdunek Underdetermined problems (2h.) Applications) (FOCUSS, M-FOCUSS,

Numerical Methods. Rafał Zdunek Underdetermined problems (2h.) Applications) (FOCUSS, M-FOCUSS, Numerical Methods Rafał Zdunek Underdetermined problems (h.) (FOCUSS, M-FOCUSS, M Applications) Introduction Solutions to underdetermined linear systems, Morphological constraints, FOCUSS algorithm, M-FOCUSS

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS Justin Dauwels Dept. of Information Technology and Electrical Engineering ETH, CH-8092 Zürich, Switzerland dauwels@isi.ee.ethz.ch

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Convex/Schur-Convex (CSC) Log-Priors

Convex/Schur-Convex (CSC) Log-Priors 1999 6th Joint Symposium or2 Neural Cornputatiort Proceedings Convex/Schur-Convex (CSC) Log-Priors and Sparse Coding K. Kreutz-Delgado, B.D. Rao, and K. Engan* Electrical and Computer Engineering Department

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Bayesian Methods for Sparse Signal Recovery

Bayesian Methods for Sparse Signal Recovery Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Motivation Motivation Sparse Signal Recovery

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

l 0 -norm Minimization for Basis Selection

l 0 -norm Minimization for Basis Selection l 0 -norm Minimization for Basis Selection David Wipf and Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego, CA 92092 dwipf@ucsd.edu, brao@ece.ucsd.edu Abstract

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models Harri Valpola, Antti Honkela, and Juha Karhunen Neural Networks Research Centre, Helsinki University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Scale Mixture Modeling of Priors for Sparse Signal Recovery

Scale Mixture Modeling of Priors for Sparse Signal Recovery Scale Mixture Modeling of Priors for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Jason Palmer, Zhilin Zhang and Ritwik Giri Outline Outline Sparse

More information

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Zhilin Zhang and Ritwik Giri Motivation Sparse Signal Recovery is an interesting

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Dictionary Learning for L1-Exact Sparse Coding

Dictionary Learning for L1-Exact Sparse Coding Dictionary Learning for L1-Exact Sparse Coding Mar D. Plumbley Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom. Email: mar.plumbley@elec.qmul.ac.u

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1 To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Antti Honkela 1, Stefan Harmeling 2, Leo Lundqvist 1, and Harri Valpola 1 1 Helsinki University of Technology,

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Expectation Maximization Deconvolution Algorithm

Expectation Maximization Deconvolution Algorithm Epectation Maimization Deconvolution Algorithm Miaomiao ZHANG March 30, 2011 Abstract In this paper, we use a general mathematical and eperimental methodology to analyze image deconvolution. The main procedure

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p

Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p Equivalence of Minimal l 0 and l p Norm Solutions of Linear Equalities, Inequalities and Linear Programs for Sufficiently Small p G. M. FUNG glenn.fung@siemens.com R&D Clinical Systems Siemens Medical

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Adaptive Sparseness Using Jeffreys Prior

Adaptive Sparseness Using Jeffreys Prior Adaptive Sparseness Using Jeffreys Prior Mário A. T. Figueiredo Institute of Telecommunications and Department of Electrical and Computer Engineering. Instituto Superior Técnico 1049-001 Lisboa Portugal

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete

More information

EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan

EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan Introduction Density estimation forms the backbone for numerous machine learning algorithms

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Color Scheme.   swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14 Color Scheme www.cs.wisc.edu/ swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July 2016 1 / 14 Statistical Inference via Optimization Many problems in statistical inference

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

THE blind image deconvolution (BID) problem is a difficult

THE blind image deconvolution (BID) problem is a difficult 2222 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 8, AUGUST 2004 A Variational Approach for Bayesian Blind Image Deconvolution Aristidis C. Likas, Senior Member, IEEE, and Nikolas P. Galatsanos,

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Widths. Center Fluctuations. Centers. Centers. Widths

Widths. Center Fluctuations. Centers. Centers. Widths Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Particle Methods as Message Passing

Particle Methods as Message Passing Particle Methods as Message Passing Justin Dauwels RIKEN Brain Science Institute Hirosawa,2-1,Wako-shi,Saitama,Japan Email: justin@dauwels.com Sascha Korl Phonak AG CH-8712 Staefa, Switzerland Email: sascha.korl@phonak.ch

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Deterministic Approximation Methods in Bayesian Inference

Deterministic Approximation Methods in Bayesian Inference Deterministic Approximation Methods in Bayesian Inference Tobias Plötz Department of Computer Science Technical University of Darmstadt 64289 Darmstadt t_ploetz@rbg.informatik.tu-darmstadt.de Abstract

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

State Space and Hidden Markov Models

State Space and Hidden Markov Models State Space and Hidden Markov Models Kunsch H.R. State Space and Hidden Markov Models. ETH- Zurich Zurich; Aliaksandr Hubin Oslo 2014 Contents 1. Introduction 2. Markov Chains 3. Hidden Markov and State

More information

Bayesian ensemble learning of generative models

Bayesian ensemble learning of generative models Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative

More information

Thomas Blumensath and Michael E. Davies

Thomas Blumensath and Michael E. Davies A FAST IMPORTANCE SAMPLING ALGORITHM FOR UNSUPERVISED LEARNING OF OVER-COMPLETE DICTIONARIES Thomas Blumensath and Michael E. Davies c 2005 IEEE. Personal use of this material is permitted. However, permission

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oford University, UK

More information

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS Balaji Lakshminarayanan and Raviv Raich School of EECS, Oregon State University, Corvallis, OR 97331-551 {lakshmba,raich@eecs.oregonstate.edu}

More information

VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Variational Approximate Inference in Latent Linear Models

Variational Approximate Inference in Latent Linear Models Variational Approximate Inference in Latent Linear Models Edward Arthur Lester Challis A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information