Variational inference in the conjugate-exponential family
|
|
- Randell Hood
- 5 years ago
- Views:
Transcription
1 Variational inference in the conjugate-exponential family Matthew J. Beal Work with Zoubin Ghahramani Gatsby Computational Neuroscience Unit August 2000
2 Abstract Variational inference in the conjugate-exponential family When learning models from data there is always the problem of over-fitting/generalisation performance. In a Bayesian approach this can be resolved by averaging over all possible settings of the model parameters. However for most interesting problems, averaging over models leads to intractable integrals and so approximations are necessary. Variational methods are becoming a widespread tool to approximate inference and learning in many graphical models: they are deterministic, generally fast, and can monotonically increase an objective function which transparently incorporates model complexity cost. I ll provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models, and show how well-known algorithms (e.g. belief-propagation) can be readily incorporated in the variational updates. Some examples to illustrate the ideas: determining the most probable number of clusters and their intrinsic latent-space dimensionality in a Mixture of Factor Analysers model; and recovering the hidden state-space dimensionality in a Linear Dynamical System model. This is work with Zoubin Ghahramani.
3 Outline ffl Briefly review variational Bayesian learning. ffl Concentrate on conjugate-exponential models. ffl Variational Expectation-Maximisation (VEM). ffl Examples in Mixture of Factor Analysers (MFA) and Linear Dynamical Systems (LDS).
4 Bayesian approach Motivation ffl Maximum likelihood (ML) does not penalise complex models, which can a priori model a larger range of data sets. ffl Bayes does not suffer from overfitting if we integrated out all the parameters. ffl We should be able to do principled model comparison. Method ffl Express distributions over all parameters in the model. ffl Calculate the Evidence P (Y jm) = Z d P (Y j ;M)P ( jm): too simple P(Data) Data Sets Y "just right" too complex (adapted from D.J.C. MacKay) Problem ffl These integrals are computationally intractable.
5 Practical approaches ffl Laplace approximations: Appeal to Central Limit Theorem, making a Gaussian approximation about the maximum a posteriori parameter estimate. ffl Large sample approximations: e.g. BIC. ffl Markov chain Monte Carlo (MCMC): Guaranteed to converge in the limit. Many samples required for accurate results. Hard to assess convergence. ffl Variational approximations.
6 The variational method Let the hidden states be X and the parameters, ln P (Y ) = ln Z dx d P (Y; X; ) Z dx d Q(X; )ln P (Y; X; ) Q(X; ) : Approximate the intractable distribution over hidden states (X) and parameters ( ) with a simpler distribution. ln P (Y ) = ln Z dx d P (Y; X; ) Z dx d Q X (X)Q ( )ln = F(Q; Y ): P (Y; X; ) Q X (X)Q ( ) Maximisation of this lower bound leads to EM-like updates: E step Q Λ X (X) / exp hln P (X;Y j )i Q ( ) M step Q Λ ( ) / P ( )exphln P (X;Y j )i Q X (X) Equivalent to minimizing the KL-divergence between the approximating and true posteriors.
7 a bit more explanation F(Q; Y ) = = Z Z dx d Q X (X)Q ( )ln d Q ( ) Z " ln P ( ) Q ( ) + dx Q X (X)ln P (X; Y; ) Q X (X)Q ( ) P (X; Y j ) Q X (X) # : e.g. setting a derivative to Y ( ) / ln P ( ) 1 ln Q ( ) + hln P (X; Y j )i QX (X) E step Q Λ X (X) / exp hln P (X;Y j )i Q ( ) M step Q Λ ( ) / P ( )exphln P (X;Y j )i Q X (X) Question: What distributions can we make use of in our models?
8 Conjugate-Exponential models Consider variational Bayesian learning in models that satisfy: Condition (1). The complete data likelihood is in the exponential family: P (x; yj ) = f(x; y) g( )exp n o ffi( ) > u(x; y) where ffi( ) is the vector of natural parameters, and u and f and g are the functions that define the exponential family. Condition (2). The parameter prior is conjugate to the complete data likelihood: P ( j ; ν) = h( ; ν) g( ) exp n o ffi( ) > ν where and ν are hyperparameters of the prior. We call models that satify conditions (1) and (2) conjugate-exponential.
9 Exponential family models Some models that are in the exponential family: ffl Gaussian mixtures, ffl factor analysis, probabilistic PCA, ffl hidden Markov models and factorial HMMs, ffl linear dynamical systems and switching state-space models, ffl Boltzmann machines, ffl discrete-variable belief networks. Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Some which are not in exponential family: ffl sigmoid belief networks, ffl independent components analysis Make approximations/changes to move into exponential family...
10 Theoretical Results Theorem 1 Given an iid data set Y = (y 1 ;:::y n ), if the model satisfies conditions (1) and (2), then at the maxima of F(Q; Y ) (minima of KL(QkP )): (a) Q ( ) is conjugate and of the form: Q ( ) = h(~ ; ~ν)g( ) ~ exp n o ffi( ) > ~ν where ~ = + n, ~ν = ν + P n i=1 u(x i ; y i ), and u(x i ; y i ) = hu(x i ; y i )i Q, using h i Q to denote expectation under Q. (b) Q X (X) = Q n i=1 Q xi (x i ) and Q xi (x i ) is of the same form as the known parameter posterior: Q xi (x i ) / f(x i ; y i )exp = P (x i jy i ; ffi( )) where ffi( ) = hffi( )i Q. n ffi( ) > u(x i ; y i ) o KEY points: (a) the approximate parameter posterior is of the same form as the prior; (b) the approximate hidden variable posterior is of the same form as the hidden variable posterior for known parameters.
11 Variational EM algorithm Since Q ( ) and Q xi (x i ) are coupled, (a) and (b) do not provide an analytic solution to the minimisation problem. VE Step: Compute the expected sufficient statistics t(y ) = P i u(x i ; y i ) under the hidden variable distributions Q xi (x i ). VM Step: Compute the expected natural parameters ffi( ) under the parameter distribution given by ~ and ~ν. Properties: ffl VE step has same complexity as corresponding E step. ffl Reduces to the EM algorithm if Q ( ) = ffi( Λ ).M step then involves re-estimation of Λ. ffl F increases monotonically, incorporates model complexity penalty.
12 Belief networks Corollary 1: Conjugate-Exponential Belief Networks. Let M be a conjugate-exponential model with hidden and visible variables z = (x; y) that satisfy a belief network factorisation. That is, each variable z j has parents z pj and P (zj ) = Q j P (z jjz pj ; ). Then the approximating joint distribution for M satisfies the same belief network factorisation: Q z (z) = Y j Q(z j jz pj ; ~ ) where the conditional distributions have exactly the same form as those in the original model but with natural parameters ffi(~ ) = ffi( ). Furthermore, with the modified parameters ~, the expectations under the approximating posterior Q x (x) / Q z (z) required for the VE Step can be obtained by applying the belief propagation algorithm if the network is singly connected and the junction tree algorithm if the network is multiply-connected.
13 Markov networks Theorem 2: Markov Networks. Let M be a model with hidden and visible variables z = (x; y) that satisfy a Markov network factorisation. That is, the joint density can be written as a product of clique-potentials ψ j, P (zj ) = g( ) Q j ψ j(c j ; ), where each clique C j is a subset of the variables in z. Then the approximating joint distribution for M satisfies the same Markov network factorisation: Q z (z) = ~g Y j ψ j (C j ) where ψ j (C j ) = exp n hln ψj (C j ; )i Q o are new clique potentials obtained by averaging over Q ( ), and ~g is a normalisation constant. Furthermore, the expectations under the approximating posterior Q x (x) required for the VE Step can be obtained by applying the junction tree algorithm. Corollary 2: Conjugate-Exponential Markov Networks. Let M be a conjugate-exponential Markov network over the variables in z. Then the approximating joint distribution for M is given by Q z (z) = ~g Q j ψ j(c j ; ~ ), where the clique potentials have exactly the same form as those in the original model but with natural parameters ffi(~ ) = ffi( ).
14 Mixture of factor analysers p-dim. vector y generated from k unobserved independent Gaussian sources, x, then corrupted with Gaussian noise, r, with diagonal covariance matrix Ψ: y = Λx + r. Marginal density of y is Gaussian with zero mean and covariance ΛΛ T + Ψ. a,b ν 1 ν 2... ν S Λ 1 Λ 2... Λ S α s n π Ψ y n n=1...n x n ffl Complete data likelihood is in exponential family. ffl Choose conjugate priors: (ν ο Gamma, Λ οgaussian, ß οdirichlet). VE: posteriors over hidden states calculated as usual. VM: posteriors over parameters have same form as priors.
15 Synthetic example True data: 6 Gaussian clusters with dimensions: ( ) embedded in 10 dimensions. Inferred structure: ffl Finds the clusters and their dimensionalities. ffl Model complexity reduces in line with lack of data support. number of points intrinsic dimensionalities per cluster
16 Linear Dynamical Systems X 1 X 2 X 3 X T Y 1 Y 2 Y 3 Y T ffl Sequence of D-dimensional real-valued observed vectors y 1:T. ffl Assume at each time step t, y t was generated from a k-dimensional real-valued hidden state variable x t, and that the sequence of x 1:T define a first order Markov process. ffl The joint probability is given by P (x 1:T ; y 1:T ) = P (x 1 )P (y 1 jx 1 ) TY t=2 P (x t jx t 1 )P (y t jx t ): ffl If transition and output functions are linear, time-invariant, and noise distributions are Gaussian, this is a Linear-Gaussian state-space model: x t = Ax t 1 + w t ; y t = Cx t + r t :
17 Model selection using ARD EM underlying Dynamics matrix, A Output matrix, C ffl Gaussian priors on the entries in the columns of these matrices. P (a i jff) = N (0; diag(ff) 1 ); P (c i jfi) = N (0; diag(ρ i fi) 1 ) ffl ff and fi are vectors of hyperparameters, to be optimised during learning. Some ff,fi! 1. ffl Empty columns signify extinct hidden dimensions for: modelling hidden dynamics, for modelling output covariance structure.
18 LDS variational approximation Parameters of the model = (A; C; ρ), hidden states x 1:T. ln P (Y ) = ln Z da dc dρ dx 1:t P (A; C; ρ; x 1:T ; y 1:T ): Lower bound: Jensen s inequality ln P (Y ) F(Q; Y ) = Z da dc dρ dx 1:T Q(A; C; ρ; x 1:T ) ln P (A; C; ρ; x 1:T ; y 1:T ) : Q(A; C; ρ; x 1:T ) Factored approximation to the true posterior Q(A; C; ρ; x 1:T ) = Q A (A)Q Cρ (C; ρ)q x (x 1:T ): This factorisation falls out from the initial assumptions and the conditional independencies in the model. Priors: ffl Sensor precisions, ρ, given conjugate gamma priors. ffl Transition and output matrices given conjugate zero-mean Gaussian priors ARD.
19 Optimal Q( ) distributions: LDS Complete data likelihood is of conjugate-exponential form: iterative analytic maximisations. Dynamics matrix Q A (A): A ο N (a i ; a i ; ± A i ) a i = S > ± A i ; ± A i = (diag(ff)+w ) 1 Noise covariance Q ρ (ρ): ρ i ο Gamma(ρ i ;~a;~b) ~a = a + T=2; ~b = b + g i =2 Output matrix Q C (Cjρ i ): C ο N (c i ; c i ; ± C i ) c i = ρ i U i ± C i ; ± C i = (diag(fi)+w 0 ) 1 =ρ i Hidden state Q x (x 1:T ): x t ο N (x t ; (T ) t ; Ψ (T ) t ) f (T ) t ; Ψ (T ) t g = KalmanSmoother(Q ACρ ; y 1:T ) Hyperparameters ff; fi: 1=ff = ha ai QA 1=fi = hc ci QCρ S = g i = TX TX t=2 X T 1 hx t 1 x > i; W = t hx t x > i; t t=1 y 2 ti U i(diag(fi) + W 0 )U > i ; U i = t=1 t=1 W 0 = W + hx t x > t i TX y ti hx > t i;
20 Method overview 1. Randomly initialise the approximate posteriors; 2. Update each posterior according to variational E- & M- steps ffl Q A (A) Q Cρ (Cjρ) Qρ(ρ) Q x (x 1:T ); 3. Update the hyperparameters: ff and fi; 4. Repeat steps 2-3 until convergence.
21 Variational E Step for LDS X 1 X 2 X 3 X T Y 1 Y 2 Y 3 Y T ffl Conjugate-exponential singly connected belief network, ffl (C1) use Belief Propagation, which for LDSs is the Kalman smoother: Forward: infer state x t given past observations, Backward: infer state x t given future observations. ffl Inference using an ensemble of parameters is possible, using just a single setting of the parameters. Required averages: hρ i c i i; hρ i c i c i> i; hai; ha > Ai ffl The result is a recursive algorithm very similar to the Kalman smoothing propagation.
22 Synthetic experiments Generated a variety of state-space models and looked the recovered hidden structure. Observed data is ffl 10-dim output ffl 200 time-steps ffl True model: 3-dim static state-space. Inferred model: A C X X Y ffl True model: 3-dim dynamical state-space. Inferred model: X X Y ffl True model: 4-dim state-space, of which 3 dynamical. Inferred model: X X Y
23 Degradation with data support True model: ffl 6-dim state-space ffl 10-dim output Complexity of recovered structure decreases with decreasing data support (400 to 10 time-steps). Inferred models: X X Y X X Y X X Y X X Y X X Y X X Y X X Y
24 Summary ffl Bayesian learning avoids overfitting and can be used to learn model structure. ffl Tractable learning using variational approximations. ffl Conjugate-exponential familes of models. ffl Variational EM and integration of existing algorithms. ffl Examples learning structure in MFA and LDS models. Issues & Extensions ffl Quality of the variational approximations? ffl How best to bring a model into exponential family? ffl Extensions to other models, e.g. switching state-space models [Ueda], is straightforward. ffl Problems integrating existing algorithms into the VE step. ffl Implementation on a general graphical model.
25 Optimal Q( ) distributions: MFA Q(x n js n ) sn (x n;s ; ± s ) : ± s 1 = hλ st Ψ 1 Λ s i Q(Λs ) + I Q(Λ s q ) sn (Λs q; ± q;s ) : Λ s q = " ± q;s 1 = Ψ 1 qq Ψ 1 N X n=1 NX n=1 x n;s = ± s Λ st Ψ 1 y n Q(s n )y n x n;st ± q;s # Q(s n )hx n x nt i Q(xn js n ) + diaghν s i Q(ν s ) q Q(ν s l ) sgamma(as l ;b s l ) : as l = a + p 2 b s l = b px q=1 hλ s ql2 iq(λs ) Q(ß) sdirichlet(!u) :!u s = ff S + N X n=1 Q(s n ) ln Q(s n ) = [ψ(!u s ) ψ(!)] ln j±s j + hln P (y n jx n ;s n ; Λ s ; Ψ)i Q(xn js n )Q(Λ s ) + c Note that the optimal distributions Q(Λ s ) have block diagonal covariance structure. So even though each Λ s is a p q matrix, its covariance only has O(pq 2 ) parameters. Differentiating F with respect to the parameters, a and b, of the precision prior we get fixed point equations ψ(a) = hln νi + ln b and b = a=hνi. Similarly the fixed point for the hyperparameters for the Dirichlet prior is ψ(ff) ψ(ff=s) + P [ψ(!u s ) ψ(!)] =S = 0.
26 Sampling from Variational Approximations Sampling m s Q( ) gives us estimates of: ffl The Exact Predictive Density: P (yjy ) = = Z Z ß M X m=1 d P (yj )P ( jy ) d Q( )P (yj ) P ( jy ) Q( ) P (yj m)!m weights:!m = Ω 1 P ( m ;Y ) Q( m ) ; withω P s:t:! m = 1 ffl The True Evidence: P (Y jm) = ffl The KL Divergence: Z d Q( ) P ( ; Y ) Q( ) = hω!i KL(Q( )kp ( jy )) = lnh!i hln!i:
27 Variational Bayes & Ensemble Learning ffl multilayer perceptrons (Hinton & van Camp, 1993) ffl mixture of experts (Waterhouse, MacKay & Robinson, 1996) ffl hidden Markov models (MacKay, 1995) ffl other work by Jaakkola, Barber, Bishop, Tipping, etc Examples of VB Learning Model Structure Model learning has been treated with variational Bayesian techniques for: ffl mixtures of factor analysers (Ghahramani & Beal, 1999) ffl mixtures of Gaussians (Attias, 1999) ffl independent components analysis (Attias, 1999; Miskin & MacKay, 2000; Valpola 2000) ffl principal components analysis (Bishop, 1999) ffl linear dynamical systems (Ghahramani & Beal, 2000) ffl mixture of experts (Ueda & Ghahramani, 2000) ffl hidden Markov models (Ueda & Ghahramani, in prep) (compiled by Zoubin)
28 0.3 Evolution of F, true evidence and KL-divergence x Train and Test fractional error train error test error lnp(y) train F(π,Λ) train F(θ) F Iteration
29 Required expectations for linear-gaussian state-space models a + T=2 defining G 0 = diag b + G=2 hc > R 1 Ci = diagfi + W 0 1 hc > R 1 i = diagfi + W 0 1 UG 0 hai = S > (diagff + W ) 1 ha > Ai = k (diagff + W ) 1 + hai > hai hp + UG 0 U > diagfi + W 0 1 i P Ω ff P T > T 1 Ω ff P > T where S = t=2 xt 1 x t, W = t=1 xt x t, U i = t=1 y tihx > t i and W 0 = W + Ω x T x T > ff. Kalman smoothing propagation Forward recursion: ± Λ t = ± 1 t Backward recursion: + ha > Ai 1 μ t = ± t hc> R 1 iy t + hai± Λ t 1 ± 1 t 1 μ t 1 ± t = I + hc > R 1 Ci hai± Λ t 1 hai>λ 1 t = Ψ t h± 1 t μ t + hai > Ψ 1 t+1 + hai> 1 hai±λ t Ψ 1 Ψ t = h ± Λ 1 t hai > Ψ 1 t+1 t+1 hai±λ t ± 1 t Λ μ t Λ t+1 + hai±λ t hai> 1 hai i 1 cov(x t ; x t+1 ) = Ψ 1 t+1 + hai±λ t hai> hai > ± Λ t haiλ 1
Unsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationThe Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures
BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationInference and estimation in probabilistic time series models
1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationThe Variational Kalman Smoother
The Variational Kalman Smoother Matthew J. Beal and Zoubin Ghahramani Gatsby Computational Neuroscience Unit {m.beal,zoubin}@gatsby.ucl.ac.uk http://www.gatsby.ucl.ac.uk May 22, 2000. last revision April
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationVariational Learning : From exponential families to multilinear systems
Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationVariational Methods in Bayesian Deconvolution
PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models
U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,
More informationVariational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller
Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationTowards a Bayesian model for Cyber Security
Towards a Bayesian model for Cyber Security Mark Briers (mbriers@turing.ac.uk) Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationVariational Bayesian Linear Dynamical Systems
Chapter 5 Variational Bayesian Linear Dynamical Systems 5.1 Introduction This chapter is concerned with the variational Bayesian treatment of Linear Dynamical Systems (LDSs), also known as linear-gaussian
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University
More informationVariational Message Passing
Journal of Machine Learning Research 5 (2004)?-? Submitted 2/04; Published?/04 Variational Message Passing John Winn and Christopher M. Bishop Microsoft Research Cambridge Roger Needham Building 7 J. J.
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More information2 XAVIER GIANNAKOPOULOS AND HARRI VALPOLA In many situations the measurements form a sequence in time and then it is reasonable to assume that the fac
NONLINEAR DYNAMICAL FACTOR ANALYSIS XAVIER GIANNAKOPOULOS IDSIA, Galleria 2, CH-6928 Manno, Switzerland z AND HARRI VALPOLA Neural Networks Research Centre, Helsinki University of Technology, P.O.Box 5400,
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationVariational Bayesian Hidden Markov Models
Chapter 3 Variational Bayesian Hidden Markov Models 3.1 Introduction Hidden Markov models (HMMs) are widely used in a variety of fields for modelling time series data, with applications including speech
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationMachine Learning 4771
Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationSequential Monte Carlo Methods for Bayesian Computation
Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMachine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014
Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationLecture 1b: Linear Models for Regression
Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationAn introduction to Variational calculus in Machine Learning
n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area
More informationVariational Message Passing
Journal of Machine Learning Research 5 (2004)?-? Submitted 2/04; Published?/04 Variational Message Passing John Winn and Christopher M. Bishop Microsoft Research Cambridge Roger Needham Building 7 J. J.
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationVIBES: A Variational Inference Engine for Bayesian Networks
VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationAn Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models
An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models Harri Valpola and Juha Karhunen Helsinki University of Technology, Neural Networks Research Centre P.O.Box 5400, FIN-02015
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and
More informationGraphical models: parameter learning
Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter
More informationArtificial Intelligence
Artificial Intelligence Probabilities Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: AI systems need to reason about what they know, or not know. Uncertainty may have so many sources:
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationWidths. Center Fluctuations. Centers. Centers. Widths
Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationType II variational methods in Bayesian estimation
Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093
More informationMarkov Chain Monte Carlo Methods for Stochastic Optimization
Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,
More informationbound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o
Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale
More informationParticle Filtering Approaches for Dynamic Stochastic Optimization
Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationNON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES
2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES Simo Särä Aalto University, 02150 Espoo, Finland Jouni Hartiainen
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by
More information