# STA 4273H: Statistical Machine Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! Sidney Smith Hall, Room 6002 Lecture 7

2 Approximate Inference When using probabilistic graphical models, we will be interested in evaluating the posterior distribution p(z X) of the latent variables Z given the observed data X. For example, in the EM algorithm, we need to evaluate the expectation of the complete-data log-likelihood with respect to the posterior distribution over the latent variables. For more complex models, it may be infeasible to evaluate the posterior distribution, or compute expectations with respect to this distribution. This typically occurs when working with high-dimensional latent spaces, or when the posterior distribution has a complex form, for which expectations are not analytically tractable (e.g. Boltzmann machines). We will examine a range of deterministic approximation schemes, some of which scale well to large application.

3 Computational Challenges Remember: the big challenge is computing the posterior distribution. There are several main approaches: Analytical integration: If we use conjugate priors, the posterior distribution can be computed analytically (we saw this in case of Bayesian linear regression). Gaussian (Laplace) approximation: Approximate the posterior distribution with a Gaussian. Works well when there is a lot of data compared to the model complexity (as posterior is close to Gaussian). Monte Carlo integration: The dominant current approach is Markov Chain Monte Carlo (MCMC) -- simulate a Markov chain that converges to the posterior distribution. It can be applied to a wide variety of problems. Variational approximation: A cleverer way to approximate the posterior. It often works much faster, but not as general as MCMC.

4 Probabilistic Model Suppose that we have a fully Bayesian model in which all parameters are given prior distributions. The model may have latent variables and parameters, and we will denote the set of all latent variables and parameters by Z. We will also denote the set of all observed variables by X. For example, we may be given a set of N i.i.d data points, so that X ={x 1,,x N } and Z = {z 1,,z N } (as we saw in previous class). Our probabilistic model specifies the joint distribution P(X,Z). Our goal is to find approximate posterior distribution P(Z X) and the model evidence p(x).

5 Variational Bound As in our previous lecture, we can decompose the marginal log-probability as: where Note that parameters are now stochastic variables and are absorbed into Z. We can maximize the variational lower bound with respect to the distribution q(z), which is equivalent to minimizing the KL divergence. If we allow any possible choice of q(z), then the maximum of the lower bound occurs when: In this case KL divergence becomes zero.

6 Variational Bound As in our previous lecture, we can decompose the marginal log-probability as: We will assume that the true posterior distribution is intractable. We can consider a restricted family of distributions q(z) and then find the member of this family for which KL is minimized. Our goal is to restrict the family of distributions so that it contains only tractable distributions. At the same time, we want to allow the family to be sufficiently rich and flexible, so that it can provide a good approximation to the posterior. One option is to use parametric distributions q(z!), governed by parameters!. The lower bound then becomes a function of!, and can optimize the lowerbound to determine the optimal values for the parameters.

7 Example One option is to use parametric distributions q(z!), governed by parameters!. An example, in which the variational distribution is Gaussian, and we optimize with respect to its mean and variance. The original distribution (yellow), along with Laplace (red), and variational (green) approximations.

8 Mean-Field We now consider restricting the family of distributions. Partition the elements of Z into M disjoint groups, denoted by Z i, i=1,,m. We assume that the q distribution factorizes with respect to these groups: Note that we place no restrictions on the functional form of the individual factors q i (we will often denote q i (Z i ) as simply q i ). This approximation framework, developed in physics, is called mean-field theory.

9 Factorized Distributions Among all factorized distributions, we look for a distribution for which the variational lower bound is maximized. Denoting q i (Z i ) as simply q I, we have: where we denote a new distribution:

10 Factorized Distributions Among all factorized distributions, we look for a distribution for which the variational lower bound is maximized. Denoting q i (Z i ) as simply q I, we have: where Here we take an expectation with respect to the q distribution over all variables Z i for i j, so that:

11 Maximizing Lower Bound Now suppose that we keep fixed, and optimize the lower bound with respect to all possible forms of the distribution q j (Z j ). This optimization is easily done by recognizing that: constant: does not depend on q. so the minimum occurs when or Observe: the log of the optimum solution for factor q j is given by: - Considering the log of the joint distribution over all hidden and visible variables - Taking the expectation with respect to all other factors {q i } for i j.

12 Maximizing Lower Bound Exponentiating and normalizing, we obtain: The set of these equations for j=1,,m represent the set of consistency conditions for the maximum of the lower bound subject to factorization constraint. To obtain a solution, we initialize all of the factors and then cycle through factors, replacing each in tern with a revised estimate. Convergence is guaranteed because the bound is convex with respect to each of the individual factors.

13 Factroized Gaussian Consider a problem of approximating a general distribution by a factorized distribution. To get some insight, let us look at the problem of approximating a Gaussian distribution using a factorized Gaussian distribution. Consider a Gaussian distribution over two correlated variables z = (z 1,z 2 ). Let us approximate this distribution using a factorized Gaussian of the form:

14 Factroized Gaussian Remember: Consider an expression for the optimal factor q 1 : Note that we have a quadratic function of z 1, and so we can identify q 1 (z 1 ) as a Gaussian distribution:

15 By symmetry, we also obtain: Factroized Gaussian There are two observations to make: - We did not assume that is Gaussian, but rather we derived this result by optimizing variational bound over all possible distributions. - The solutions are coupled. The optimal depends on expectation computed with respect to One option is to cycle through the variables in turn and update them until convergence.

16 By symmetry, we also obtain: Factroized Gaussian However, in our case, The green contours correspond to 1,2, and 3 standard deviations of the correlated Gaussian. The red contours correspond to the factorial approximation q(z) over the same two variables. Observe that a factorized variational approximation tends to give approximations that are too compact.

17 Alternative Form of KL Divergence We have looked at the variational approximation that minimizes KL(q p). For comparison, suppose that we were minimizing KL(p q). It is easy to show that: constant: does not depend on q. The optimal factor is given by the marginal distribution of p(z).

18 Comparison of two KLs Comparison of two the alternative forms for the KL divergence. KL(q p) KL(p q) Approximation is too compact. Approximation is too spread.

19 Comparison of two KLs The difference between these two approximations can be understood as follows: KL(q p) There is a large positive contribution to the KL divergence from regions of Z space in which: - p(z) is near zero, - unless q(z) is also close to zero. Minimizing KL(q p) leads to distributions q(z) that avoid regions in which p(z) is small.

20 Comparison of two KLs Similar arguments apply for the alternative KL divergence: KL(p q) There is a large positive contribution to the KL divergence from regions of Z space in which: - q(z) is near zero, - unless p(z) is also close to zero. Minimizing KL(p q) leads to distributions q(z) that are nonzero in regions where p(z) is nonzero.

21 Approximating Multimodal Distribution Consider approximating multimodal distribution with a unimodal one. Blue contours show bimodal distribution p(z), red contours show a single Gaussian distribution that best approximates q(z) that best approximates p(z). KL(p q) KL(q p) KL(q p) In practice, the true posterior will often be mutlimodal. KL(q p) will tend to find a single mode, whereas KL(p q) will average across all of the modes.

22 Alpha-family of Divergences The two forms of KL are members of the alpha-family divergences: Observe three points: - KL(p q) corresponds to the limit! 1. - KL(q p) corresponds to the limit! D (p q) 0, for all, and D (p q)=0 iff q(x) = p(x). Suppose p(x) is fixed and we minimize D (p q) with respect to q distribution. For < -1, the divergence is zero-forcing: q(x) will underestimate the support of p(x). For > 1, the divergence is zero-avoiding: q(x) will stretch to cover all of p(x). For = 0, we obtain a symmetric divergence which is related to Hellinger Distance:

23 Univariate Gaussian Consider a factorized approximation using a Gaussian distribution over a single variable x. Given a dataset over the mean µ and precision. The likelihood term is given: we would like to infer posterior distribution The conjugate prior is given by the Normal-Gamma prior: For this simple problem, the posterior also takes the form of Normal-Gamma distribution and hence has a closed form solution. However, let us consider a variational approximation to the posterior.

24 Approximating Mean We now consider a factorized variational approximation to the posterior: Note that the true posterior does not factorize this way! Remember: Hence: So: Depends on expectation with respect to q( ).

25 Approximating Mean We now consider a factorized variational approximation to the posterior: As N! 1, this gives Maximum Likelihood result becomes infinite. and the precision

26 Approximating Precision We now consider a factorized variational approximation to the posterior: For optimal solution for the precision factor: Hence the optimal factor is a Gamma distribution: Depends on expectation with respect to q(µ).

27 Approximating Precision We now consider a factorized variational approximation to the posterior: Hence the optimal factor is a Gamma distribution: As N! 1, the variational posterior q * ( ) has a mean given by the inverse of the maximum likelihood estimator for the variance of the data, and a variance that goes to zero. Note that we did not assume specific functional forms for the optimal q distributions. They were derived by optimizing variational bound over all possible distributions.

28 Iterative Procedure The optimal distributions for mean and precision terms depend on moments evaluated with respect to the other distributions. One option is to cycle through the mean and precision in turn and update them until convergence. Variational inference for the mean and precision. Green contours represent the true posterior, blue contours represent variational approximation.

29 Mixture of Gaussians We will look at the Bayesian mixture of Gaussians model and apply the variational inference to approximate the posterior. Note: Many models, corresponding to much more sophisticated distributions, can be solved by straightforward extensions of this analysis. Remember the Gaussian mixture model: The log-likelihood takes form:

30 Bayesian Mixture of Gaussians We next introduce priors over parameters ¼, µ, and. From now on, we assume that each component has the same parameter: for k=1,,k. Note that parameters 0 can be viewed as the effective prior number of observations associated with each component in the mixture, If 0 is small, the posterior will be influenced primarily by the data, rather than by the prior.

31 Bayesian Mixture of Gaussians We next introduce priors over parameters ¼, µ, and. We also place Gaussian-Wishart prior: Typically, we would choose, m 0 = 0 (by symmetry), and W 0 = I, and Notice the distinction between latent variables and parameters.

32 Bayesian Matrix Factorization One of the popular matrix factorization model used in collaborative filtering / recommender systems. As pointed before: Many models, corresponding to much more sophisticated distributions, can be solved by straightforward extensions of presented analysis.

33 Variational Distribution We can write down the joint distribution over all random variables: Consider a variational distribution that factorizes between the latent variables and model parameters: Remarkably, this is the only assumption we need in order to obtain a tractable practical solution to our Bayesian mixture model. The functional form of the factors will be determined automatically by optimization of the variational distribution.

34 Variational Distribution Using our general result, we can obtain the optimal factor for q(z): Using decomposition of and retaining terms that depend on Z: Substituting, we obtain: where expectations are taken with respect to

35 Variational Distribution Using our general result, we can obtain the optimal factor for q(z): So far, we have: Exponentiating and normalizing, we have: plays the role of responsibility. The optimal solution takes the same function form as the prior p(z ¼) (multinomial), and Note that the optimal solution depends on moments evaluated with respect to distributions of other variables.

36 Variational Distribution Using our general result, consider the optimal factor for q(¼,µ, ): Substituting, we obtain: where expectations are taken with respect to The result decomposes into a sum of terms involving only ¼ and {µ k, k }, k=1,..,k.

37 Substituting we obtain: Variational Distribution So the optimal q * (¼) takes form: Exponentiating, we have a Dirichlet distribution: where has components:

38 Variational Distribution It will be convenient to define three statistics of the observed dataset with respect to the responsibilities: Effective number of points in component k. The mean of component k. Covariance of component k. These are analogous to quantities evaluated in the maximum likelihood EM for the Gaussian mixture models.

39 Substituting we obtain: Variational Distribution It is easy to verify that optimal q * (µ k, k ) is a Gaussian- Wishart distribution: These update equations are quite intuitive. But they depend on responsibilities!

40 Iterative Optimization The optimization of variational posterior amounts to cycling between two stages, analogous to the E and M steps of the maximum likelihood EM. In the variational E-step, we use the current distribution over parameters q(¼,µ, ) to evaluate responsibilities: In the variational M-step, we use the current distribution over latents q(z), or responsibilities, to compute the variational distribution over parameters. Each step improves (does not decrease) the variational lower bound on the logprobability of the data. The variational posterior has the same function form as the corresponding factor in the joint distribution.

41 Example Variational Bayesian mixture of K=6 Gaussians. Components whose expected mixing coefficients are numerically indistinguishable from zero are not shown. The posterior over latents is given by Dirichlet: Consider a component for which N k ' 0 and 0 ' 0. If the prior is broad, so that 0! 0, the E[¼ k ]! 0, and the component plays no role in the model.

42 Variational Lower Bound It is straightforward to evaluate the variational lower bound for this model. We can monitor during the re-estimation in order to test for convergence. It also provides a valuable check on the mathematical updates and software implementation. At each step of the iterative procedure, the variational lower bound should not decrease. For the variational mixture of Gaussian model, the lower bound can be evaluated as: The various terms in the bound can be easily evaluated.

43 Predictive Density In general, we will be interested in the predictive density for a new value of the observed variable. We will also have a corresponding latent variable The predictive density takes form: probability of a new data point given latent component and parameters. probability of a latent component that is summed out. posterior probability over parameters conditioned on the entire dataset. Summing out latent variable we obtain:

44 Predictive density takes form: Predictive Density We now approximate the true posterior distribution p(¼,µ, ) with its variational approximation q(¼)q(µ, ). This gives an approximation: The integration can now be performed analytically, giving a mixture of Student s t-distribution: where and the precision is given by: As N becomes large, the predictive distribution reduces to a mixture of Gaussians.

45 Determining the Number of Components Plot of the variational lower bound (including multimodility factor K!) vs. the number K of components. For each value of K, the model is trained from 100 different random starts. Maximum likelihood would increase monotonically with K (assuming no singular solutions).

46 Induced Factorizations In our variational mixture of Gaussians model, we assumed a particular factorization: The optimal solutions for various factors exhibit additional factorizations: We call these induced factorizations. In a numerical implementation of the variational approach it is important to take into account additional factorizations. These additional factorizations can be detected using d-separation.

47 Induced Factorizations Let us partition the latent variables into A,B,C, and assume the the following factorization that would approximate the true posterior: The solution for q(a,b) takes form: We now test whether the resulting solution factorizes between A and B: This will happen iff: Or if A is independent of B conditioned on C and X.

48 Induced Factorizations In case of Bayesian mixture of Gaussians,we can immediately see that the variational posterior over parameters must factorize between ¼ and (µ, ). All paths that connecting µ or must pass through one of the nodes z n, all of which are in our conditioning set.

49 Variational Linear Regression We now look at the Bayesian linear regression model as another example. Interating over parameters and hyperparameters is often intractable. We can find a tractable approximation using variational methods. Bayesian Linear Regression model: where We next place a conjugate Gamma prior over :

50 Variational Linear Regression The joint distribution takes form: Our goal is to find an approximation to the posterior: Using variational framework, we assume approximate posterior factorizes: Consider distribution over :

51 Variational Linear Regression The distribution over : We can easily recognize this as the log of a Gamma distribution:

52 Variational Linear Regression The optimal factor for model parameters takes form: Hence the distribution q * (w) is Gaussian (due to quadratic form): Same form when alpha was treated as a fixed parameter. The difference is that fixed is replaced by its expectation under the variational approximation q( ).

53 Variational Lower Bound Once we have identified the optimal q distributions, it is easy to compute: We can also easily evaluate the variational lower bound on the log-probability of the data: Expected completedata log-likelihood Negative entropy of approximate q distribution

54 Predictive Distribution We can also easily evaluate the predictive distribution over t given a new input x. where the input-dependent variance is given by: This takes exactly the same form as we considered before with fixed. The difference is that fixed is replaced by its expectation under the variational approximation q( ).

55 Predictive Distribution Plot of the lower bound vs. the order M of the polynomail: So far we have looked at Bayesian models where we are interested in approximating the posterior. The same variational framework can also be applied when learning other models, including undirected models with latent variables (e.g. Deep Boltzmann Machines).

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

### STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

### STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

### Variational Inference. Sargur Srihari

Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

### STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

### 13: Variational inference II

10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

### Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

### Lecture 13 : Variational Inference: Mean Field Approximation

10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

### Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

### A graph contains a set of nodes (vertices) connected by links (edges or arcs)

BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

### Pattern Recognition and Machine Learning

Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

### Variational Inference. Sargur Srihari

Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

### Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### Nonparametric Bayesian Methods (Gaussian Processes)

[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

### 13 : Variational Inference: Loopy Belief Propagation and Mean Field

10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### Expectation Propagation Algorithm

Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

### Latent Variable Models

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

### The Expectation-Maximization Algorithm

1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

### Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

### Part 1: Expectation Propagation

Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

### IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

### Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

### Machine Learning Summer School

Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

### Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

### CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

### Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

### Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

### Variational Inference (11/04/13)

STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

### The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

### Variational Mixture of Gaussians. Sargur Srihari

Variational Mixture of Gaussians Sargur srihari@cedar.buffalo.edu 1 Objective Apply variational inference machinery to Gaussian Mixture Models Demonstrates how Bayesian treatment elegantly resolves difficulties

### Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

### Chapter 16. Structured Probabilistic Models for Deep Learning

Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

### Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

### Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

### Stochastic Variational Inference

Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

### Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

### Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

### Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

### Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

### ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

### Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

### Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### Gaussian Mixture Models

Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

### Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

### Variational Autoencoders

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

### The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

### PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### Variational inference

Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM

### Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

### Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

### Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

### Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### Expectation Maximization

Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

### Data Mining Techniques

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

### Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

### CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### ECE285/SIO209, Machine learning for physical applications, Spring 2017

ECE285/SIO209, Machine learning for physical applications, Spring 2017 Peter Gerstoft, 534-7768, gerstoft@ucsd.edu We meet Wednesday from 5 to 6:20pm in Spies Hall 330 Text Bishop Grading A or maybe S

### Approximate Inference using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

### Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

### Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

### K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

### Quantitative Biology II Lecture 4: Variational Methods

10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

### Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

### Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

### Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

### Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

### 26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

### Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and