# STA 4273H: Statistical Machine Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! Sidney Smith Hall, Room 6002 Lecture 7

2 Approximate Inference When using probabilistic graphical models, we will be interested in evaluating the posterior distribution p(z X) of the latent variables Z given the observed data X. For example, in the EM algorithm, we need to evaluate the expectation of the complete-data log-likelihood with respect to the posterior distribution over the latent variables. For more complex models, it may be infeasible to evaluate the posterior distribution, or compute expectations with respect to this distribution. This typically occurs when working with high-dimensional latent spaces, or when the posterior distribution has a complex form, for which expectations are not analytically tractable (e.g. Boltzmann machines). We will examine a range of deterministic approximation schemes, some of which scale well to large application.

3 Computational Challenges Remember: the big challenge is computing the posterior distribution. There are several main approaches: Analytical integration: If we use conjugate priors, the posterior distribution can be computed analytically (we saw this in case of Bayesian linear regression). Gaussian (Laplace) approximation: Approximate the posterior distribution with a Gaussian. Works well when there is a lot of data compared to the model complexity (as posterior is close to Gaussian). Monte Carlo integration: The dominant current approach is Markov Chain Monte Carlo (MCMC) -- simulate a Markov chain that converges to the posterior distribution. It can be applied to a wide variety of problems. Variational approximation: A cleverer way to approximate the posterior. It often works much faster, but not as general as MCMC.

4 Probabilistic Model Suppose that we have a fully Bayesian model in which all parameters are given prior distributions. The model may have latent variables and parameters, and we will denote the set of all latent variables and parameters by Z. We will also denote the set of all observed variables by X. For example, we may be given a set of N i.i.d data points, so that X ={x 1,,x N } and Z = {z 1,,z N } (as we saw in previous class). Our probabilistic model specifies the joint distribution P(X,Z). Our goal is to find approximate posterior distribution P(Z X) and the model evidence p(x).

5 Variational Bound As in our previous lecture, we can decompose the marginal log-probability as: where Note that parameters are now stochastic variables and are absorbed into Z. We can maximize the variational lower bound with respect to the distribution q(z), which is equivalent to minimizing the KL divergence. If we allow any possible choice of q(z), then the maximum of the lower bound occurs when: In this case KL divergence becomes zero.

6 Variational Bound As in our previous lecture, we can decompose the marginal log-probability as: We will assume that the true posterior distribution is intractable. We can consider a restricted family of distributions q(z) and then find the member of this family for which KL is minimized. Our goal is to restrict the family of distributions so that it contains only tractable distributions. At the same time, we want to allow the family to be sufficiently rich and flexible, so that it can provide a good approximation to the posterior. One option is to use parametric distributions q(z!), governed by parameters!. The lower bound then becomes a function of!, and can optimize the lowerbound to determine the optimal values for the parameters.

7 Example One option is to use parametric distributions q(z!), governed by parameters!. An example, in which the variational distribution is Gaussian, and we optimize with respect to its mean and variance. The original distribution (yellow), along with Laplace (red), and variational (green) approximations.

8 Mean-Field We now consider restricting the family of distributions. Partition the elements of Z into M disjoint groups, denoted by Z i, i=1,,m. We assume that the q distribution factorizes with respect to these groups: Note that we place no restrictions on the functional form of the individual factors q i (we will often denote q i (Z i ) as simply q i ). This approximation framework, developed in physics, is called mean-field theory.

9 Factorized Distributions Among all factorized distributions, we look for a distribution for which the variational lower bound is maximized. Denoting q i (Z i ) as simply q I, we have: where we denote a new distribution:

10 Factorized Distributions Among all factorized distributions, we look for a distribution for which the variational lower bound is maximized. Denoting q i (Z i ) as simply q I, we have: where Here we take an expectation with respect to the q distribution over all variables Z i for i j, so that:

11 Maximizing Lower Bound Now suppose that we keep fixed, and optimize the lower bound with respect to all possible forms of the distribution q j (Z j ). This optimization is easily done by recognizing that: constant: does not depend on q. so the minimum occurs when or Observe: the log of the optimum solution for factor q j is given by: - Considering the log of the joint distribution over all hidden and visible variables - Taking the expectation with respect to all other factors {q i } for i j.

12 Maximizing Lower Bound Exponentiating and normalizing, we obtain: The set of these equations for j=1,,m represent the set of consistency conditions for the maximum of the lower bound subject to factorization constraint. To obtain a solution, we initialize all of the factors and then cycle through factors, replacing each in tern with a revised estimate. Convergence is guaranteed because the bound is convex with respect to each of the individual factors.

13 Factroized Gaussian Consider a problem of approximating a general distribution by a factorized distribution. To get some insight, let us look at the problem of approximating a Gaussian distribution using a factorized Gaussian distribution. Consider a Gaussian distribution over two correlated variables z = (z 1,z 2 ). Let us approximate this distribution using a factorized Gaussian of the form:

14 Factroized Gaussian Remember: Consider an expression for the optimal factor q 1 : Note that we have a quadratic function of z 1, and so we can identify q 1 (z 1 ) as a Gaussian distribution:

15 By symmetry, we also obtain: Factroized Gaussian There are two observations to make: - We did not assume that is Gaussian, but rather we derived this result by optimizing variational bound over all possible distributions. - The solutions are coupled. The optimal depends on expectation computed with respect to One option is to cycle through the variables in turn and update them until convergence.

16 By symmetry, we also obtain: Factroized Gaussian However, in our case, The green contours correspond to 1,2, and 3 standard deviations of the correlated Gaussian. The red contours correspond to the factorial approximation q(z) over the same two variables. Observe that a factorized variational approximation tends to give approximations that are too compact.

17 Alternative Form of KL Divergence We have looked at the variational approximation that minimizes KL(q p). For comparison, suppose that we were minimizing KL(p q). It is easy to show that: constant: does not depend on q. The optimal factor is given by the marginal distribution of p(z).

18 Comparison of two KLs Comparison of two the alternative forms for the KL divergence. KL(q p) KL(p q) Approximation is too compact. Approximation is too spread.

19 Comparison of two KLs The difference between these two approximations can be understood as follows: KL(q p) There is a large positive contribution to the KL divergence from regions of Z space in which: - p(z) is near zero, - unless q(z) is also close to zero. Minimizing KL(q p) leads to distributions q(z) that avoid regions in which p(z) is small.

20 Comparison of two KLs Similar arguments apply for the alternative KL divergence: KL(p q) There is a large positive contribution to the KL divergence from regions of Z space in which: - q(z) is near zero, - unless p(z) is also close to zero. Minimizing KL(p q) leads to distributions q(z) that are nonzero in regions where p(z) is nonzero.

21 Approximating Multimodal Distribution Consider approximating multimodal distribution with a unimodal one. Blue contours show bimodal distribution p(z), red contours show a single Gaussian distribution that best approximates q(z) that best approximates p(z). KL(p q) KL(q p) KL(q p) In practice, the true posterior will often be mutlimodal. KL(q p) will tend to find a single mode, whereas KL(p q) will average across all of the modes.

22 Alpha-family of Divergences The two forms of KL are members of the alpha-family divergences: Observe three points: - KL(p q) corresponds to the limit! 1. - KL(q p) corresponds to the limit! D (p q) 0, for all, and D (p q)=0 iff q(x) = p(x). Suppose p(x) is fixed and we minimize D (p q) with respect to q distribution. For < -1, the divergence is zero-forcing: q(x) will underestimate the support of p(x). For > 1, the divergence is zero-avoiding: q(x) will stretch to cover all of p(x). For = 0, we obtain a symmetric divergence which is related to Hellinger Distance:

23 Univariate Gaussian Consider a factorized approximation using a Gaussian distribution over a single variable x. Given a dataset over the mean µ and precision. The likelihood term is given: we would like to infer posterior distribution The conjugate prior is given by the Normal-Gamma prior: For this simple problem, the posterior also takes the form of Normal-Gamma distribution and hence has a closed form solution. However, let us consider a variational approximation to the posterior.

24 Approximating Mean We now consider a factorized variational approximation to the posterior: Note that the true posterior does not factorize this way! Remember: Hence: So: Depends on expectation with respect to q( ).

25 Approximating Mean We now consider a factorized variational approximation to the posterior: As N! 1, this gives Maximum Likelihood result becomes infinite. and the precision

26 Approximating Precision We now consider a factorized variational approximation to the posterior: For optimal solution for the precision factor: Hence the optimal factor is a Gamma distribution: Depends on expectation with respect to q(µ).

27 Approximating Precision We now consider a factorized variational approximation to the posterior: Hence the optimal factor is a Gamma distribution: As N! 1, the variational posterior q * ( ) has a mean given by the inverse of the maximum likelihood estimator for the variance of the data, and a variance that goes to zero. Note that we did not assume specific functional forms for the optimal q distributions. They were derived by optimizing variational bound over all possible distributions.

28 Iterative Procedure The optimal distributions for mean and precision terms depend on moments evaluated with respect to the other distributions. One option is to cycle through the mean and precision in turn and update them until convergence. Variational inference for the mean and precision. Green contours represent the true posterior, blue contours represent variational approximation.

29 Mixture of Gaussians We will look at the Bayesian mixture of Gaussians model and apply the variational inference to approximate the posterior. Note: Many models, corresponding to much more sophisticated distributions, can be solved by straightforward extensions of this analysis. Remember the Gaussian mixture model: The log-likelihood takes form:

30 Bayesian Mixture of Gaussians We next introduce priors over parameters ¼, µ, and. From now on, we assume that each component has the same parameter: for k=1,,k. Note that parameters 0 can be viewed as the effective prior number of observations associated with each component in the mixture, If 0 is small, the posterior will be influenced primarily by the data, rather than by the prior.

31 Bayesian Mixture of Gaussians We next introduce priors over parameters ¼, µ, and. We also place Gaussian-Wishart prior: Typically, we would choose, m 0 = 0 (by symmetry), and W 0 = I, and Notice the distinction between latent variables and parameters.

32 Bayesian Matrix Factorization One of the popular matrix factorization model used in collaborative filtering / recommender systems. As pointed before: Many models, corresponding to much more sophisticated distributions, can be solved by straightforward extensions of presented analysis.

33 Variational Distribution We can write down the joint distribution over all random variables: Consider a variational distribution that factorizes between the latent variables and model parameters: Remarkably, this is the only assumption we need in order to obtain a tractable practical solution to our Bayesian mixture model. The functional form of the factors will be determined automatically by optimization of the variational distribution.

34 Variational Distribution Using our general result, we can obtain the optimal factor for q(z): Using decomposition of and retaining terms that depend on Z: Substituting, we obtain: where expectations are taken with respect to

35 Variational Distribution Using our general result, we can obtain the optimal factor for q(z): So far, we have: Exponentiating and normalizing, we have: plays the role of responsibility. The optimal solution takes the same function form as the prior p(z ¼) (multinomial), and Note that the optimal solution depends on moments evaluated with respect to distributions of other variables.

36 Variational Distribution Using our general result, consider the optimal factor for q(¼,µ, ): Substituting, we obtain: where expectations are taken with respect to The result decomposes into a sum of terms involving only ¼ and {µ k, k }, k=1,..,k.

37 Substituting we obtain: Variational Distribution So the optimal q * (¼) takes form: Exponentiating, we have a Dirichlet distribution: where has components:

38 Variational Distribution It will be convenient to define three statistics of the observed dataset with respect to the responsibilities: Effective number of points in component k. The mean of component k. Covariance of component k. These are analogous to quantities evaluated in the maximum likelihood EM for the Gaussian mixture models.

39 Substituting we obtain: Variational Distribution It is easy to verify that optimal q * (µ k, k ) is a Gaussian- Wishart distribution: These update equations are quite intuitive. But they depend on responsibilities!

40 Iterative Optimization The optimization of variational posterior amounts to cycling between two stages, analogous to the E and M steps of the maximum likelihood EM. In the variational E-step, we use the current distribution over parameters q(¼,µ, ) to evaluate responsibilities: In the variational M-step, we use the current distribution over latents q(z), or responsibilities, to compute the variational distribution over parameters. Each step improves (does not decrease) the variational lower bound on the logprobability of the data. The variational posterior has the same function form as the corresponding factor in the joint distribution.

41 Example Variational Bayesian mixture of K=6 Gaussians. Components whose expected mixing coefficients are numerically indistinguishable from zero are not shown. The posterior over latents is given by Dirichlet: Consider a component for which N k ' 0 and 0 ' 0. If the prior is broad, so that 0! 0, the E[¼ k ]! 0, and the component plays no role in the model.

42 Variational Lower Bound It is straightforward to evaluate the variational lower bound for this model. We can monitor during the re-estimation in order to test for convergence. It also provides a valuable check on the mathematical updates and software implementation. At each step of the iterative procedure, the variational lower bound should not decrease. For the variational mixture of Gaussian model, the lower bound can be evaluated as: The various terms in the bound can be easily evaluated.

43 Predictive Density In general, we will be interested in the predictive density for a new value of the observed variable. We will also have a corresponding latent variable The predictive density takes form: probability of a new data point given latent component and parameters. probability of a latent component that is summed out. posterior probability over parameters conditioned on the entire dataset. Summing out latent variable we obtain:

44 Predictive density takes form: Predictive Density We now approximate the true posterior distribution p(¼,µ, ) with its variational approximation q(¼)q(µ, ). This gives an approximation: The integration can now be performed analytically, giving a mixture of Student s t-distribution: where and the precision is given by: As N becomes large, the predictive distribution reduces to a mixture of Gaussians.

45 Determining the Number of Components Plot of the variational lower bound (including multimodility factor K!) vs. the number K of components. For each value of K, the model is trained from 100 different random starts. Maximum likelihood would increase monotonically with K (assuming no singular solutions).

46 Induced Factorizations In our variational mixture of Gaussians model, we assumed a particular factorization: The optimal solutions for various factors exhibit additional factorizations: We call these induced factorizations. In a numerical implementation of the variational approach it is important to take into account additional factorizations. These additional factorizations can be detected using d-separation.

47 Induced Factorizations Let us partition the latent variables into A,B,C, and assume the the following factorization that would approximate the true posterior: The solution for q(a,b) takes form: We now test whether the resulting solution factorizes between A and B: This will happen iff: Or if A is independent of B conditioned on C and X.

48 Induced Factorizations In case of Bayesian mixture of Gaussians,we can immediately see that the variational posterior over parameters must factorize between ¼ and (µ, ). All paths that connecting µ or must pass through one of the nodes z n, all of which are in our conditioning set.

49 Variational Linear Regression We now look at the Bayesian linear regression model as another example. Interating over parameters and hyperparameters is often intractable. We can find a tractable approximation using variational methods. Bayesian Linear Regression model: where We next place a conjugate Gamma prior over :

50 Variational Linear Regression The joint distribution takes form: Our goal is to find an approximation to the posterior: Using variational framework, we assume approximate posterior factorizes: Consider distribution over :

51 Variational Linear Regression The distribution over : We can easily recognize this as the log of a Gamma distribution:

52 Variational Linear Regression The optimal factor for model parameters takes form: Hence the distribution q * (w) is Gaussian (due to quadratic form): Same form when alpha was treated as a fixed parameter. The difference is that fixed is replaced by its expectation under the variational approximation q( ).

53 Variational Lower Bound Once we have identified the optimal q distributions, it is easy to compute: We can also easily evaluate the variational lower bound on the log-probability of the data: Expected completedata log-likelihood Negative entropy of approximate q distribution

54 Predictive Distribution We can also easily evaluate the predictive distribution over t given a new input x. where the input-dependent variance is given by: This takes exactly the same form as we considered before with fixed. The difference is that fixed is replaced by its expectation under the variational approximation q( ).

55 Predictive Distribution Plot of the lower bound vs. the order M of the polynomail: So far we have looked at Bayesian models where we are interested in approximating the posterior. The same variational framework can also be applied when learning other models, including undirected models with latent variables (e.g. Deep Boltzmann Machines).

### Variational Inference. Sargur Srihari

Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

### STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### Approximate Inference using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

### Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

### Basic Sampling Methods

Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

### Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### Foundations of Statistical Inference

Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian

### Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

### Week 3: The EM algorithm

Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

### Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

### Review and Motivation

Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

### Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

### Series 7, May 22, 2018 (EM Convergence)

Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

### Introduction to Graphical Models

Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

### Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

### Lecture 21: Spectral Learning for Graphical Models

10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

### Expectation Maximization Algorithm

Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

### Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

### Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

### Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

### Least Mean Squares Regression

Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

### Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

### Expectation Propagation in Factor Graphs: A Tutorial

DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

### Approximate inference in Energy-Based Models

CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

### Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

### Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

### Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

### Bayesian time series classification

Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University

### Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

### Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

### Andriy Mnih and Ruslan Salakhutdinov

MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

### Instructor: Dr. Volkan Cevher. 1. Background

Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These

### Parametric Techniques

Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

### CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### Exponential Families

Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

### Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### Scaling Neighbourhood Methods

Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

### Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

### Expectation Maximization (EM)

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter

### VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

### Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### COM336: Neural Computing

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

### Outline Lecture 2 2(32)

Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

### Contrastive Divergence

Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

### Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

### Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

### The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

### A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

### STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 2 Linear Least Squares From

### Variational Bayesian Inference Techniques

Advanced Signal Processing 2, SE Variational Bayesian Inference Techniques Johann Steiner 1 Outline Introduction Sparse Signal Reconstruction Sparsity Priors Benefits of Sparse Bayesian Inference Variational

### Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

### Lecture 4 October 18th

Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

### Restricted Boltzmann Machines

Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

### Probabilistic Graphical Models

2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

### Probabilistic Graphical Models

Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

### Introduction to Bayesian Learning

Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

### CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

### Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

### Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

### Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

### A Review of Mathematical Techniques in Machine Learning. Owen Lewis

A Review of Mathematical Techniques in Machine Learning Owen Lewis May 26, 2010 Introduction This thesis is a review of an array topics in machine learning. The topics chosen are somewhat miscellaneous;

### MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

### REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

### An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application, CleverSet, Inc. STARMAP/DAMARS Conference Page 1 The research described in this presentation has been funded by the U.S.

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several