Probabilistic Graphical Models

Similar documents
Auto-Encoding Variational Bayes

Advanced Introduction to Machine Learning

Introduction to Machine Learning

Variational Autoencoder

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

Lecture 16 Deep Neural Generative Models

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

17 : Optimization and Monte Carlo Methods

STA 4273H: Statistical Machine Learning

Approximate inference in Energy-Based Models

Denoising Criterion for Variational Auto-Encoding Framework

Does the Wake-sleep Algorithm Produce Good Density Estimators?

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Origin of Deep Learning. Lili Mou Jan, 2015

STA 4273H: Statistical Machine Learning

Approximate Inference using MCMC

STA 414/2104: Lecture 8

Variational Autoencoders (VAEs)

HOMEWORK #4: LOGISTIC REGRESSION

Advanced Introduction to Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Hamiltonian Monte Carlo for Scalable Deep Learning

The connection of dropout and Bayesian statistics

STA 414/2104: Lecture 8

Bayesian Deep Learning

CSC411 Fall 2018 Homework 5

How to do backpropagation in a brain

Homework 6: Image Completion using Mixture of Bernoullis

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

17 : Markov Chain Monte Carlo

Programming Assignment 4: Image Completion using Mixture of Bernoullis

STA 4273H: Sta-s-cal Machine Learning

Learning Energy-Based Models of High-Dimensional Data

Variational Autoencoders

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Generative models for missing value completion

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Natural Gradients via the Variational Predictive Distribution

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Methods for Machine Learning

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

HOMEWORK #4: LOGISTIC REGRESSION

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Nonparametric Bayesian Methods (Gaussian Processes)

Day 3 Lecture 3. Optimizing deep networks

STA 4273H: Statistical Machine Learning

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Markov Chain Monte Carlo (MCMC)

HOMEWORK 4: SVMS AND KERNELS

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Deep Generative Models. (Unsupervised Learning)

Machine Learning, Fall 2009: Midterm

Variational Bayes on Monte Carlo Steroids

Inference Suboptimality in Variational Autoencoders

Introduction. Chapter 1

Contrastive Divergence

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Understanding Neural Networks : Part I

ECE521 week 3: 23/26 January 2017

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STA414/2104 Statistical Methods for Machine Learning II

Variational Dropout via Empirical Bayes

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Monte Carlo in Bayesian Statistics

arxiv: v8 [stat.ml] 3 Mar 2014

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to Machine Learning

Nonparametric Inference for Auto-Encoding Variational Bayes

Probabilistic Machine Learning

Deep unsupervised learning

Variational Dropout and the Local Reparameterization Trick

Latent Variable Models

arxiv: v4 [stat.co] 19 May 2015

GENERATIVE ADVERSARIAL LEARNING

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Quasi-Monte Carlo Flows

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

19 : Slice Sampling and HMC

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Tutorial on Probabilistic Programming with PyMC3

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

MCMC: Markov Chain Monte Carlo

10-701/ Machine Learning, Fall

UNSUPERVISED LEARNING

STA 4273H: Statistical Machine Learning

Probability and Information Theory. Sargur N. Srihari

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

STA 414/2104: Machine Learning

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

On Markov chain Monte Carlo methods for tall data

Deep Learning Lab Course 2017 (Deep Learning Practical)

Transcription:

10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem should start on a new page and marked appropriately on Gradescope. For policy on late submission, please see course website. 2. We recommend that you typeset your homework using appropriate software such as L A TEX. If you are writing, please make sure your homework is cleanly written up and legible. The TAs will not invest undue effort to decrypt bad handwriting. 3. Code submission: for programming questions, you must submit the complete source code of your implementation also via Gradescope. Remember to include a small README file and a script that would help us execute your code. 4. Collaboration: You are allowed to discuss the homework, but you should write up your own solution and code. Please indicate anyone you collaborated with in your submission. 1

1 Variational Autoencoders (Maruan) [55 pts] In this problem we ask you to (re-)derive and implement the Helmholtz machine (aka Variational Autoencoder (VAE)) and compare the two training procedures: the classical wake-sleep algorithm [1] and the modern autoencoding variational (empirical) Bayes [2] 1. Let s start with a few derivations that will help you better understand the principles behind the model and will be useful for your implementation. 1.1 Derivations (15 pts) z θ x N Figure 1: The latent variable model. Our goal is to learn a directed latent variable model (Figure 1) that is able represent a complex distribution over the data, p(x), in the following form: p θ (x) = p θ (x z)p(z)dz, (1) In this problem, we assume a Gaussian prior on z and consider x to be binary vectors, i.e., p θ (x z) can be modeled with a sigmoid belief net. (3 pts) Assume that we would like to approximate the posterior distribution p θ (z x) using some variational distribution, q φ (z x). Derive the evidence lower bound (ELBO) on the log likelihood of the model for N data points, {x (i) } N i=1. (4 pts) A tractable way to learn this model is to optimize the ELBO objective. The wake-sleep algorithm decomposes the optimization procedure into two phases: the wake and the sleep phase. Describe each phase of the wake-sleep algorithm. Derive the optimization objectives for each of the phases for our particular case. Briefly describe the advantages and disadvantages of the wake-sleep algorithm. (4 pts) Autoencoding variational Bayes (AEVB) approach avoids the two-stage optimization procedure and instead optimizes (a stochastic estimate of) ELBO directly w.r.t. to parameters of the generative model (generation network) and parameters of the variational distribution (recognition network). Write down the stochastic estimate of the ELBO used as the objective. Derive the gradient of the objective by reparametrizing the latents as deterministic functions of the inputs and auxiliary noise variables. Briefly describe the advantages and disadvantages of the autoencoding variational Bayes approach. (4 pts) To compare trained models, we could simply look at the values of the lower bound. However, the bound could be loose and hence the numbers could be misleading. Here, we derive and prove a tighter 1 Please make sure you have read and understood the papers. 2

approximation of the lower bound on the marginal likelihood that will be used as metric of the quality in the experiments. In fact, the new bound is extremely simple: ( ) L k (x) = E z (1),...,z (k) q φ (z x) log 1 k p θ (x, z (i) ). (2) k q φ (z (i) x) Prove that the following inequalities hold for any k: log p(x) L k+1 (x) L k (x). (bonus 3 pts) The above inequalities do not guarantee L k (x) log p(x) when k. Provide a counterexample (i.e., a simple handcrafted p and q) such that L k (x) < log p(x). 1.2 Implementation and Experiments (40 pts) In this part, you need to implement the wake-sleep and the AEVB learning algorithms and use them to train a simple model on MNIST handwritten digits dataset. 1.2.1 Model specification Let both p θ (x z) and q φ (z x) be parametrized by neural networks with one hidden layer that consists of 512 ReLU neurons and let the dimensionality of the latent space be 2. The weight matrices between the layers should be initialized randomly by sampling from N (0, 0.01) and the biases should be initially set to zeros. Since x s are binary, the output layer of the generation network that represents p θ (x z) should consist of sigmoid neurons. The variational distribution, q φ (z x), is represented by a Gaussian of the form, N (z; µ φ (x), diag (σ φ (x))). The recognition network should parametrize µ φ (x) and σ φ (x) and hence have two linear outputs corresponding to µ and σ, respectively. A note on implementation: You are free to use any programming language for your implementation. We recommend you additionally use a library that allows you to perform automatic reverse-mode differentiation which will simplify model development (e.g., TensorFlow or PyTorch). You are NOT allowed to use any libraries that provide some sort of pre-cooked implementations of the (variational) autoencoders you need to implement the models and the training algorithms from the basic building blocks yourself. For the optimization, we recommend you use one of the popular algorithms such as Adagrad [3] or Adam [4]. 1.2.2 Experiments Train and evaluate your models on the MNIST handwritten digits dataset. (15 pts) Implement and train the specified simple model using the wake-sleep algorithm for about 100 epochs. Provide the plots of the L train 1000 and L test 1000 vs. the epoch number. (15 pts) Implement and train the specified simple model using the AEVB algorithm for about 100 epochs. Provide the plots of the L train 1000 and L test 1000 vs. the epoch number. Does it converge faster or slower? (3 pts) Visualize a random sample of 100 MNIST digits on 10 10 tile grid (i.e., 10 rows, 10 digits per row). Using your trained models, sample and visualize 100 digits from each of them in the same manner. To do this, sample 100 random z, then apply the generator network, p(x z), to produce digit samples. (7 pts) Since we have specifically chosen the latent space to be 2-dimensional, now you can easily visualize the learned latent manifold of digits: Using your pre-trained recognition networks, transform images from the test set to the latent space. Visualize the points in the latent space as a scatter plot, where colors of the points should correspond to the labels of the digits. From the previous point, determine the min and max values of z 1 and z 2. Create a 20 20 grid that corresponds to (z 1, z 2 ) values between the min and max. For each z = (z 1, z 2 ), generate and visualize digits using each of your trained models, and plot each set on a 20 20 tile grid. i=1 3

2 Markov Chain Mote Carlo (Maruan) [45 pts] In this problem, we will study behavior of the two classical MCMC methods: Metropolis-Hastings and Hamiltonian MCMC. These methods are often used for sampling from complex, often multimodal distributions. We assume that we want to sample from a Gaussian mixture of m components: P (x) = m π i N (x; µ i, Σ i ), (3) i=1 where π i s are the mixture component probabilities. 2.1 Metropolis-Hastings (5 pts) (5 pts) Write down the M-H algorithm and derive the accept/reject ration for a proposal distribution of your choice (e.g., isotropic Gaussian). 2.2 Hamiltonian MCMC (10 pts) (3 pts) Write down the Hamiltonian (potential + kinetic energy) for the Gaussian mixture model. Assume that the mass matrix is identity. (3 pts) Derive the gradients of the potential energy for the Gaussian mixture model that you will use in your HMC algorithm. (4 pts) Write down the HMC algorithm (with either Euler- or leap-frog-based proposals; you will need to implement both for the experimental part). Describe the tunable parameters of your sampler. 2.3 Effective sample size (10 pts) Assuming that the Markov chain has converged, we can be certain that computing MCMC expectations using this chain should result into unbiased estimates. In particular, if we want to estimate E P (f(x)) where f is some function, we can produce samples {x (1),..., x (L) } from the Markov chain (with a stationary distribution matching P ) and have the following estimator: E MCMC (f(x)) = 1 L L f(x (l) ). The samples produced by MCMC are non-i.i.d., and hence, depending on the correlation structure between the samples, the variance of this estimator might be high. It is common to analyze the effective sample size (ESS) of an MCMC sampler, i.e., the ratio between number of samples produced by the chain to the number of perfect i.i.d. samples both of result into estimators that have same variance. (6 pts) Prove that, in the limit, the ratio of the sample of size S produced by MCMC to its effective size has the following form: S Cov ( f(x (i) ), f(x (i+k) ) ) = 1 + 2 S eff Var ( f(x (i) ) ). (4) (Hint: Use the Central Limit Theorem.) k=1 (4 pts) To diagnose the ESS in practice, obviously, you need to also estimate S eff empirically from the MCMC sample. Assuming that you sampled L points using MCMC, suggest a simple estimator for S eff. l=1 4

2.4 Experiments (20 pts) In this part, we consider a Gaussian mixture model that consists of multiple components in 2 and 100 dimensions to better understand how these MCMC methods work (the 2D Gaussian mixture is visualized in Figure 2). The parameters of the distributions are provided in the data handout. You need to implement M-H and HMC algorithms and perform experiments outlined below. A note on implementation: You are free to use any programming language for your implementation. You are NOT allowed to use any existing implementations of M-H and HMC in this problem. 2.4.1 2D mixture (10 pts) Figure 2: The ground truth 2D Gaussian mixture (left) and 10k samples drawn from it (right). For both MH and HMC, you need to do the following: Use the starting point for your sampling x 0 = ( 10, 0). Run the MCMC chain for 1,000 steps to burn in and then collect 500 samples with t steps in between (i.e., run M-H for 500t steps and collect only each t-th sample). For each parameter setting, produce a 2D plot (connect the dots with lines so that the MCMC paths are visible); produce 1D plots for each coordinate separately. Estimate the reject ratio and the effective sample size of the final samples for each parameter setting. Report your results in a table. Comment on the results. Which parameter setting worked the best for each of the algorithm? Now, the details on the MH and HMC parametrization: Metropolis-Hastings. Use isotropic Gaussian proposal distribution, N ( 0, σ 2 I ), with σ = 0.01, 0.1, 1, 10. When collecting samples, use t = 1, 10, 20, 50, 100. Hamiltonian MCMC. Use HMC with Euler- and leap-frog-based proposals (set ɛ = 0.25). Vary the number of Euler and leap-frog steps, t = 1, 10, 20, 50, 100. Note that for HMC, when collecting samples, take each (i.e., do not take every t-th sample as you did for M-H). This ensures that performance of M-H and HMC can be compared on the same scale. 5

2.4.2 100D Gaussian (10 pts) In this final part of the problem, you need to perform the same set of experiments as for 2D Gaussian mixture, but for 100-dimensional Gaussian distribution with zero mean and the covariance matrix of the following form: 0.01 2 0 0 0 0 0.02 2 0 0 Σ =........ 0 0 0.99 2 0 0 0 0 1.0 For HMC, set ɛ = 0.01; for M-H, set σ = 0.025. Select the number of steps (Euler or leap-frog for HMC and the number of standard accept/reject steps for M-H) between 100 and 250. Run the same experiments as for the 2D case (use the starting point x 0 = (0, 0,..., 0, 1.5).) Report your results for both M-H and HMC in the following form: Report the rejection ratio and the effective sample size. Plot a 1D plot of the sampling trajectories for the last coordinate. Plot a chart of the sample variance vs. the true variance for each coordinate (100 points on the chart). References [1] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995. [2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arxiv preprint arxiv:1312.6114, 2013. [3] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [4] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:1412.6980, 2014. 6