Generative models for missing value completion

Size: px
Start display at page:

Download "Generative models for missing value completion"

Transcription

1 Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA Abstract Deep generative models such as generative adversarial networks (GANs) and variational auto-encoders (VAEs) are attracting high attention these days, but their theory and their relation to simpler generative models are not well studied. In this work, we first reformulate the well-known statistical method, PCA, as a generative latent Gaussian variable model and derive EM algorithm as a corresponding inference method. Then, we compare those model-inference pair of the Probabilistic PCA (PPCA) to VAEs which consists of a deep generative latent Gaussian model and variational inference as the inference method. The comparison is done by fitting both models to the MNIST hand-written database and completing hidden parts of digits by maximum likelihood estimation. VAEs reconstruct more realistic digits, and we conclude that the improvement comes from the representational power of the multi-layer neural network and the robustness to overfitting of variational inference which are used in VAEs. 1 Introduction In the field of machine learning, discriminative models as typified by logistic regression, support vector machine, and convolutional neural networks have been studied intensively [10], and they have shown success especially in classification tasks. However, such models need labeled dataset for training, which is not always available. Also, the use of the conditional probability distribution p(y x) given by these discriminative models are relatively limited. On the other hand, generative models such as naive Bayes, hidden Markov model, and restricted Boltmann machine[10] model the joint probability distribution p(x, y), so the resulting model can be used for broader purposes including anomaly detection, data compression, feature extraction, clustering, and missing value completion as well as classification. Generative models often assume a structure behind the given data and assume latent variables. Common way to infer parameters of probabilistic models with latent variables is EM algorithm. However, EM algorithm cannot be used when the model has multiple latent variables or includes comoputationally intractable expressions such as the integral of a probability distribution. Researchers have tackled this problem by estimating the distribution by mainly two methods. One is sampling and the other is approximation. Sampling methods includes many variations of Mankov chain Monte Carlo (MCMC)[7][12], but they are computationally expensive if not tractable. The approximation methods is represented by variational inference[9][14]. Variational inference is more efficient than MCMC and usually gives a tight approximation. In this work, we first review the generative latent variable models by reformulating PCA as a generative latent Gaussian variable model, called PPCA[13]. Then, we introduce a recently proposed generative framework, VAEs[5] as a deep generative latent Gaussian variable model. We examine the relationship between VAEs and PPCA and compare their performance by applying both methods koar8470/

2 to an image completion task. Specifically, their performance are evaluated quantitatively by comparing how well each model reconstructs the hidden part of the hand-written digits[15] in terms of the pixel-wise residual sum of squares. This is not a complete evaluation metric because it would have huge value if a perfecty reconstructed sample was shifted to the right just by one pixel. Therefore, we also visualie the reconstruction and compare them qualitatively by eyes. 2 Preliminaries 2.1 Generative latent variable models Generative models model the actual probability distribution of given data X. Specifically, as opposed to discriminative models calculate the conditional probability of class y given data X, p(y X) for classification or prediction tasks, generative models compute the joint probability of X and y, p(x, y). Because generative models are the model of the generation process of the data, we are able to sample, or generate data X by specifying its class y. This is why it is named generative models. When we model data X in a generative manner, we often assume some structure behind the data and introduce latent variables to leverage it. Suppose we have a digit whose top half is occluded. For example, let s assume we see the bottom half of a digit 7. Then, if we don t assume a structure in the data, the top half can be anything. It may be 8 or even random noise. However, if we assume a structure, the top half is most likely 1 or 7. This is because digits are not random set of dots, but lines which are drawn by following some rules to express some information. The random variable which represents the structure behind the data is called latent variable. For another example, by reading half of an article, we can guess what kind of words are used in the rest half. Suppose words like perceptron, convolutional, and gpu appeared in the given half, then words such as deep learning, and cloud would more likely appear compare to words like soccer or constitution in the rest. This prediction is possible because we assume that the topic of the article is machine learning, for instance. The topic is the latent variable in this example as opposed to the class of the digit in the previous example. Latent variables may sound like the class of the data, but the important difference is that the latent variables are not explicitly given in the form of labels of the data. Also, latent variables can encode much more complex structures that human may not be able to put in words like the style of images. Formally, generative models can be defined as following. Let f( ) denote a deterministic function that map the random latent variable to the data space, like a topic to articles and a digit class to actual digit images. X ˆX = f(, θ) The function f( ) differs depending on the model, but our motivation is to minimie the error between X and ˆX. Therefore, the objective of generative models is to maximie p(x θ) = p θ (X )p()d in terms of θ. This is a general maximum likelihood problem although evaluating the integral is often computationally intractable. 2.2 Probabilistic principal component analysis PCA is usually defined as the orthogonal projection of D dimensional data X onto a M < D dimensional linear space such that the variance of the projected data is maximied[8]. Therefore, the principal components of some data are obtained by first evaluating the covariance matrix of the centralied data and then taking its eigenvectors u corresponding to the M largest eigenvalues. Here, u is the principal components of X. The original data is approximated by the linear combination of principal components as, 2

3 Figure 1: The left two images shows how latent variable is sampled from Gaussian distribution and mapped to the actual data space X. The right most image is the resulting probability distribution of the observable variable X. Bishop, Pattern Recognition and Machine Learning, Springer, 2009, p.570. X X = M i u i + µ i=1 where i is the weight of the principal component u i and µ is the mean of the original data. For this property, PCA is used for applications such as dimensionality reduction and feature extraction. We now reformulate PCA as a generative model. Specifically, PCA is the maximum likelihood solution of a latent Gaussian variable model. This formulation is called probabilistic PCA (PPCA)[13]. We first introduce a latent variable corresponding to the principal component subspace W. Then, the probability distribution of X conditioned on the latent variable is, p(x ) = N (X W + µ, σ 2 I) where p() = N ( 0, I). Note that we can assume to be standard Gaussian distribution as opposed to a general Gaussian distribution without loss of generality. By seeing this probabilistic model from a generative viewpoint, that is, by assuming the data is generated by first sampling the value of the latent variable, and then mapping to the actual data space X, we get the expression, X = W + µ + ϵ where ϵ is the Gaussian noise. Note that PPCA gives the same principal components as PCA when we take the limit σ 2 0. You will notice the similarity between PPCA and PCA if you see this expression as a linear combination of the column vectors of W and regard them as principal components. The likelihood of given data X and the latent variables is, by the definition of Gaussian distribution, N p(x, µ, W, σ 2 ) = p(x n n )p( n ) N = (2πσ 2 ) D 2 ( ) X n W n µ 2 exp 2σ 2 (2π) M 2 exp ( X n 2 2 ). 3

4 2.3 Expectation-Maximiation algorithm The likelihood of a model that includes a latent variable can be maximied by EM algorithm. Given data X and latent variables, EM algorithm finds the maximum likelihood estimates of the model parameters θ. To compute the likelihood, L(θ), we need to integrate out. L(θ) = log p θ (X, ) This equation is intractable to compute because it requires the enumeration of. EM algorithm, instead, maximie the lower bound of the likelihood. We can derive the lower bound by, L(θ) = log p θ (X, ) = log q() log p θ(x, ) q() = q() p θ(x, ) q() q() log p θ (X, ) + H() = F(q, θ) Note that we introduced a probability distribution q() in the first line of the derivation by multiplying and dividing the same quantity, and therefore its actual form can be chosen arbitrarily. Then, we used Jensen s inequality to obtain the inequality in the second line. Jensen s inequality if p 1,..., p n are positive numbers which sum to 1 and f is a real continuous function that is concave, then ( n ) f p i x i i=1 n p i f(x i ). H() denotes the Shannon entropy q() log q(), whose whole point is that it s independent of θ. Now, we reduced the likelihood maximiation problem to the lower bound maximiation problem, however, before we maximie it, we need to choose q() properly so that F(q, θ) become a tight bound of L(θ). Their difference is, L(θ) F(q, θ) = log = log = i=1 p θ (X, ) q() log p θ(x, ) q() p θ (X, ) q() log p θ( X)p θ (X) q() q() log p θ( X) = D KL (q() p θ ( X)) q() where D KL (p q) denotes Kullback-Leibler divergence which is often used as a distance metric between two probability distributions. It is non-negative and becomes ero iff p = q. Kullback-Leibler divergence For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence is defined as: D KL (P Q) = x p(x) log p(x) q(x). Therefore, the lower bound F(q, θ) is tight when q() = p θ ( X). In EM algorithm, we set the lower bound in the expectation step, then maximie the lower bound in the maximiation step. Formally, Expectation step : q() = p θ ( X) Maximiation step : arg max q() log p θ (X, ) θ 4

5 The computation of EM algorithm in the context of PPCA is described below. Note that the expectation of posterior is used instead of full distribution which is intractable to compute. Let N be the number of samples, and D be the dimension of the data. Also, let ˆX be the centralied data, (X X). Then, the equations for iterative updates are the following. The derivation is articulated by Tipping and Bishop[13]. Expectation step: E[ n ] = (W T W + σ 2 I) 1 W T ( ˆX n ) Maximiation step: [ N ] [ N ] 1 W new = ( X n )E[ n ] T E[ n ]E[ n ] T σ 2 new = 1 ND N { ˆX 2 n 2E[n ] T Wnew( T ˆX n ) + T r(e[ n ]E[ n ] T WnewW T new )} In the formulation of PPCA, we assumed that the dimensionality of the principal component M is given a priori, and M = 2 is usually used for the visualiation purpose. However, the decision of M is a critical problem when PPCA is used for dimensionality reduction. We could use cross-validation to optimie M, but it is computationally expensive. Instead, we can take Bayesian approach by assuming a prior α over W as a hyper parameter. This approach is called Bayesian PCA (BPCA)[2]. As Bayesian approach often does, BPCA requires evaluating the marginal distribution of µ, W, and σ 2, which is intractable. To this end, we can use variational framework[3], which is a natural extension of the EM algorithm, to approximate the marginaliation. BPCA has several advantages including robustness to the overfitting as well as automatic dimensionality selection, however, we fix the dimensionality of the principal components in this work in order to compare the experimental results with VAEs. We discuss the variational framework along the VAE s context in the next section. 3 Variational frameworks Variational inference is a framework to approximate the true posterior P with the best Q from a set of distributions Q by minimiing a difference metric between P and Q as an optimiation problem. The most common choice for the difference metric is Kullback-Leibler divergence as we have seen in the EM algorithm, but there are variations depending on how the true distribution is approximated and how the difference is measured. 3.1 Variational EM algorithm Remember that in EM algorithm for PPCA, we used point estimate of rather than the distribution q() because of the computational difficulty. However, it tends to overfit to the training data; also EM algorithm cannot be used when there are multiple latent variables. Hence, we extend EM algorithm by approximating q() = p θ ( X), leading to variational EM algorithm. The common approximation is q() i q ϕi( i ) where q ϕ () is a conjugate distribution. Then, the approximation is improved by minimiing the KL divergence between q ϕ () and p θ ( X) by optimiing the parameter ϕ, which is equivalent to maximiing the lower bound of the likelihood. It is formulated as, Expectation step : ϕ = arg max F(ϕ, θ) ϕ Maximiation step : θ = arg max F(ϕ, θ) θ where F(ϕ, θ) = q ϕ () log p θ(x, ) q ϕ (). 5

6 Figure 2: Karpathy et al, Generative Models, The updating steps are now tractable and able to obtain the closed form solution because of the conjugacy assumption of q ϕ. 3.2 Variational-autoencoder Variational auto-encoder is an extension of variational EM algorithm, but it approximates the distributon q() by multi-layer neural network instead of the product of conjugate distributions. The expectation step and the maximiation step are performed jointly by backpropagation on the network. The advantage of this model is that it does not assume a simple conjugate distribution but the probability distribution can be learned conditioning on the data. As a result, the approximation accuracy may improve. Because of this dependency on the data, the lower bound is rewritten as, F(ϕ, θ) = q ϕ ( X) log p θ(x, ) q ϕ ( X) d = q ϕ ( X) log p θ(x )p θ () d q ϕ ( X) = D KL (q ϕ ( X) p θ ()) + q ϕ ( X)p θ (X ) + H(). Therefore, the objective of VAEs is to estimate the parameter ϕ of the approximated posterior q ϕ ( X) to match p θ (). The parameter is estimated with a multi-layer neural network encoder by backpropagation. Then, p θ (X ) is computed with a decoder which is also a multi-layer neural network. In other words, the function f( ) for VAEs that maps the latent variable to the data space X is a multi-layer neural network; while PPCA mapped the latent variable by a simple linear transformation. P (X θ) = N (X f(, θ), σ 2 I)d Just like PPCA, we assume the probability distribution of is the standard Gaussian N ( 0, I). In spite of such a simple assumption, the model can represent fairly complex data because of the mapping by the neural network. In order to train the encoder by backpropagating the reconstruction error, we need the gradient estimator for ϕ. To this end, VAEs use a reparametriation trick in which the random variable ẑ drawn from the approximated posterior q ϕ ( X) is replaced with a differentiable function of noise variable ϵ as, ẑ = g ϕ (ϵ, X) where ϵ is sampled from a fixed distribution. In case of Gaussian, ẑ = µ + σϵ where ϵ N (0, 1) and µ, σ = NN(X). NN denotes the multi-layer neural network. In this way we connect the reconstruction to the encoder through the decoder. 6

7 In summary, we first compute the parameters µ, σ of the approximated posterior q ϕ ( X) with an encoder neural network. Also, compute the KL divergence and the gradient with respect to ϕ. Then, sample to compute the Monte Carlo estimate of the expectation of reconstruction error by using the reparametriation trick. Reconstruct the data with a decoder and as the input. Compute the cross entropy error and gradient with respect to θ and ϕ. Finally, update θ and ϕ. 4 Experiments 4.1 Method We implemented PPCA and VAEs in Python using NumPy. TensorFlow is used to implement multilayer neural networks as a part of VAEs. We tested PPCA and VAEs by image completion task using MNIST database which contains hand-written digits. We first trained the generative models of the MNIST database using out of images as a training dataset. Specifically, in the case of PPCA, we fit matrix to the training data by EM algorithm. Then, we reconstructed images by choosing the latent variable by gradient descent so that it minimies the residual sum of squares with the given half of the 16 images that ware picked randomly from the testing dataset. In the case of VAEs, we trained the encoder by giving the full MNIST image plus the bottom half of the image, then reconstructed the top half by conditioning on the bottom half of the image. Since MNIST image has 784 pixels, the input for encoder is (3/2)784- dimensional vector. We set the number of nodes in the hidden layer to be 100. The reconstructed image is evaluated against the ground truth quantitatively by the pixel-wise residual sum of squares between the reconstruction and the ground truth, and qualitatively by seeing the reconstructed image by eyes. The some of the results are shown below. 4.2 Results This graph shows the average of the pixel-wise residual sum of squares between the ground truth image and the reconstructed image at each epoch. Precisely, the error of a reconstruction is, (X i,j ˆX i,j ) 2 i j where X i,j is the pixel value at (i, j) which is normalied to [0, 1]. In the diagram, the average loss of 16 reconstructions is reported. An epoch for PPCA consists of updates through all the training data, while an epoch of the variational inference consists of steps of backpropagation. The graph indicates that PPCA might have overfitted to the training data after the Epoch 1, while the error of VAEs doesn t increase after the training saturated at the Epoch 1. Note that the pixel-wise residual sum of squares is not a great metric of the error. Suppose we had reconstruction which is perfect except that it is shifted to the right just by one value. The reconstruction would look good for human, but the residual sum of squares becomes huge. To remedy this problem, we visualied the reconstruction as below. 7

8 (a) PPCA completion. (b) VAEs completion. Figure 4: Each block consists of ground truth in the top row, occluded input in the second row, and the reconstructed image in the third row. The left column shows the result by PPCA and the right column is the result by VAEs. The The first row blocks are the 0th epoch, i.e. initialiation, then 1st epoch, 3rd epoch, 9th epoch follows. 5 Discussion Both PPCA and VAEs are latent Gaussian variable models. However, the samples generated by VAEs look more realistic compared to the ones generated by PPCA. The difference may come from mainly two reasons: the representational power of the model and the robustness of the inference. First, the mapping function used in VAEs to map latent variables from the latent space to the actual data space is a multi-layer neural network, which is a non-linear function. In contrast, the mapping function used in PPCA only applies a simple linear transformation to the latent variable. Second, EM algorithm used in PPCA point estimates the latent variable, and it often leads to overfitting to the training data. On the other hand, the variational inference used in VAEs estimates the posterior distribution of the latent variable. Although the latter method is an approximation method, it is more robust to overfitting than EM algorithm. 6 Conclusion Deep generative models including VAEs attract huge attention these days, but the same task can be done by very simple models like PCA as we have shown by the example of occluded handwritten digits completion task. However, the deep models and corresponding inference algorithms certainly have advantages such as representational power and the robustness to the overfitting. Rather than using deep models blindly, it is important to evaluate the complexity of the given task and validate the necessity of using such a complex model like VAEs. Also, the theoretical foundation of deep models should be further studied so that we can systematically choose the model to use without relying on empirical results too much. 8

9 References [1] Carl Doersch. Tutorial on Variational Autoencoders, 2016; arxiv: [2] Christopher M. Bishop. Bayesian PCA, 1998; Advances in Neural Information Processing Systems 11. [3] Christopher M. Bishop. Variational Principal Components, 1999; In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN 99, IEE, volume 1, pages [4] David M. Blei, Alp Kucukelbir and Jon D. McAuliffe. Variational Inference: A Review for Statisticians, 2016; arxiv: [5] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, 2013; arxiv: [6] Diederik P. Kingma, Danilo J. Reende, Shakir Mohamed, Max Welling. Semi-supervised Learning with Deep Generative Models; 2014; Advances in Neural Information Processing Systems 27. [7] Hastings, W. Monte Carlo sampling methods using Markov chains and their applications, 1970; Biometrika, 57: [8] Hotelling, H. Analysis of a Complex of Statistical Variables Into Principal Components, 1933; Journal of Educational Psychology, 24, [9] Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. Introduction to variational methods for graphical models, 1999; Machine Learning, 37: [10] Kevin P. Murphy Machine Learning A Probabilistic Perspective, 2013; MIT. Print. [11] Kihyuk Sohn, Xinchen Yan, Honglak Lee. Learning Structured Output Representation using Deep Conditional Generative Models, 2015; Advances in Neural Information Processing Systems 28. [12] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, M., and Teller, E. Equations of state calculations by fast computing machines, 1953; Journal of Chemical Physics, 21: [13] Michael E. Tipping and Christopher M. Bishop. Probabilistic Principal Component Analysis, 1999; Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 3, [14] Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference, 2008; Foundations and Trends in Machine Learning, 1(1 2): [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition, 1998; Proceedings of the IEEE, 86(11): , November. 9

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Deep latent variable models

Deep latent variable models Deep latent variable models Pierre-Alexandre Mattei IT University of Copenhagen http://pamattei.github.io @pamattei 19 avril 2018 Séminaire de statistique du CNAM 1 Overview of talk A short introduction

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

LDA with Amortized Inference

LDA with Amortized Inference LDA with Amortied Inference Nanbo Sun Abstract This report describes how to frame Latent Dirichlet Allocation LDA as a Variational Auto- Encoder VAE and use the Amortied Variational Inference AVI to optimie

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

arxiv: v2 [stat.ml] 15 Aug 2017

arxiv: v2 [stat.ml] 15 Aug 2017 Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

arxiv: v1 [stat.ml] 6 Dec 2018

arxiv: v1 [stat.ml] 6 Dec 2018 missiwae: Deep Generative Modelling and Imputation of Incomplete Data arxiv:1812.02633v1 [stat.ml] 6 Dec 2018 Pierre-Alexandre Mattei Department of Computer Science IT University of Copenhagen pima@itu.dk

More information

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion

More information

VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

More information

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University An Overview of Edward: A Probabilistic Programming System Dustin Tran Columbia University Alp Kucukelbir Eugene Brevdo Andrew Gelman Adji Dieng Maja Rudolph David Blei Dawen Liang Matt Hoffman Kevin Murphy

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

arxiv: v3 [stat.ml] 30 Jun 2017

arxiv: v3 [stat.ml] 30 Jun 2017 Rui Shu 1 Hung H. Bui 2 Mohammad Ghavamadeh 3 arxiv:1611.08568v3 stat.ml] 30 Jun 2017 Abstract We introduce a new framework for training deep generative models for high-dimensional conditional density

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme Ghassen Jerfel April 2017 As we will see during this talk, the Bayesian and information-theoretic

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

Denoising Criterion for Variational Auto-Encoding Framework

Denoising Criterion for Variational Auto-Encoding Framework Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Denoising Criterion for Variational Auto-Encoding Framework Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua

More information

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational

More information