Generative models for missing value completion

Similar documents
Variational Autoencoder

Natural Gradients via the Variational Predictive Distribution

Unsupervised Learning

Variational Principal Components

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Autoencoders (VAEs)

Auto-Encoding Variational Bayes

Machine Learning Techniques for Computer Vision

Recent Advances in Bayesian Inference Techniques

Latent Variable Models

STA 414/2104: Lecture 8

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Probabilistic Graphical Models

Pattern Recognition and Machine Learning

Nonparametric Inference for Auto-Encoding Variational Bayes

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Introduction to Machine Learning

Variational Autoencoders

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 16 Deep Neural Generative Models

Deep latent variable models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

STA 414/2104: Lecture 8

The connection of dropout and Bayesian statistics

Density Estimation. Seungjin Choi

STA 414/2104: Machine Learning

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Probabilistic Graphical Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Latent Variable Models and EM algorithm

STA 4273H: Statistical Machine Learning

Expectation Maximization

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Unsupervised Learning

Local Expectation Gradients for Doubly Stochastic. Variational Inference

13: Variational inference II

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Bayesian Semi-supervised Learning with Deep Generative Models

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Probabilistic Graphical Models

Probabilistic Graphical Models

Lecture : Probabilistic Machine Learning

LDA with Amortized Inference

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Deep Learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Variational Inference via Stochastic Backpropagation

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

But if z is conditioned on, we need to model it:

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Black-box α-divergence Minimization

Nonparametric Bayesian Methods (Gaussian Processes)

Approximate Inference Part 1 of 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

An Introduction to Statistical and Probabilistic Linear Models

Approximate Inference Part 1 of 2

UNSUPERVISED LEARNING

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

CPSC 540: Machine Learning

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

The Success of Deep Generative Models

arxiv: v2 [stat.ml] 15 Aug 2017

Probabilistic Graphical Models for Image Analysis - Lecture 1

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

arxiv: v1 [stat.ml] 6 Dec 2018

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

VIBES: A Variational Inference Engine for Bayesian Networks

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

p(d θ ) l(θ ) 1.2 x x x

The Expectation-Maximization Algorithm

Study Notes on the Latent Dirichlet Allocation

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Expectation Propagation Algorithm

arxiv: v3 [stat.ml] 30 Jun 2017

Deep Generative Models. (Unsupervised Learning)

Dreem Challenge report (team Bussanati)

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Introduction to Gaussian Processes

Variational Scoring of Graphical Model Structures

Expectation Propagation for Approximate Bayesian Inference

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Machine Learning Lecture 5

Posterior Regularization

Neutron inverse kinetics via Gaussian Processes

Based on slides by Richard Zemel

STA 4273H: Statistical Machine Learning

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Support Vector Machines

ECE 5984: Introduction to Machine Learning

Denoising Criterion for Variational Auto-Encoding Framework

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Transcription:

Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative models such as generative adversarial networks (GANs) and variational auto-encoders (VAEs) are attracting high attention these days, but their theory and their relation to simpler generative models are not well studied. In this work, we first reformulate the well-known statistical method, PCA, as a generative latent Gaussian variable model and derive EM algorithm as a corresponding inference method. Then, we compare those model-inference pair of the Probabilistic PCA (PPCA) to VAEs which consists of a deep generative latent Gaussian model and variational inference as the inference method. The comparison is done by fitting both models to the MNIST hand-written database and completing hidden parts of digits by maximum likelihood estimation. VAEs reconstruct more realistic digits, and we conclude that the improvement comes from the representational power of the multi-layer neural network and the robustness to overfitting of variational inference which are used in VAEs. 1 Introduction In the field of machine learning, discriminative models as typified by logistic regression, support vector machine, and convolutional neural networks have been studied intensively [10], and they have shown success especially in classification tasks. However, such models need labeled dataset for training, which is not always available. Also, the use of the conditional probability distribution p(y x) given by these discriminative models are relatively limited. On the other hand, generative models such as naive Bayes, hidden Markov model, and restricted Boltmann machine[10] model the joint probability distribution p(x, y), so the resulting model can be used for broader purposes including anomaly detection, data compression, feature extraction, clustering, and missing value completion as well as classification. Generative models often assume a structure behind the given data and assume latent variables. Common way to infer parameters of probabilistic models with latent variables is EM algorithm. However, EM algorithm cannot be used when the model has multiple latent variables or includes comoputationally intractable expressions such as the integral of a probability distribution. Researchers have tackled this problem by estimating the distribution by mainly two methods. One is sampling and the other is approximation. Sampling methods includes many variations of Mankov chain Monte Carlo (MCMC)[7][12], but they are computationally expensive if not tractable. The approximation methods is represented by variational inference[9][14]. Variational inference is more efficient than MCMC and usually gives a tight approximation. In this work, we first review the generative latent variable models by reformulating PCA as a generative latent Gaussian variable model, called PPCA[13]. Then, we introduce a recently proposed generative framework, VAEs[5] as a deep generative latent Gaussian variable model. We examine the relationship between VAEs and PPCA and compare their performance by applying both methods http://homes.cs.washington.edu/ koar8470/

to an image completion task. Specifically, their performance are evaluated quantitatively by comparing how well each model reconstructs the hidden part of the hand-written digits[15] in terms of the pixel-wise residual sum of squares. This is not a complete evaluation metric because it would have huge value if a perfecty reconstructed sample was shifted to the right just by one pixel. Therefore, we also visualie the reconstruction and compare them qualitatively by eyes. 2 Preliminaries 2.1 Generative latent variable models Generative models model the actual probability distribution of given data X. Specifically, as opposed to discriminative models calculate the conditional probability of class y given data X, p(y X) for classification or prediction tasks, generative models compute the joint probability of X and y, p(x, y). Because generative models are the model of the generation process of the data, we are able to sample, or generate data X by specifying its class y. This is why it is named generative models. When we model data X in a generative manner, we often assume some structure behind the data and introduce latent variables to leverage it. Suppose we have a digit whose top half is occluded. For example, let s assume we see the bottom half of a digit 7. Then, if we don t assume a structure in the data, the top half can be anything. It may be 8 or even random noise. However, if we assume a structure, the top half is most likely 1 or 7. This is because digits are not random set of dots, but lines which are drawn by following some rules to express some information. The random variable which represents the structure behind the data is called latent variable. For another example, by reading half of an article, we can guess what kind of words are used in the rest half. Suppose words like perceptron, convolutional, and gpu appeared in the given half, then words such as deep learning, and cloud would more likely appear compare to words like soccer or constitution in the rest. This prediction is possible because we assume that the topic of the article is machine learning, for instance. The topic is the latent variable in this example as opposed to the class of the digit in the previous example. Latent variables may sound like the class of the data, but the important difference is that the latent variables are not explicitly given in the form of labels of the data. Also, latent variables can encode much more complex structures that human may not be able to put in words like the style of images. Formally, generative models can be defined as following. Let f( ) denote a deterministic function that map the random latent variable to the data space, like a topic to articles and a digit class to actual digit images. X ˆX = f(, θ) The function f( ) differs depending on the model, but our motivation is to minimie the error between X and ˆX. Therefore, the objective of generative models is to maximie p(x θ) = p θ (X )p()d in terms of θ. This is a general maximum likelihood problem although evaluating the integral is often computationally intractable. 2.2 Probabilistic principal component analysis PCA is usually defined as the orthogonal projection of D dimensional data X onto a M < D dimensional linear space such that the variance of the projected data is maximied[8]. Therefore, the principal components of some data are obtained by first evaluating the covariance matrix of the centralied data and then taking its eigenvectors u corresponding to the M largest eigenvalues. Here, u is the principal components of X. The original data is approximated by the linear combination of principal components as, 2

Figure 1: The left two images shows how latent variable is sampled from Gaussian distribution and mapped to the actual data space X. The right most image is the resulting probability distribution of the observable variable X. Bishop, Pattern Recognition and Machine Learning, Springer, 2009, p.570. X X = M i u i + µ i=1 where i is the weight of the principal component u i and µ is the mean of the original data. For this property, PCA is used for applications such as dimensionality reduction and feature extraction. We now reformulate PCA as a generative model. Specifically, PCA is the maximum likelihood solution of a latent Gaussian variable model. This formulation is called probabilistic PCA (PPCA)[13]. We first introduce a latent variable corresponding to the principal component subspace W. Then, the probability distribution of X conditioned on the latent variable is, p(x ) = N (X W + µ, σ 2 I) where p() = N ( 0, I). Note that we can assume to be standard Gaussian distribution as opposed to a general Gaussian distribution without loss of generality. By seeing this probabilistic model from a generative viewpoint, that is, by assuming the data is generated by first sampling the value of the latent variable, and then mapping to the actual data space X, we get the expression, X = W + µ + ϵ where ϵ is the Gaussian noise. Note that PPCA gives the same principal components as PCA when we take the limit σ 2 0. You will notice the similarity between PPCA and PCA if you see this expression as a linear combination of the column vectors of W and regard them as principal components. The likelihood of given data X and the latent variables is, by the definition of Gaussian distribution, N p(x, µ, W, σ 2 ) = p(x n n )p( n ) N = (2πσ 2 ) D 2 ( ) X n W n µ 2 exp 2σ 2 (2π) M 2 exp ( X n 2 2 ). 3

2.3 Expectation-Maximiation algorithm The likelihood of a model that includes a latent variable can be maximied by EM algorithm. Given data X and latent variables, EM algorithm finds the maximum likelihood estimates of the model parameters θ. To compute the likelihood, L(θ), we need to integrate out. L(θ) = log p θ (X, ) This equation is intractable to compute because it requires the enumeration of. EM algorithm, instead, maximie the lower bound of the likelihood. We can derive the lower bound by, L(θ) = log p θ (X, ) = log q() log p θ(x, ) q() = q() p θ(x, ) q() q() log p θ (X, ) + H() = F(q, θ) Note that we introduced a probability distribution q() in the first line of the derivation by multiplying and dividing the same quantity, and therefore its actual form can be chosen arbitrarily. Then, we used Jensen s inequality to obtain the inequality in the second line. Jensen s inequality if p 1,..., p n are positive numbers which sum to 1 and f is a real continuous function that is concave, then ( n ) f p i x i i=1 n p i f(x i ). H() denotes the Shannon entropy q() log q(), whose whole point is that it s independent of θ. Now, we reduced the likelihood maximiation problem to the lower bound maximiation problem, however, before we maximie it, we need to choose q() properly so that F(q, θ) become a tight bound of L(θ). Their difference is, L(θ) F(q, θ) = log = log = i=1 p θ (X, ) q() log p θ(x, ) q() p θ (X, ) q() log p θ( X)p θ (X) q() q() log p θ( X) = D KL (q() p θ ( X)) q() where D KL (p q) denotes Kullback-Leibler divergence which is often used as a distance metric between two probability distributions. It is non-negative and becomes ero iff p = q. Kullback-Leibler divergence For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence is defined as: D KL (P Q) = x p(x) log p(x) q(x). Therefore, the lower bound F(q, θ) is tight when q() = p θ ( X). In EM algorithm, we set the lower bound in the expectation step, then maximie the lower bound in the maximiation step. Formally, Expectation step : q() = p θ ( X) Maximiation step : arg max q() log p θ (X, ) θ 4

The computation of EM algorithm in the context of PPCA is described below. Note that the expectation of posterior is used instead of full distribution which is intractable to compute. Let N be the number of samples, and D be the dimension of the data. Also, let ˆX be the centralied data, (X X). Then, the equations for iterative updates are the following. The derivation is articulated by Tipping and Bishop[13]. Expectation step: E[ n ] = (W T W + σ 2 I) 1 W T ( ˆX n ) Maximiation step: [ N ] [ N ] 1 W new = ( X n )E[ n ] T E[ n ]E[ n ] T σ 2 new = 1 ND N { ˆX 2 n 2E[n ] T Wnew( T ˆX n ) + T r(e[ n ]E[ n ] T WnewW T new )} In the formulation of PPCA, we assumed that the dimensionality of the principal component M is given a priori, and M = 2 is usually used for the visualiation purpose. However, the decision of M is a critical problem when PPCA is used for dimensionality reduction. We could use cross-validation to optimie M, but it is computationally expensive. Instead, we can take Bayesian approach by assuming a prior α over W as a hyper parameter. This approach is called Bayesian PCA (BPCA)[2]. As Bayesian approach often does, BPCA requires evaluating the marginal distribution of µ, W, and σ 2, which is intractable. To this end, we can use variational framework[3], which is a natural extension of the EM algorithm, to approximate the marginaliation. BPCA has several advantages including robustness to the overfitting as well as automatic dimensionality selection, however, we fix the dimensionality of the principal components in this work in order to compare the experimental results with VAEs. We discuss the variational framework along the VAE s context in the next section. 3 Variational frameworks Variational inference is a framework to approximate the true posterior P with the best Q from a set of distributions Q by minimiing a difference metric between P and Q as an optimiation problem. The most common choice for the difference metric is Kullback-Leibler divergence as we have seen in the EM algorithm, but there are variations depending on how the true distribution is approximated and how the difference is measured. 3.1 Variational EM algorithm Remember that in EM algorithm for PPCA, we used point estimate of rather than the distribution q() because of the computational difficulty. However, it tends to overfit to the training data; also EM algorithm cannot be used when there are multiple latent variables. Hence, we extend EM algorithm by approximating q() = p θ ( X), leading to variational EM algorithm. The common approximation is q() i q ϕi( i ) where q ϕ () is a conjugate distribution. Then, the approximation is improved by minimiing the KL divergence between q ϕ () and p θ ( X) by optimiing the parameter ϕ, which is equivalent to maximiing the lower bound of the likelihood. It is formulated as, Expectation step : ϕ = arg max F(ϕ, θ) ϕ Maximiation step : θ = arg max F(ϕ, θ) θ where F(ϕ, θ) = q ϕ () log p θ(x, ) q ϕ (). 5

Figure 2: Karpathy et al, Generative Models, 2016. The updating steps are now tractable and able to obtain the closed form solution because of the conjugacy assumption of q ϕ. 3.2 Variational-autoencoder Variational auto-encoder is an extension of variational EM algorithm, but it approximates the distributon q() by multi-layer neural network instead of the product of conjugate distributions. The expectation step and the maximiation step are performed jointly by backpropagation on the network. The advantage of this model is that it does not assume a simple conjugate distribution but the probability distribution can be learned conditioning on the data. As a result, the approximation accuracy may improve. Because of this dependency on the data, the lower bound is rewritten as, F(ϕ, θ) = q ϕ ( X) log p θ(x, ) q ϕ ( X) d = q ϕ ( X) log p θ(x )p θ () d q ϕ ( X) = D KL (q ϕ ( X) p θ ()) + q ϕ ( X)p θ (X ) + H(). Therefore, the objective of VAEs is to estimate the parameter ϕ of the approximated posterior q ϕ ( X) to match p θ (). The parameter is estimated with a multi-layer neural network encoder by backpropagation. Then, p θ (X ) is computed with a decoder which is also a multi-layer neural network. In other words, the function f( ) for VAEs that maps the latent variable to the data space X is a multi-layer neural network; while PPCA mapped the latent variable by a simple linear transformation. P (X θ) = N (X f(, θ), σ 2 I)d Just like PPCA, we assume the probability distribution of is the standard Gaussian N ( 0, I). In spite of such a simple assumption, the model can represent fairly complex data because of the mapping by the neural network. In order to train the encoder by backpropagating the reconstruction error, we need the gradient estimator for ϕ. To this end, VAEs use a reparametriation trick in which the random variable ẑ drawn from the approximated posterior q ϕ ( X) is replaced with a differentiable function of noise variable ϵ as, ẑ = g ϕ (ϵ, X) where ϵ is sampled from a fixed distribution. In case of Gaussian, ẑ = µ + σϵ where ϵ N (0, 1) and µ, σ = NN(X). NN denotes the multi-layer neural network. In this way we connect the reconstruction to the encoder through the decoder. 6

In summary, we first compute the parameters µ, σ of the approximated posterior q ϕ ( X) with an encoder neural network. Also, compute the KL divergence and the gradient with respect to ϕ. Then, sample to compute the Monte Carlo estimate of the expectation of reconstruction error by using the reparametriation trick. Reconstruct the data with a decoder and as the input. Compute the cross entropy error and gradient with respect to θ and ϕ. Finally, update θ and ϕ. 4 Experiments 4.1 Method We implemented PPCA and VAEs in Python using NumPy. TensorFlow is used to implement multilayer neural networks as a part of VAEs. We tested PPCA and VAEs by image completion task using MNIST database which contains 70000 hand-written digits. We first trained the generative models of the MNIST database using 60000 out of 70000 images as a training dataset. Specifically, in the case of PPCA, we fit 784 100 matrix to the training data by EM algorithm. Then, we reconstructed images by choosing the latent variable by gradient descent so that it minimies the residual sum of squares with the given half of the 16 images that ware picked randomly from the testing dataset. In the case of VAEs, we trained the encoder by giving the full MNIST image plus the bottom half of the image, then reconstructed the top half by conditioning on the bottom half of the image. Since MNIST image has 784 pixels, the input for encoder is (3/2)784- dimensional vector. We set the number of nodes in the hidden layer to be 100. The reconstructed image is evaluated against the ground truth quantitatively by the pixel-wise residual sum of squares between the reconstruction and the ground truth, and qualitatively by seeing the reconstructed image by eyes. The some of the results are shown below. 4.2 Results This graph shows the average of the pixel-wise residual sum of squares between the ground truth image and the reconstructed image at each epoch. Precisely, the error of a reconstruction is, (X i,j ˆX i,j ) 2 i j where X i,j is the pixel value at (i, j) which is normalied to [0, 1]. In the diagram, the average loss of 16 reconstructions is reported. An epoch for PPCA consists of updates through all the training data, while an epoch of the variational inference consists of 10000 steps of backpropagation. The graph indicates that PPCA might have overfitted to the training data after the Epoch 1, while the error of VAEs doesn t increase after the training saturated at the Epoch 1. Note that the pixel-wise residual sum of squares is not a great metric of the error. Suppose we had reconstruction which is perfect except that it is shifted to the right just by one value. The reconstruction would look good for human, but the residual sum of squares becomes huge. To remedy this problem, we visualied the reconstruction as below. 7

(a) PPCA completion. (b) VAEs completion. Figure 4: Each block consists of ground truth in the top row, occluded input in the second row, and the reconstructed image in the third row. The left column shows the result by PPCA and the right column is the result by VAEs. The The first row blocks are the 0th epoch, i.e. initialiation, then 1st epoch, 3rd epoch, 9th epoch follows. 5 Discussion Both PPCA and VAEs are latent Gaussian variable models. However, the samples generated by VAEs look more realistic compared to the ones generated by PPCA. The difference may come from mainly two reasons: the representational power of the model and the robustness of the inference. First, the mapping function used in VAEs to map latent variables from the latent space to the actual data space is a multi-layer neural network, which is a non-linear function. In contrast, the mapping function used in PPCA only applies a simple linear transformation to the latent variable. Second, EM algorithm used in PPCA point estimates the latent variable, and it often leads to overfitting to the training data. On the other hand, the variational inference used in VAEs estimates the posterior distribution of the latent variable. Although the latter method is an approximation method, it is more robust to overfitting than EM algorithm. 6 Conclusion Deep generative models including VAEs attract huge attention these days, but the same task can be done by very simple models like PCA as we have shown by the example of occluded handwritten digits completion task. However, the deep models and corresponding inference algorithms certainly have advantages such as representational power and the robustness to the overfitting. Rather than using deep models blindly, it is important to evaluate the complexity of the given task and validate the necessity of using such a complex model like VAEs. Also, the theoretical foundation of deep models should be further studied so that we can systematically choose the model to use without relying on empirical results too much. 8

References [1] Carl Doersch. Tutorial on Variational Autoencoders, 2016; arxiv:1606.05908. [2] Christopher M. Bishop. Bayesian PCA, 1998; Advances in Neural Information Processing Systems 11. [3] Christopher M. Bishop. Variational Principal Components, 1999; In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN 99, IEE, volume 1, pages 509 514. [4] David M. Blei, Alp Kucukelbir and Jon D. McAuliffe. Variational Inference: A Review for Statisticians, 2016; arxiv:1601.00670. [5] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, 2013; arxiv:1312.6114. [6] Diederik P. Kingma, Danilo J. Reende, Shakir Mohamed, Max Welling. Semi-supervised Learning with Deep Generative Models; 2014; Advances in Neural Information Processing Systems 27. [7] Hastings, W. Monte Carlo sampling methods using Markov chains and their applications, 1970; Biometrika, 57:97 109. [8] Hotelling, H. Analysis of a Complex of Statistical Variables Into Principal Components, 1933; Journal of Educational Psychology, 24, 417 441. [9] Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. Introduction to variational methods for graphical models, 1999; Machine Learning, 37:183 233. [10] Kevin P. Murphy Machine Learning A Probabilistic Perspective, 2013; MIT. Print. [11] Kihyuk Sohn, Xinchen Yan, Honglak Lee. Learning Structured Output Representation using Deep Conditional Generative Models, 2015; Advances in Neural Information Processing Systems 28. [12] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, M., and Teller, E. Equations of state calculations by fast computing machines, 1953; Journal of Chemical Physics, 21:1087 1092. [13] Michael E. Tipping and Christopher M. Bishop. Probabilistic Principal Component Analysis, 1999; Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 3, 611 622. [14] Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference, 2008; Foundations and Trends in Machine Learning, 1(1 2):1 305. [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition, 1998; Proceedings of the IEEE, 86(11):2278 2324, November. 9