Generative models for missing value completion

Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative models such as generative adversarial networks (GANs) and variational auto-encoders (VAEs) are attracting high attention these days, but their theory and their relation to simpler generative models are not well studied. In this work, we first reformulate the well-known statistical method, PCA, as a generative latent Gaussian variable model and derive EM algorithm as a corresponding inference method. Then, we compare those model-inference pair of the Probabilistic PCA (PPCA) to VAEs which consists of a deep generative latent Gaussian model and variational inference as the inference method. The comparison is done by fitting both models to the MNIST hand-written database and completing hidden parts of digits by maximum likelihood estimation. VAEs reconstruct more realistic digits, and we conclude that the improvement comes from the representational power of the multi-layer neural network and the robustness to overfitting of variational inference which are used in VAEs. 1 Introduction In the field of machine learning, discriminative models as typified by logistic regression, support vector machine, and convolutional neural networks have been studied intensively [10], and they have shown success especially in classification tasks. However, such models need labeled dataset for training, which is not always available. Also, the use of the conditional probability distribution p(y x) given by these discriminative models are relatively limited. On the other hand, generative models such as naive Bayes, hidden Markov model, and restricted Boltmann machine[10] model the joint probability distribution p(x, y), so the resulting model can be used for broader purposes including anomaly detection, data compression, feature extraction, clustering, and missing value completion as well as classification. Generative models often assume a structure behind the given data and assume latent variables. Common way to infer parameters of probabilistic models with latent variables is EM algorithm. However, EM algorithm cannot be used when the model has multiple latent variables or includes comoputationally intractable expressions such as the integral of a probability distribution. Researchers have tackled this problem by estimating the distribution by mainly two methods. One is sampling and the other is approximation. Sampling methods includes many variations of Mankov chain Monte Carlo (MCMC)[7][12], but they are computationally expensive if not tractable. The approximation methods is represented by variational inference[9][14]. Variational inference is more efficient than MCMC and usually gives a tight approximation. In this work, we first review the generative latent variable models by reformulating PCA as a generative latent Gaussian variable model, called PPCA[13]. Then, we introduce a recently proposed generative framework, VAEs[5] as a deep generative latent Gaussian variable model. We examine the relationship between VAEs and PPCA and compare their performance by applying both methods http://homes.cs.washington.edu/ koar8470/

to an image completion task. Specifically, their performance are evaluated quantitatively by comparing how well each model reconstructs the hidden part of the hand-written digits[15] in terms of the pixel-wise residual sum of squares. This is not a complete evaluation metric because it would have huge value if a perfecty reconstructed sample was shifted to the right just by one pixel. Therefore, we also visualie the reconstruction and compare them qualitatively by eyes. 2 Preliminaries 2.1 Generative latent variable models Generative models model the actual probability distribution of given data X. Specifically, as opposed to discriminative models calculate the conditional probability of class y given data X, p(y X) for classification or prediction tasks, generative models compute the joint probability of X and y, p(x, y). Because generative models are the model of the generation process of the data, we are able to sample, or generate data X by specifying its class y. This is why it is named generative models. When we model data X in a generative manner, we often assume some structure behind the data and introduce latent variables to leverage it. Suppose we have a digit whose top half is occluded. For example, let s assume we see the bottom half of a digit 7. Then, if we don t assume a structure in the data, the top half can be anything. It may be 8 or even random noise. However, if we assume a structure, the top half is most likely 1 or 7. This is because digits are not random set of dots, but lines which are drawn by following some rules to express some information. The random variable which represents the structure behind the data is called latent variable. For another example, by reading half of an article, we can guess what kind of words are used in the rest half. Suppose words like perceptron, convolutional, and gpu appeared in the given half, then words such as deep learning, and cloud would more likely appear compare to words like soccer or constitution in the rest. This prediction is possible because we assume that the topic of the article is machine learning, for instance. The topic is the latent variable in this example as opposed to the class of the digit in the previous example. Latent variables may sound like the class of the data, but the important difference is that the latent variables are not explicitly given in the form of labels of the data. Also, latent variables can encode much more complex structures that human may not be able to put in words like the style of images. Formally, generative models can be defined as following. Let f( ) denote a deterministic function that map the random latent variable to the data space, like a topic to articles and a digit class to actual digit images. X ˆX = f(, θ) The function f( ) differs depending on the model, but our motivation is to minimie the error between X and ˆX. Therefore, the objective of generative models is to maximie p(x θ) = p θ (X )p()d in terms of θ. This is a general maximum likelihood problem although evaluating the integral is often computationally intractable. 2.2 Probabilistic principal component analysis PCA is usually defined as the orthogonal projection of D dimensional data X onto a M < D dimensional linear space such that the variance of the projected data is maximied[8]. Therefore, the principal components of some data are obtained by first evaluating the covariance matrix of the centralied data and then taking its eigenvectors u corresponding to the M largest eigenvalues. Here, u is the principal components of X. The original data is approximated by the linear combination of principal components as, 2

Figure 1: The left two images shows how latent variable is sampled from Gaussian distribution and mapped to the actual data space X. The right most image is the resulting probability distribution of the observable variable X. Bishop, Pattern Recognition and Machine Learning, Springer, 2009, p.570. X X = M i u i + µ i=1 where i is the weight of the principal component u i and µ is the mean of the original data. For this property, PCA is used for applications such as dimensionality reduction and feature extraction. We now reformulate PCA as a generative model. Specifically, PCA is the maximum likelihood solution of a latent Gaussian variable model. This formulation is called probabilistic PCA (PPCA)[13]. We first introduce a latent variable corresponding to the principal component subspace W. Then, the probability distribution of X conditioned on the latent variable is, p(x ) = N (X W + µ, σ 2 I) where p() = N ( 0, I). Note that we can assume to be standard Gaussian distribution as opposed to a general Gaussian distribution without loss of generality. By seeing this probabilistic model from a generative viewpoint, that is, by assuming the data is generated by first sampling the value of the latent variable, and then mapping to the actual data space X, we get the expression, X = W + µ + ϵ where ϵ is the Gaussian noise. Note that PPCA gives the same principal components as PCA when we take the limit σ 2 0. You will notice the similarity between PPCA and PCA if you see this expression as a linear combination of the column vectors of W and regard them as principal components. The likelihood of given data X and the latent variables is, by the definition of Gaussian distribution, N p(x, µ, W, σ 2 ) = p(x n n )p( n ) N = (2πσ 2 ) D 2 ( ) X n W n µ 2 exp 2σ 2 (2π) M 2 exp ( X n 2 2 ). 3

2.3 Expectation-Maximiation algorithm The likelihood of a model that includes a latent variable can be maximied by EM algorithm. Given data X and latent variables, EM algorithm finds the maximum likelihood estimates of the model parameters θ. To compute the likelihood, L(θ), we need to integrate out. L(θ) = log p θ (X, ) This equation is intractable to compute because it requires the enumeration of. EM algorithm, instead, maximie the lower bound of the likelihood. We can derive the lower bound by, L(θ) = log p θ (X, ) = log q() log p θ(x, ) q() = q() p θ(x, ) q() q() log p θ (X, ) + H() = F(q, θ) Note that we introduced a probability distribution q() in the first line of the derivation by multiplying and dividing the same quantity, and therefore its actual form can be chosen arbitrarily. Then, we used Jensen s inequality to obtain the inequality in the second line. Jensen s inequality if p 1,..., p n are positive numbers which sum to 1 and f is a real continuous function that is concave, then ( n ) f p i x i i=1 n p i f(x i ). H() denotes the Shannon entropy q() log q(), whose whole point is that it s independent of θ. Now, we reduced the likelihood maximiation problem to the lower bound maximiation problem, however, before we maximie it, we need to choose q() properly so that F(q, θ) become a tight bound of L(θ). Their difference is, L(θ) F(q, θ) = log = log = i=1 p θ (X, ) q() log p θ(x, ) q() p θ (X, ) q() log p θ( X)p θ (X) q() q() log p θ( X) = D KL (q() p θ ( X)) q() where D KL (p q) denotes Kullback-Leibler divergence which is often used as a distance metric between two probability distributions. It is non-negative and becomes ero iff p = q. Kullback-Leibler divergence For distributions P and Q of a continuous random variable, the Kullback-Leibler divergence is defined as: D KL (P Q) = x p(x) log p(x) q(x). Therefore, the lower bound F(q, θ) is tight when q() = p θ ( X). In EM algorithm, we set the lower bound in the expectation step, then maximie the lower bound in the maximiation step. Formally, Expectation step : q() = p θ ( X) Maximiation step : arg max q() log p θ (X, ) θ 4

The computation of EM algorithm in the context of PPCA is described below. Note that the expectation of posterior is used instead of full distribution which is intractable to compute. Let N be the number of samples, and D be the dimension of the data. Also, let ˆX be the centralied data, (X X). Then, the equations for iterative updates are the following. The derivation is articulated by Tipping and Bishop[13]. Expectation step: E[ n ] = (W T W + σ 2 I) 1 W T ( ˆX n ) Maximiation step: [ N ] [ N ] 1 W new = ( X n )E[ n ] T E[ n ]E[ n ] T σ 2 new = 1 ND N { ˆX 2 n 2E[n ] T Wnew( T ˆX n ) + T r(e[ n ]E[ n ] T WnewW T new )} In the formulation of PPCA, we assumed that the dimensionality of the principal component M is given a priori, and M = 2 is usually used for the visualiation purpose. However, the decision of M is a critical problem when PPCA is used for dimensionality reduction. We could use cross-validation to optimie M, but it is computationally expensive. Instead, we can take Bayesian approach by assuming a prior α over W as a hyper parameter. This approach is called Bayesian PCA (BPCA)[2]. As Bayesian approach often does, BPCA requires evaluating the marginal distribution of µ, W, and σ 2, which is intractable. To this end, we can use variational framework[3], which is a natural extension of the EM algorithm, to approximate the marginaliation. BPCA has several advantages including robustness to the overfitting as well as automatic dimensionality selection, however, we fix the dimensionality of the principal components in this work in order to compare the experimental results with VAEs. We discuss the variational framework along the VAE s context in the next section. 3 Variational frameworks Variational inference is a framework to approximate the true posterior P with the best Q from a set of distributions Q by minimiing a difference metric between P and Q as an optimiation problem. The most common choice for the difference metric is Kullback-Leibler divergence as we have seen in the EM algorithm, but there are variations depending on how the true distribution is approximated and how the difference is measured. 3.1 Variational EM algorithm Remember that in EM algorithm for PPCA, we used point estimate of rather than the distribution q() because of the computational difficulty. However, it tends to overfit to the training data; also EM algorithm cannot be used when there are multiple latent variables. Hence, we extend EM algorithm by approximating q() = p θ ( X), leading to variational EM algorithm. The common approximation is q() i q ϕi( i ) where q ϕ () is a conjugate distribution. Then, the approximation is improved by minimiing the KL divergence between q ϕ () and p θ ( X) by optimiing the parameter ϕ, which is equivalent to maximiing the lower bound of the likelihood. It is formulated as, Expectation step : ϕ = arg max F(ϕ, θ) ϕ Maximiation step : θ = arg max F(ϕ, θ) θ where F(ϕ, θ) = q ϕ () log p θ(x, ) q ϕ (). 5

Figure 2: Karpathy et al, Generative Models, 2016. The updating steps are now tractable and able to obtain the closed form solution because of the conjugacy assumption of q ϕ. 3.2 Variational-autoencoder Variational auto-encoder is an extension of variational EM algorithm, but it approximates the distributon q() by multi-layer neural network instead of the product of conjugate distributions. The expectation step and the maximiation step are performed jointly by backpropagation on the network. The advantage of this model is that it does not assume a simple conjugate distribution but the probability distribution can be learned conditioning on the data. As a result, the approximation accuracy may improve. Because of this dependency on the data, the lower bound is rewritten as, F(ϕ, θ) = q ϕ ( X) log p θ(x, ) q ϕ ( X) d = q ϕ ( X) log p θ(x )p θ () d q ϕ ( X) = D KL (q ϕ ( X) p θ ()) + q ϕ ( X)p θ (X ) + H(). Therefore, the objective of VAEs is to estimate the parameter ϕ of the approximated posterior q ϕ ( X) to match p θ (). The parameter is estimated with a multi-layer neural network encoder by backpropagation. Then, p θ (X ) is computed with a decoder which is also a multi-layer neural network. In other words, the function f( ) for VAEs that maps the latent variable to the data space X is a multi-layer neural network; while PPCA mapped the latent variable by a simple linear transformation. P (X θ) = N (X f(, θ), σ 2 I)d Just like PPCA, we assume the probability distribution of is the standard Gaussian N ( 0, I). In spite of such a simple assumption, the model can represent fairly complex data because of the mapping by the neural network. In order to train the encoder by backpropagating the reconstruction error, we need the gradient estimator for ϕ. To this end, VAEs use a reparametriation trick in which the random variable ẑ drawn from the approximated posterior q ϕ ( X) is replaced with a differentiable function of noise variable ϵ as, ẑ = g ϕ (ϵ, X) where ϵ is sampled from a fixed distribution. In case of Gaussian, ẑ = µ + σϵ where ϵ N (0, 1) and µ, σ = NN(X). NN denotes the multi-layer neural network. In this way we connect the reconstruction to the encoder through the decoder. 6

In summary, we first compute the parameters µ, σ of the approximated posterior q ϕ ( X) with an encoder neural network. Also, compute the KL divergence and the gradient with respect to ϕ. Then, sample to compute the Monte Carlo estimate of the expectation of reconstruction error by using the reparametriation trick. Reconstruct the data with a decoder and as the input. Compute the cross entropy error and gradient with respect to θ and ϕ. Finally, update θ and ϕ. 4 Experiments 4.1 Method We implemented PPCA and VAEs in Python using NumPy. TensorFlow is used to implement multilayer neural networks as a part of VAEs. We tested PPCA and VAEs by image completion task using MNIST database which contains 70000 hand-written digits. We first trained the generative models of the MNIST database using 60000 out of 70000 images as a training dataset. Specifically, in the case of PPCA, we fit 784 100 matrix to the training data by EM algorithm. Then, we reconstructed images by choosing the latent variable by gradient descent so that it minimies the residual sum of squares with the given half of the 16 images that ware picked randomly from the testing dataset. In the case of VAEs, we trained the encoder by giving the full MNIST image plus the bottom half of the image, then reconstructed the top half by conditioning on the bottom half of the image. Since MNIST image has 784 pixels, the input for encoder is (3/2)784- dimensional vector. We set the number of nodes in the hidden layer to be 100. The reconstructed image is evaluated against the ground truth quantitatively by the pixel-wise residual sum of squares between the reconstruction and the ground truth, and qualitatively by seeing the reconstructed image by eyes. The some of the results are shown below. 4.2 Results This graph shows the average of the pixel-wise residual sum of squares between the ground truth image and the reconstructed image at each epoch. Precisely, the error of a reconstruction is, (X i,j ˆX i,j ) 2 i j where X i,j is the pixel value at (i, j) which is normalied to [0, 1]. In the diagram, the average loss of 16 reconstructions is reported. An epoch for PPCA consists of updates through all the training data, while an epoch of the variational inference consists of 10000 steps of backpropagation. The graph indicates that PPCA might have overfitted to the training data after the Epoch 1, while the error of VAEs doesn t increase after the training saturated at the Epoch 1. Note that the pixel-wise residual sum of squares is not a great metric of the error. Suppose we had reconstruction which is perfect except that it is shifted to the right just by one value. The reconstruction would look good for human, but the residual sum of squares becomes huge. To remedy this problem, we visualied the reconstruction as below. 7

(a) PPCA completion. (b) VAEs completion. Figure 4: Each block consists of ground truth in the top row, occluded input in the second row, and the reconstructed image in the third row. The left column shows the result by PPCA and the right column is the result by VAEs. The The first row blocks are the 0th epoch, i.e. initialiation, then 1st epoch, 3rd epoch, 9th epoch follows. 5 Discussion Both PPCA and VAEs are latent Gaussian variable models. However, the samples generated by VAEs look more realistic compared to the ones generated by PPCA. The difference may come from mainly two reasons: the representational power of the model and the robustness of the inference. First, the mapping function used in VAEs to map latent variables from the latent space to the actual data space is a multi-layer neural network, which is a non-linear function. In contrast, the mapping function used in PPCA only applies a simple linear transformation to the latent variable. Second, EM algorithm used in PPCA point estimates the latent variable, and it often leads to overfitting to the training data. On the other hand, the variational inference used in VAEs estimates the posterior distribution of the latent variable. Although the latter method is an approximation method, it is more robust to overfitting than EM algorithm. 6 Conclusion Deep generative models including VAEs attract huge attention these days, but the same task can be done by very simple models like PCA as we have shown by the example of occluded handwritten digits completion task. However, the deep models and corresponding inference algorithms certainly have advantages such as representational power and the robustness to the overfitting. Rather than using deep models blindly, it is important to evaluate the complexity of the given task and validate the necessity of using such a complex model like VAEs. Also, the theoretical foundation of deep models should be further studied so that we can systematically choose the model to use without relying on empirical results too much. 8

References [1] Carl Doersch. Tutorial on Variational Autoencoders, 2016; arxiv:1606.05908. [2] Christopher M. Bishop. Bayesian PCA, 1998; Advances in Neural Information Processing Systems 11. [3] Christopher M. Bishop. Variational Principal Components, 1999; In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN 99, IEE, volume 1, pages 509 514. [4] David M. Blei, Alp Kucukelbir and Jon D. McAuliffe. Variational Inference: A Review for Statisticians, 2016; arxiv:1601.00670. [5] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, 2013; arxiv:1312.6114. [6] Diederik P. Kingma, Danilo J. Reende, Shakir Mohamed, Max Welling. Semi-supervised Learning with Deep Generative Models; 2014; Advances in Neural Information Processing Systems 27. [7] Hastings, W. Monte Carlo sampling methods using Markov chains and their applications, 1970; Biometrika, 57:97 109. [8] Hotelling, H. Analysis of a Complex of Statistical Variables Into Principal Components, 1933; Journal of Educational Psychology, 24, 417 441. [9] Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. Introduction to variational methods for graphical models, 1999; Machine Learning, 37:183 233. [10] Kevin P. Murphy Machine Learning A Probabilistic Perspective, 2013; MIT. Print. [11] Kihyuk Sohn, Xinchen Yan, Honglak Lee. Learning Structured Output Representation using Deep Conditional Generative Models, 2015; Advances in Neural Information Processing Systems 28. [12] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, M., and Teller, E. Equations of state calculations by fast computing machines, 1953; Journal of Chemical Physics, 21:1087 1092. [13] Michael E. Tipping and Christopher M. Bishop. Probabilistic Principal Component Analysis, 1999; Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 3, 611 622. [14] Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference, 2008; Foundations and Trends in Machine Learning, 1(1 2):1 305. [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition, 1998; Proceedings of the IEEE, 86(11):2278 2324, November. 9