arxiv: v3 [cs.lg] 14 Oct 2015

Size: px

Start display at page:

Download "arxiv: v3 [cs.lg] 14 Oct 2015"

Wesley Pitts
5 years ago
Views:

1 Learning Deep Generative Models wit Doubly Stocastic MCMC Cao Du Jun Zu Bo Zang Dept. of Comp. Sci. & Tec., State Key Lab of Intell. Tec. & Sys., TNList Lab, Center for Bio-Inspired Computing Researc, Tsingua University, Beijing, , Cina arxiv: v3 [cs.lg] 14 Oct 2015 Abstract We present doubly stocastic gradient MCMC, a simple and generic metod for (approximate) Bayesian inference of deep generative models in te collapsed continuous parameter space. At eac MCMC sampling step, te algoritm randomly draws a minibatc of data samples to estimate te gradient of log-posterior and furter estimates te intractable expectation over latent variables via a Gibbs sampler or a neural adaptive importance sampler. We demonstrate te effectiveness on learning deep sigmoid belief networks (DSBNs). Compared to te state-ofte-art metods using Gibbs sampling wit data augmentation, our algoritm is muc more efficient and manages to learn DSBNs on large datasets. 1 Introduction Learning deep models tat consist of multi-layered representations as proven effective wit state-of-te-art performance in many tasks [5, 13, 28]. To fit suc deep models, a small dataset is often insufficient. Terefore, it is important to address te computational callenge of learning suc deep models on large-scale datasets, wic are often in ig-dimensional spaces also. Deep generative models (DGMs) [13, 28] represent an important family of deep models tat can answer a wide range of queries by performing probabilistic inference, suc as inferring te missing values of input data, wic is beyond te scope of recognition networks suc as deep neural networks. However, probabilistic inference wit DGMs is callenging, especially wen a Bayesian framework is employed, wic is desirable as it can offer various possibilities, suc as robustness to overfitting, sparse Bayesian inference [11] and nonparametric inference [1] to learn te grap structure. For Bayesian metods in general, te posterior inference often involves intractable integrals because of several potential factors, suc as tat te space is extremely ig-dimensional and tat te Bayesian model is non-conjugate. To address te callenges, approximate metods ave to be adopted, including variational [14, 30] and Markov cain Monte Carlo (MCMC) metods [27]. Altoug muc progress as been made on stocastic variational metods for DGMs [16, 25], under some mean-field or parameterization assumptions, little work as been done on extending MCMC metods to learn DGMs in a Bayesian setting, wic are often more accurate. A few exceptions exist. Gan et al. [11] present a Gibbs sampling for deep sigmoid belief networks wit a sparsity-inducing prior via data augmentation, and Adams et al. [1] present a Metropolis- Hastings metod for cascading Indian buffet process. In tis paper, we present a simple and generic metod, named doubly stocastic gradient MCMC, to improve te efficiency of performing Bayesian inference on DGMs. By drawing samples in te collapsed parameter space, our metod extends te recent work on stocastic gradient MCMC [32, 2, 24] to deal wit te callenging task of posterior inference wit DGMs. Besides te stocasticity of randomly drawing a minibatc of samples in stocastic approximation, our algoritm introduces an extra dimension of stocasticity to estimate te intractable gradients by randomly drawing te latent variables in DGMs. Te sampling can be done via a simple Gibbs sampler. We furter develop a neural adaptive importance sampler, were te adaptive proposal is parameterized by a recognition network and te parameters are optimized by descending inclusive KL-divergence. By combining te two types of stocasticity, we construct an unbiased estimate of te gradient in te continuous parameter space. Ten, a stocastic gradient MCMC metod is applied wit guarantee to (approximately) converge to te target posterior wen te learning rates are set under some proper annealing sceme. Our metod can be widely applied to te DGMs wit

2 eiter discrete or continuous latent variables. As an example, we demonstrate te efficacy on learning deep sigmoid belief networks on large-scale datasets, acieving significant speedup compared to te very recent Gibbs sampler wit data augmentation [11]. Finally, note tat independent from our work, Gan et al. [10] also adopted a Monte Carlo estimate via Gibbs sampling to te intractable gradients under a stocastic MCMC metod particularly for topic models. Besides a general perspective wic is applicable on various types of DGM models, we propose a neural adaptive importance sampler wic is more efficient tan Gibbs sampling and in most cases leads to better estimates. 2 Doubly Stocastic Gradient MCMC for Deep Generative Models We now present te doubly stocastic gradient MCMC for deep generative models. 2.1 Variational MLE for DGMs Let X = {x n } N n=1 be a given dataset wit N i.i.d. samples. A deep generative model (DGM) assumes tat eac x n R J is generated from a vector of latent variables z n R H, wic itself follows some distribution p(z α). Let p(x z, β) be te likeliood model. Te joint probability of a DGM is as follows: N p(x, Z θ) = p(z n α)p(x n z n, β), (1) n=1 were θ := (α, β). Depending on te structure of z, e.g., directed or undirected graps, various DGMs ave been developed, suc as deep belief networks [13], deep Boltzmann macines [28] and deep sigmoid belief networks [20], wic are our focus in tis paper. Learning DGMs is often very callenging due to te intractability of posterior inference. Te state-of-te-art metods resort to stocastic variational metods under te maximum likeliood estimation (MLE) framework, ˆθ = argmax θ log p(x θ). Specifically, let q φ (Z) be some variational distribution to approximate te true posterior p(z X, θ). A variational bound of te log-likeliood log p(x θ) can be derived. Ten, we optimize te variational bound wit respect to te variational parameters. However, te variational bound is often intractable to compute analytically for DGMs. To address tis callenge, one possible metod is to bound te intractable parts wit tractable ones by introducing extra variational parameters [30]; but tese metods increase te gap between te bound being optimized and te log-likeliood, potentially leading to poorer estimates. Anoter way is to adopt te recent progress [16, 25, 20] on ybrid Monte Carlo and variational metods, wic approximate te intractable expectations and teir gradients over te parameters (θ, φ) via some unbiased Monte Carlo estimates. Furtermore, to andle large-scale datasets, stocastic optimization [26, 7] of te variational objective can be used wit a suitable learning rate annealing sceme. Note tat variance reduction is a key part of tese metods in order to ave fast and stable convergence. Te reweigted wake-sleep (RWS) [6] represents anoter type of metods tat directly estimate te loglikeliood (as well as its gradient) via importance sampling, were te proposal distribution can be caracterized by a recognition model (or inference network). We draw inspiration from tese variational metods to build our MCMC samplers, as explained below. 2.2 Doubly Stocastic Gradient MCMC We consider te Bayesian setting to infer te posterior distribution p(θ, Z X) p 0 (θ)p(z θ)p(x Z, θ) or its marginal distribution p(θ X), by assuming some prior p 0 (θ). Except a andful of special examples, te posterior distribution is intractable to infer. Toug variational metods can be developed as in [16, 25, 20], under some mean-field or parameterization assumptions, tey often require non-trivial model-specific deviations and may lead to inaccurate approximation wen te assumptions are not properly made. Here, we consider MCMC metods, wic are more generally applicable and can asymptotically approac te target posterior. A straigtforward application of MCMC metods can be Gibbs sampling or stocastic gradient MCMC [32, 2, 24]. However, a Gibbs sampler can suffer from te random-walk beavior in ig-dimensional spaces. Furtermore, a Gibbs sampler would need to process all data at eac iteration, wic is proibitive wen dealing wit large-scale datasets. Te stocastic gradient MCMC metods can lead to significant speedup by exploring statistical redundancy in large datasets; but tey require tat te sample space is continuous, wic is not true for many DGMs, suc as deep sigmoid belief networks tat ave discrete latent variables. Below, we present a doubly stocastic gradient MCMC wit general applicability. We made te mildest assumption tat te parameter space is continuous and te log joint distribution log p(x, z θ) is differentiable wit respect to te model parameters θ almost everywere except a zero-mass set. Suc an assumption is true for almost all existing DGMs. Ten, our metod draws samples in a collapsed space tat involves te model parameters θ only, by integrating out te latent variables z: p(θ X) = 1 N p(x) p 0(θ) p(x n, z n θ) dz n, (2) n=1 were for discrete variables te integral will be a summation. Ten te gradient of te log-posterior is

3 θ log p(θ X) = θ log p 0 (θ) + N n=1 θ log p(x n θ), were te second term can be calculated as: θ log p(x θ) = 1 p(x, z θ) dz = p(x θ) θ p(x, z θ) p(x θ) θ log p(x, z θ) dz ]. (3) [ = E p(z x,θ) log p(x, z θ) θ Wit te above gradient, we can adopt a stocastic gradient MCMC metod to draw samples of θ. We consider te second-order stocastic gradient Hamiltonian Monte Carlo (SGHMC) [8]; but our metod can be naturally extended to te first-order stocastic gradient Langevin dynamics [32] and stocastic gradient termostats [9] algoritms. SGHMC defines a potential energy U(θ) = log p(θ) were p(θ) is te target distribution, and builds a Markov cain using te gradient of U(θ). In our case, te target distribution is p(θ X) and te gradient can be written as θ U(θ) = θ log p 0 (θ) N n=1 θ log p(x n θ). Let B t be a random mini-batc sampled from te training data D. We get te SGHMC metod: γ t,i = (1 ξ)γ t,i 1 + λ t ( log p 0 (θ t,i 1 )+ D ) log p(x n θ t,i 1 ) + N (0, 2ξλ t ) B t n B t θ t,i = θ t,i 1 + γ t,i, (4) were γ t are some augmented momentum variables, ξ is te momentum decay and te subscript i is te inner step of te discretization wen simulating te Hamiltonian dynamic at iteration t. Wit a proper annealing sceme over te learning rate λ t, te Hamiltonian dynamics will converge to te target posterior. Te remaining callenge is to compute te gradient as te expectation in Eq. (3) is often intractable for DGMs. Here, we construct an unbiased estimate of te gradient by a set of samples {z (s) } L s=1 from te posterior p(z x, θ): θ log p(x θ) 1 L ( ) L θ log p(x, z(s) θ). (5) s=1 To draw te samples z (s), we consider two strategies. Te first one is a Gibbs sampler, wic alternately samples eac dimension of z given te oters. A Gibbs sampler is simple and applicable to bot discrete and continuous latent variables. Te second strategy is neural adaptive importance sampling, wic again applies to bot discrete and continuous latent variables, as detailed below. Let q(z x; φ) be a proposal distribution wic satisfies q(z x; φ) > 0 werever p(z x, θ) > 0, we ten ave [ ] p(z x, θ) θ log p(x θ) = E q(z x;φ) log p(x, z θ), q(z x; φ) θ from wic an unbiased importance sampling estimator can be derived wit te sample weigts being p(z x,θ) q(z x;φ). However, computing p(z x, θ) is often ard for most DGMs. By noticing tat p(z x, θ) p(x, z θ) and computing p(x, z θ) is easy, we derive a selfnormalized importance sampling estimate as follows: L ( s=1 θ θ log p(x θ) log p(x, z(s) θ) ) ω (s) L, (6) s=1 ω(s) were {z (s) } L s=1 is a set of samples drawn from te proposal q(z x; φ) and ω (s) = p(x,z(s) θ) is te unnormalized likeliood ratio. Tis estimate is asymptot- q(z (s) x;φ) ically consistent [23], and its sligt bias decreases as drawing more samples. Neural Adaptive Proposals: To reduce te variance of te estimator in Eq. (6) and get accurate gradient estimates, q(z x; φ) sould be as close to p(z x, θ) as possible. Here, we draw inspirations from variational metods and learn adaptive proposals by minimizing some criterion. Specifically, we build a recognition model (or inference network) to represent te proposal distribution q(z x; φ) of latent variables, as in te variational metods [16, 6]. Suc a recognition model takes x as input and outputs {z (s) } as samples from q(z x; φ). We optimize te quality of te proposal distribution by minimizing te inclusive KLdivergence between te target posterior distribution and te proposal E p(z x,θ) [log p(z x,θ) q(z x;φ) ] [6] or equivalently maximizing te expected log-likeliood of te recognition model J (φ; θ, x) = E p(z x,θ) [log q(z x; φ)]. (7) We coose tis objective due to te following reasons. If te target posterior belongs to te family of proposal distributions, maximizing J (φ; θ, x) leads to te optimal solution tat is te target posterior; oterwise, minimizing te inclusive KL-divergence tends to find proposal distributions tat ave iger entropy tan te target posterior. Suc a property is advantageous for importance sampling as we require tat q(z x; φ) > 0 werever p(z x, θ) > 0. In contrast, te exclusive ], as widely adopted in te variational metods [16, 25, 20], does not ave suc a property It can appen tat q(z x; φ) = 0 wen p(z x, θ) > 0; terefore unsuitable for importance sampling. KL-divergence L(φ; θ, x) := E q(z x;φ) [log q(z x;φ) p(z x,θ) Te gradient of J (φ; θ, x) wit respect to te parameters of te proposal distribution is φ J (φ; θ, x) = E p(z x,θ) [ φ log q(z x; φ)], (8)

4 Algoritm 1 Doubly Stocastic Gradient MCMC wit Neural Adaptive Proposals Input: X 1: initialize θ, φ 2: for iteration t = 1, 2, do 3: sample a mini-batc B t {1,, N} 4: for i = 1 to m of SGHMC steps do 5: sample {z (s) n } from q(z x n ; φ), n B t 6: estimate log p(x n θ) by Eq. (6), n B t 7: update θ by Eq. (4) 8: update φ wit te gradient in Eq. (10) Output: θ, φ wic can be estimated using importance sampling similar as in Eq. (6): ( ) L s=1 φ log q(z(s) ; x, θ) ω (s) φ J (φ; θ, x) L, (9) s=1 ω(s) were {z (s) } L s=1 are samples from te latest proposal distribution q(z x; φ) and te weigts are te same as in Eq. (6). To improve te efficiency, we adopt stocastic gradient descent metods to optimize te objective J (φ; θ, X) := N n=1 J (φ; θ, x n), wit te gradient being estimated by a random mini-batc of data points at eac iteration: φ J (φ; θ, X) D B t n B t φ J (φ; θ, x n ), (10) were eac term φ J (φ; θ, x n ) is furter estimated by samples as in Eq. (9). Tis stocastic gradient descent metod naturally fits into our stocastic gradient MCMC wit little cost. In fact, we can use same set of samples {z (s) } L s=1 to estimate te gradients in bot Eq. (6) and Eq. (10) in practice witout losing accuracy. Tis saves cost of drawing samples. Wit te above gradient estimates, we get te overall algoritm wit neural adaptive importance sampling, as outlined in Algoritm 1, were we adaptively update te proposal distribution by performing one step of recognition model update after eac step of Hamiltonian dynamics simulation. 3 Sigmoid Belief Networks We now apply te doubly stocastic gradient MCMC metod to learn deep sigmoid belief networks. A sigmoid belief network (SBN) [22] is a directed grapical model tat as a generative process for binary data. Let x {0, 1} J be a J-dimensional binary vector. Te binary data is modeled in terms of a vector of binary idden variables z {0, 1} H by a factorized likeliood p(x z) = J j=1 p(x j = 1 z): p(x j = 1 z) = σ(w j z + c j ), (11) were σ(x) = 1/(1 + exp( x)) is te sigmoid function and W = [w 1,, w J ] R J H, c R J are model parameters. Te prior of idden variables is also assumed to be a factorized distribution p(z) = H p(z = 1) = =1 H σ(b ), (12) =1 were b R H are te parameters. If we consider layers of SBN as ordered sequences of observed variables or idden variables, and assume directed links witin layers, we obtain an autoregressive version of SBN (ARSBN), wose prior and likeliood models are defined as follows: p(z = 1 z < ) = σ(u k,<kz < + b ), (13) p(x j = 1 x <j, z) = σ(s j,<jx <j + w j z + c j ), (14) were U = [u 1,, u K ] and S = [s 1,, s J ] are lower triangular matrices, denoting te autoregressive weigts witin layer. SBN as been widely used witin a deep arcitecture to learn latent representations [31, 21]. Recent work as been mainly focusing on developing scalable variational metods. Te work [11] is an exception, wic learns SBN under a Bayesian setting and employs a sparsity-inducing prior (e.g., Student-t s or te tree parameter Beta normal prior [3]) on te model weigts. Adopting a sparsity-inducing prior is able to bias te model toward learning sparse latent representations. However, it also makes te posterior inference callenging as te prior is non-conjugate to te sigmoid likeliood in SBN. To deal wit te non-conjugacy, te very recent data augmentation tecniques were applied to derive a Gibbs sampler [11]. However, tis metod is computationally intensive wen te dataset is large and/or te number of idden units is large. Furtermore, sampling in an augmented space can potentially lead to slow mixing rates. We will compare extensively wit tis strong baseline. In tis paper, we use te Student-t s prior on all model parameters. A deep sigmoid belief network (DSBN) is constructed by stacking SBN layers, wit joint probability K 1 p(x, z (1),, z (K) ) = p(z (K) ) p(z (l) z (l+1) )p(x z (1) ), l=1 were te probability of te last (top) layer p(z (K) ) follows Eq. (12) and all oter factor distributions are defined as in Eq. (11) wit layer-specific parameters θ (l) = {W (l), c (l) }. Let J (l) be te number of units at te l-t layer (wit x being te 0-t layer). Te model parameters for DSBN are θ = {θ (l), b} K l=1.

5 It is clear tat DSBN satisfies our mildest assumption of differentiability in te collapsed parameter space; terefore our doubly stocastic gradient MCMC metod applies, as detailed below for bot te Gibbs sampler and neural adaptive importance sampler to estimate te gradient. For te Gibbs sampler, te goal is to draw from p(z (1),, z (K) x, θ). Note tat we define z (0) = x for convenience. It iteratively draws samples from te conditional distribution of a single unit by conditioning on all oter variables: p(z (l) z(l), z( l) ) exp[(w (l) : z(l 1) + b (l+1) were b (l) J (l 1) =1 = w (l) )z (l) log(1 + e (w(l) z (l) +c (l) ) )], (15) z (l) + c (l), b(k+1) = b, W (l) : = [w (l) 1,, w(l) ], J (l 1) z(l) denotes all te variables in z(l) except z (l), and z( l) denotes all te layers except z (l). Detailed derivation is included in Appendix. For te neural adaptive importance sampler, we construct a recognition model wit te same structure as DSBN but wit all layers inverted to approximate te posterior of latent layers. Specifically, we build K 1 q(z (1),, z (K) ; x) = q(z (1) ; x) q(z (l+1) z (l) ), l=1 were x is te input and all z are sample from tis inverse SBN layerstack. Te parameters of te recognition model are φ = {W (l), c (l) } K l=1. Note tat a more complicated recognition model can lead to better performance [6]. In tis paper, we restrict ourself to te simple recognition network. 4 Experiments We now present experimental results of our doubly stocastic MCMC metod on four public datasets, including MNIST, Caltec-101 Silouettes [19], OCR letters [17] and MNIST8M [18]. Te MNIST and MNIST8M datasets are binarized by stocastically setting eac pixel to 1 in proportion to its intensity according to [29]. Table 1 summarizes te statistics of te datasets. We use te doubly stocastic gradient Hamiltonian Monte Carlo wit bot a Gibbs sampler (DSGHMC-Gibbs) and an adaptive importance sampler (DSGHMC-AIS) in te experiments. 4.1 Setups We consider five SBN models wit different arcitectures on all te datasets. Te first tree are te standard SBN models one idden layer wit 200 idden units, two idden layers wit 200 idden units at eac layer and tree idden layers wit 200 idden units at Table 1: Te descriptions of te four datasets. MNIST Caltec-101 OCR MNIST8M dimension train size 50, 000 4, , 152 8, 100, 000 valid size 10, 000 2, , , 100 test size 10, 000 2, , , 000 eac layer. Te oter two are te autoregressive SBN (ARSBN) models one as a single idden layer wit 200 units and te oter one as two idden layers wit 200 units at eac layer. We compare wit te recent Gibbs sampling algoritm [11] using te autors code, wic was sown to be te best MCMC metod on learning deep SBN models under Bayesian setting. Te priors of model parameters are set as stated in Section 3. Te model parameters are randomly initialized by sampling from an zero-mean normal wit small variance. Te learning rate λ t is set among {0.1, 0.02, 0.01}, from wic we report te experiment wit best performance on validation set. Following te suggestion of [8], te momentum decay parameter ξ is cosen from {0.1, 0.05}. Te minibatc size B t is set to 100 and te number of samples L during training is set to 5, wic is sufficiently large for all te tested models. For Gibbs sampling, we follow te settings in [11] (wit te initial 40 epocs as burn-in). For DSGHMC- Gibbs, we use deterministic termination criteria te number of burn-in iterations is set as 140 for one-idden-layer, 240 for two-idden-layer and 340 for tree-idden-layer models. Te number of burn-in steps of te Gibbs sampler for te posterior of latent variables is set to 5. For DSGHMC-AIS, due to te efficiency of sampling idden layers, we use early stopping for better convergence. Te parameters of recognition model are updated using te Adam [15] optimizer wit stepsizes of {0.001, }. ARSBN depends on te order of te input variables. In our experiments, te ordering is simply determined by te original order in te dataset. For ARSBN, we use recognition models wit ARSBN layers. See Appendix for more details about te experimental setting. 4.2 Results We present bot quantitative results on predicting testing data and qualitative results on generating samples. We also analyze te time efficiency Predictive Performance We first present predictive performance to examine te quality of posterior inference by our DSGHMC metods. To assess te quality, we report te average loglikeliood of te test data. For simplicity, one sam-

6 Table 2: Predictive results on various datasets, were Dim denotes te number of latent variables in eac layer, wit layer closest to te data laying left. Values surrounded by brackets are variational lower bounds, values witout brackets are average test log-likelioods. Te results marked by * are taken from [11]. ( ) Discriminative fine-tuning is performed, wic probably leads to better results. Model Dim MNIST Caltec-101 OCR MNIST8M SBN (Gibbs) SBN (Gibbs) SBN (VB)* 200 ( ) ( ) ( 48.20) SBN (VB)* ( ) ( ) ( 47.84) SBN (DSGHMC-Gibbs) SBN (DSGHMC-Gibbs) SBN (DSGHMC-Gibbs) SBN (DSGHMC-AIS) SBN (DSGHMC-AIS) SBN (DSGHMC-AIS) ARSBN (DSGHMC-Gibbs) ARSBN (DSGHMC-Gibbs) ARSBN (DSGHMC-AIS) ARSBN (DSGHMC-AIS) ARSBN (VB)* 200 ( ) ( 96.78) ( 37.97) ARSBN (VB)* ( ) ( 97.57) ( 38.56) ple drawn from te posterior after convergence is used to evaluate te results. Te log-likelioods of models trained by DSGHMC-AIS are evaluated by importance sampling according to [6] wit L = 100, 000 samples. To evaluate te models trained by DSGHMC-Gibbs, wic do not ave a recognition model, we adopt te Annealed Importance Sampling [29] metod, wic provides an unbiased estimator of te log-likeliood. We observe tat te two estimators always give similar results wen applying tem to te same learned model (wit difference less tan 0.7). Details of evaluation are provided in Appendix. Table 2 sows te average test log-likeliood on eac of te four datasets, wit comparison wit te Gibbs sampler via data augmentation as well as te variational Bayesian (VB) [11] metod for SBN and AR- SBN. 1 We cite te results of VB metods from [11], wic are referred as variational lower bounds. We first examine te results for te SBN model wit one idden layer. One can observe tat our metod acieves similar or better results on all te datasets, compared to te Gibbs sampler. On MNIST, our metod gives an average test log-likeliood of 102.9, wic brings a 12 nats improvement. Ten, we utilize a second layer for SBN. A greedy layer-wise pre-training requires generating samples of lower idden layers as te input data for upper idden layers. However, our metod only generates a few samples for te mini-batc at eac iteration, wic 1 For ARSBN, te work [11] only reported te results wit variational Bayes. is not enoug for pre-training te next layer. Tus pre-training is not performed in all deep models. As can be seen, utilizing a second idden layer improves te performance on all datasets. Furtermore, our metod acieves consistently better results tan te Gibbs sampler. We also explore ARSBNs wit one idden layer and two idden layers. Our metod on autoregressive structures give improvements compared to standard SBN as expected, wic suggests tat our metod works on different deep generative models. We furter investigate te performance of SBN models wit tree idden layers. Training suc models using Gibbs sampling is too time-costly and tus is not included in Table 2. We can see tat our metod continues improving te results as te model grows deeper, wic sows tat our metods scale better to large or deep structures tan Gibbs sampling. Table 3 sows comparison on MNIST results of various training metods, including Neural Variational Inference and Learning (NVIL) [20], wake-sleep [12] and Reweigted Wake-Sleep (RWS) [6]. Our metod gives better results tan wake-sleep and acieves comparable results wit RWS(Q:SBN), wic uses a same recognition model as ours to approximate te posterior of latent variables. We observe tat on MNIST and OCR (See Table 2), DSGHMC-AIS gives better results tan DSGHMC- Gibbs, wile on Caltec-101 te facts are te opposite. Note tat depending on te dataset being modeled, a Gibbs sampler for posterior of latent variables can give poor samples wen it as a low mixing rate,

7 Table 3: MNIST results of various training metods on SBNs wit various arcitectures. Te results of NVIL are cited from [20]; and te results of Wake-sleep and RWS are cited from [6]. Model Dim Variational Bayes MCMC NVIL wake-sleep RWS(Q:SBN) Gibbs DSGHMC-Gibbs DSGHMC-AIS SBN 200 ( 113.1) 116.3( 120.7) SBN ( 99.8) 106.9( 109.4) SBN ( 96.7) 101.3( 104.4) (a) (b) (c) (d) Figure 1: Visualization: (a) Training data. (b) Samples from te SBN( ) model. (Te probabilities for sampling eac pixel are sowed.) (c) Features at te bottom layer learned wit sparse prior. (d) Features at te bottom layer learned wit normal prior. (Top) MNIST. (Bottom) Caltec-101. wile a neural adaptive importance sampler can give poor samples wen true posterior is far from te scope a recognition model can approximate. Tus depending on te different datasets being modeled, eiter one of above may become dominating, leading to te different performance of te two proposed metods. Using a more powerful recognition model may furter improve te performance of DSGHMC-AIS [6]; We leave a systematic investigation as our future work. Finally, our efficient DSGHMC metods manage to learn te deep SBN models on te large-scale MNIST8M [18] dataset, wic consists of 8.1 million training digit images generated by applying transformation to te standard MNIST training examples and as a same testing set as te standard MNIST dataset. MNIST8M is too large to be processed by te batc Gibbs sampler. Again, we can see in Table 2 tat by increasing te model dept te testing log-likelioods improve for SBN models. Te sligtly worse results for ARSBN may be due to overfitting Generative Performance Fig. 1 sows te generative ability of te learned SBN models. In Fig. 1(a), we sow te randomly sampled training data of MNIST and Caltec-101. Fig. 1(b) presents some random examples generated from 3- layer SBNs learned by DSGHMC-AIS. One advantage of Bayesian framework is tat we can specify some sparsity-encouraging priors on te model parameters explicitly, e.g., Student-t prior in our experiments. Fig. 1(c) and Fig. 1(d) demonstrate te difference between features learned wit sparse priors and non-sparse priors. We can see tat te features learned wit sparse priors appear more localized. We furter examine te ability of te learned models on predicting missing data. For eac test image, te lower alf is assumed missing and te upper alf is used to inference te idden units [11]. Ten, wit te idden units, te lower alf is constructed. Prediction is done by repeating tis procedure and adopting a majority vote for eac pixel. Fig. 2 demonstrates some example completions for te missing data on MNIST.

8 Avg. test log likeliood Gibbs DSGHMC Gibbs(M) DSGHMC Gibbs(T) DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs(M) DSGHMC Gibbs(T) DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs DSGHMC AIS time (s) (a) #data visited (b) time (s) (c) #data visited Figure 3: Convergence curves wit respect to training time and number of training data visited on (a-b) MNIST and (c-d) OCR letters datasets. In (a) and (c) DSGHMC-Gibbs(T) denotes DSGHMC-Gibbs implemented in Teano wit GPU acceleration and DSGHMC-Gibbs(M) denotes DSGHMC-Gibbs in MATLAB code. (d) Table 4: Average training time per iteration using Gibbs and DSGHMC-Gibbs for one-idden-layer SBN (200 idden units). Experiments in tis table are conducted using MATLAB code for bot metods on a PC wit eigt Intel Core i CPUs (3.40GHz). Datasets train-size Gibbs DSGHMC-Gibbs MNIST 50, s 25.4s Caltec-101 4, s 25.4s OCR 32, s 5.44s MNIST8M 8, 100, 000 > s Figure 2: Missing data prediction: (Top) Original data. (Middle) Hollowed data. (Bottom) Reconstructed data Time Efficiency We compare te efficiency of Gibbs, DSGHMC-Gibbs and DSGHMC-AIS in tis section. We implement bot DSGHMC-Gibbs and DSGHMC-AIS in Teano [4] wit GPU acceleration. For fair comparison wit Gibbs (MATLAB code by [11]), we also implement DSGHMC-Gibbs in MATLAB code. Table 4 presents te efficiency of Gibbs and DSGHMC-Gibbs on different datasets. We can see tat DSGHMC-Gibbs is faster tan te Gibbs sampler on all te four datasets, especially for te large-scale MNIST8M dataset, wic is too time-consuming for te Gibbs sampler. Note tat our DSGHMC-Gibbs spends almost identical time at eac iteration on MNIST, Caltec-101 and MNIST8M datasets, due to te fact tat all tese datasets ave te same dimensionality, wic sows te scalability of te proposed metod. We gain significant acceleration by using GPU. For example, On MNIST, te average training time (oneidden-layer SBN) per iteration of DSGHMC-Gibbs and DSGHMC-AIS is 5.2s and 0.18s, respectively. Fig. 3 sows te convergence curves on te MNIST and OCR datasets wit respect to te training time and te number of training data visited. We can see tat: 1) Bot DSGHMC-Gibbs and DSGHMC-AIS converge faster and better tan Gibbs; 2) bot metods visit fewer data points to converge, compared to Gibbs; and 3) DSGHMC-AIS converges faster tan DSGHMC-AIS wit respect to training time. 5 Conclusions and Future Work We present doubly stocastic gradient MCMC, a simple and general metod, to learn deep generative models in a Bayesian setting. Wen applied to deep sigmoid belief networks, our metod manages to learn on large-scale datasets wit good inference accuracy. For future work, we like to apply tis metod to learn even deeper belief networks. We are also interested in investigating te performance on learning sparse Bayesian models, wic often involve intractable gradients tat can be estimated by our doubly stocastic strategy. Finally, learning nonparametric Bayesian DGMs is anoter interesting callenge. References [1] R. Adams, H. Wallac, and Z. Garamani. Learning te structure of deep sparse grapical models. In AISTATS, [2] S. An, A. Korattikara, and M. Welling. Bayesian posterior sampling via stocastic gradient fiser scoring. In ICML, 2012.

9 [3] A. Armagan, D. Dunson, and M. Clyde. Generalized beta mixtures of gaussians. In NIPS, pages , [4] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Boucard, D. Warde-Farley, and Y. Bengio. Teano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Worksop, [5] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stocastic networks trainable by backprop. In ICML, [6] J. Bornscein and Y. Bengio. Reweigted wakesleep. CoRR, abs/ , [7] L. Bottou. Online Algoritms and Stocastic Approximations. Online Learning and Neural Networks, Edited by David Saad, Cambridge University Press, Cambridge, UK, [8] T. Cen, E. Fox, and C. Guestrin. Stocastic gradient Hamiltonian Monte Carlo. In ICML, [9] N. Ding, Y. Fang, R. Babbus, C. Cen, and H. Skeel, R.and Neven. Bayesian sampling using stocastic gradient termostats. In NIPS, pages , [10] Z. Gan, C. Cen, R. Henao, D. Carlson, and L. Carin. Scalable deep poisson factor analysis for topic modeling. In ICML, [11] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks wit data augmentation. In AISTATS, [12] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. Te wake-sleep algoritm for unsupervised neural networks. Science, 268(5214): , [13] G. E. Hinton, S. Osindero, and Y. Te. A fast learning algoritm for deep belief nets. Neural Computation, 18, [14] M. Jordan, Z. Garamani, T. Jaakkola, and L. Saul. An introduction to variational metods for grapical models. MLJ, 37(2): , [15] D. P. Kingma and J. Ba. Adam: A metod for stocastic optimization. CoRR, abs/ , [16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, [17] M. Licman. UCI macine learning repository, [18] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector macines using selective sampling. In L. Bottou, O. Capelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Macines, pages MIT Press, Cambridge, MA., [19] B. Marlin, K. Swersky, B. Cen, and N. Freitas. Inductive principles for restricted boltzmann macine learning. In AISTATS, pages , [20] A. Mni and K. Gregor. Neural variational inference and learning in belief networks. In ICML, [21] A. Moamed, G. Dal, and G. Hinton. Acoustic modeling using deep belief networks. IEEE Trans. on Audio, Speec, and Language Processing, 20(1):14 22, [22] R. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71 113, [23] A. B. Owen. Monte Carlo teory, metods and examples [24] S. Patterson and Y. Te. Stocastic gradient Riemannian Langevin dynamics on te probability simplex. In NIPS, [25] D. J. Rezende, S. Moamed, and D. Wierstra. Stocastic backpropagation and approximate inference in deep generative models. In ICML, [26] H. Robbins and S. Monro. A stocastic approximation metod. Te Annals of Matematical Statistics, 22(3): , [27] C. Robert and G. Casella. Monte Carlo Statistical Metods. Springer, [28] R. Salakutdinov and G. E. Hinton. Deep Boltzmann macines. In AISTATS, [29] R. Salakutdinov and I. Murray. On te quantitative analysis of deep belief networks. In ICML, pages , [30] L. Saul, T. Jaakkola, and M. Jordan. Mean field teory for sigmoid belief networks. Journal of AI Researc, 4:61 76, [31] I. Titov and J. Henderson. Constituent parsing wit incremental sigmoid belief networks. In ACL, [32] M. Welling and Y. Te. Bayesian learning via stocastic gradient Langevin dynamics. In ICML, 2011.

10 Appendix In tis appendix, we first provide te derivations of te Gibbs sampler for DSGHMC-Gibbs. We also give details about ow we evaluate te learned models. Finally we sow some extra experimental settings. A Derivations A.1 Gibbs sampler for DSBN latent layers We sample te latent variables layer by layer and dimension-wise witin eac layer. Note tat we define z (0) = x for convenience. Ten p(z (l) z(l), x, z( l) ) can be written as p(z (l) z(l), z( l) ). We ave p(z (l) z(l), z( l) ) =p(z (l) z(l), z(<l), z (>l) ) p(z (<l) z (l), z(l), z(>l) ) p(z (l) z(l), z(>l) ) =p(z (<l) z (l) ) p(z (l) z(l+1) ) =p(z (l 1) z (l) ) p(z (<(l 1)) z (l 1), z (l) ) p(z (l) z(l+1) ) p(z (l 1) z (l) ) p(z (l) J (l 1) = =1 [ exp exp [ J (l 1) exp ( =1 z(l+1) ) (w (l) z (l) + c (l) )z (l 1) ] log(1 + e (w(l) z (l) +c (l) ) ) [ (w (l+1) z (l+1) + c (l+1) )z (l) log(1 + e (w(l+1) z (l+1) +c (l+1) ) ) W (l) z(l 1) J (l 1) =1 + (w (l+1) ] z (l+1) + c (l+1) ))z (l) ] log(1 + e (w(l) z (l) +c (l) ) ). For l = K, tere is no upper layers and te term w (l) z (l) + c (l) wic is denoting te confidence from upper layer will be replaced wit te prior of te top layer b. We define W (l) : = [w (l) 1,, w(l) J (l 1) ], b (l) = w(l) z (l) + c (l) and b(k+1) = b for notational convenience. Finally we get Eq. (15) p(z (l) z(l), z( l) ) exp[(w (l) : z(l 1) + b (l+1) for l = 1 to K. J (l 1) =1 log(1 + e (w(l) z (l) +c (l) ) )], )z (l) A Gibbs sampler of DARSBN can be derived in a similar way. B Evaluation In our experiment, all models are evaluated in terms of average test log-likeliood. We now present te details of evaluation. B.1 Importance sampling For te models trained by DSGHMC-AIS, we can evaluate te log-likeliood using importance sampling [6]. Specifically, we ave an unbiased estimator p(x θ) = E p(z x,θ) [p(x θ)] [ ] p(z x, θ) = E q(z x;φ) q(z x; φ) p(x θ) 1 L L s=1 p(x, z (s) θ) q(z (s) x; φ), (16) were {z (s) } L s=1 are samples from te proposal distribution q(z x; φ). In te experiments, we take te logaritm of te above estimator to compute te loglikeliood. Note tat taking logaritm makes te estimator biased [6], but te bias and te variance will decrease as te number of samples L increase. In our experiments, we use L = 100, 000, wic is sown sufficiently large. B.2 Annealed importance sampling For te models trained by DSGHMC-Gibbs or te Gibbs sampling [11], te above estimator is not applicable since no recognition model is trained. Noticing p(z x, θ) = p(x, z θ)/p(x θ), we can view p(x, z θ) as some unnormalized probability of p(z x, θ) and te p(x θ) as te normalizing constant. We adopt te Annealed Importance Sampling (AIS) [29] metod, wic can evaluate te ratio of normalizing constants accurately by reducing te variance of estimation via a sequence of intermediate distributions. See [29] for details. For our DSBN, we leverage te AIS to compute te likeliood p(x θ) by computing te ratio of p(x θ) and p(x θ = 0), were p(x θ = 0) is te likeliood wit all model parameters being 0. Note tat θ = 0 implies (0) J te likeliood is a constant p(x θ = 0) = 2 were J (0) is te dimension of x, since eac dimension of x as equal probability to be 0 or 1. Tus computing te ratio p(x θ) p(x θ=0) is equivalent to computing te likeliood p(x θ). To compute te ratio, we introduce a sequence of intermediate distribution wit p 0 (z) = p(z, x θ =

11 0),, p k (z) = p(z, x k K θ),, p K(z) = p(x, z θ) for k = 0,, K, were K is te number of intermediate distributions and k K θ denotes model parameters multiplied by k p(x θ) K. Ten we estimate p(x θ=0) by p 1 (z 1 ) p 2 (z 2 ) p 0 (z 1 ) p 1 (z 2 ) pk 1(z K 1 ) p K (z K ) p K 2 (z K 1 ) p K 1 (z K ), (17) were z 0 is sampled from p 0 (z), and z k+1 is sampled from te Gibbs sampler Eq. (15) wit assumption tat te current latent layers is z k and te model is k K θ. In our experiments we coose K = 1000 and evaluate te log-likeliood by computing Eq. (17) repeatedly and taking average. C More Detailed Experimental Settings In Eq. (4), te gradient θ U(θ) is depending on te size of data, wic results te fact tat for different datasets we ave to coose different learning rates. In our experiments, instead of evaluating te summation of te gradient among all te data x n, we compute te per-data gradient θ U(θ)/ D in order to use same learning rate. For te majority of our experiments a learning rate 0.01 give te best results. One parameter not mentioned in te main paper is te number of discretization steps wen simulating te dynamics. In our experiments, we fixed te number of steps to 10. Te Student-t s prior used in te experiments are set wit a scale parameter σ = 0.09, location parameter µ = 0 and degrees of freedom ν = 2.2.

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann