arxiv: v3 [cs.lg] 14 Oct 2015

Size: px
Start display at page:

Download "arxiv: v3 [cs.lg] 14 Oct 2015"

Transcription

1 Learning Deep Generative Models wit Doubly Stocastic MCMC Cao Du Jun Zu Bo Zang Dept. of Comp. Sci. & Tec., State Key Lab of Intell. Tec. & Sys., TNList Lab, Center for Bio-Inspired Computing Researc, Tsingua University, Beijing, , Cina arxiv: v3 [cs.lg] 14 Oct 2015 Abstract We present doubly stocastic gradient MCMC, a simple and generic metod for (approximate) Bayesian inference of deep generative models in te collapsed continuous parameter space. At eac MCMC sampling step, te algoritm randomly draws a minibatc of data samples to estimate te gradient of log-posterior and furter estimates te intractable expectation over latent variables via a Gibbs sampler or a neural adaptive importance sampler. We demonstrate te effectiveness on learning deep sigmoid belief networks (DSBNs). Compared to te state-ofte-art metods using Gibbs sampling wit data augmentation, our algoritm is muc more efficient and manages to learn DSBNs on large datasets. 1 Introduction Learning deep models tat consist of multi-layered representations as proven effective wit state-of-te-art performance in many tasks [5, 13, 28]. To fit suc deep models, a small dataset is often insufficient. Terefore, it is important to address te computational callenge of learning suc deep models on large-scale datasets, wic are often in ig-dimensional spaces also. Deep generative models (DGMs) [13, 28] represent an important family of deep models tat can answer a wide range of queries by performing probabilistic inference, suc as inferring te missing values of input data, wic is beyond te scope of recognition networks suc as deep neural networks. However, probabilistic inference wit DGMs is callenging, especially wen a Bayesian framework is employed, wic is desirable as it can offer various possibilities, suc as robustness to overfitting, sparse Bayesian inference [11] and nonparametric inference [1] to learn te grap structure. For Bayesian metods in general, te posterior inference often involves intractable integrals because of several potential factors, suc as tat te space is extremely ig-dimensional and tat te Bayesian model is non-conjugate. To address te callenges, approximate metods ave to be adopted, including variational [14, 30] and Markov cain Monte Carlo (MCMC) metods [27]. Altoug muc progress as been made on stocastic variational metods for DGMs [16, 25], under some mean-field or parameterization assumptions, little work as been done on extending MCMC metods to learn DGMs in a Bayesian setting, wic are often more accurate. A few exceptions exist. Gan et al. [11] present a Gibbs sampling for deep sigmoid belief networks wit a sparsity-inducing prior via data augmentation, and Adams et al. [1] present a Metropolis- Hastings metod for cascading Indian buffet process. In tis paper, we present a simple and generic metod, named doubly stocastic gradient MCMC, to improve te efficiency of performing Bayesian inference on DGMs. By drawing samples in te collapsed parameter space, our metod extends te recent work on stocastic gradient MCMC [32, 2, 24] to deal wit te callenging task of posterior inference wit DGMs. Besides te stocasticity of randomly drawing a minibatc of samples in stocastic approximation, our algoritm introduces an extra dimension of stocasticity to estimate te intractable gradients by randomly drawing te latent variables in DGMs. Te sampling can be done via a simple Gibbs sampler. We furter develop a neural adaptive importance sampler, were te adaptive proposal is parameterized by a recognition network and te parameters are optimized by descending inclusive KL-divergence. By combining te two types of stocasticity, we construct an unbiased estimate of te gradient in te continuous parameter space. Ten, a stocastic gradient MCMC metod is applied wit guarantee to (approximately) converge to te target posterior wen te learning rates are set under some proper annealing sceme. Our metod can be widely applied to te DGMs wit

2 eiter discrete or continuous latent variables. As an example, we demonstrate te efficacy on learning deep sigmoid belief networks on large-scale datasets, acieving significant speedup compared to te very recent Gibbs sampler wit data augmentation [11]. Finally, note tat independent from our work, Gan et al. [10] also adopted a Monte Carlo estimate via Gibbs sampling to te intractable gradients under a stocastic MCMC metod particularly for topic models. Besides a general perspective wic is applicable on various types of DGM models, we propose a neural adaptive importance sampler wic is more efficient tan Gibbs sampling and in most cases leads to better estimates. 2 Doubly Stocastic Gradient MCMC for Deep Generative Models We now present te doubly stocastic gradient MCMC for deep generative models. 2.1 Variational MLE for DGMs Let X = {x n } N n=1 be a given dataset wit N i.i.d. samples. A deep generative model (DGM) assumes tat eac x n R J is generated from a vector of latent variables z n R H, wic itself follows some distribution p(z α). Let p(x z, β) be te likeliood model. Te joint probability of a DGM is as follows: N p(x, Z θ) = p(z n α)p(x n z n, β), (1) n=1 were θ := (α, β). Depending on te structure of z, e.g., directed or undirected graps, various DGMs ave been developed, suc as deep belief networks [13], deep Boltzmann macines [28] and deep sigmoid belief networks [20], wic are our focus in tis paper. Learning DGMs is often very callenging due to te intractability of posterior inference. Te state-of-te-art metods resort to stocastic variational metods under te maximum likeliood estimation (MLE) framework, ˆθ = argmax θ log p(x θ). Specifically, let q φ (Z) be some variational distribution to approximate te true posterior p(z X, θ). A variational bound of te log-likeliood log p(x θ) can be derived. Ten, we optimize te variational bound wit respect to te variational parameters. However, te variational bound is often intractable to compute analytically for DGMs. To address tis callenge, one possible metod is to bound te intractable parts wit tractable ones by introducing extra variational parameters [30]; but tese metods increase te gap between te bound being optimized and te log-likeliood, potentially leading to poorer estimates. Anoter way is to adopt te recent progress [16, 25, 20] on ybrid Monte Carlo and variational metods, wic approximate te intractable expectations and teir gradients over te parameters (θ, φ) via some unbiased Monte Carlo estimates. Furtermore, to andle large-scale datasets, stocastic optimization [26, 7] of te variational objective can be used wit a suitable learning rate annealing sceme. Note tat variance reduction is a key part of tese metods in order to ave fast and stable convergence. Te reweigted wake-sleep (RWS) [6] represents anoter type of metods tat directly estimate te loglikeliood (as well as its gradient) via importance sampling, were te proposal distribution can be caracterized by a recognition model (or inference network). We draw inspiration from tese variational metods to build our MCMC samplers, as explained below. 2.2 Doubly Stocastic Gradient MCMC We consider te Bayesian setting to infer te posterior distribution p(θ, Z X) p 0 (θ)p(z θ)p(x Z, θ) or its marginal distribution p(θ X), by assuming some prior p 0 (θ). Except a andful of special examples, te posterior distribution is intractable to infer. Toug variational metods can be developed as in [16, 25, 20], under some mean-field or parameterization assumptions, tey often require non-trivial model-specific deviations and may lead to inaccurate approximation wen te assumptions are not properly made. Here, we consider MCMC metods, wic are more generally applicable and can asymptotically approac te target posterior. A straigtforward application of MCMC metods can be Gibbs sampling or stocastic gradient MCMC [32, 2, 24]. However, a Gibbs sampler can suffer from te random-walk beavior in ig-dimensional spaces. Furtermore, a Gibbs sampler would need to process all data at eac iteration, wic is proibitive wen dealing wit large-scale datasets. Te stocastic gradient MCMC metods can lead to significant speedup by exploring statistical redundancy in large datasets; but tey require tat te sample space is continuous, wic is not true for many DGMs, suc as deep sigmoid belief networks tat ave discrete latent variables. Below, we present a doubly stocastic gradient MCMC wit general applicability. We made te mildest assumption tat te parameter space is continuous and te log joint distribution log p(x, z θ) is differentiable wit respect to te model parameters θ almost everywere except a zero-mass set. Suc an assumption is true for almost all existing DGMs. Ten, our metod draws samples in a collapsed space tat involves te model parameters θ only, by integrating out te latent variables z: p(θ X) = 1 N p(x) p 0(θ) p(x n, z n θ) dz n, (2) n=1 were for discrete variables te integral will be a summation. Ten te gradient of te log-posterior is

3 θ log p(θ X) = θ log p 0 (θ) + N n=1 θ log p(x n θ), were te second term can be calculated as: θ log p(x θ) = 1 p(x, z θ) dz = p(x θ) θ p(x, z θ) p(x θ) θ log p(x, z θ) dz ]. (3) [ = E p(z x,θ) log p(x, z θ) θ Wit te above gradient, we can adopt a stocastic gradient MCMC metod to draw samples of θ. We consider te second-order stocastic gradient Hamiltonian Monte Carlo (SGHMC) [8]; but our metod can be naturally extended to te first-order stocastic gradient Langevin dynamics [32] and stocastic gradient termostats [9] algoritms. SGHMC defines a potential energy U(θ) = log p(θ) were p(θ) is te target distribution, and builds a Markov cain using te gradient of U(θ). In our case, te target distribution is p(θ X) and te gradient can be written as θ U(θ) = θ log p 0 (θ) N n=1 θ log p(x n θ). Let B t be a random mini-batc sampled from te training data D. We get te SGHMC metod: γ t,i = (1 ξ)γ t,i 1 + λ t ( log p 0 (θ t,i 1 )+ D ) log p(x n θ t,i 1 ) + N (0, 2ξλ t ) B t n B t θ t,i = θ t,i 1 + γ t,i, (4) were γ t are some augmented momentum variables, ξ is te momentum decay and te subscript i is te inner step of te discretization wen simulating te Hamiltonian dynamic at iteration t. Wit a proper annealing sceme over te learning rate λ t, te Hamiltonian dynamics will converge to te target posterior. Te remaining callenge is to compute te gradient as te expectation in Eq. (3) is often intractable for DGMs. Here, we construct an unbiased estimate of te gradient by a set of samples {z (s) } L s=1 from te posterior p(z x, θ): θ log p(x θ) 1 L ( ) L θ log p(x, z(s) θ). (5) s=1 To draw te samples z (s), we consider two strategies. Te first one is a Gibbs sampler, wic alternately samples eac dimension of z given te oters. A Gibbs sampler is simple and applicable to bot discrete and continuous latent variables. Te second strategy is neural adaptive importance sampling, wic again applies to bot discrete and continuous latent variables, as detailed below. Let q(z x; φ) be a proposal distribution wic satisfies q(z x; φ) > 0 werever p(z x, θ) > 0, we ten ave [ ] p(z x, θ) θ log p(x θ) = E q(z x;φ) log p(x, z θ), q(z x; φ) θ from wic an unbiased importance sampling estimator can be derived wit te sample weigts being p(z x,θ) q(z x;φ). However, computing p(z x, θ) is often ard for most DGMs. By noticing tat p(z x, θ) p(x, z θ) and computing p(x, z θ) is easy, we derive a selfnormalized importance sampling estimate as follows: L ( s=1 θ θ log p(x θ) log p(x, z(s) θ) ) ω (s) L, (6) s=1 ω(s) were {z (s) } L s=1 is a set of samples drawn from te proposal q(z x; φ) and ω (s) = p(x,z(s) θ) is te unnormalized likeliood ratio. Tis estimate is asymptot- q(z (s) x;φ) ically consistent [23], and its sligt bias decreases as drawing more samples. Neural Adaptive Proposals: To reduce te variance of te estimator in Eq. (6) and get accurate gradient estimates, q(z x; φ) sould be as close to p(z x, θ) as possible. Here, we draw inspirations from variational metods and learn adaptive proposals by minimizing some criterion. Specifically, we build a recognition model (or inference network) to represent te proposal distribution q(z x; φ) of latent variables, as in te variational metods [16, 6]. Suc a recognition model takes x as input and outputs {z (s) } as samples from q(z x; φ). We optimize te quality of te proposal distribution by minimizing te inclusive KLdivergence between te target posterior distribution and te proposal E p(z x,θ) [log p(z x,θ) q(z x;φ) ] [6] or equivalently maximizing te expected log-likeliood of te recognition model J (φ; θ, x) = E p(z x,θ) [log q(z x; φ)]. (7) We coose tis objective due to te following reasons. If te target posterior belongs to te family of proposal distributions, maximizing J (φ; θ, x) leads to te optimal solution tat is te target posterior; oterwise, minimizing te inclusive KL-divergence tends to find proposal distributions tat ave iger entropy tan te target posterior. Suc a property is advantageous for importance sampling as we require tat q(z x; φ) > 0 werever p(z x, θ) > 0. In contrast, te exclusive ], as widely adopted in te variational metods [16, 25, 20], does not ave suc a property It can appen tat q(z x; φ) = 0 wen p(z x, θ) > 0; terefore unsuitable for importance sampling. KL-divergence L(φ; θ, x) := E q(z x;φ) [log q(z x;φ) p(z x,θ) Te gradient of J (φ; θ, x) wit respect to te parameters of te proposal distribution is φ J (φ; θ, x) = E p(z x,θ) [ φ log q(z x; φ)], (8)

4 Algoritm 1 Doubly Stocastic Gradient MCMC wit Neural Adaptive Proposals Input: X 1: initialize θ, φ 2: for iteration t = 1, 2, do 3: sample a mini-batc B t {1,, N} 4: for i = 1 to m of SGHMC steps do 5: sample {z (s) n } from q(z x n ; φ), n B t 6: estimate log p(x n θ) by Eq. (6), n B t 7: update θ by Eq. (4) 8: update φ wit te gradient in Eq. (10) Output: θ, φ wic can be estimated using importance sampling similar as in Eq. (6): ( ) L s=1 φ log q(z(s) ; x, θ) ω (s) φ J (φ; θ, x) L, (9) s=1 ω(s) were {z (s) } L s=1 are samples from te latest proposal distribution q(z x; φ) and te weigts are te same as in Eq. (6). To improve te efficiency, we adopt stocastic gradient descent metods to optimize te objective J (φ; θ, X) := N n=1 J (φ; θ, x n), wit te gradient being estimated by a random mini-batc of data points at eac iteration: φ J (φ; θ, X) D B t n B t φ J (φ; θ, x n ), (10) were eac term φ J (φ; θ, x n ) is furter estimated by samples as in Eq. (9). Tis stocastic gradient descent metod naturally fits into our stocastic gradient MCMC wit little cost. In fact, we can use same set of samples {z (s) } L s=1 to estimate te gradients in bot Eq. (6) and Eq. (10) in practice witout losing accuracy. Tis saves cost of drawing samples. Wit te above gradient estimates, we get te overall algoritm wit neural adaptive importance sampling, as outlined in Algoritm 1, were we adaptively update te proposal distribution by performing one step of recognition model update after eac step of Hamiltonian dynamics simulation. 3 Sigmoid Belief Networks We now apply te doubly stocastic gradient MCMC metod to learn deep sigmoid belief networks. A sigmoid belief network (SBN) [22] is a directed grapical model tat as a generative process for binary data. Let x {0, 1} J be a J-dimensional binary vector. Te binary data is modeled in terms of a vector of binary idden variables z {0, 1} H by a factorized likeliood p(x z) = J j=1 p(x j = 1 z): p(x j = 1 z) = σ(w j z + c j ), (11) were σ(x) = 1/(1 + exp( x)) is te sigmoid function and W = [w 1,, w J ] R J H, c R J are model parameters. Te prior of idden variables is also assumed to be a factorized distribution p(z) = H p(z = 1) = =1 H σ(b ), (12) =1 were b R H are te parameters. If we consider layers of SBN as ordered sequences of observed variables or idden variables, and assume directed links witin layers, we obtain an autoregressive version of SBN (ARSBN), wose prior and likeliood models are defined as follows: p(z = 1 z < ) = σ(u k,<kz < + b ), (13) p(x j = 1 x <j, z) = σ(s j,<jx <j + w j z + c j ), (14) were U = [u 1,, u K ] and S = [s 1,, s J ] are lower triangular matrices, denoting te autoregressive weigts witin layer. SBN as been widely used witin a deep arcitecture to learn latent representations [31, 21]. Recent work as been mainly focusing on developing scalable variational metods. Te work [11] is an exception, wic learns SBN under a Bayesian setting and employs a sparsity-inducing prior (e.g., Student-t s or te tree parameter Beta normal prior [3]) on te model weigts. Adopting a sparsity-inducing prior is able to bias te model toward learning sparse latent representations. However, it also makes te posterior inference callenging as te prior is non-conjugate to te sigmoid likeliood in SBN. To deal wit te non-conjugacy, te very recent data augmentation tecniques were applied to derive a Gibbs sampler [11]. However, tis metod is computationally intensive wen te dataset is large and/or te number of idden units is large. Furtermore, sampling in an augmented space can potentially lead to slow mixing rates. We will compare extensively wit tis strong baseline. In tis paper, we use te Student-t s prior on all model parameters. A deep sigmoid belief network (DSBN) is constructed by stacking SBN layers, wit joint probability K 1 p(x, z (1),, z (K) ) = p(z (K) ) p(z (l) z (l+1) )p(x z (1) ), l=1 were te probability of te last (top) layer p(z (K) ) follows Eq. (12) and all oter factor distributions are defined as in Eq. (11) wit layer-specific parameters θ (l) = {W (l), c (l) }. Let J (l) be te number of units at te l-t layer (wit x being te 0-t layer). Te model parameters for DSBN are θ = {θ (l), b} K l=1.

5 It is clear tat DSBN satisfies our mildest assumption of differentiability in te collapsed parameter space; terefore our doubly stocastic gradient MCMC metod applies, as detailed below for bot te Gibbs sampler and neural adaptive importance sampler to estimate te gradient. For te Gibbs sampler, te goal is to draw from p(z (1),, z (K) x, θ). Note tat we define z (0) = x for convenience. It iteratively draws samples from te conditional distribution of a single unit by conditioning on all oter variables: p(z (l) z(l), z( l) ) exp[(w (l) : z(l 1) + b (l+1) were b (l) J (l 1) =1 = w (l) )z (l) log(1 + e (w(l) z (l) +c (l) ) )], (15) z (l) + c (l), b(k+1) = b, W (l) : = [w (l) 1,, w(l) ], J (l 1) z(l) denotes all te variables in z(l) except z (l), and z( l) denotes all te layers except z (l). Detailed derivation is included in Appendix. For te neural adaptive importance sampler, we construct a recognition model wit te same structure as DSBN but wit all layers inverted to approximate te posterior of latent layers. Specifically, we build K 1 q(z (1),, z (K) ; x) = q(z (1) ; x) q(z (l+1) z (l) ), l=1 were x is te input and all z are sample from tis inverse SBN layerstack. Te parameters of te recognition model are φ = {W (l), c (l) } K l=1. Note tat a more complicated recognition model can lead to better performance [6]. In tis paper, we restrict ourself to te simple recognition network. 4 Experiments We now present experimental results of our doubly stocastic MCMC metod on four public datasets, including MNIST, Caltec-101 Silouettes [19], OCR letters [17] and MNIST8M [18]. Te MNIST and MNIST8M datasets are binarized by stocastically setting eac pixel to 1 in proportion to its intensity according to [29]. Table 1 summarizes te statistics of te datasets. We use te doubly stocastic gradient Hamiltonian Monte Carlo wit bot a Gibbs sampler (DSGHMC-Gibbs) and an adaptive importance sampler (DSGHMC-AIS) in te experiments. 4.1 Setups We consider five SBN models wit different arcitectures on all te datasets. Te first tree are te standard SBN models one idden layer wit 200 idden units, two idden layers wit 200 idden units at eac layer and tree idden layers wit 200 idden units at Table 1: Te descriptions of te four datasets. MNIST Caltec-101 OCR MNIST8M dimension train size 50, 000 4, , 152 8, 100, 000 valid size 10, 000 2, , , 100 test size 10, 000 2, , , 000 eac layer. Te oter two are te autoregressive SBN (ARSBN) models one as a single idden layer wit 200 units and te oter one as two idden layers wit 200 units at eac layer. We compare wit te recent Gibbs sampling algoritm [11] using te autors code, wic was sown to be te best MCMC metod on learning deep SBN models under Bayesian setting. Te priors of model parameters are set as stated in Section 3. Te model parameters are randomly initialized by sampling from an zero-mean normal wit small variance. Te learning rate λ t is set among {0.1, 0.02, 0.01}, from wic we report te experiment wit best performance on validation set. Following te suggestion of [8], te momentum decay parameter ξ is cosen from {0.1, 0.05}. Te minibatc size B t is set to 100 and te number of samples L during training is set to 5, wic is sufficiently large for all te tested models. For Gibbs sampling, we follow te settings in [11] (wit te initial 40 epocs as burn-in). For DSGHMC- Gibbs, we use deterministic termination criteria te number of burn-in iterations is set as 140 for one-idden-layer, 240 for two-idden-layer and 340 for tree-idden-layer models. Te number of burn-in steps of te Gibbs sampler for te posterior of latent variables is set to 5. For DSGHMC-AIS, due to te efficiency of sampling idden layers, we use early stopping for better convergence. Te parameters of recognition model are updated using te Adam [15] optimizer wit stepsizes of {0.001, }. ARSBN depends on te order of te input variables. In our experiments, te ordering is simply determined by te original order in te dataset. For ARSBN, we use recognition models wit ARSBN layers. See Appendix for more details about te experimental setting. 4.2 Results We present bot quantitative results on predicting testing data and qualitative results on generating samples. We also analyze te time efficiency Predictive Performance We first present predictive performance to examine te quality of posterior inference by our DSGHMC metods. To assess te quality, we report te average loglikeliood of te test data. For simplicity, one sam-

6 Table 2: Predictive results on various datasets, were Dim denotes te number of latent variables in eac layer, wit layer closest to te data laying left. Values surrounded by brackets are variational lower bounds, values witout brackets are average test log-likelioods. Te results marked by * are taken from [11]. ( ) Discriminative fine-tuning is performed, wic probably leads to better results. Model Dim MNIST Caltec-101 OCR MNIST8M SBN (Gibbs) SBN (Gibbs) SBN (VB)* 200 ( ) ( ) ( 48.20) SBN (VB)* ( ) ( ) ( 47.84) SBN (DSGHMC-Gibbs) SBN (DSGHMC-Gibbs) SBN (DSGHMC-Gibbs) SBN (DSGHMC-AIS) SBN (DSGHMC-AIS) SBN (DSGHMC-AIS) ARSBN (DSGHMC-Gibbs) ARSBN (DSGHMC-Gibbs) ARSBN (DSGHMC-AIS) ARSBN (DSGHMC-AIS) ARSBN (VB)* 200 ( ) ( 96.78) ( 37.97) ARSBN (VB)* ( ) ( 97.57) ( 38.56) ple drawn from te posterior after convergence is used to evaluate te results. Te log-likelioods of models trained by DSGHMC-AIS are evaluated by importance sampling according to [6] wit L = 100, 000 samples. To evaluate te models trained by DSGHMC-Gibbs, wic do not ave a recognition model, we adopt te Annealed Importance Sampling [29] metod, wic provides an unbiased estimator of te log-likeliood. We observe tat te two estimators always give similar results wen applying tem to te same learned model (wit difference less tan 0.7). Details of evaluation are provided in Appendix. Table 2 sows te average test log-likeliood on eac of te four datasets, wit comparison wit te Gibbs sampler via data augmentation as well as te variational Bayesian (VB) [11] metod for SBN and AR- SBN. 1 We cite te results of VB metods from [11], wic are referred as variational lower bounds. We first examine te results for te SBN model wit one idden layer. One can observe tat our metod acieves similar or better results on all te datasets, compared to te Gibbs sampler. On MNIST, our metod gives an average test log-likeliood of 102.9, wic brings a 12 nats improvement. Ten, we utilize a second layer for SBN. A greedy layer-wise pre-training requires generating samples of lower idden layers as te input data for upper idden layers. However, our metod only generates a few samples for te mini-batc at eac iteration, wic 1 For ARSBN, te work [11] only reported te results wit variational Bayes. is not enoug for pre-training te next layer. Tus pre-training is not performed in all deep models. As can be seen, utilizing a second idden layer improves te performance on all datasets. Furtermore, our metod acieves consistently better results tan te Gibbs sampler. We also explore ARSBNs wit one idden layer and two idden layers. Our metod on autoregressive structures give improvements compared to standard SBN as expected, wic suggests tat our metod works on different deep generative models. We furter investigate te performance of SBN models wit tree idden layers. Training suc models using Gibbs sampling is too time-costly and tus is not included in Table 2. We can see tat our metod continues improving te results as te model grows deeper, wic sows tat our metods scale better to large or deep structures tan Gibbs sampling. Table 3 sows comparison on MNIST results of various training metods, including Neural Variational Inference and Learning (NVIL) [20], wake-sleep [12] and Reweigted Wake-Sleep (RWS) [6]. Our metod gives better results tan wake-sleep and acieves comparable results wit RWS(Q:SBN), wic uses a same recognition model as ours to approximate te posterior of latent variables. We observe tat on MNIST and OCR (See Table 2), DSGHMC-AIS gives better results tan DSGHMC- Gibbs, wile on Caltec-101 te facts are te opposite. Note tat depending on te dataset being modeled, a Gibbs sampler for posterior of latent variables can give poor samples wen it as a low mixing rate,

7 Table 3: MNIST results of various training metods on SBNs wit various arcitectures. Te results of NVIL are cited from [20]; and te results of Wake-sleep and RWS are cited from [6]. Model Dim Variational Bayes MCMC NVIL wake-sleep RWS(Q:SBN) Gibbs DSGHMC-Gibbs DSGHMC-AIS SBN 200 ( 113.1) 116.3( 120.7) SBN ( 99.8) 106.9( 109.4) SBN ( 96.7) 101.3( 104.4) (a) (b) (c) (d) Figure 1: Visualization: (a) Training data. (b) Samples from te SBN( ) model. (Te probabilities for sampling eac pixel are sowed.) (c) Features at te bottom layer learned wit sparse prior. (d) Features at te bottom layer learned wit normal prior. (Top) MNIST. (Bottom) Caltec-101. wile a neural adaptive importance sampler can give poor samples wen true posterior is far from te scope a recognition model can approximate. Tus depending on te different datasets being modeled, eiter one of above may become dominating, leading to te different performance of te two proposed metods. Using a more powerful recognition model may furter improve te performance of DSGHMC-AIS [6]; We leave a systematic investigation as our future work. Finally, our efficient DSGHMC metods manage to learn te deep SBN models on te large-scale MNIST8M [18] dataset, wic consists of 8.1 million training digit images generated by applying transformation to te standard MNIST training examples and as a same testing set as te standard MNIST dataset. MNIST8M is too large to be processed by te batc Gibbs sampler. Again, we can see in Table 2 tat by increasing te model dept te testing log-likelioods improve for SBN models. Te sligtly worse results for ARSBN may be due to overfitting Generative Performance Fig. 1 sows te generative ability of te learned SBN models. In Fig. 1(a), we sow te randomly sampled training data of MNIST and Caltec-101. Fig. 1(b) presents some random examples generated from 3- layer SBNs learned by DSGHMC-AIS. One advantage of Bayesian framework is tat we can specify some sparsity-encouraging priors on te model parameters explicitly, e.g., Student-t prior in our experiments. Fig. 1(c) and Fig. 1(d) demonstrate te difference between features learned wit sparse priors and non-sparse priors. We can see tat te features learned wit sparse priors appear more localized. We furter examine te ability of te learned models on predicting missing data. For eac test image, te lower alf is assumed missing and te upper alf is used to inference te idden units [11]. Ten, wit te idden units, te lower alf is constructed. Prediction is done by repeating tis procedure and adopting a majority vote for eac pixel. Fig. 2 demonstrates some example completions for te missing data on MNIST.

8 Avg. test log likeliood Gibbs DSGHMC Gibbs(M) DSGHMC Gibbs(T) DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs(M) DSGHMC Gibbs(T) DSGHMC AIS Avg. test log likeliood Gibbs DSGHMC Gibbs DSGHMC AIS time (s) (a) #data visited (b) time (s) (c) #data visited Figure 3: Convergence curves wit respect to training time and number of training data visited on (a-b) MNIST and (c-d) OCR letters datasets. In (a) and (c) DSGHMC-Gibbs(T) denotes DSGHMC-Gibbs implemented in Teano wit GPU acceleration and DSGHMC-Gibbs(M) denotes DSGHMC-Gibbs in MATLAB code. (d) Table 4: Average training time per iteration using Gibbs and DSGHMC-Gibbs for one-idden-layer SBN (200 idden units). Experiments in tis table are conducted using MATLAB code for bot metods on a PC wit eigt Intel Core i CPUs (3.40GHz). Datasets train-size Gibbs DSGHMC-Gibbs MNIST 50, s 25.4s Caltec-101 4, s 25.4s OCR 32, s 5.44s MNIST8M 8, 100, 000 > s Figure 2: Missing data prediction: (Top) Original data. (Middle) Hollowed data. (Bottom) Reconstructed data Time Efficiency We compare te efficiency of Gibbs, DSGHMC-Gibbs and DSGHMC-AIS in tis section. We implement bot DSGHMC-Gibbs and DSGHMC-AIS in Teano [4] wit GPU acceleration. For fair comparison wit Gibbs (MATLAB code by [11]), we also implement DSGHMC-Gibbs in MATLAB code. Table 4 presents te efficiency of Gibbs and DSGHMC-Gibbs on different datasets. We can see tat DSGHMC-Gibbs is faster tan te Gibbs sampler on all te four datasets, especially for te large-scale MNIST8M dataset, wic is too time-consuming for te Gibbs sampler. Note tat our DSGHMC-Gibbs spends almost identical time at eac iteration on MNIST, Caltec-101 and MNIST8M datasets, due to te fact tat all tese datasets ave te same dimensionality, wic sows te scalability of te proposed metod. We gain significant acceleration by using GPU. For example, On MNIST, te average training time (oneidden-layer SBN) per iteration of DSGHMC-Gibbs and DSGHMC-AIS is 5.2s and 0.18s, respectively. Fig. 3 sows te convergence curves on te MNIST and OCR datasets wit respect to te training time and te number of training data visited. We can see tat: 1) Bot DSGHMC-Gibbs and DSGHMC-AIS converge faster and better tan Gibbs; 2) bot metods visit fewer data points to converge, compared to Gibbs; and 3) DSGHMC-AIS converges faster tan DSGHMC-AIS wit respect to training time. 5 Conclusions and Future Work We present doubly stocastic gradient MCMC, a simple and general metod, to learn deep generative models in a Bayesian setting. Wen applied to deep sigmoid belief networks, our metod manages to learn on large-scale datasets wit good inference accuracy. For future work, we like to apply tis metod to learn even deeper belief networks. We are also interested in investigating te performance on learning sparse Bayesian models, wic often involve intractable gradients tat can be estimated by our doubly stocastic strategy. Finally, learning nonparametric Bayesian DGMs is anoter interesting callenge. References [1] R. Adams, H. Wallac, and Z. Garamani. Learning te structure of deep sparse grapical models. In AISTATS, [2] S. An, A. Korattikara, and M. Welling. Bayesian posterior sampling via stocastic gradient fiser scoring. In ICML, 2012.

9 [3] A. Armagan, D. Dunson, and M. Clyde. Generalized beta mixtures of gaussians. In NIPS, pages , [4] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Boucard, D. Warde-Farley, and Y. Bengio. Teano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Worksop, [5] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stocastic networks trainable by backprop. In ICML, [6] J. Bornscein and Y. Bengio. Reweigted wakesleep. CoRR, abs/ , [7] L. Bottou. Online Algoritms and Stocastic Approximations. Online Learning and Neural Networks, Edited by David Saad, Cambridge University Press, Cambridge, UK, [8] T. Cen, E. Fox, and C. Guestrin. Stocastic gradient Hamiltonian Monte Carlo. In ICML, [9] N. Ding, Y. Fang, R. Babbus, C. Cen, and H. Skeel, R.and Neven. Bayesian sampling using stocastic gradient termostats. In NIPS, pages , [10] Z. Gan, C. Cen, R. Henao, D. Carlson, and L. Carin. Scalable deep poisson factor analysis for topic modeling. In ICML, [11] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks wit data augmentation. In AISTATS, [12] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. Te wake-sleep algoritm for unsupervised neural networks. Science, 268(5214): , [13] G. E. Hinton, S. Osindero, and Y. Te. A fast learning algoritm for deep belief nets. Neural Computation, 18, [14] M. Jordan, Z. Garamani, T. Jaakkola, and L. Saul. An introduction to variational metods for grapical models. MLJ, 37(2): , [15] D. P. Kingma and J. Ba. Adam: A metod for stocastic optimization. CoRR, abs/ , [16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, [17] M. Licman. UCI macine learning repository, [18] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector macines using selective sampling. In L. Bottou, O. Capelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Macines, pages MIT Press, Cambridge, MA., [19] B. Marlin, K. Swersky, B. Cen, and N. Freitas. Inductive principles for restricted boltzmann macine learning. In AISTATS, pages , [20] A. Mni and K. Gregor. Neural variational inference and learning in belief networks. In ICML, [21] A. Moamed, G. Dal, and G. Hinton. Acoustic modeling using deep belief networks. IEEE Trans. on Audio, Speec, and Language Processing, 20(1):14 22, [22] R. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71 113, [23] A. B. Owen. Monte Carlo teory, metods and examples [24] S. Patterson and Y. Te. Stocastic gradient Riemannian Langevin dynamics on te probability simplex. In NIPS, [25] D. J. Rezende, S. Moamed, and D. Wierstra. Stocastic backpropagation and approximate inference in deep generative models. In ICML, [26] H. Robbins and S. Monro. A stocastic approximation metod. Te Annals of Matematical Statistics, 22(3): , [27] C. Robert and G. Casella. Monte Carlo Statistical Metods. Springer, [28] R. Salakutdinov and G. E. Hinton. Deep Boltzmann macines. In AISTATS, [29] R. Salakutdinov and I. Murray. On te quantitative analysis of deep belief networks. In ICML, pages , [30] L. Saul, T. Jaakkola, and M. Jordan. Mean field teory for sigmoid belief networks. Journal of AI Researc, 4:61 76, [31] I. Titov and J. Henderson. Constituent parsing wit incremental sigmoid belief networks. In ACL, [32] M. Welling and Y. Te. Bayesian learning via stocastic gradient Langevin dynamics. In ICML, 2011.

10 Appendix In tis appendix, we first provide te derivations of te Gibbs sampler for DSGHMC-Gibbs. We also give details about ow we evaluate te learned models. Finally we sow some extra experimental settings. A Derivations A.1 Gibbs sampler for DSBN latent layers We sample te latent variables layer by layer and dimension-wise witin eac layer. Note tat we define z (0) = x for convenience. Ten p(z (l) z(l), x, z( l) ) can be written as p(z (l) z(l), z( l) ). We ave p(z (l) z(l), z( l) ) =p(z (l) z(l), z(<l), z (>l) ) p(z (<l) z (l), z(l), z(>l) ) p(z (l) z(l), z(>l) ) =p(z (<l) z (l) ) p(z (l) z(l+1) ) =p(z (l 1) z (l) ) p(z (<(l 1)) z (l 1), z (l) ) p(z (l) z(l+1) ) p(z (l 1) z (l) ) p(z (l) J (l 1) = =1 [ exp exp [ J (l 1) exp ( =1 z(l+1) ) (w (l) z (l) + c (l) )z (l 1) ] log(1 + e (w(l) z (l) +c (l) ) ) [ (w (l+1) z (l+1) + c (l+1) )z (l) log(1 + e (w(l+1) z (l+1) +c (l+1) ) ) W (l) z(l 1) J (l 1) =1 + (w (l+1) ] z (l+1) + c (l+1) ))z (l) ] log(1 + e (w(l) z (l) +c (l) ) ). For l = K, tere is no upper layers and te term w (l) z (l) + c (l) wic is denoting te confidence from upper layer will be replaced wit te prior of te top layer b. We define W (l) : = [w (l) 1,, w(l) J (l 1) ], b (l) = w(l) z (l) + c (l) and b(k+1) = b for notational convenience. Finally we get Eq. (15) p(z (l) z(l), z( l) ) exp[(w (l) : z(l 1) + b (l+1) for l = 1 to K. J (l 1) =1 log(1 + e (w(l) z (l) +c (l) ) )], )z (l) A Gibbs sampler of DARSBN can be derived in a similar way. B Evaluation In our experiment, all models are evaluated in terms of average test log-likeliood. We now present te details of evaluation. B.1 Importance sampling For te models trained by DSGHMC-AIS, we can evaluate te log-likeliood using importance sampling [6]. Specifically, we ave an unbiased estimator p(x θ) = E p(z x,θ) [p(x θ)] [ ] p(z x, θ) = E q(z x;φ) q(z x; φ) p(x θ) 1 L L s=1 p(x, z (s) θ) q(z (s) x; φ), (16) were {z (s) } L s=1 are samples from te proposal distribution q(z x; φ). In te experiments, we take te logaritm of te above estimator to compute te loglikeliood. Note tat taking logaritm makes te estimator biased [6], but te bias and te variance will decrease as te number of samples L increase. In our experiments, we use L = 100, 000, wic is sown sufficiently large. B.2 Annealed importance sampling For te models trained by DSGHMC-Gibbs or te Gibbs sampling [11], te above estimator is not applicable since no recognition model is trained. Noticing p(z x, θ) = p(x, z θ)/p(x θ), we can view p(x, z θ) as some unnormalized probability of p(z x, θ) and te p(x θ) as te normalizing constant. We adopt te Annealed Importance Sampling (AIS) [29] metod, wic can evaluate te ratio of normalizing constants accurately by reducing te variance of estimation via a sequence of intermediate distributions. See [29] for details. For our DSBN, we leverage te AIS to compute te likeliood p(x θ) by computing te ratio of p(x θ) and p(x θ = 0), were p(x θ = 0) is te likeliood wit all model parameters being 0. Note tat θ = 0 implies (0) J te likeliood is a constant p(x θ = 0) = 2 were J (0) is te dimension of x, since eac dimension of x as equal probability to be 0 or 1. Tus computing te ratio p(x θ) p(x θ=0) is equivalent to computing te likeliood p(x θ). To compute te ratio, we introduce a sequence of intermediate distribution wit p 0 (z) = p(z, x θ =

11 0),, p k (z) = p(z, x k K θ),, p K(z) = p(x, z θ) for k = 0,, K, were K is te number of intermediate distributions and k K θ denotes model parameters multiplied by k p(x θ) K. Ten we estimate p(x θ=0) by p 1 (z 1 ) p 2 (z 2 ) p 0 (z 1 ) p 1 (z 2 ) pk 1(z K 1 ) p K (z K ) p K 2 (z K 1 ) p K 1 (z K ), (17) were z 0 is sampled from p 0 (z), and z k+1 is sampled from te Gibbs sampler Eq. (15) wit assumption tat te current latent layers is z k and te model is k K θ. In our experiments we coose K = 1000 and evaluate te log-likeliood by computing Eq. (17) repeatedly and taking average. C More Detailed Experimental Settings In Eq. (4), te gradient θ U(θ) is depending on te size of data, wic results te fact tat for different datasets we ave to coose different learning rates. In our experiments, instead of evaluating te summation of te gradient among all te data x n, we compute te per-data gradient θ U(θ)/ D in order to use same learning rate. For te majority of our experiments a learning rate 0.01 give te best results. One parameter not mentioned in te main paper is te number of discretization steps wen simulating te dynamics. In our experiments, we fixed te number of steps to 10. Te Student-t s prior used in te experiments are set wit a scale parameter σ = 0.09, location parameter µ = 0 and degrees of freedom ν = 2.2.

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Overdispersed Variational Autoencoders

Overdispersed Variational Autoencoders Overdispersed Variational Autoencoders Harsil Sa, David Barber and Aleksandar Botev Department of Computer Science, University College London Alan Turing Institute arsil.sa.15@ucl.ac.uk, david.barber@ucl.ac.uk,

More information

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy Moammad Ali Keyvanrad a, Moammad Medi Homayounpour a a Laboratory for Intelligent Multimedia Processing (LIMP), Computer

More information

Minimizing D(Q,P) def = Q(h)

Minimizing D(Q,P) def = Q(h) Inference Lecture 20: Variational Metods Kevin Murpy 29 November 2004 Inference means computing P( i v), were are te idden variables v are te visible variables. For discrete (eg binary) idden nodes, exact

More information

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab To appear in: Advances in Neural Information Processing Systems 9, eds. M. C. Mozer, M. I. Jordan and T. Petsce. MIT Press, 997 Bayesian Model Comparison by Monte Carlo Caining David Barber D.Barber@aston.ac.uk

More information

Notes on Neural Networks

Notes on Neural Networks Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat

More information

arxiv: v4 [cs.lg] 16 Apr 2015

arxiv: v4 [cs.lg] 16 Apr 2015 REWEIGHTED WAKE-SLEEP Jörg Bornschein and Yoshua Bengio Department of Computer Science and Operations Research University of Montreal Montreal, Quebec, Canada ABSTRACT arxiv:1406.2751v4 [cs.lg] 16 Apr

More information

Deep Generative Models

Deep Generative Models Deep Generative Models Durk Kingma Max Welling Deep Probabilistic Models Worksop Wednesday, 1st of Oct, 2014 D.P. Kingma Deep generative models Transformations between Bayes nets and Neural nets Transformation

More information

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata

More information

Regularized Regression

Regularized Regression Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize

More information

THE hidden Markov model (HMM)-based parametric

THE hidden Markov model (HMM)-based parametric JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Modeling Spectral Envelopes Using Restricted Boltzmann Macines and Deep Belief Networks for Statistical Parametric Speec Syntesis Zen-Hua Ling,

More information

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm Probabilistic Grapical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. Tis omework assignment covers te material presented in Lectures 1-3. You must complete all four problems to obtain

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

EDML: A Method for Learning Parameters in Bayesian Networks

EDML: A Method for Learning Parameters in Bayesian Networks : A Metod for Learning Parameters in Bayesian Networks Artur Coi, Kaled S. Refaat and Adnan Darwice Computer Science Department University of California, Los Angeles {aycoi, krefaat, darwice}@cs.ucla.edu

More information

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational

More information

A = h w (1) Error Analysis Physics 141

A = h w (1) Error Analysis Physics 141 Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Variational Bayes on Monte Carlo Steroids

Variational Bayes on Monte Carlo Steroids Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering,

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks A Teoretically Grounded Application of Dropout in Recurrent Neural Networks Yarin Gal University of Cambridge {yg279,zg201}@cam.ac.uk oubin Garamani Abstract Recurrent neural networks (RNNs) stand at te

More information

Learning Deep Sigmoid Belief Networks with Data Augmentation

Learning Deep Sigmoid Belief Networks with Data Augmentation Learning Deep Sigmoid Belief Networs with Data Augmentation Zhe Gan Ricardo Henao David Carlson Lawrence Carin Department of Electrical and Computer Engineering, Due University, Durham NC 27708, USA Abstract

More information

MULTI-DISTRIBUTION DEEP BELIEF NETWORK FOR SPEECH SYNTHESIS. Shiyin Kang, Xiaojun Qian and Helen Meng

MULTI-DISTRIBUTION DEEP BELIEF NETWORK FOR SPEECH SYNTHESIS. Shiyin Kang, Xiaojun Qian and Helen Meng MULTI-DISTRIBUTION DEEP BELIEF NETORK FOR SPEECH SYNTHESIS Siyin Kang, Xiaojun Qian and Helen Meng Human Computer Communications Laboratory, Department of Systems Engineering and Engineering Management,

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Learning based super-resolution land cover mapping

Learning based super-resolution land cover mapping earning based super-resolution land cover mapping Feng ing, Yiang Zang, Giles M. Foody IEEE Fellow, Xiaodong Xiuua Zang, Siming Fang, Wenbo Yun Du is work was supported in part by te National Basic Researc

More information

Denoising Criterion for Variational Auto-Encoding Framework

Denoising Criterion for Variational Auto-Encoding Framework Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Denoising Criterion for Variational Auto-Encoding Framework Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua

More information

Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Spike train entropy-rate estimation using hierarchical Dirichlet process priors publised in: Advances in Neural Information Processing Systems 26 (23), 276 284. Spike train entropy-rate estimation using ierarcical Diriclet process priors Karin Knudson Department of Matematics kknudson@mat.utexas.edu

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f = Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks.

More information

2.11 That s So Derivative

2.11 That s So Derivative 2.11 Tat s So Derivative Introduction to Differential Calculus Just as one defines instantaneous velocity in terms of average velocity, we now define te instantaneous rate of cange of a function at a point

More information

Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning

Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning Long Term Time Series Prediction wit Multi-Input Multi-Output Local Learning Gianluca Bontempi Macine Learning Group, Département d Informatique Faculté des Sciences, ULB, Université Libre de Bruxelles

More information

How to Find the Derivative of a Function: Calculus 1

How to Find the Derivative of a Function: Calculus 1 Introduction How to Find te Derivative of a Function: Calculus 1 Calculus is not an easy matematics course Te fact tat you ave enrolled in suc a difficult subject indicates tat you are interested in te

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Max-Margin Min-Entropy Models

Max-Margin Min-Entropy Models Kevin Miller M. Pawan Kumar Ben Packer Stanford University Ecole Centrale Paris & INRIA Saclay Stanford University Danny Goodman Stanford University Dapne Koller Stanford University Abstract We propose

More information

Representational Power of Restricted Boltzmann Machines

Representational Power of Restricted Boltzmann Machines 1 Representational Power of Restricted Boltzmann Macines and Deep Belief Networks Nicolas Le Roux and Yosua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada {lerouxni,bengioy}@iro.umontreal.ca

More information

CS340: Bayesian concept learning. Kevin Murphy Based on Josh Tenenbaum s PhD thesis (MIT BCS 1999)

CS340: Bayesian concept learning. Kevin Murphy Based on Josh Tenenbaum s PhD thesis (MIT BCS 1999) CS340: Bayesian concept learning Kevin Murpy Based on Jos Tenenbaum s PD tesis (MIT BCS 1999) Concept learning (binary classification) from positive and negative examples Concept learning from positive

More information

arxiv: v1 [stat.ml] 2 Sep 2014

arxiv: v1 [stat.ml] 2 Sep 2014 On the Equivalence Between Deep NADE and Generative Stochastic Networks Li Yao, Sherjil Ozair, Kyunghyun Cho, and Yoshua Bengio Département d Informatique et de Recherche Opérationelle Université de Montréal

More information

LAPLACIAN MATRIX LEARNING FOR SMOOTH GRAPH SIGNAL REPRESENTATION

LAPLACIAN MATRIX LEARNING FOR SMOOTH GRAPH SIGNAL REPRESENTATION LAPLACIAN MATRIX LEARNING FOR SMOOTH GRAPH SIGNAL REPRESENTATION Xiaowen Dong, Dorina Tanou, Pascal Frossard and Pierre Vandergeynst Media Lab, MIT, USA xdong@mit.edu Signal Processing Laboratories, EPFL,

More information

On the Identifiability of the Post-Nonlinear Causal Model

On the Identifiability of the Post-Nonlinear Causal Model UAI 9 ZHANG & HYVARINEN 647 On te Identifiability of te Post-Nonlinear Causal Model Kun Zang Dept. of Computer Science and HIIT University of Helsinki Finland Aapo Hyvärinen Dept. of Computer Science,

More information

Teaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line

Teaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line Teacing Differentiation: A Rare Case for te Problem of te Slope of te Tangent Line arxiv:1805.00343v1 [mat.ho] 29 Apr 2018 Roman Kvasov Department of Matematics University of Puerto Rico at Aguadilla Aguadilla,

More information

Efficient algorithms for for clone items detection

Efficient algorithms for for clone items detection Efficient algoritms for for clone items detection Raoul Medina, Caroline Noyer, and Olivier Raynaud Raoul Medina, Caroline Noyer and Olivier Raynaud LIMOS - Université Blaise Pascal, Campus universitaire

More information

Copyright c 2008 Kevin Long

Copyright c 2008 Kevin Long Lecture 4 Numerical solution of initial value problems Te metods you ve learned so far ave obtained closed-form solutions to initial value problems. A closedform solution is an explicit algebriac formula

More information

Order of Accuracy. ũ h u Ch p, (1)

Order of Accuracy. ũ h u Ch p, (1) Order of Accuracy 1 Terminology We consider a numerical approximation of an exact value u. Te approximation depends on a small parameter, wic can be for instance te grid size or time step in a numerical

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

232 Calculus and Structures

232 Calculus and Structures 3 Calculus and Structures CHAPTER 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS FOR EVALUATING BEAMS Calculus and Structures 33 Copyrigt Capter 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS 17.1 THE

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion

More information

arxiv: v1 [math.pr] 28 Dec 2018

arxiv: v1 [math.pr] 28 Dec 2018 Approximating Sepp s constants for te Slepian process Jack Noonan a, Anatoly Zigljavsky a, a Scool of Matematics, Cardiff University, Cardiff, CF4 4AG, UK arxiv:8.0v [mat.pr] 8 Dec 08 Abstract Slepian

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Polynomial Interpolation

Polynomial Interpolation Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximatinga function fx, wose values at a set of distinct points x, x, x,, x n are known, by a polynomial P x suc

More information

Fast optimal bandwidth selection for kernel density estimation

Fast optimal bandwidth selection for kernel density estimation Fast optimal bandwidt selection for kernel density estimation Vikas Candrakant Raykar and Ramani Duraiswami Dept of computer science and UMIACS, University of Maryland, CollegePark {vikas,ramani}@csumdedu

More information

Adaptive Neural Filters with Fixed Weights

Adaptive Neural Filters with Fixed Weights Adaptive Neural Filters wit Fixed Weigts James T. Lo and Justin Nave Department of Matematics and Statistics University of Maryland Baltimore County Baltimore, MD 150, U.S.A. e-mail: jameslo@umbc.edu Abstract

More information

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx. Capter 2 Integrals as sums and derivatives as differences We now switc to te simplest metods for integrating or differentiating a function from its function samples. A careful study of Taylor expansions

More information

Pre-Calculus Review Preemptive Strike

Pre-Calculus Review Preemptive Strike Pre-Calculus Review Preemptive Strike Attaced are some notes and one assignment wit tree parts. Tese are due on te day tat we start te pre-calculus review. I strongly suggest reading troug te notes torougly

More information

Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

More information

FloatBoost Learning for Classification

FloatBoost Learning for Classification loatboost Learning for Classification Stan Z. Li Microsoft Researc Asia Beijing, Cina Heung-Yeung Sum Microsoft Researc Asia Beijing, Cina ZenQiu Zang Institute of Automation CAS, Beijing, Cina HongJiang

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

Chapter 5 FINITE DIFFERENCE METHOD (FDM) MEE7 Computer Modeling Tecniques in Engineering Capter 5 FINITE DIFFERENCE METHOD (FDM) 5. Introduction to FDM Te finite difference tecniques are based upon approximations wic permit replacing differential

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Numerical Differentiation

Numerical Differentiation Numerical Differentiation Finite Difference Formulas for te first derivative (Using Taylor Expansion tecnique) (section 8.3.) Suppose tat f() = g() is a function of te variable, and tat as 0 te function

More information

Artificial Neural Network Model Based Estimation of Finite Population Total

Artificial Neural Network Model Based Estimation of Finite Population Total International Journal of Science and Researc (IJSR), India Online ISSN: 2319-7064 Artificial Neural Network Model Based Estimation of Finite Population Total Robert Kasisi 1, Romanus O. Odiambo 2, Antony

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING

EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING Statistica Sinica 13(2003), 641-653 EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING J. K. Kim and R. R. Sitter Hankuk University of Foreign Studies and Simon Fraser University Abstract:

More information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Density estimation. Computing, and avoiding, partition functions. Iain Murray Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh

More information

Kernel Density Based Linear Regression Estimate

Kernel Density Based Linear Regression Estimate Kernel Density Based Linear Regression Estimate Weixin Yao and Zibiao Zao Abstract For linear regression models wit non-normally distributed errors, te least squares estimate (LSE will lose some efficiency

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Bayesian ML Sequence Detection for ISI Channels

Bayesian ML Sequence Detection for ISI Channels Bayesian ML Sequence Detection for ISI Cannels Jill K. Nelson Department of Electrical and Computer Engineering George Mason University Fairfax, VA 030 Email: jnelson@gmu.edu Andrew C. Singer Department

More information

Financial Econometrics Prof. Massimo Guidolin

Financial Econometrics Prof. Massimo Guidolin CLEFIN A.A. 2010/2011 Financial Econometrics Prof. Massimo Guidolin A Quick Review of Basic Estimation Metods 1. Were te OLS World Ends... Consider two time series 1: = { 1 2 } and 1: = { 1 2 }. At tis

More information

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES Ronald Ainswort Hart Scientific, American Fork UT, USA ABSTRACT Reports of calibration typically provide total combined uncertainties

More information

Estimating Peak Bone Mineral Density in Osteoporosis Diagnosis by Maximum Distribution

Estimating Peak Bone Mineral Density in Osteoporosis Diagnosis by Maximum Distribution International Journal of Clinical Medicine Researc 2016; 3(5): 76-80 ttp://www.aascit.org/journal/ijcmr ISSN: 2375-3838 Estimating Peak Bone Mineral Density in Osteoporosis Diagnosis by Maximum Distribution

More information

WYSE Academic Challenge 2004 Sectional Mathematics Solution Set

WYSE Academic Challenge 2004 Sectional Mathematics Solution Set WYSE Academic Callenge 00 Sectional Matematics Solution Set. Answer: B. Since te equation can be written in te form x + y, we ave a major 5 semi-axis of lengt 5 and minor semi-axis of lengt. Tis means

More information

The derivative function

The derivative function Roberto s Notes on Differential Calculus Capter : Definition of derivative Section Te derivative function Wat you need to know already: f is at a point on its grap and ow to compute it. Wat te derivative

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Homework 1 Due: Wednesday, September 28, 2016

Homework 1 Due: Wednesday, September 28, 2016 0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q

More information

Flavius Guiaş. X(t + h) = X(t) + F (X(s)) ds.

Flavius Guiaş. X(t + h) = X(t) + F (X(s)) ds. Numerical solvers for large systems of ordinary differential equations based on te stocastic direct simulation metod improved by te and Runge Kutta principles Flavius Guiaş Abstract We present a numerical

More information

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION LAURA EVANS.. Introduction Not all differential equations can be explicitly solved for y. Tis can be problematic if we need to know te value of y

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

New Streamfunction Approach for Magnetohydrodynamics

New Streamfunction Approach for Magnetohydrodynamics New Streamfunction Approac for Magnetoydrodynamics Kab Seo Kang Brooaven National Laboratory, Computational Science Center, Building 63, Room, Upton NY 973, USA. sang@bnl.gov Summary. We apply te finite

More information

lecture 26: Richardson extrapolation

lecture 26: Richardson extrapolation 43 lecture 26: Ricardson extrapolation 35 Ricardson extrapolation, Romberg integration Trougout numerical analysis, one encounters procedures tat apply some simple approximation (eg, linear interpolation)

More information

Online Learning: Bandit Setting

Online Learning: Bandit Setting Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac

More information

Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks

Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks Learning to Reject Sequential Importance Steps for Continuous-Time Bayesian Networks Jeremy C. Weiss University of Wisconsin-Madison Madison, WI, US jcweiss@cs.wisc.edu Sriraam Natarajan Indiana University

More information

A Nonparametric Prior for Simultaneous Covariance Estimation

A Nonparametric Prior for Simultaneous Covariance Estimation WEB APPENDIX FOR A Nonparametric Prior for Simultaneous Covariance Estimation J. T. Gaskins and M. J. Daniels Appendix : Derivation of Teoretical Properties Tis appendix contains proof for te properties

More information

Variational Dropout and the Local Reparameterization Trick

Variational Dropout and the Local Reparameterization Trick Variational ropout and the Local Reparameterization Trick iederik P. Kingma, Tim Salimans and Max Welling Machine Learning Group, University of Amsterdam Algoritmica University of California, Irvine, and

More information

An Empirical Bayesian interpretation and generalization of NL-means

An Empirical Bayesian interpretation and generalization of NL-means Computer Science Tecnical Report TR2010-934, October 2010 Courant Institute of Matematical Sciences, New York University ttp://cs.nyu.edu/web/researc/tecreports/reports.tml An Empirical Bayesian interpretation

More information

EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS

EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS Statistica Sinica 24 2014, 395-414 doi:ttp://dx.doi.org/10.5705/ss.2012.064 EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS Jun Sao 1,2 and Seng Wang 3 1 East Cina Normal University,

More information

Handling Missing Data on Asymmetric Distribution

Handling Missing Data on Asymmetric Distribution International Matematical Forum, Vol. 8, 03, no. 4, 53-65 Handling Missing Data on Asymmetric Distribution Amad M. H. Al-Kazale Department of Matematics, Faculty of Science Al-albayt University, Al-Mafraq-Jordan

More information

Boosting Kernel Density Estimates: a Bias Reduction. Technique?

Boosting Kernel Density Estimates: a Bias Reduction. Technique? Boosting Kernel Density Estimates: a Bias Reduction Tecnique? Marco Di Marzio Dipartimento di Metodi Quantitativi e Teoria Economica, Università di Cieti-Pescara, Viale Pindaro 42, 65127 Pescara, Italy

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

MATH 1020 Answer Key TEST 2 VERSION B Fall Printed Name: Section #: Instructor:

MATH 1020 Answer Key TEST 2 VERSION B Fall Printed Name: Section #: Instructor: Printed Name: Section #: Instructor: Please do not ask questions during tis exam. If you consider a question to be ambiguous, state your assumptions in te margin and do te best you can to provide te correct

More information

Sparse Additive Matrix Factorization for Robust PCA and Its Generalization

Sparse Additive Matrix Factorization for Robust PCA and Its Generalization JMLR: Worksop and Conference Proceedings 25:1 16 2012 CML 2012 Sparse dditive Matrix Factorization for Robust PC and Its Generalization Sinici Nakajima Nikon Corporation Tokyo 140-8601 Japan Masasi Sugiyama

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Sin, Cos and All That

Sin, Cos and All That Sin, Cos and All Tat James K. Peterson Department of Biological Sciences and Department of Matematical Sciences Clemson University Marc 9, 2017 Outline Sin, Cos and all tat! A New Power Rule Derivatives

More information

Cubic Functions: Local Analysis

Cubic Functions: Local Analysis Cubic function cubing coefficient Capter 13 Cubic Functions: Local Analysis Input-Output Pairs, 378 Normalized Input-Output Rule, 380 Local I-O Rule Near, 382 Local Grap Near, 384 Types of Local Graps

More information