Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models
|
|
- Christian Booker
- 6 years ago
- Views:
Transcription
1 Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering, Duke University 2 Computational Biology and Bioinformatics, Duke University chunyuan.li@duke.edu, cchangyou@gmail.com, kai.fan@duke.edu, lcarin@duke.edu A he proof of Lemma Proof In msgnh, X t = (θ t, p t, ξ t. he update equations with a Euler integrator are: θ t+ = θ t + p t h p t+ = p t θ Ũ t (θ t+ h diag(ξ t p t h + 2Dζ t+ ξ t+ = ξ t + (p t+ p t+ h Based on the update equations, it is easily seen that the corresponding Kolmogorov operator P h l for msgnh is P h l = e hl e hl2 e hl3, ( where L (p p ξ, L 2 diag(ξp t p θ Ũ l (θ p + 2DI : p p, and L 3 p θ. Using the BCH formula, we have P l h = e h(l+l2+l3 + O(h 2. (2 On the other hand, the local generator of msgnh at the t-th iteration can be seen to be: L t = L + L 2 + L 3, (3 where L and L 3 are defined previously, and L 2 = diag(ξp p θ Ũ l (θ p +2DI : p p. According to the Kolmogorov s backward equation, we have E[f(X t+ ] = e h L t f(x t. (4 Substitute (4 into (2 and use the fact that p = p t + O(h by aylor expansion, it is easily seen P l h = e h L t+o(h 2 + O(h 2 = e h L t + O(h 2. his completes the proof. B he proof of Lemma 2 Proof In symmetric splitting scheme for msgnh, according to the splitting in (4 in the main text, the generator L t is split into the following sub-generators which can be solved analytically: L l = L A + L B + L Ol, where A L A = p θ + (p p ξ, B L B = diag(ξp t p, O l L Ol = θ Ũ l (θ p + 2D : p p. Copyright c 26, Association for the Advancement of Artificial Intelligence ( All rights reserved. he corresponding Kolmogorov operator P h l for the splitting integrator can be seen to be: P l h e h 2 L A e h 2 L B e hl O l e h 2 L B e h 2 L A, In the following we use the Baker Campbell Hausdorff (BCH formula (Rossmann to show that P h l is a 2ndorder integrator. Specifically, e h 2 A e h 2 B = e h 2 A+ h 2 B+ 8 [A,B]+ ([A,[A,B]]+[B,[B,A]]+ (5 96 = e h 2 A+ h 2 B+ 8 [A,B] + O(h 3, (6 where [X, Y ] XY Y X is the commutator of X and Y, (5 follows from the BCH formula, and (6 follows by moving high order terms O(h 3 out of the exponential map using aylor expansion. Similarly, for the other composition, we have e ho l e h 2 A e h 2 B = e ho l (e h 2 A+ h 2 B+ 8 [A,B] + O(h 3 =e ho l+ h 2 A+ h 2 B+ 8 [A,B]+ 2 [ho l, h 2 A+ h 2 B+ 8 [A,B]] + O(h 3 =e ho l+ h 2 A+ h 2 e h 2 A e ho l e h 2 A e h 2 B =e h 2 A ( e ho l+ h 2 A+ h 2 =e ho l+ha+ h 2 B+ 8 [A,B]+ 4 [O l,a]+ 4 [Ol,B] + O(h 3 B+ 8 [A,B]+ 4 [O l,a]+ 4 [Ol,B] + O(h 3 B+ 4 [A,B]+ 2 [Ol,B] + O(h 3 As a result P l h e h 2 B e h 2 A e hz e h 2 A e h 2 B =e h 2 B ( e ho l+ha+ h 2 =e ho l+ha+hb+ 4 [A,B]+ =e h(b+a+o l + O(h 3 B+ 4 [A,B]+ 2 [Ol,B] + O(h 3 2 [O l,b]+ 4 =e h(l+ V l + O(h 3 = e h L l + O(h 3. his completes the proof. [B,A]+ 4 [B,O l]+ 8 [B,B] + O(h 3 C More details on Lemma 3 Our justification of the symmetric splitting integrator is based on Lemma 3, which is a simplification of the main
2 theorems in (Chen, Ding, and Carin 25. For completeness, we give details of their main theorems in this section. o recap notation, for an Itó diffusion with an invariant measure ρ(x, the posterior average is defined as: φ φ(xρ(xdx for some test function φ(x of interest. X Given samples (x t t= from a SG-MCMC, we use the sample average ˆφ t= φ(x t to approximate φ. In addition, we define an operator for the t-th iteration as: V t ( θ Ũ t θ U p. (7 heorem and heorem 2 summarize the convergence of a SG-MCMC algorithm with a K-th order integrator with respect to the Bias and MSE, under certain assumptions. Please refer to (Chen, Ding, and Carin 25 for detailed proofs. heorem Let be the operator norm. he bias of an SG-MCMC with a Kth-order integrator at time = h can be bounded as: E ˆφ φ = O ( h + t E V t + h K. heorem 2 For a smooth test function φ, the MSE of an SG-MCMC with a Kth-order integrator at time = h is bounded, for some C > independent of (, h, as E ( ˆφ φ 2 C ( B mse C ( t E V t 2 t E Vt 2 + h + h + K We simplify heorem and heorem ( 2 to Lemma 3, where the functions B bias O h + t E Vt and, independent of the order of an integrator. In addition, from heorem and heorem 2, we can get the optimal convergence rate of a SG-MCMC algorithm with respect to the Bias and MSE by optimizing the bounds. Specifically, the optimal convergence rates of the Bias for the Euler integrator is /2 with optimal stepsize /2, while this is 2/3 for the symmetric splitting operator with optimal stepsize /3. For the MSE, the rate for the Euler integrator is 2/3 with optimal stepsize /3, compared to 4/5 with optimal stepsize /5 for the symmetric splitting integrator. D Latent Dirichlet Allocation Following (Ding et al., this section describes the semicollapsed posterior of the LDA model, and the Expanded- Natural representation of the prabability simplexes used in (Patterson and eh 23. Let W = {w jv } be the observed words, Z = {z jv } be the topic indicator variables, where j indexes the documents and v indexes the words. Let (π kw be the topic-word distribution, n jkw be the number of word w in document j allocated to topic k, means marginal sum, i.e., n jk = w n dkw. he semi-collapsed posterior of the LDA model is J p(w, Z, π α, τ = p(π τ p(w j, z j α, π, (8 j=. where J is the number of documents, α is the parameter in the Dirichlet prior of the topic distribution for each document, τ is the parameter in the prior of π, and p(w j, z j α, π = K k= Γ (α + n jk Γ (α W w= π n jkw kw. (9 he Expanded-Natural representation of the simplexes π k s in (Patterson and eh 23 is used, where π kw = e θ kw w eθ kw. ( Following (Ding et al., a Gaussian prior on θ kw is adopted, p(θ kw τ = {β, σ} = N (θ kw, σ 2. he stochastic gradient of the log-posterior of parameter θ kw with a mini-batch S t becomes, Ũ(θ log p(θ W, τ, α = ( θ kw θ kw = β θ kw σ 2 + J E zj w S t j,θ,α (n jkw π kw n jk. i S t for the t-th iteration, where S t {, 2,, J}, and is the cardinality of a set. When σ =, we obtain the same stochastic gradient as in (Patterson and eh 23 using Riemannian manifold. o calculate the expectation term, we use the same Gibbs sampling method as in (Patterson and eh 23, ( α + n \v jk e θ kw jv p(z jv = k w d, θ, α =, (2 k (α + n \v jk e θ k w jv where \v denotes the count excluding the v-th topic assignment variable. he expectation is estimated by the samples. Running time For the results reported in the main text, to achieve reported results of LDA on ICML dataset,, and Gibbs take iterations, the running times are 6.99, 8.55 and 56.2 second, respectively. E Logistic Regression For each data {x, y}, x R P is the input data, y {, } is the label. Logistic Regression gives the prabability P (y x g θ (x = + exp ( (W x + c. (3 A Gaussian prior is placed on model parameters θ = {W, c} N (, σ 2 I. We set σ 2 = in our experiment. We study the effectiveness of for different h and D. Learning curves of test errors and training loglikelihoods are shown in Fig.. Generally, the performances of the proposed are consistently more stable than the and SGHMC, across varying h and D.
3 est Error (% raining Log Likelihood h =.5, D = # 5 h =.5, D = est Error (% raining Log Likelihood h =.5, D = # 4 h =.5, D = SGHMC-E.4 SGHMC-S est Error (% raining Log Likelihood h =., D = # 4 h =., D = est Error (% raining Log Likelihood h =.5, D = # 4 h =.5, D = SGHMC-E SGHMC-S Figure : Learning curves for logistic regression on a9a dataset for varying h and D. Column share the same h; column 2-4 share the same D. op row is testing error; bottom row is training log-likelihood h = 2e-4, # h = e-4, # h = 5e-5, # h = 2e-4, # h = e-4, # h = 5e-5, # Figure 2: Learning curves for FNN (ReLU link on MNIS dataset for varying stepsize h. Step size decreases top-down # # # # # # Figure 3: Learning curves for FNN (Sigmoid link on MNIS dataset for varying network depth. Depth increases top-down. Furthermore, converges faster than msgnh- E, especially at the beginning of learning. Comparing column (fixing h, varying D, is more robust to the choice of diffusion factor D. Comparing column 2-4 (fixing D, varying h, larger h potentially brings larger gradient-estimation errors and numerical errors. msgnh is shown to significantly outperform others when h is large. F Deep Neural Networks F. Sensitivity to Step-size For feedforward neural nets (FNN with a 2-layer model of - ReLU, we study the performance of for different h. D = 5. We test a wide range of h, and show in Fig. 2 the learning curves of the test accuracy and training negative log-likelihood for h = 2 4, 4, 5 5. is less sensitive to step size; it maintains fast convergence when h is large, while significantly slows down.
4 (% h =., D = (% h =.5, D = Figure 4: Learning curves of CNN for different step sizes. F.2 Sigmoid Activation Function We compare different methods in the case of sigmoid link. Similar to the main paper, we test the FNNs with varying depths, e.g., {, 2, 3}, respectively. We set D = and h = 3. Fig. 3 displays learning curves on testing accuracy and training negative log-likelihood. he gaps between and becomes larger in deeper models. When a 3-layer network is emplyed, msgnh- E fails at the first 5 epochs, whilst converges pretty well. Moreover, converges more accurately, yielding an accuracy 97.36%, while gives 96.86%. his manifests the importance of numerical accuracy in SG-MCMC for practical applications, and is desirable in this regard. F.3 Comparison with Other Methods Based on network size 4-4 with ReLU, we also compared with a recent state-of-the-art inference method for neural nets, Bayes by Backprop (BBB (Blundell et al. 25. h = 2 4 and D = 6. he comparison is in able. BBB and SGD are results taken from (Blundell et al. 25. able : Classification accuracy on MNIS. Method Accuracy (% BBB 98.8 SGD 98.7 F.4 Convolutional Neural Networks We also performed the comparison with a standard network convolutional neural networks, LeNet (Jarrett et al. 29, on MNIS dataset. It is 2-layer convolution networks followed by a 2-layer fully-connected FNN, each containing 2 hidden nodes that uses ReLU. Both convolutional layers use 5 5 filter size with 32 and 64 channels, respectively, 2 2 max pooling are used after each convolutional layer. 4 epochs are used, and L is set to 2. In Fig. 4, we tested the stepsizes h = 4 and h = 5 4 for and, and D = 5. Again, under the same network architecture, CNN trained with converges fater than. In particular, when a large stepsize used, has a significant improvement over. G Deep Poisson Factor Analysis G. Model Specification We first provide model details of Deep Poisson Factor Analysis (DPFA (Gan et al. 25. Given a discrete matrix W Z V + J containing counts from J documents and V words, Poisson factor analysis (PFA (Zhou et al. assumes the entries of W are summations of K < latent counts, each produced by a latent factor (in the case of topic modeling, a hidden topic. For W, the generative process of DPFA with L-layer Sigmoid Belief Networks (SBN, is as follows p(h (L k L = g(b (L k L (4 p(h (l n = g ( (w (l h (l+ n + c (l (5 n = h(l+ W Pois(Φ(Ψ h ( (6 where g( is the Sigmoid link. Equation (4 and (5 define Deep Sigmoid Belief Networks (DSBN. Φ is the factor loading matrix. Each column of Φ, φ k V, encodes the relative importance of each word in topic k, with V representing the V -dimensional simplex. Θ R K J + is the factor score matrix. Each column ψ j, contains relative topic intensities specific to document j. h (l {, } K is a latent binary feature vector, which defines whether certain topics are associated with documents. he factor scores for document j at bottom layer are the element-wise multiplication ψ j h (. DPFA is constructed by placing Dirichlet priors on Φ k and gamma priors on ψ j. his is summarized as, K x vj = x vjk, x vjk Pois(φ vk ψ kj h k (7 k= with priors specified as φ k Dir(a φ,..., a φ, ψ kn Gamma(r k, p k /( p k, r k Gamma(γ, /c, and γ Gamma(e, /f. G.2 Model Inference Following (Gan et al. 25, we use stochastic gradient Riemannian Langevin dynamics (SGRLD (Patterson and eh 23 to sample the topic-word distributions {φ k }. ms- GNH is used to sample the parameters in DSBN, i.e., θ = ( W (l, c (l, b (L, where l =,, L. Specifically, the stochastic gradients of W (l and c (l evaluated on a mini-batch of data (denote S as the index set of a minibatch are calculated, Ũ w (l Ũ c (l = J S = J S i D i D E (l h i,h (l+ i E (l h i,h (l+ i [( σ (l i h(l i ] h (l+ i, (8 [ ] σ (l i h(l i, (9 where σ (l i = σ((w(l h (l+ i + c (l, and the expectation is taken over posteriors. Monte Carlo integration is used to approximate this quantity.
5 References Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 25. Weight uncertainty in neural networks. In ICML. Chen, C.; Ding, N.; and Carin, L. 25. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In NIPS. Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H.. Bayesian sampling using stochastic gradient thermostats. In NIPS. Gan, Z.; Chen, C.; Henao, R.; Carlson, D.; and Carin, L. 25. Scalable deep Poisson factor analysis for topic modeling. In ICML. Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and LeCun, Y. 29. What is the best multi-stage architecture for object recognition? In ICCV. Patterson, S., and eh, Y. W. 23. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In NIPS. Rossmann, W.. Lie Groups An Introduction hrough Linear Groups. Oxford Graduate exts in Mathematics, Oxford Science Publications. Zhou, M.; Hannah, L.; Dunson, D. B.; and Carin, L.. Beta-negative Binomial process and Poisson factor analysis. In AISAS.
High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models
High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering, Duke University
More informationHigh-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models
High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering, Duke University
More informationScalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material
: Supplementary Material Zhe Gan ZHEGAN@DUKEEDU Changyou Chen CHANGYOUCHEN@DUKEEDU Ricardo Henao RICARDOHENAO@DUKEEDU David Carlson DAVIDCARLSON@DUKEEDU Lawrence Carin LCARIN@DUKEEDU Department of Electrical
More informationBridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic
More informationOn the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators
On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators Changyou Chen Nan Ding awrence Carin Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA
More informationIntroduction to Stochastic Gradient Markov Chain Monte Carlo Methods
Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer
More informationStochastic Gradient MCMC with Stale Gradients
Stochastic Gradient MCMC with Stale Gradients Changyou Chen Nan Ding Chunyuan Li Yizhe Zhang Lawrence Carin Dept. of Electrical and Computer Engineering, Duke University, Durham, NC, USA Google Inc., Venice,
More informationScalable Non-linear Beta Process Factor Analysis
Scalable Non-linear Beta Process Factor Analysis Kai Fan Duke University kai.fan@stat.duke.edu Katherine Heller Duke University kheller@stat.duke.com Abstract We propose a non-linear extension of the factor
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationPreconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence AAAI-6 Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks Chunyuan Li, Changyou Chen, David Carlson and
More informationBridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen David Carlson Zhe Gan Chunyuan i awrence Carin Department of Electrical and Computer Engineering, Duke University
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 14: Deep Generative Learning
Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationHamiltonian Monte Carlo for Scalable Deep Learning
Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018
More informationVariational Bayes on Monte Carlo Steroids
Variational Bayes on Monte Carlo Steroids Aditya Grover, Stefano Ermon Department of Computer Science Stanford University {adityag,ermon}@cs.stanford.edu Abstract Variational approaches are often used
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationarxiv: v3 [cs.lg] 6 Nov 2015
Bayesian Dark Knowledge Anoop Korattikara, Vivek Rathod, Kevin Murphy Google Research {kbanoop, rathodv, kpmurphy}@google.com Max Welling University of Amsterdam m.welling@uva.nl arxiv:1506.04416v3 [cs.lg]
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationarxiv: v1 [cs.lg] 14 Jun 2015
Bayesian Dark Knowledge Anoop Korattikara, Vivek Rathod, Kevin Murphy Google Inc. {kbanoop, rathodv, kpmurphy}@google.com Max Welling University of Amsterdam m.welling@uva.nl arxiv:1506.04416v1 [cs.lg]
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationNonlinear Statistical Learning with Truncated Gaussian Graphical Models
Nonlinear Statistical Learning with Truncated Gaussian Graphical Models Qinliang Su, Xuejun Liao, Changyou Chen, Lawrence Carin Department of Electrical & Computer Engineering, Duke University Presented
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationIntroduction to (Convolutional) Neural Networks
Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient
More informationBayesian Sampling Using Stochastic Gradient Thermostats
Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationIndex. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,
Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation
More informationCS60010: Deep Learning
CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationArtificial Neural Networks
Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on
More informationScalable Model Selection for Belief Networks
Scalable Model Selection for Belief Networks Zhao Song, Yusuke Muraoka, Ryohei Fujimaki, Lawrence Carin Department of ECE, Duke University Durham, NC 27708, USA {zhao.song, lcarin}@duke.edu NEC Data Science
More informationBayesian Semi-supervised Learning with Deep Generative Models
Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge
More informationAppendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo Jimenez Rezende Shakir Mohamed Daan Wierstra Google DeepMind, London, United Kingdom DANILOR@GOOGLE.COM
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationDeep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang
Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationStreaming Variational Bayes
Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationFinal Examination CS 540-2: Introduction to Artificial Intelligence
Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11
More informationImproved Bayesian Compression
Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationConvolutional Neural Networks. Srikumar Ramalingam
Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationBayesian Sampling Using Stochastic Gradient Thermostats
Bayesian Sampling Using Stochastic Gradient Thermostats Nan Ding Google Inc. dingnan@google.com Youhan Fang Purdue University yfang@cs.purdue.edu Ryan Babbush Google Inc. babbush@google.com Changyou Chen
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationMachine Learning Final Exam May 5, 2015
Name: Andrew ID: Instructions Anything on paper is OK in arbitrary shape size, and quantity. Electronic devices are not acceptable. This includes ipods, ipads, Android tablets, Blackberries, Nokias, Windows
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationFast yet Simple Natural-Gradient Variational Inference in Complex Models
Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,
More informationMachine Learning Summer School, Austin, TX January 08, 2015
Parametric Department of Information, Risk, and Operations Management Department of Statistics and Data Sciences The University of Texas at Austin Machine Learning Summer School, Austin, TX January 08,
More informationDeep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated
Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationLocal Expectation Gradients for Doubly Stochastic. Variational Inference
Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationBayesian Networks (Part I)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More information