Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models
|
|
- Shana Collins
- 6 years ago
- Views:
Transcription
1 Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo Jimenez Rezende Shakir Mohamed Daan Wierstra Google DeepMind, London, United Kingdom A. Additional Model Details In equation (6 we showed an alternative form of the oint log likelihood that explicitly separates the deterministic and stochastic parts of the generative model and corroborates the view that the generative model works by applying a complex non-linear transformation to a spherical Gaussian distribution N (ξ 0, I such that the transformed distribution best matches the empirical distribution. We provide more details on this view here for clarity. From the model description in equations (3 and (4, we can interpret the variables h l as deterministic functions of the noise variables ξ l. This can be formally introduced as a coordinate transformation of the probability density in equation (5: we perform a change of coordinates h l ξ l. The density of the transformed variables ξ l can be expressed in terms of the density (5 times the determinant of the Jacobian of the transformation p(ξ l = p(h l (ξ l h l. Since the co-ordinate transformation is linear we have h l = G l and the distribution of ξ l is obtained as follows: p(ξ l = p(h l (ξ l h l L 1 p(ξ l =p(h L G L G l p l (h l h l+1 = G l S l 1 N (ξl = G l G l G T l 1 N (ξl 0, I = N(ξ l 0, I. ( Combining this equation with the distribution of the visible layer we obtain equation (6. A.1. Examples Below we provide simple, explicit examples of generative and recognition models. In the case of a two-layer model the activation h 1 (ξ 1, in equation (6 can be explicitly written as h 1 (ξ 1, = W 1 f(g ξ + G 1 ξ 1 + b 1. (3 Similarly, a simple recognition model consists of a single deterministic layer and a stochastic Gaussian layer with the rank-one covariance structure and is constructed as: q(ξ l v = N ( ξ l µ; (diag(d + uu 1 (4 µ = W µ z + b µ (5 log d = W d z + b d ; u = W u z + b u (6 z = f(w v v + b v (7 where the function f is a rectified linearity (but other nonlinearities such as tanh can be used. B. Proofs for the Gaussian Gradient Identities Here we review the derivations of Bonnet s and Price s theorems that were presented in section 3. Theorem B.1 (Bonnet s theorem. Let f(ξ : R d R be a integrable and twice differentiable function. The gradient of the expectation of f(ξ under a Gaussian distribution N (ξ µ, C with respect to the mean µ can be expressed as the expectation of the gradient of f(ξ. µi E N (µ,c [f(ξ] = E N (µ,c [ ξ i f(ξ], Proceedings of the 31 st International Conference on Machine Learning, Beiing, China, 014. JMLR: W&CP volume 3. Copyright 014 by the author(s.
2 µi E [f(ξ] = N (µ,c µi N (ξ µ, Cf(ξdξ = ξi N (ξ µ, Cf(ξdξ [ ] ξi=+ = N (ξ µ, Cf(ξdξ i + where we have used the identity N (ξ µ, C ξi f(ξdξ ξ i= = E N (µ,c [ ξ i f(ξ], (8 µi N (ξ µ, C = ξi N (ξ µ, C in moving from step 1 to. From step to 3 we have used the product rule for integrals with the first term evaluating to zero. Theorem B. (Price s theorem. Under the same conditions as before. The gradient of the expectation of f(ξ under a Gaussian distribution N (ξ 0, C with respect to the covariance C can be expressed in terms of the expectation of the Hessian of f(ξ as Ci, E N (0,C [f(ξ] = 1 E N (0,C [ ξi,ξ f(ξ ] Ci, E [f(ξ] = N (0,C Ci, N (ξ 0, Cf(ξdξ = 1 ξi,ξ N (ξ 0, Cf(ξdξ = 1 N (ξ 0, C ξi,ξ f(ξdξ = 1 E [ N (0,C ξi,ξ f(ξ ]. (9 In moving from steps 1 to, we have used the identity Ci, N (ξ µ, C = 1 ξ i,ξ N (ξ µ, C, which can be verified by taking the derivatives on both sides and comparing the resulting expressions. From step to 3 we have used the product rule for integrals twice. C. Deriving Stochastic Back-propagation Rules In section 3 we described two ways in which to derive stochastic back-propagation rules. We show specific examples and provide some more discussion in this section. C.1. Using the Product Rule for Integrals We can derive rules for stochastic back-propagation for many distributions by finding a appropriate non-linear function B(x; θ that allows us to express the gradient with respect to the parameters of the distribution as a gradient with respect to the random variable directly. The approach we described in the main text was: θ E p [f(x]= θ p(x θf(xdx= x p(x θb(xf(xdx = [B(xf(xp(x θ] supp(x p(x θ x [B(xf(x] = E p(x θ [ x [B(xf(x]] (30 where we have introduced the non-linear function B(x; θ to allow for the transformation of the gradients and have applied the product rule for integrals (rule for integration by parts to rewrite the integral in two parts in the second line, and the supp(x indicates that the term is evaluated at the boundaries of the support. To use this approach, we require that the density we are analysing be zero at the boundaries of the support to ensure that the first term in the second line is zero. As an alternative, we can also write this differently and find an non-linear function of the form: θ E p [f(x]== E p(x θ [B(x x f(x]. (31 Consider general exponential family distributions of the form: p(x θ = h(x exp(η(θ φ(x A(θ (3 where h(x is the base measure, θ is the set of mean parameters of the distribution, η is the set of natural parameters, and A(θ is the log-partition function. We can express the non-linear function in (30 using these quantities as: B(x = [ θ η(θφ(x θ A(θ] [ x log[h(x] + η(θ T x φ(x]. (33 This can be derived for a number of distributions such as the Gaussian, inverse Gamma, Log-Normal, Wald (inverse Gaussian and other distributions. We show some of these below: The B(x; θ corresponding to the second formulation can also be derived and may be useful in certain situations, requiring the solution of a first order differential equation. This approach of searching for non-linear transformations leads us to the second approach for deriving stochastic back-propagation rules.
3 Family θ B(x ( ( µ 1 Gaussian σ (x µ σ(x µ+σ σ (x µ ( ( x ( ln x Ψ(α+ln β α x(α+1+β Inv. Gamma β x ( ( 1 + α ( ( x(α+1+β x β µ Log-Normal σ 1 (ln x µ σ(ln x µ+σ σ (ln x µ C.. Using Alternative Coordinate Transformations There are many distributions outside the exponential family that we would like to consider using. A simpler approach is to search for a co-ordinate transformation that allows us to separate the deterministic and stochastic parts of the distribution. We described the case of the Gaussian in section 3. Other distributions also have this property. As an example, consider the Levy distribution (which is a special case of the inverse Gamma considered above. Due to the selfsimilarity property of this distribution, if we draw X from a Levy distribution with known parameters X Levy(µ, λ, we can obtain any other Levy distribution by rescaling and shifting this base distribution: kx + b Levy(kµ + b, kc. Many other distributions hold this property, allowing stochastic back-propagation rules to be determined for distributions such as the Student s t-distribution, Logistic distribution, the class of stable distributions and the class of generalised extreme value distributions (GEV. Examples of co-ordinate transformations T ( and the resulsting distributions are shown below for variates X drawn from the standard distribution listed in the first column. Std Distr. T( Gen. Distr. GEV(µ, σ, 0 mx +b GEV(mµ+b, mσ, 0 Exp(1 µ+βln(1+exp( X Logistic(µ, β Exp(1 λx k 1 Weibull(λ, k D. Variance Reduction using Control Variates An alternative approach for stochastic gradient computation is commonly based on the method of control variates. We analyse the variance properties of various estimators in a simple example using univariate function. We then show the correspondence of the widely-known REINFORCE algorithm to the general control variate framework. D.1. Variance discussion for REINFORCE The REINFORCE estimator is based on θ E p [f(ξ] = E p [(f(ξ b θ log p(ξ θ], (34 where b is a baseline typically chosen to reduce the variance of the estimator. The variance of (34 scales poorly with the number of random variables (Dayan et al., To see this limitation, consider functions of the form f(ξ = K i=1 f(ξ i, where each individual term and its gradient has a bounded variance, i.e., κ l Var[f(ξ i ] κ u and κ l Var[ ξi f(ξ i ] κ u for some 0 κ l κ u and assume independent or weakly correlated random variables. Given these assumptions the variance of GBP (7 scales as Var[ ξi f(ξ] O(1, while [ the variance for REINFORCE ] (34 scales as Var (ξi µ i (f(ξ E[f(ξ] O(K. σ i For the variance of GBP above, all terms in f(ξ that do not depend on ξ i have zero gradient, whereas for REINFORCE the variance involves a summation over all K terms. Even if most of these terms have zero expectation, they still contribute to the variance of the estimator. Thus, the REINFORCE estimator has the undesirable property that its variance scales linearly with the number of independent random variables in the target function, while the variance of GBP is bounded by a constant. The assumption of weakly correlated terms is relevant for variational learning in larger generative models where independence assumptions and structure in the variational distribution result in free energies that are summations over weakly correlated or independent terms. D.. Univariate variance analysis In analysing the variance properties of many estimators, we discuss the general scaling of likelihood ratio approaches in appendix D. As an example to further emphasise the highvariance nature of these alternative approaches, we present a short analysis in the univariate case. Consider a random variable p(ξ = N (ξ µ, σ and a simple quadratic function of the form f(ξ = c ξ. (35 For this function we immediately obtain the following variances V ar[ ξ f(ξ] = c σ (36 V ar[ ξ f(ξ] = 0 (37 (ξ µ V ar[ ξ f(ξ] = c σ + µ c σ (38 (ξ µ V ar[ σ (f(ξ E[f(ξ]] = c µ + 5 c σ (39 Equations (36, (37 and (38 correspond to the variance of the estimators based on (7, (8, (10 respectively whereas equation (39 corresponds to the variance of the REIN- FORCE algorithm for the gradient with respect to µ.
4 From these relations we see that, for any parameter configuration, the variance of the REINFORCE estimator is strictly larger than the variance of the estimator based on (7. Additionally, the ratio between the variances of the former and later estimators is lower-bounded by 5/. We can also see that the variance of the estimator based on equation (8 is zero for this specific function whereas the variance of the estimator based on equation (10 is not. E. Estimating the Marginal Likelihood We compute the marginal likelihood by importance sampling by generating S samples from the recognition model and using the following estimator: p(v 1 S S s=1 F. Missing Data Imputation p(v h(ξ (s p(ξ (s q(ξ s ; ξ (s q(ξ v v (40 Image completion can be approximatively achieved by a simple iterative procedure which consists of (i initializing the non-observed pixels with random values; (ii sampling from the recognition distribution given the resulting image; (iii reconstruct the image given the sample from the recognition model; (iv iterate the procedure. We denote the observed and missing entries in an observation as v o, v m, respectively. The observed v o is fixed throughout, therefore all the computations in this section will be conditioned on v o. The imputation procedure can be written formally as a Markov chain on the space of missing entries v m with transition kernel T q (v m v m, v o given by T q (v m v m, v o = p(v m, v o ξq(ξ vdv odξ, (41 where v = (v m, v o. Provided that the recognition model q(ξ v constitutes a good approximation of the true posterior p(ξ v, (41 can be seen as an approximation of the kernel T (v m v m, v o = p(v m, v o ξp(ξ vdv odξ. (4 The kernel (4 has two important properties: (i it has as its eigen-distribution the marginal p(v m v o ; (ii T (v m v m, v o > 0 v o, v m, v m. The property (i can be derived by applying the kernel (4 to the marginal p(v m v o and noting that it is a fixed point. Property (ii is an immediate consequence of the smoothness of the model. We apply the fundamental theorem for Markov chains (Neal, 1993, pp. 38 and conclude that given the above properties, a Markov chain generated by (4 is guaranteed to generate samples from the correct marginal p(v m v o. In practice, the stationary distribution of the completed pixels will not be exactly the marginal p(v m v o, since we use the approximated kernel (41. Even in this setting we can provide a bound on the L 1 norm of the difference between the resulting stationary marginal and the target marginal p(v m v o Proposition F.1 (L 1 bound on marginal error. If the recognition model q(ξ v is such that for all ξ q(ξ vp(v ε > 0 s.t. p(v ξ p(ξ dv ε (43 then the marginal p(v m v o is a weak fixed point of the kernel (41 in the following sense: (T q (v m v m, v o = T (v m v m, v o p(v m v o dv m dv m < ε. (44 [T q (v m v m, v o T (v m v m, v o ] p(v m v o dv m dv m p(v m, v o ξp(v m, v o [q(ξ v m, v o p(ξ v m, v o ]dv m dξ dv m = p(v ξp(v[q(ξ v p(ξ v] p(v p(ξ p(ξ p(v dvdξ dv = p(v ξp(ξ[q(ξ v p(v p(ξ p(v ξ]dvdξ dv p(v ξp(ξ q(ξ v p(v p(ξ p(v ξ dvdξdv ε, where we apply the condition (43 to obtain the last statement. That is, if the recognition model is sufficiently close to the true posterior to guarantee that (43 holds for some acceptable error ε than (44 guarantees that the fixed-point of the Markov chain induced by the kernel (41 is no further than ε from the true marginal with respect to the L 1 norm.
5 G. Variational Bayes for Deep Directed Models In the main test we focussed on the variational problem of specifying an posterior on the latent variables only. It is natural to consider the variational Bayes problem in which we specify an approximate posterior for both the latent variables and model parameters. Following the same construction and considering an Gaussian approximate distribution on the model parameters θ g, the free energy becomes: F(V = reconstruction error {}}{ E q [log p(v n h(ξ n ] n + 1 [ µn,l + Tr C n,l log C n,l 1 ] n,l }{{} latent regularization term [ ] + 1 m κ + τ κ + log κ log τ 1, (45 }{{} parameter regularization term which now includes an additional term for the cost of using parameters and their regularisation. We must now compute the additional set of gradients with respect to the parameter s mean m and variance τ are: ] m F(v = E q [ θ g log p(v h(ξ + m (46 [ ] τ F(v = 1 E θ m q θ g log p(v h(ξ τ + 1 κ 1 τ (47 H. Additional Simulation Details We use training data of various types including binary and real-valued data sets. In all cases, we train using minibatches, which requires the introduction of scaling terms in the free energy obective function (13 in order to maintain the correct scale between the prior over the parameters and the remaining terms (Ahn et al., 01; Welling & Teh, 011. We make use of the obective: F(V = λ E q [log p(v n h(ξ n ] + 1 κ θg n + λ [ µn,l + Tr(C n,l log C n,l 1 ], (48 n,l where n is an index over observations in the mini-batch and λ is equal to the ratio of the data-set and the mini-batch size. At each iteration, a random mini-batch of size 00 observations is chosen. All parameters of the model were initialized using samples from a Gaussian distribution with mean zero and variance ; the prior variance of the parameters was κ = We compute the marginal likelihood on the test data by importance sampling using samples from the recognition model; we describe our estimator in appendix E. References Ahn, S., Balan, A. K., and Welling, M. Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML, 01. Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. The Helmholtz machine. Neural computation, 7(5: , September Neal, R. M. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, University of Toronto, Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 8th International Conference on Machine Learning (ICML-11, pp , 011.
arxiv: v3 [stat.ml] 30 May 2014
Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra {danilor, shakir, daanw}@google.com Google DeepMind, London arxiv:1401.408v3
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationAuto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationRiemann Manifold Methods in Bayesian Statistics
Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationVariational Dropout and the Local Reparameterization Trick
Variational ropout and the Local Reparameterization Trick iederik P. Kingma, Tim Salimans and Max Welling Machine Learning Group, University of Amsterdam Algoritmica University of California, Irvine, and
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationNatural Gradients via the Variational Predictive Distribution
Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More information17 : Optimization and Monte Carlo Methods
10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationLocal Expectation Gradients for Doubly Stochastic. Variational Inference
Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,
More informationDeep latent variable models
Deep latent variable models Pierre-Alexandre Mattei IT University of Copenhagen http://pamattei.github.io @pamattei 19 avril 2018 Séminaire de statistique du CNAM 1 Overview of talk A short introduction
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationVariational Inference for Monte Carlo Objectives
Variational Inference for Monte Carlo Objectives Andriy Mnih, Danilo Rezende June 21, 2016 1 / 18 Introduction Variational training of directed generative models has been widely adopted. Results depend
More informationDenoising Criterion for Variational Auto-Encoding Framework
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Denoising Criterion for Variational Auto-Encoding Framework Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua
More informationREINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS
Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationPatterns of Scalable Bayesian Inference Background (Session 1)
Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles
More informationMarkov Chain Monte Carlo Methods for Stochastic
Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationEvaluating the Variance of
Evaluating the Variance of Likelihood-Ratio Gradient Estimators Seiya Tokui 2 Issei Sato 2 3 Preferred Networks 2 The University of Tokyo 3 RIKEN ICML 27 @ Sydney Task: Gradient estimation for stochastic
More informationApproximate Bayesian inference
Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic
More informationApproximate Slice Sampling for Bayesian Posterior Inference
Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,
More informationNeed for Sampling in Machine Learning. Sargur Srihari
Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRegularization in Neural Networks
Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationRiemannian Stein Variational Gradient Descent for Bayesian Inference
Riemannian Stein Variational Gradient Descent for Bayesian Inference Chang Liu, Jun Zhu 1 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech.
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationProbabilistic & Bayesian deep learning. Andreas Damianou
Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationA short introduction to INLA and R-INLA
A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationCombine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning
Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More information