Afternoon Meeting on Bayesian Computation 2018 University of Reading
|
|
- Adrian Fox
- 5 years ago
- Views:
Transcription
1 Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of Reading
2 High-dimensional Problems Gradient-based optimization MCMC Issues arising from high dimensionality: non-convexity strong correlations multimodality
3 Related Work Gradient-based optimization AdaGrad AdaDelta Adam RMSProp MCMC Hamiltonian Monte Carlo Particle Monte Carlo Stochastic gradient Langevin dynamics All of these methods focus on computing clever updates for optimization algorithms or for Markov chains. Novelty: to the best of our knowledge, no dimensionality reduction approaches were applied in this direction before.
4 The Manifold Idea After t steps of optimization or sampling, we assume the obtained points in the parameter space to be lying on a manifold. We then feed them to a dimensionality reduction method to find a lower-dimensional representation. 3D example: if the sampler/ optimizer algorithm keeps on returning proposals on a sphere surface, that information might be used to our advantage Can we perform better if the algorithm acts with knowledge of the manifold?
5 Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D where: - θ: observed variables/parameters - ω: latent variables - f: mapping - D, Q: dimensionalities of Θ Ω respectively
6 Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D mapping: θ = f(ω) + η with η N(0, β 1 I) Dimensionality reduction Manifold identification The lower-dimensional manifold on which the samples lie is characterized through the latent set
7 Gaussian Process Latent Variable Model The choice of the dimensionality reduction method fell on the Gaussian Process Latent Variable Model [1]. GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η Motivation: Analytically sound mathematical tool Full distribution over the mapping f Full distribution over the derivatives of the mapping f [1] Lawrence, N., Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of machine learning research (2005)
8 Gaussian Process Gaussian Process [2] : a collection of rom variables, any finite number of which have a joint Gaussian distribution. If a real-valued stochastic process f is a GP, it will be denoted as f( ) GP(m( ), k(, )) A Gaussian Process is fully specified by a mean function m( ) a covariance function k(, ) where Training data Regression GP Real function m(ω) = E [ f(ω) ], k(ω, ω ) = E [( f(ω) m(ω) )( f(ω ) m(ω ) )] [2] Rasmussen, C. E., Williams, C. K. I., Gaussian Processes for Machine, the MIT Press (2006)
9 Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η The likelihood of the data Θ given the latent Ω is given by 1 marginalizing the mapping 2 optimizing the latent variables Resulting likelihood: D p(θ Ω, β) = N ( θ :,j 0, K + β 1 I ) j=1 D = N ( θ :,j 0, K ) j=1 With the resulting noise model being: θ i,j = K (ωi,ω) K 1 Θ :,j + η j
10 Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η For differentiable kernels k(, ), the Jacobian J of the mapping f can be computed analytically: J ij = f i ω j But as previously said, GPLVM can yield the full (Gaussian) distribution over the Jacobian. If the rows of J are assumed to be independent: p(j Ω, β) = D N ( ) J i,: µ Ji,:, Σ J, i=1
11 Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.
12 Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.
13 Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.
14 Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.
15 AdaGeo Gradient-based Optimization Minimization problem: θ = arg min g(θ) θ Iterative scheme solution (e.g. (stochastic) gradient descent): θ t+1 = θ t θ t( θ g) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω = arg min g(f(ω)) ω Iterative scheme solution (e.g. (stochastic) gradient descent): ω t+1 = ω t ω t( ωg)
16 AdaGeo Gradient-based Optimization Minimization problem: θ = arg min g(θ) θ Iterative scheme solution (e.g. (stochastic) gradient descent): θ t+1 = θ t θ t( θ g) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω = arg min g(f(ω)) ω Iterative scheme solution (e.g. (stochastic) gradient descent): ω t+1 = ω t ω t( ωg)
17 AdaGeo Gradient-based Optimization Algorithm 1 AdaGeo gradient-based optimization (minimization) 1: while convergence is not reached do 2: Perform T θ iterations with classic updates on the parameter space Θ: θ t = θ t( θ g(θ)) θ t+1 = θ t θ t 3: Train the GP-LVM model on the samples Θ = {θ 1,..., θ Tθ } 4: Continue performing T ω using the AdaGeo optimizer: ω t = ω t( ωg(f(ω))) ω t+1 = ω t ω t moving back to the parameter space with 5: end while. θ t+1 = f(ω t+1).
18 Experiment: Logistic Regression on MNIST SGD Neural Network with a single hidden layer implementing logistic regression on MNIST Dimension of parameter space: D = 7850 Dimension of latent space: Q = 9 Iterations: T θ = 20 T ω = 30 NN training loss function AdaGeo-SGD n of epochs
19 Experiment: Gaussian Process Training Concrete compressive strength dataset [3] : regression task with 8 real variables Composite kernel (RBF, Matérn, linear bias) Dimension of parameter space: D = 9 Dimension of latent space: Q = 3 GP Negative Log-likelihood Iterations: T θ = 15 T ω = , n of iterations Gradient Descent AdaGeo-Gradient Descent [3] Lichman, M., UCI Machine Repository [ Irvine, CA: University of California, School of Information Computer Science (2013)
20 AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Performing statistical inference means getting insights on the posterior distribution p(θ X) = p(x θ)p(θ) p(x) analytically or approximately through sampling.
21 AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Unfortunately the denominator is often intractable. One possible approach is to approximate the integral p(x) = p(x, θ)dθ = p(x θ)p(θ)dθ Θ Θ through Markov Chain Monte Carlo or similar methods.
22 Stochastic Gradient Langevin Dynamics Stochastic gradient Langevin dynamics [4] combines stochastic optimization the physical concept of Langevin dynamics to build a posterior sampler At each time t a mini-batch is extracted the parameters are updated as: θ t+1 = θ t + θ t, ( θ t = ϵt 2 η t N(0, ϵ ti) θ log p(θ t) + N n ) n θ log p(x i θ t) + η t, i=1 with the learning rate ϵ t satisfying: ϵ t =, t=1 t=1 ϵ 2 t < [4] Welling, M., Teh, Y. W., Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine (ICML) (2011)
23 AdaGeo - Stochastic Gradient Langevin Dynamics Analogously as before, we propose to: 1 Pick your favourite sampler produce the first t samples to build the set Θ = {θ 1,..., θ N } 2 Train a GPLVM on Θ to learn the latent space Ω 3 Move the updates onto the latent space with AdaGeo - Stochastic Gradient Langevin Dynamics: ω t+1 = ω t + ω t, ( ω t = ϵt 2 η t N(0, ϵ ti) ω log p ( f(ω t) ) + N n n ω log p ( x ti f(ω )) t) + η t, i=1
24 Experiment: from the Banana Distribution The banana distribution has this formula: p(θ) exp θ (θ2 bθ b) 2 2 D j=3 θ 2 j 10 θ θ 1 θ 2 present the interaction shown on the left, while the other variables produce Gaussian noise θ 1
25 Experiment: from the Banana Distribution A Metropolis-Hastings returns the first 100 samples drawn from a 50-dimensional banana distribution AdaGeo-SGLD is then employed to sample from a 5-dimensional latent space 10 MH sampler AdaGeo-sampler θ2 0 θ MH sampler AdaGeo-sampler θ θ 3
26 Bonus Round: Riemannian Extensions (theory only) If a covariance function of a Gaussian Process is differentiable, then it is straightforward to show that the mapping f is also differentiable. Under this assumption we can compute the latent metric tensor G, which will give further information about the geometry of the latent space (distances, geodetic lines etc.) If J is the Jacobian of the mapping f, then G = J J This yields a distribution over the metric tensor [5] : [ G W Q (D, Σ J, E J ] [ ]) E J a punctual estimate can be obtained with [ ] [ E J J = E J ] [ ] E J + DΣ J. [5] Tosi, A., Hauberg, S., Vellido, A., Lawrence, N. D., Metrics for probabilistic geometries. Uncertainty in Artificial Intelligence (2014)
27 Bonus Round: Stochastic Gradient Riemannian Langevin Dynamics Stochastic gradient Riemannian Langevin dynamics [6] puts together the advantages of exploiting a known Riemannian geometry with the scalability of the stochastic optimization approaches: where µ(θ) j = ( 2 G 1 (θ) ( θ t+1 = θ t + θ t, θ t = ϵt 2 µ(θt) + G 1 2 (θt)η t η t N(0, ϵ ti), θ log p(θ) + N n )) n θ log p(x ti θ) i=1 D ( G 1 (θ) G(θ) ) G 1 (θ) θ k k=1 jk + j D ( G 1 (θ) ) (G Tr 1 (θ) G(θ) ) jk θ k k=1 [6] Patterson, S., Teh, Y. W., Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Advances in Neural Information Processing Systems (2013)
28 Bonus Round: AdaGeo - Stochastic Gradient Riemannian Langevin Dynamics Analogously as the SGLD case, we can move now the update in the latent space with AdaGeo - Stochastic gradient Riemannian Langevin dynamics: where µ(ω) j = 2 ω t+1 = ω t + ω t ω t = ϵt 2 µ(ωt) + 1 G 2 ω (ω t)η t η t N(0, ϵ ti) ( ( G 1 ω (ω) ω log p ( f(ω) ) + N n Q ( k=1 G 1 ω (ω) Gω(ω) ω k ) G 1 ω (ω) + jk n ω log p ( x ti f(ω) ))) i=1 j Q ( G 1 ω (ω) ) ( Tr G 1 jk ω (ω) Gω(ω) ) ω k k=1
29 Conclusions We develop a generic framework for combining dimensionality reduction techniques with sampling optimization methods We contribute to gradient-based optimization methods by coupling them with appropriate dimensionality reduction techniques. In particular, we improve the performances of gradient descent stochastic gradient descent, when training respectively a Gaussian Process a neural network We contribute to Markov Chain Monte Carlo by developing a AdaGeo version of stochastic gradient Langevin dynamics; the information gathered through the latent space are employed to compute the steps of the Markov chain We extend the approach to stochastic gradient Riemannian Langevin dynamics, thanks to the geometric tensor naturally recovered by the GP-LVM model
30 Thank you Reference: Abbati, G., Tosi, A., Flaxman, S., Osborne, M. A. (2018).. In Proceedings of the 21st International Conference on Artificial Intelligence Statistics (AISTATS), to appear.
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationMonte Carlo in Bayesian Statistics
Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationNotes on pseudo-marginal methods, variational Bayes and ABC
Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationApproximate Slice Sampling for Bayesian Posterior Inference
Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationRiemann Manifold Methods in Bayesian Statistics
Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationA Review of Pseudo-Marginal Markov Chain Monte Carlo
A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationIntroduction to Stochastic Gradient Markov Chain Monte Carlo Methods
Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationBridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization
Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization Changyou Chen, David Carlson, Zhe Gan, Chunyuan Li, Lawrence Carin May 2, 2016 1 Changyou Chen Bridging the Gap between Stochastic
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationLearning the hyper-parameters. Luca Martino
Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationAnnealing Between Distributions by Averaging Moments
Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We
More informationMCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17
MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationBayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
Sungjin Ahn Dept. of Computer Science, UC Irvine, Irvine, CA 92697-3425, USA Anoop Korattikara Dept. of Computer Science, UC Irvine, Irvine, CA 92697-3425, USA Max Welling Dept. of Computer Science, UC
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationApproximate Slice Sampling for Bayesian Posterior Inference
Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois GraphLab, Inc. Anoop Korattikara Dept. of Computer Science UC Irvine Max Welling Informatics Institute University of Amsterdam
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationarxiv: v4 [stat.co] 4 May 2016
Hamiltonian Monte Carlo Acceleration using Surrogate Functions with Random Bases arxiv:56.5555v4 [stat.co] 4 May 26 Cheng Zhang Department of Mathematics University of California, Irvine Irvine, CA 92697
More informationLog Gaussian Cox Processes. Chi Group Meeting February 23, 2016
Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationHamiltonian Monte Carlo for Scalable Deep Learning
Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018
More informationDoubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London
Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train Doubly Stochastic
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationMarkov Chain Monte Carlo
1 Motivation 1.1 Bayesian Learning Markov Chain Monte Carlo Yale Chang In Bayesian learning, given data X, we make assumptions on the generative process of X by introducing hidden variables Z: p(z): prior
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationScalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material
: Supplementary Material Zhe Gan ZHEGAN@DUKEEDU Changyou Chen CHANGYOUCHEN@DUKEEDU Ricardo Henao RICARDOHENAO@DUKEEDU David Carlson DAVIDCARLSON@DUKEEDU Lawrence Carin LCARIN@DUKEEDU Department of Electrical
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationSupplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models
Supplementary Material of High-Order Stochastic Gradient hermostats for Bayesian Learning of Deep Models Chunyuan Li, Changyou Chen, Kai Fan 2 and Lawrence Carin Department of Electrical and Computer Engineering,
More informationProvable Nonparametric Bayesian Inference
Provable Nonparametric Bayesian Inference Bo Dai, Niao He, Hanjun Dai, Le Song Georgia Institute of Technology {bodai, nhe6, hdai8}@gatech.edu, lsong@cc.gatech.edu Abstract Bayesian methods are appealing
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationTutorial on Probabilistic Programming with PyMC3
185.A83 Machine Learning for Health Informatics 2017S, VU, 2.0 h, 3.0 ECTS Tutorial 02-04.04.2017 Tutorial on Probabilistic Programming with PyMC3 florian.endel@tuwien.ac.at http://hci-kdd.org/machine-learning-for-health-informatics-course
More informationRecurrent Latent Variable Networks for Session-Based Recommendation
Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)
More informationAusterity in MCMC Land: Cutting the Metropolis-Hastings Budget
Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget Anoop Korattikara, Yutian Chen and Max Welling,2 Department of Computer Science, University of California, Irvine 2 Informatics Institute,
More informationAlgorithms for Variational Learning of Mixture of Gaussians
Algorithms for Variational Learning of Mixture of Gaussians Instructors: Tapani Raiko and Antti Honkela Bayes Group Adaptive Informatics Research Center 28.08.2008 Variational Bayesian Inference Mixture
More informationINFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES
INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationScaling up Bayesian Inference
Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background
More information17 : Optimization and Monte Carlo Methods
10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationTime-Sensitive Dirichlet Process Mixture Models
Time-Sensitive Dirichlet Process Mixture Models Xiaojin Zhu Zoubin Ghahramani John Lafferty May 25 CMU-CALD-5-4 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Abstract We introduce
More informationMachine Learning. Probabilistic KNN.
Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007
More informationManifold Monte Carlo Methods
Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationModeling human function learning with Gaussian processes
Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationDAG models and Markov Chain Monte Carlo methods a short overview
DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex
More informationLocal Expectation Gradients for Doubly Stochastic. Variational Inference
Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationOnline but Accurate Inference for Latent Variable Models with Local Gibbs Sampling
Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling Christophe Dupuy INRIA - Technicolor christophe.dupuy@inria.fr Francis Bach INRIA - ENS francis.bach@inria.fr Abstract
More informationFastGP: an R package for Gaussian processes
FastGP: an R package for Gaussian processes Giri Gopalan Harvard University Luke Bornn Harvard University Many methodologies involving a Gaussian process rely heavily on computationally expensive functions
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationGenerative models for missing value completion
Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative
More informationRiemannian Stein Variational Gradient Descent for Bayesian Inference
Riemannian Stein Variational Gradient Descent for Bayesian Inference Chang Liu, Jun Zhu 1 Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech.
More information