Annealing Between Distributions by Averaging Moments
|
|
- Cori Gabriella Stanley
- 5 years ago
- Views:
Transcription
1 Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto
2 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x)
3 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x) For Markov Random Fields (MRFs) partition function Z(θ) = x exp( E(x, θ)) is intractable Goal: Estimate log Z(θ).
4 Estimating Partition Functions Variational approximations and bounds on log Z (Yedida et al., 2005; Wainwright et al., 2005). We want our models to reflect a highly dependent world, this can hurt variational approaches as we assume more and more independence. This assumption less costly for posterior inference over parameters. Sampling methods such as path sampling (Gelman and Meng, 1998), sequential Monte Carlo (e.g. del Moral et al., 2006), simple importance sampling, and annealed importance sampling (Neal, 2002). Slow, finicky, and hard to diagnose In principle, can deal with multimodality
5 Simple Importance Sampling (SIS) Two distributions p a (x) and p b (x) over X Then Z a M f a (x)/z a tractable Z easy to sample M i=1 for x (i) p a (x). f b (x (i) ) f a (x (i) ) f b (x)/z b intractable Z hard to sample fb (x) p a (x) p a(x) dx = Z b Variance is high (sometimes ) if p a << p b in some regions.
6 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x)
7 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x) hard to do
8 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002)
9 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant
10 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x k T k (x x k 1 ) }{{} leaves p k (x) invariant w (i) = f 1(x (i) 0 ) f 0 (x (i) 0 )
11 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x 1 x 1 T 1 (x x 0 ) x k T k (x x k 1 ) }{{} leaves p k (x) invariant (i) 1 ) w (i) = f 1(x (i) 0 ) f 2 (x f 0 (x (i) 0 ) f 1 (x (i) 1 )
12 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2 x 1 T 1 (x x 0 ) (i) 1 ) (i) 2 ) w (i) = f 1(x (i) 0 ) f 2 (x f 3 (x f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 )
13 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) w (i) = f 1(x (i) 0 ) (i) f 2 (x1 ) (i) f 3 (x2 ) f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 ) f K (x (i) K 1 ) f K 1 (x (i) K 1 )
14 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) Z 0 M M w (i) Z K i=1 Intuition: SIS on an extended state space, remarkably unbiased!
15 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x)
16 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x) Let P be a family of distributions parameterized by θ Define a path γ : [0, 1] P and a schedule of points 0 = β 0 < β 1 <... < β K = 1 p b p k = p θ(β) p a
17 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)]
18 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)] }{{} finite difference approximation assump. + math k 1 d log Z(θ(β)) 0 dβ dβ = log Z K Z 0 Intuition: log w (i) s accumulate finite differences of log Z
19 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions?
20 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) )
21 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) ) With perfect transitions the following are monotonic in the sum of KL divergences δ(log w (i) ) var [ log w (i)] var [ w (i)] Goal: Minimize the sum of KL divergences.
22 Analyzing AIS Paths Approach: Approximate the sum of KL with a functional. Let γ be a path and β k a linearly spaced schedule. Then K K k=1 D KL (p k 1 p k ) K F(γ) θ(β) T G θ (β) }{{ θ(β) dβ, } 0 metric on manifold defined by Fisher Inform. where G θ = cov pθ ( θ log p θ (x)) is the Fisher Information and θ(β) = dθ(β)/dβ. Ties to information geometry. Analogous functional for path sampling (Gelman and Meng, 1998).
23 Analyzing AIS Paths Under the optimal schedule, the value of the functional is F(γ) = l(γ) 2 /2 where l(γ) = 1 0 θ(β) T G θ (β) θ(β)dβ is the path length on the Riemannian manifold defined by G θ (β). Intuition: Spend more time on segments with high curvature.
24 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s
25 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1)
26 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1) New Moment Averaged Path γ MA (β) is the distribution with s(β) = (1 β)s(0) + βs(1)
27 The Picture for Gaussians Often unintuitive... Two Gaussians
28 The Picture for Gaussians Often unintuitive... GA path γ GA places mass only where both pdfs agree veto effects
29 The Picture for Gaussians Often unintuitive... GA path MA path γ GA places mass only where both pdfs agree veto effects γ MA interpolate means and covariances then stretch covariance
30 Path Properties: Variational Interpretation For geometric averages, the intermediate distribution minimizes a weighted sum of KLs p (GA) β = arg min p (1 β)d KL (p p a ) + βd KL (p p b ) For moment averages, the same but of the reverse KLs p (MA) β = arg min p (1 β)d KL (p a p) + βd KL (p b p) GA path MA path
31 Path Properties: Cost Functional For the exponential family we can find F(γ) Important this assumes linear schedules Both γ GA and γ MA have the same functional! F(γ GA ) = F(γ MA ) = 1 2 (s(1) s(0))t (η(1) η(0)) If we partition a schedule by distributions p j into piecewise linear schedules we can optimally allocate distributions from a total budget = K j K j (η j+1 η j ) T (s j+1 s j ) Biggest effect is the choice of path, not schedule.
32 Gaussians Two Gaussians N (( 10 0 ), ( )) (( and N 100 ) (, ))
33 Gaussians Number of intermediate distributions required to anneal between two Gaussians with means µ 1 and µ 2 and variance σ as a function of d = (µ 2 µ1)/σ GA, linear schedule O(d 2 ) MA, linear schedule O(d 2 ) GA, optimal schedule O(d 2 ) MA, optimal schedule O((log d) 2 ) Optimal path (Gelman and Meng, 1998) O((log d) 2 ) MA within constant factor of optimal under its optimal scheduling no general proof yet. Mixing issues dominate performance in practice where MA shines.
34 Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines RBMs are Markov Random Fields of coupled visible and For hidden many binary real-world variables problems, x = (v, we h) need {0, to introduce 1} D {0, hidden 1} F with variables. a Our special random bipartite variables structure. will contain visible and hidden variables x=(v,h). hidden variables Bipartite Structure The energy of a joint Stochastic binary visible variables are connected configuration to stochastic is E(v, h, binary θ) hidden variables. binary potentials unary potential {}}{{}}{ The energy of vthe T Wjoint h configuration: }{{} v T c h T b unary potential model parameters. Image visible variables where θ = (W, c, b). Probability Z(θ) = of the v joint h exp( E(v, configuration h, θ)) is given is generally by the Boltzmann intractable, distribution: except if we have a small number (< 25) of hidden units or few visible units. parooon$funcoon$ potenoal$funcoons$
35 Restricted Boltzmann Machines (RBMs) Two different paths for an RBM trained on the MNIST digit dataset (60,000 B&W images of digits).
36 Restricted Boltzmann Machines (RBMs) p k = p θ(β) p b MA path infeasible for most RBMs solve for natural parameters p a estimate moments {}}{{}}{ E[vh T ] β = (1 β)e[vh T ] 0 + β E[vh T ] 1 Can do approximately for a few intermediate models. Moment Averaged Spline Path (γ MAS now in blue) Knots are moment matched, annealing between them with GA p b p a knot
37 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 20 hidden units trained on MNIST with PCD
38 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 500 hidden units trained on MNIST with CD1. Under estimating by 20 nats is difference between a log probability of 130 and 110 on MNIST test set! overestimate p(x) = exp( E(x, θ))/z(θ) underestimate
39 Conclusions Theoretical foundations for studying AIS under prefect mixing A new annealing scheme for exponential family distributions with practical approximations Improved estimates of partition functions for RBMs Ongoing work Diagnostics for AIS Extend this work to models that are harder for AIS MA intermediate distributions for other tempering-based methods such as learning MRFs with MA parallel tempering Using MA path in marginal likelihood estimation for directed models
40 Thanks!
Contrastive Divergence
Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive
More informationLearning and Evaluating Boltzmann Machines
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Ruslan Salakhutdinov 2008. June 26, 2008
More informationDensity estimation. Computing, and avoiding, partition functions. Iain Murray
Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh
More informationBias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions
- Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationLearning Deep Boltzmann Machines using Adaptive MCMC
Ruslan Salakhutdinov Brain and Cognitive Sciences and CSAIL, MIT 77 Massachusetts Avenue, Cambridge, MA 02139 rsalakhu@mit.edu Abstract When modeling high-dimensional richly structured data, it is often
More informationReplicated Softmax: an Undirected Topic Model. Stephen Turner
Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationarxiv: v1 [cs.lg] 8 Oct 2015
Empirical Analysis of Sampling Based Estimators for Evaluating RBMs Vidyadhar Upadhya and P. S. Sastry arxiv:1510.02255v1 [cs.lg] 8 Oct 2015 Indian Institute of Science, Bangalore, India Abstract. The
More informationSandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse
Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion
More informationPatterns of Scalable Bayesian Inference Background (Session 1)
Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles
More informationStochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence
ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient
More informationInductive Principles for Restricted Boltzmann Machine Learning
Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationRestricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem
More informationNeed for Sampling in Machine Learning. Sargur Srihari
Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationarxiv: v1 [cs.ne] 6 May 2014
Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationApproximate inference in Energy-Based Models
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationTraining an RBM: Contrastive Divergence. Sargur N. Srihari
Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and
More informationLearning to Disentangle Factors of Variation with Manifold Learning
Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationAn ABC interpretation of the multiple auxiliary variable method
School of Mathematical and Physical Sciences Department of Mathematics and Statistics Preprint MPS-2016-07 27 April 2016 An ABC interpretation of the multiple auxiliary variable method by Dennis Prangle
More informationAu-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto
Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationChapter 20. Deep Generative Models
Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution
More informationParallel Tempering is Efficient for Learning Restricted Boltzmann Machines
Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation
More informationControlled sequential Monte Carlo
Controlled sequential Monte Carlo Jeremy Heng, Department of Statistics, Harvard University Joint work with Adrian Bishop (UTS, CSIRO), George Deligiannidis & Arnaud Doucet (Oxford) Bayesian Computation
More informationLearning the hyper-parameters. Luca Martino
Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationKyle Reing University of Southern California April 18, 2018
Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationF denotes cumulative density. denotes probability density function; (.)
BAYESIAN ANALYSIS: FOREWORDS Notation. System means the real thing and a model is an assumed mathematical form for the system.. he probability model class M contains the set of the all admissible models
More informationStatistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling
1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]
More informationWasserstein Training of Boltzmann Machines
Wasserstein Training of Boltzmann Machines Grégoire Montavon, Klaus-Rober Muller, Marco Cuturi Presenter: Shiyu Liang December 1, 2016 Coordinated Science Laboratory Department of Electrical and Computer
More informationNeural Networks for Machine Learning. Lecture 11a Hopfield Nets
Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold
More informationEnhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines
Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines KyungHyun Cho KYUNGHYUN.CHO@AALTO.FI Tapani Raiko TAPANI.RAIKO@AALTO.FI Alexander Ilin ALEXANDER.ILIN@AALTO.FI Department
More informationAn Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns
An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns Daniel Jiwoong Im, Ethan Buchman, and Graham W. Taylor School of Engineering University of Guelph Guelph,
More informationDeep Boltzmann Machines
Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationIntractable Likelihood Functions
Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap p(x y o ) = z p(x,y o,z) x,z
More informationEstimating the Partition Function by Discriminance Sampling
Estimating the Partition Function by Discriminance Sampling Qiang Liu CSAIL, MIT & Computer Science, Dartmouth College qliu@cs.dartmouth.edu Alexander Ihler Information and Computer Science, UC Irvine
More informationAn Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-037 August 4, 2010 An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationLearning in Markov Random Fields An Empirical Study
Learning in Markov Random Fields An Empirical Study Sridevi Parise, Max Welling University of California, Irvine sparise,welling@ics.uci.edu Abstract Learning the parameters of an undirected graphical
More informationVariations. ECE 6540, Lecture 10 Maximum Likelihood Estimation
Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter
More informationA Review of Pseudo-Marginal Markov Chain Monte Carlo
A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationTowards Maximum Likelihood: Learning Undirected Graphical Models using Persistent Sequential Monte Carlo
JMLR: Workshop and Conference Proceedings 39:205 220, 2014 ACML 2014 Towards Maximum Likelihood: Learning Undirected Graphical Models using Persistent Sequential Monte Carlo Hanchen Xiong Sandor Szedmak
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More informationDiscrete Latent Variable Models
Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course
More informationExercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters
Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for
More informationAutoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas
On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationLearning Deep Architectures
Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,
More informationInformation geometry of mirror descent
Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee
More informationA Few Notes on Fisher Information (WIP)
A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationCOMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationMonte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants
Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants Faming Liang Texas A& University Sooyoung Cheon Korea University Spatial Model Introduction
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationSequential Monte Carlo Methods for Bayesian Computation
Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter
More informationDivision of Informatics, University of Edinburgh
T E H U N I V E R S I T Y O H F R G Division of Informatics, University of Edinburgh E D I N B U Institute for Adaptive and Neural Computation An Analysis of Contrastive Divergence Learning in Gaussian
More information