Annealing Between Distributions by Averaging Moments

Size: px
Start display at page:

Download "Annealing Between Distributions by Averaging Moments"

Transcription

1 Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto

2 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x)

3 Partition Functions We usually specify distributions up to a normalizing constant, MRFs p(y) = f (y)/z Posteriors y x θ f exp( E(x, θ)) p(x θ)p(θ) Z Z(θ) p(x) For Markov Random Fields (MRFs) partition function Z(θ) = x exp( E(x, θ)) is intractable Goal: Estimate log Z(θ).

4 Estimating Partition Functions Variational approximations and bounds on log Z (Yedida et al., 2005; Wainwright et al., 2005). We want our models to reflect a highly dependent world, this can hurt variational approaches as we assume more and more independence. This assumption less costly for posterior inference over parameters. Sampling methods such as path sampling (Gelman and Meng, 1998), sequential Monte Carlo (e.g. del Moral et al., 2006), simple importance sampling, and annealed importance sampling (Neal, 2002). Slow, finicky, and hard to diagnose In principle, can deal with multimodality

5 Simple Importance Sampling (SIS) Two distributions p a (x) and p b (x) over X Then Z a M f a (x)/z a tractable Z easy to sample M i=1 for x (i) p a (x). f b (x (i) ) f a (x (i) ) f b (x)/z b intractable Z hard to sample fb (x) p a (x) p a(x) dx = Z b Variance is high (sometimes ) if p a << p b in some regions.

6 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x)

7 If f 0 is the uniform distribution, this becomes An Intuition f (x) =f 1 (x), which explains the term annealing Think Move of gradually as an inverse fromtemperature a hotter p a to a colder p b annealing oger Grosse (joint Reduce work with variance Chris Annealing by Maddison, chaining between distributions Ruslan importance by Salakhutdinov) averagingsamplers moments May 22, / 22 Z 0 M M i=1 f 1 (x (i) 0 ) f 0 (x (i) 0 ) f K (x (i) K 1 ) Z 1 f K 1 (x (i) K 1 ) Z 0 Z 0 Z K Z K 1 = Z K where we independently draw x k p k (x) hard to do

8 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002)

9 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant

10 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x k T k (x x k 1 ) }{{} leaves p k (x) invariant w (i) = f 1(x (i) 0 ) f 0 (x (i) 0 )

11 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) For chain i x 0 p 0 x 0 x 1 x 1 T 1 (x x 0 ) x k T k (x x k 1 ) }{{} leaves p k (x) invariant (i) 1 ) w (i) = f 1(x (i) 0 ) f 2 (x f 0 (x (i) 0 ) f 1 (x (i) 1 )

12 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2 x 1 T 1 (x x 0 ) (i) 1 ) (i) 2 ) w (i) = f 1(x (i) 0 ) f 2 (x f 3 (x f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 )

13 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) w (i) = f 1(x (i) 0 ) (i) f 2 (x1 ) (i) f 3 (x2 ) f 0 (x (i) 0 ) f 1 (x (i) 1 ) f 2 (x (i) 2 ) f K (x (i) K 1 ) f K 1 (x (i) K 1 )

14 Annealed Importance Sampling (AIS) We can still do this with certain dependent samplers! (Neal, 2002) x k T k (x x k 1 ) }{{} leaves p k (x) invariant For chain i x 0 p 0 x 2 T 2 (x x 1 ) x 0 x 1 x 2... x K 1 x 1 T 1 (x x 0 ) x K 1 T K 1 (x x K 2 ) Z 0 M M w (i) Z K i=1 Intuition: SIS on an extended state space, remarkably unbiased!

15 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x)

16 Analyzing AIS Paths Virtually the only scheme used is geometric averages, f k (x) = f a 1 β k (x)f b β k (x) Let P be a family of distributions parameterized by θ Define a path γ : [0, 1] P and a schedule of points 0 = β 0 < β 1 <... < β K = 1 p b p k = p θ(β) p a

17 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)]

18 Analyzing AIS Paths Assume perfect transitions T k (x x k 1 ) = p k (x) then E [ log w (i)] = K k=1 E p k [log f k (x) log f k 1 (x)] }{{} finite difference approximation assump. + math k 1 d log Z(θ(β)) 0 dβ dβ = log Z K Z 0 Intuition: log w (i) s accumulate finite differences of log Z

19 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions?

20 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) )

21 Analyzing AIS Paths What is the error for a fixed number of intermediate distributions? E [ log w (i)] is biased, unlike E [ w (i)] [ E log w (i)] = K E pk [log f k (x) log f k 1 (x)] k=1 = log Z K Z 0 K D KL (p k 1 p k ) k=1 } {{ } bias = δ(log w (i) ) With perfect transitions the following are monotonic in the sum of KL divergences δ(log w (i) ) var [ log w (i)] var [ w (i)] Goal: Minimize the sum of KL divergences.

22 Analyzing AIS Paths Approach: Approximate the sum of KL with a functional. Let γ be a path and β k a linearly spaced schedule. Then K K k=1 D KL (p k 1 p k ) K F(γ) θ(β) T G θ (β) }{{ θ(β) dβ, } 0 metric on manifold defined by Fisher Inform. where G θ = cov pθ ( θ log p θ (x)) is the Fisher Information and θ(β) = dθ(β)/dβ. Ties to information geometry. Analogous functional for path sampling (Gelman and Meng, 1998).

23 Analyzing AIS Paths Under the optimal schedule, the value of the functional is F(γ) = l(γ) 2 /2 where l(γ) = 1 0 θ(β) T G θ (β) θ(β)dβ is the path length on the Riemannian manifold defined by G θ (β). Intuition: Spend more time on segments with high curvature.

24 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s

25 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1)

26 Paths for Exponential Family Distributions Let us restrict ourselves to the exponential family p(x) = 1 Z(η) h(x) exp(ηt g(x)) η or s = E[g(x)] completely specifies a distribution. One-to-one correspondence between η and s Old Geom. Averaged Path γ GA (β) is the distribution with η(β) = (1 β)η(0) + βη(1) New Moment Averaged Path γ MA (β) is the distribution with s(β) = (1 β)s(0) + βs(1)

27 The Picture for Gaussians Often unintuitive... Two Gaussians

28 The Picture for Gaussians Often unintuitive... GA path γ GA places mass only where both pdfs agree veto effects

29 The Picture for Gaussians Often unintuitive... GA path MA path γ GA places mass only where both pdfs agree veto effects γ MA interpolate means and covariances then stretch covariance

30 Path Properties: Variational Interpretation For geometric averages, the intermediate distribution minimizes a weighted sum of KLs p (GA) β = arg min p (1 β)d KL (p p a ) + βd KL (p p b ) For moment averages, the same but of the reverse KLs p (MA) β = arg min p (1 β)d KL (p a p) + βd KL (p b p) GA path MA path

31 Path Properties: Cost Functional For the exponential family we can find F(γ) Important this assumes linear schedules Both γ GA and γ MA have the same functional! F(γ GA ) = F(γ MA ) = 1 2 (s(1) s(0))t (η(1) η(0)) If we partition a schedule by distributions p j into piecewise linear schedules we can optimally allocate distributions from a total budget = K j K j (η j+1 η j ) T (s j+1 s j ) Biggest effect is the choice of path, not schedule.

32 Gaussians Two Gaussians N (( 10 0 ), ( )) (( and N 100 ) (, ))

33 Gaussians Number of intermediate distributions required to anneal between two Gaussians with means µ 1 and µ 2 and variance σ as a function of d = (µ 2 µ1)/σ GA, linear schedule O(d 2 ) MA, linear schedule O(d 2 ) GA, optimal schedule O(d 2 ) MA, optimal schedule O((log d) 2 ) Optimal path (Gelman and Meng, 1998) O((log d) 2 ) MA within constant factor of optimal under its optimal scheduling no general proof yet. Mixing issues dominate performance in practice where MA shines.

34 Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines RBMs are Markov Random Fields of coupled visible and For hidden many binary real-world variables problems, x = (v, we h) need {0, to introduce 1} D {0, hidden 1} F with variables. a Our special random bipartite variables structure. will contain visible and hidden variables x=(v,h). hidden variables Bipartite Structure The energy of a joint Stochastic binary visible variables are connected configuration to stochastic is E(v, h, binary θ) hidden variables. binary potentials unary potential {}}{{}}{ The energy of vthe T Wjoint h configuration: }{{} v T c h T b unary potential model parameters. Image visible variables where θ = (W, c, b). Probability Z(θ) = of the v joint h exp( E(v, configuration h, θ)) is given is generally by the Boltzmann intractable, distribution: except if we have a small number (< 25) of hidden units or few visible units. parooon$funcoon$ potenoal$funcoons$

35 Restricted Boltzmann Machines (RBMs) Two different paths for an RBM trained on the MNIST digit dataset (60,000 B&W images of digits).

36 Restricted Boltzmann Machines (RBMs) p k = p θ(β) p b MA path infeasible for most RBMs solve for natural parameters p a estimate moments {}}{{}}{ E[vh T ] β = (1 β)e[vh T ] 0 + β E[vh T ] 1 Can do approximately for a few intermediate models. Moment Averaged Spline Path (γ MAS now in blue) Knots are moment matched, annealing between them with GA p b p a knot

37 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 20 hidden units trained on MNIST with PCD

38 Estimating Partition Functions of RBMs Estimating partition function of an RBM with 500 hidden units trained on MNIST with CD1. Under estimating by 20 nats is difference between a log probability of 130 and 110 on MNIST test set! overestimate p(x) = exp( E(x, θ))/z(θ) underestimate

39 Conclusions Theoretical foundations for studying AIS under prefect mixing A new annealing scheme for exponential family distributions with practical approximations Improved estimates of partition functions for RBMs Ongoing work Diagnostics for AIS Extend this work to models that are harder for AIS MA intermediate distributions for other tempering-based methods such as learning MRFs with MA parallel tempering Using MA path in marginal likelihood estimation for directed models

40 Thanks!

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Learning and Evaluating Boltzmann Machines

Learning and Evaluating Boltzmann Machines Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Ruslan Salakhutdinov 2008. June 26, 2008

More information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Density estimation. Computing, and avoiding, partition functions. Iain Murray Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Learning Deep Boltzmann Machines using Adaptive MCMC

Learning Deep Boltzmann Machines using Adaptive MCMC Ruslan Salakhutdinov Brain and Cognitive Sciences and CSAIL, MIT 77 Massachusetts Avenue, Cambridge, MA 02139 rsalakhu@mit.edu Abstract When modeling high-dimensional richly structured data, it is often

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

arxiv: v1 [cs.lg] 8 Oct 2015

arxiv: v1 [cs.lg] 8 Oct 2015 Empirical Analysis of Sampling Based Estimators for Evaluating RBMs Vidyadhar Upadhya and P. S. Sastry arxiv:1510.02255v1 [cs.lg] 8 Oct 2015 Indian Institute of Science, Bangalore, India Abstract. The

More information

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion

More information

Patterns of Scalable Bayesian Inference Background (Session 1)

Patterns of Scalable Bayesian Inference Background (Session 1) Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Inductive Principles for Restricted Boltzmann Machine Learning

Inductive Principles for Restricted Boltzmann Machine Learning Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Need for Sampling in Machine Learning. Sargur Srihari

Need for Sampling in Machine Learning. Sargur Srihari Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

arxiv: v1 [cs.ne] 6 May 2014

arxiv: v1 [cs.ne] 6 May 2014 Training Restricted Boltzmann Machine by Perturbation arxiv:1405.1436v1 [cs.ne] 6 May 2014 Siamak Ravanbakhsh, Russell Greiner Department of Computing Science University of Alberta {mravanba,rgreiner@ualberta.ca}

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Approximate inference in Energy-Based Models

Approximate inference in Energy-Based Models CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Training an RBM: Contrastive Divergence. Sargur N. Srihari Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and

More information

Learning to Disentangle Factors of Variation with Manifold Learning

Learning to Disentangle Factors of Variation with Manifold Learning Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

An ABC interpretation of the multiple auxiliary variable method

An ABC interpretation of the multiple auxiliary variable method School of Mathematical and Physical Sciences Department of Mathematics and Statistics Preprint MPS-2016-07 27 April 2016 An ABC interpretation of the multiple auxiliary variable method by Dennis Prangle

More information

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines

Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines Parallel Tempering is Efficient for Learning Restricted Boltzmann Machines KyungHyun Cho, Tapani Raiko, Alexander Ilin Abstract A new interest towards restricted Boltzmann machines (RBMs) has risen due

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Controlled sequential Monte Carlo

Controlled sequential Monte Carlo Controlled sequential Monte Carlo Jeremy Heng, Department of Statistics, Harvard University Joint work with Adrian Bishop (UTS, CSIRO), George Deligiannidis & Arnaud Doucet (Oxford) Bayesian Computation

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

F denotes cumulative density. denotes probability density function; (.)

F denotes cumulative density. denotes probability density function; (.) BAYESIAN ANALYSIS: FOREWORDS Notation. System means the real thing and a model is an assumed mathematical form for the system.. he probability model class M contains the set of the all admissible models

More information

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling 1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

More information

Wasserstein Training of Boltzmann Machines

Wasserstein Training of Boltzmann Machines Wasserstein Training of Boltzmann Machines Grégoire Montavon, Klaus-Rober Muller, Marco Cuturi Presenter: Shiyu Liang December 1, 2016 Coordinated Science Laboratory Department of Electrical and Computer

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines KyungHyun Cho KYUNGHYUN.CHO@AALTO.FI Tapani Raiko TAPANI.RAIKO@AALTO.FI Alexander Ilin ALEXANDER.ILIN@AALTO.FI Department

More information

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns

An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns An Empirical Investigation of Minimum Probability Flow Learning Under Different Connectivity Patterns Daniel Jiwoong Im, Ethan Buchman, and Graham W. Taylor School of Engineering University of Guelph Guelph,

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Intractable Likelihood Functions

Intractable Likelihood Functions Intractable Likelihood Functions Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap p(x y o ) = z p(x,y o,z) x,z

More information

Estimating the Partition Function by Discriminance Sampling

Estimating the Partition Function by Discriminance Sampling Estimating the Partition Function by Discriminance Sampling Qiang Liu CSAIL, MIT & Computer Science, Dartmouth College qliu@cs.dartmouth.edu Alexander Ihler Information and Computer Science, UC Irvine

More information

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-037 August 4, 2010 An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Learning in Markov Random Fields An Empirical Study

Learning in Markov Random Fields An Empirical Study Learning in Markov Random Fields An Empirical Study Sridevi Parise, Max Welling University of California, Irvine sparise,welling@ics.uci.edu Abstract Learning the parameters of an undirected graphical

More information

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter

More information

A Review of Pseudo-Marginal Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo A Review of Pseudo-Marginal Markov Chain Monte Carlo Discussed by: Yizhe Zhang October 21, 2016 Outline 1 Overview 2 Paper review 3 experiment 4 conclusion Motivation & overview Notation: θ denotes the

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Towards Maximum Likelihood: Learning Undirected Graphical Models using Persistent Sequential Monte Carlo

Towards Maximum Likelihood: Learning Undirected Graphical Models using Persistent Sequential Monte Carlo JMLR: Workshop and Conference Proceedings 39:205 220, 2014 ACML 2014 Towards Maximum Likelihood: Learning Undirected Graphical Models using Persistent Sequential Monte Carlo Hanchen Xiong Sandor Szedmak

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Discrete Latent Variable Models

Discrete Latent Variable Models Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course

More information

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters Exercises Tutorial at ICASSP 216 Learning Nonlinear Dynamical Models Using Particle Filters Andreas Svensson, Johan Dahlin and Thomas B. Schön March 18, 216 Good luck! 1 [Bootstrap particle filter for

More information

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Information geometry of mirror descent

Information geometry of mirror descent Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee

More information

A Few Notes on Fisher Information (WIP)

A Few Notes on Fisher Information (WIP) A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants Faming Liang Texas A& University Sooyoung Cheon Korea University Spatial Model Introduction

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Division of Informatics, University of Edinburgh

Division of Informatics, University of Edinburgh T E H U N I V E R S I T Y O H F R G Division of Informatics, University of Edinburgh E D I N B U Institute for Adaptive and Neural Computation An Analysis of Contrastive Divergence Learning in Gaussian

More information