Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Basic Sampling Methods

STA 4273H: Statistical Machine Learning

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

16 : Markov Chain Monte Carlo (MCMC)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MCMC and Gibbs Sampling. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Markov Chain Monte Carlo

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Pattern Recognition and Machine Learning

17 : Markov Chain Monte Carlo

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Inference and MCMC

Approximate Inference using MCMC

Introduction to Machine Learning CMU-10701

CPSC 540: Machine Learning

19 : Slice Sampling and HMC

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

2 Random Variable Generation

Markov Chain Monte Carlo (MCMC)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Introduction to Machine Learning

Probabilistic Graphical Models

Nonparametric Bayesian Methods (Gaussian Processes)

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Lecture 8: The Metropolis-Hastings Algorithm

Lecture 7 and 8: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Adaptive Rejection Sampling with fixed number of nodes

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

MCMC algorithms for fitting Bayesian models

2 Functions of random variables

Monte Carlo Methods. Leon Gu CSD, CMU

MTH739U/P: Topics in Scientific Computing Autumn 2016 Week 6

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computational statistics

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Probabilistic Graphical Models: Exercises

Adaptive Monte Carlo methods

Advanced Introduction to Machine Learning

Markov Chain Monte Carlo (MCMC)

16 : Approximate Inference: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo

Linear Models for Classification

Adaptive Rejection Sampling with fixed number of nodes

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Graphical Models and Kernel Methods

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Generating the Sample

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Answers and expectations

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Markov chain Monte Carlo Lecture 9

Bagging During Markov Chain Monte Carlo for Smoother Predictions

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

The Expectation-Maximization Algorithm

Bayesian Methods for Machine Learning

6 Markov Chain Monte Carlo (MCMC)

Intelligent Systems I

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Today: Fundamentals of Monte Carlo

CS281A/Stat241A Lecture 22

Metropolis-Hastings Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

( ) ( ) Monte Carlo Methods Interested in. E f X = f x d x. Examples:

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Linear Classification

Inverse Problems in the Bayesian Framework

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Monte Carlo integration (naive Monte Carlo)

Introduction to Machine Learning Midterm, Tues April 8

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Linear Dynamical Systems

Markov Chain Monte Carlo

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

ECE 302 Division 2 Exam 2 Solutions, 11/4/2009.

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Part XVI from the from Standard Rejection Importance Gibbs 588of 828

Approximating π - Buffon s Needle (1777) Drop length c needle on parallel lines distance d apart Needle falls perpendicular: Probability of crossing the line is c/d. Needle falls at an arbitrary angle a: Probability of crossing the line c sin(a)/d. Every angle is equally probable. Calculate the mean: p(crossing) = c d π 0 sin(a) dp(a) = 1 π c d π (iv) n crossings in N experiments results in n N 2 π 0 sin(a) da = 2 π c d c d from the from Standard Rejection Importance Gibbs d c d a c sin a (i) Needle falls perpendicular (a = π/2). (ii) Needle falls at arbitrary angle a. 589of 828

For most probabilistic models of practical interest, exact inference is intractable. Need approximation. Numerical sampling (Monte methods). Fundamental problem : Find the expectation of some function f (z) w.r.t. a probability distribution p(z) E [f ] = f (z) p(z) dz Key idea : Draw z (l), l = 1...., L independent samples from p(z) and approximate the expectation by f = 1 L L f (z (l) ) l=1 from the from Standard Rejection Importance Gibbs Problem: How to obtain independent samples from p(z)? 590of 828

Approximating the Expectation of f (z) Samples must be independent, otherwise the effective sample size is much smaller than the apparent sample size. If f (z) is small in regions where p(z) is large (or vice versa) : need large sample sizes to catch contributions from all regions. from the from Standard p(z) f(z) Rejection Importance Gibbs z 591of 828

from the In a computer usually via pseudorandom number generator : an algorithm generating a sequence of numbers that approximates the properties of random numbers. Use a mathematically well crafted pseudorandom number generator. From now on we will assume that we have a good pseudorandom number generator for uniformly distributed data available. If you don t trust any algorithm : Three carefully adjusted radio receivers picking up atmospheric noise to provide real random numbers at http://www.random.org/ from the from Standard Rejection Importance Gibbs 592of 828

from Standard Goal: Sample from p(y) which is given in analytical form. Suppose uniformly distributed samples of z in the interval (0, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx Transform the samples from U(z 0, 1) by y = h 1 (z) to obtain samples y distributed according to p(y). Easy to check that h is then the CDF of y: from the from Standard Rejection Importance Gibbs p(y < x) = p(h 1 (z) < x) = p(z < h(x)) = h(x). 593of 828

from Standard Goal: Sample from p(y) which is given in analytical form. If a uniformly distributed random variable z is transformed using y = h 1 (z) then y will be distributed according to p(y). 1 h(y) from the from Standard Rejection p(y) Importance Gibbs 0 y 594of 828

from the Exponential Distribution Goal: Sample from the exponential distribution { λe λy 0 y p(y) = 0 y < 0 with rate parameter λ > 0. Suppose uniformly distributed samples of z in the interval (0, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx = y Transform the samples from U(z 0, 1) by 0 λe λx dx = 1 e λy from the from Standard Rejection Importance Gibbs y = h 1 (z) = 1 ln(1 z) λ to obtain samples y distributed according to the exponential distribution. 595of 828

from Multivariate Generalisation to multiple variables is straightforward Consider change of variables via the Jacobian p(y 1,..., y M ) = p(z 1,..., z M ) (z 1,..., z M ) (y 1,..., y M ) Technical challenge: Multiple integrals; inverting nonlinear functions of multiple variables. from the from Standard Rejection Importance Gibbs 596of 828

the Gaussian Distribution - Box-Muller 1 Generate pairs of uniformly distributed random numbers z 1, z 2 ( 1, 1) (e.g. z i = 2z 1 for z from U(z 0, 1)) 2 Discard any pair (z 1, z 2 ) unless z 2 1 + z2 2 1. Results in a uniform distribution inside of the unit circle p(z 1, z 2 ) = 1/π. 3 Evaluate r 2 = z 2 1 + z2 2 and y 1 = z 1 ( 2 ln r 2 r 2 y 2 = z 2 ( 2 ln r 2 r 2 ) 1/2 ) 1/2 4 y 1 and y 2 are independent with joint distribution p(y 1, y 2 ) = p(z 1, z 2 ) (z 1, z 2 ) (y 1, y 2 ) = 1 e y2 1 /2 1 e y2 2 /2 2π 2π from the from Standard Rejection Importance Gibbs 597of 828

the Multivariate Gaussian Generate x N (0, I) let y = Ax + b E y = b E [(y E y) (y E y)] = A E [x x]a = A A from the from Standard Rejection Importance We know y is Gaussian (why?). So to sample y N (m, Σ) put b = m A A = Σ A is the Cholesky factor of Σ Gibbs 598of 828

Rejection Assumption 1 : directly from p(z) is difficult, but we can evaluate p(z) up to some unknown normalisation constant Z p p(z) = 1 Z p p(z) Assumption 2 : We can draw samples from a simpler distribution q(z) and for some constant k and all z holds kq(z) p(z) from the from Standard Rejection kq(z 0 ) kq(z) Importance Gibbs u 0 p(z) z 0 z 599of 828

Rejection 1 Generate a random number z 0 from the distribution q(z). 2 Generate a number from the u 0 from the uniform distribution over [0, k q(z 0 )]. 3 If u 0 > p(z 0 ) then reject the pair (z 0, u 0 ). 4 The remaining pairs have uniform distribution under the curve p(z). 5 The z values are distributed according to p(z). from the from Standard Rejection kq(z 0 ) kq(z) Importance Gibbs u 0 p(z) z 0 z 600of 828

Rejection - Example Consider the Gamma Distribution for a > 1 Gam(z a, b) = ba z a 1 exp( bz) Γ(a) Suitable q(z) could be like the Cauchy distribution q(z) = k 1 + (z c) 2 /b 2 Samples z from q(z) by using uniformly distributed y and transformation z = b tan y + c for c = a 1, b 2 = 2a 1 and k as small as possible for kq(z) p(z). 0.15 from the from Standard Rejection Importance p(z) 0.1 0.05 Gibbs 0 0 10 20 30 z 601of 828

Suitable form for the proposal distribution q(z) might be difficult to find. If p(z) is log-concave (ln p(z) has nonincreasing derivatives), use the derivatives to construct an envelope. 1 Start with an initial grid of points z 1,..., z M and construct the envelope using the tangents at the p(z i ), i = 1,..., M). 2 Draw a sample from the envelop function and if accepted use it to calculate p(z). Otherwise, use it to refine the grid. ln p(z) from the from Standard Rejection Importance Gibbs z 1 z 2 z 3 z 602of 828

- Example The piecewise exponential distribution is defined as p(z) = k m λ m e λm(z zm 1) ẑ m 1,m < z ẑ m,m+1 where ẑ m 1,m is the point of intersection of the tangent lines at z m 1 and z m, λ m is the slope of the tangent at z m and k m accounts for the corresponding offset. ln p(z) from the from Standard Rejection Importance Gibbs z 1 z 2 z 3 z 603of 828

Rejection - Problems Need to find a proposal distribution q(z) which is a close upper bound to p(z); otherwise many samples are rejected. Curse of dimensionality for multivariate distributions. from the kq(z 0 ) kq(z) from Standard Rejection Importance u 0 p(z) Gibbs z 0 z 604of 828

Importance Directly calculate E p [f (z)] w.r.t. some p(z). Does not sample p(z) as an intermediate step. Again use samples z (l) from some proposal q(z). E [f ] = f (z) p(z) dz = f (z) p(z) q(z) dz q(z) 1 L p(z (l) ) L q(z (l) ) f (z(l) ) l=1 from the from Standard Rejection Importance Gibbs 605of 828

Importance - Unnormalised Consider unnormalised p(z) = p(z) Z p It follows then that, with r l = p(z(l) ) q(z (l) ), and q(z) = q(z) Z q. E [f ] = f (z) p(z) dz = Z q f (z) p(z) q(z) dz Z p q(z) Z q 1 L r l f (z (l) ). Z p L l=1 Using the same samples (and f (z) = 1) gives Z p 1 L r l Z q L E [f ] L l=1 l=1 r l L m=1 r f (z (l) ) m from the from Standard Rejection Importance Gibbs 606of 828

Importance - Key Points Importance weights r l correct the bias introduced by sampling from the proposal distribution q(z) instead of the wanted distribution p(z). Success depends on how well q(z) approximates p(z). If p(z) > 0 in same region, then q(z) > 0 necessary. from the from Standard Rejection Importance Gibbs 607of 828

Goal: sample from p(z). Generate a sequence using a Markov chain. 1 Generate a new sample z q(z z (l) ), conditional on the previous sample z (l). 2 Accept or reject the new sample according to some appropriate criterion. { z (l+1) z if accepted = z (l) if rejected 3 For an appropriate proposal and corresponding acceptance criterion, as l, z (l) approaches an independent sample of p(z). from the from Standard Rejection Importance Gibbs 608of 828

Metropolis Algorithm 1 Choose a symmetric proposal distribution q(z A z B ) = q(z B z A ). 2 Accept the new sample z with probability ( ) A(z, z (r) ) = min 1, p(z ) p(z (r) ) e.g., let u Uniform(0, 1) and accept if p(z ) p(z (r) ) > u. 3 Unlike rejection sampling we include the previous sample on rejection of the proposal: { z (l+1) z if accepted = z (r) if rejected from the from Standard Rejection Importance Gibbs 609of 828

Metropolis Algorithm - Illustration from a Gaussian Distribution (black contour shows one standard deviation). Proposal is q(z z (l) ) = N (z z (l), σ 2 I) with σ = 0.2. 150 candidates generated; 43 rejected. 3 2.5 2 1.5 1 from the from Standard Rejection Importance Gibbs 0.5 0 0 0.5 1 1.5 2 2.5 3 accepted steps, rejected steps. 610of 828

- Why it works A First Order Markov Chain is a series of random variables z (1),..., z (M) such that the following property holds Marginal probability p(z (m+1) z (1),..., z (m) ) = p(z (m+1) z (m) ) p(z (m+1) ) = z (m) = z (m) p(z (m+1) z (m) ) p(z (m) ) T m (z (m) z (m+1) ) p(z (m) ) where T m (z (m) z (m+1) ) are the transition probabilities. from the from Standard Rejection Importance Gibbs z1 z2 zm zm+1 611of 828

- Why it works Marginal probability p(z (m+1) ) = z (m) T m (z (m) z (m+1) ) p(z (m) ) A Markov chain is called homogeneous if the transition probabilities are the same for all m, denoted by T(z, z). A distribution is invariant, or stationary, with respect to a Markov chain if each step leaves the distribution invariant. For a homogeneous Markov chain, the distribution p (z) is invariant if p (z) = z T(z, z) p (z ). from the from Standard Rejection Importance Gibbs (Note: There can be many. If T is the identity matrix, every distribution is invariant.) 612of 828

- Why it works Detailed balance p (z) T(z, z ) = p (z ) T(z, z). is sufficient (but not necessary) for p (z) to be invariant. (A Markov chain that respects the detailed balance is called reversible.) A Markov chain is ergodic if it converges to the invariant distribution irrespective of the choice of the initial conditions. The invariant distribution is then called equilibrum. An ergodic Markov chain can have only one equilibrium distribution. Why is it working? Choose the transition probabilities T to satisfy the detailed balance for our goal distribution p(z). from the from Standard Rejection Importance Gibbs 613of 828

- Metropolis-Hastings Generalisation of the Metropolis algorithm for nonsymmetric proposal distributions q k. At step τ, draw a sample z from the distribution q k (z z (τ) ) where k labels the set of possible transitions. Accept with probability ( A k (z, z (τ) p(z ) q k (z (τ) z ) ) ) = min 1, p(z (τ) ) q k (z z (τ) ) Choice of proposal distribution critical. Common choice : Gaussian centered on the current state. small variance high acceptance rate, but slow walk through the state space; samples not independent large variance high rejection rate from the from Standard Rejection Importance Gibbs 614of 828

- Metropolis-Hastings Transition probability of this Markov chain is T(z, z ) = q k (z z) A k (z, z) Prove that p(z) is the invariant distribution if the detailed balance holds p(z) T(z, z ) = T(z, z) p(z ). Using the symmetry min(a, b) = min(b, a) it can be shown that the detailed balance holds p(z) q k (z z) A k (z, z) = min (p(z) q k (z z), p(z ) q k (z z )) = min (p(z ) q k (z z ), p(z) q k (z z)) = p(z ) q k (z z ) A k (z, z ). from the from Standard Rejection Importance Gibbs 615of 828

- Metropolis-Hastings Isotropoic Gaussian proposal distribution (blue) In order to keep the rejection rate low, use the smallest standard deviation σ min of the multivariate Gaussian (red) for the proposal distribution. Leads to random walk behaviour slow exploration of the state space. Number of steps separating states that are approximately independent is (σ max /σ min ) 2. from the from Standard Rejection Importance σ max Gibbs ρ σ min 616of 828

Gibbs Goal: sample from a joint distribution p(z) = p(z 1,..., z M ) How? sample one variable from the distribution conditioned on all the other variable Example : given p(z 1, z 2, z 3 ) At step τ we have samples z (τ) 1, z (τ) 2 and z (τ) 3. Get samples for the next step τ + 1 z (τ+1) 1 p(z (τ) 1 z (τ) 2, z (τ) 3 ) z (τ+1) 2 p(z (τ) 2 z (τ+1) 1, z (τ) 3 ) z (τ+1) 3 p(z (τ) 3 z (τ+1) 1, z (τ+1) 2 ) from the from Standard Rejection Importance Gibbs 617of 828

Gibbs - Why does it work? 1 p(z) is invariant of each of the Gibbs sampling steps and hence of the whole Markov chain : a) p(z i {z \i }) is invariant because the marginal distribution p(z \i ) does not change, and b) by definition each step samples from p(z i {z \i }). 2 Ergodicity: sufficient that none of the conditional distribution is zero anywhere. 3 Gibbs sampling is a Metropolis-Hastings sampling in which each step is accepted. z 2 L from the from Standard Rejection Importance Gibbs l 618of 828