Basic Sampling Methods

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

STA 4273H: Statistical Machine Learning

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Markov Chain Monte Carlo

Need for Sampling in Machine Learning. Sargur Srihari

Linear Dynamical Systems

Introduction to Probabilistic Graphical Models

Learning MN Parameters with Approximation. Sargur Srihari

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Nonparametric Bayesian Methods (Gaussian Processes)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Probabilistic Graphical Models: Exercises

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Markov Chain Monte Carlo Methods

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Mixtures of Gaussians. Sargur Srihari

STA 4273H: Statistical Machine Learning

17 : Markov Chain Monte Carlo

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

MCMC and Gibbs Sampling. Sargur Srihari

Bayesian Inference and MCMC

STA 4273H: Statistical Machine Learning

Variational Inference (11/04/13)

Probability and Information Theory. Sargur N. Srihari

Sampling Algorithms for Probabilistic Graphical models

Graphical Models and Kernel Methods

Probabilistic Graphical Models

19 : Slice Sampling and HMC

Chris Bishop s PRML Ch. 8: Graphical Models

Conditional Independence

Variational Inference. Sargur Srihari

p L yi z n m x N n xi

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Latent Variable View of EM. Sargur Srihari

Variational Bayesian Logistic Regression

DAG models and Markov Chain Monte Carlo methods a short overview

Using Graphs to Describe Model Structure. Sargur N. Srihari

The Expectation-Maximization Algorithm

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Inference and estimation in probabilistic time series models

Graphical Models 359

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

CIS 390 Fall 2016 Robotics: Planning and Perception Final Review Questions

Introduction to Bayesian Learning

Approximate Inference using MCMC

Latent Variable Models

13: Variational inference II

Likelihood Weighting and Importance Sampling

Learning Bayesian network : Given structure and completely observed data

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Bayesian Inference for the Multivariate Normal

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Part 6: Multivariate Normal and Linear Models

The Laplace Approximation

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Variational Principal Components

Quantitative Biology II Lecture 4: Variational Methods

Multivariate Gaussians. Sargur Srihari

Lecture 7 and 8: Markov Chain Monte Carlo

Based on slides by Richard Zemel

An Introduction to Bayesian Machine Learning

Inference as Optimization

Variational Inference. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Previously Monte Carlo Integration

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Doing Bayesian Integrals

Stat 451 Lecture Notes Simulating Random Variables

Undirected Graphical Models

Conditional probabilities and graphical models

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

CS 188: Artificial Intelligence. Bayes Nets

Machine Learning using Bayesian Approaches

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Probabilistic Graphical Models

Bayesian Network Representation

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Expectation Maximization

Inference in Bayesian Networks

Nonparameteric Regression:

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Machine Learning Techniques for Computer Vision

Practice Problems Section Problems

14 : Mean Field Assumption

Strong Lens Modeling (II): Statistical Methods

14 : Approximate Inference Monte Carlo Methods

Likelihood-Based Methods

16 : Markov Chain Monte Carlo (MCMC)

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

F denotes cumulative density. denotes probability density function; (.)

Introduction to Bayesian inference

Introduction. Chapter 1

Bayesian Linear Regression. Sargur Srihari

Transcription:

Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1

1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution 4. Rejection Sampling 5. Importance Sampling 6. Sampling-Importance-Resampling 2

1. Motivation When exact inference is intractable, we need some form of approximation True of probabilistic models of practical significance Inference methods based on numerical sampling are known as Monte Carlo techniques Most situations will require evaluating expectations of unobserved variables, e.g., to make predictions Rather than the posterior distribution 3

Common Task Find expectation of some function f(z) with respect to a probability distribution p(z) Components of z can be discrete, continuous or combination In case of continuous we wish to evaluate E[ f ] = f (z) p(z) dz In discrete case, integral replaces by summation Illustration In Bayesian regression, From sum and product rules p(t x)= p(t x,w)p(w)dw where p(t x,w)~n(y(x,w),β -1 ) In general such expectations are too complex to be evaluated analytically 4

Sampling: main idea Obtain set of samples z (l) where i =1,.., L Drawn independently from distribution p(z) Allows expectation to be approximated by Then fˆ = 1 L L i= 1 f ( z ( l) E[ ˆf ] = E[ f ] ) Called an estimator, i.e., estimator has the correct mean And var[ ˆf ] = 1 which is the variance of the estimator L E" ( f E( f ) ) 2 $ # % E[ f ] = f (z) p(z) dz Accuracy independent of dimensionality of z High accuracy can be achieved with few (10 or twenty samples) However samples may not be independent Effective sample size may be smaller than apparent sample size In example f(z) is small when p(z) is high and vice versa Expectation may be dominated by regions of small probability thereby requiring large sample sizes 5

2. Ancestral Sampling If joint distribution is represented by a BN no observed variables a straightforward method exists Distribution is specified by p( z) where z i are set of variables associated with node i and pa i are set of variables associated with node parents of node i To obtain samples from joint we make one pass through set of variables in order z 1,..z M sampling from conditional distribution p(z pa i ) After one pass through the graph we obtain one sample Frequency of different values defines the distribution i= 1 E.g., allowing us to determine marginals = M p(z i pa i ) P(L, S) = P(D, I,G, L, S) = P(D) P(I)P(G D, I)P(L G)P(S I) D,I,G D,I,G 6

Ancestral sampling with some nodes instantiated Directed graph where some nodes instantiated with observed values P(L = l 0, s = s 1 ) = P(D)P(I)P(G D, I) D,I,G P(L = l 0 G)P(S = s 1 I) Called Logic sampling Use ancestral sampling, except When sample is obtained for an observed value: if they agree then sample value is retained and proceed to next variable If they don t agree, whole sample is discarded 7

Properties of Logic Sampling P(L = l 0, s = s 1 ) = P(D)P(I)P(G D, I) D,I,G P(L = l 0 G)P(S = s 1 I) Samples correctly from posterior distribution Corresponds to sampling from joint distribution of hidden and data variables But prob. of accepting sample decreases as no of variables increase and number of states that variables can take increases A special case of Importance Sampling Rarely used in practice 8

Undirected Graphs No one-pass sampling strategy even from prior distribution with no observed variables Computationally expensive methods such as Gibbs sampling must be used Start with a sample Replace first variable conditioned on rest of values, then next variable, etc 9

3. Basic Sampling Algorithms Simple strategies for generating random samples from a given distribution They will be pseudo random numbers that pass a test for randomness Assume that algorithm is provided with a pseudo-random generator for uniform distribution over (0,1) For standard distributions we can use transformation method of generating nonuniformly distributed random numbers 10

Transformation Method Goal: generate random numbers from a simple non-uniform distribution Assume we have a source of uniformly distributed random numbers z is uniformly distributed over (0,1), i.e., p(z) =1 in the interval p(z) 0 1 z 11

Transformation for Standard Distributions p(z) f(z) 0 1 z If we transform values of z using f() such that y=f(z) Distribution of y is governed by dz p ( y) = p( z) (1) dy Goal is to choose f(z) such that values of y have distribution p(y) Integrating (1) above Since p(z)=1 and integral of dz/dy wrt y is z z = h(y) p(ŷ) dŷ which is an indefinite integral of p(y) y p(y) y = f(z) Thus y = h -1 (z) So we have to transform uniformly distributed random numbers Using function which is inverse of the indefinite integral of the distribution 12

Geometry of Transformation We are interested in generating r.v. s from p(y) non-uniform random variables h(y) is indefinite integral of desired p(y) z ~ U(0,1) is transformed using y = h -1 (z) Results in y being distributed as p(y) 13

Transformations for Exponential & Cauchy We need samples from the Exponential Distribution p( y) = λ exp( λy) where 0 y < In this case the indefinite integral is If we transform using Then y will have an exponential distribution We need samples from the Cauchy Distribution Cauchy Distribution z = h(y) = p(y )dy =1 exp( λy) y = λ 1 ln(1 z) 1 1 p( y) = π 1+ y Inverse of the indefinite integral can be expressed as a tan function 14 2 y 0

Generalization of Transformation Method Single variable transformation p(y) = p(z) dz dy Where z is uniformly distributed over (0,1) Multiple variable transformation p(y 1,..., y M ) = p(z 1,...,z M ) d(z 1,...,z M ) d(y 1,..., y M ) 15

Transformation Method for Gaussian Box-Muller method for Gaussian Example of a bivariate Gaussian First generate uniform distribution in unit circle Generate pairs of uniformly distributed random numbers z 1,z 2 ( 1, 1) Can be done from U(0,1) using zà 2z-1 Discard each pair unless z 12 +z 22 <1 Leads to uniform distribution of points inside unit circle with p(z 1,z 2 ) = 1 π 16

Generating a Gaussian For each pair z 1,z 2 evaluate the quantities y 1 = where r 2ln z z1 2 r Then y 1 and y 2 are independent Gaussians with zero mean and variance For arbitrary mean and covariance matrix In multivariate case 2 = z 2 1 1 + z 1/ 2 2 2 y 2 = z 2 2ln z 2 r If y N(0,1) then σ y + µ has N(µ,σ 2 ) 2 1/ 2 If components are independent and N ( 0,1) then y = µ + Lz will have N(µ,Σ) where Σ = LL T, is called the Cholesky decomposition 17

Limitation of Transformation Method Need to first calculate and then invert indefinite integral of the desired distribution Feasible only for a small number of distributions Need alternate approaches Rejection sampling and Importance sampling applicable to univariate distributions only But useful as components in more general strategies 18

4. Rejection Sampling Allows sampling from a relatively complex distribution Consider univariate, then extend to several variables Wish to sample from distribution p(z) Not a simple standard distribution Sampling from p(z) is difficult Suppose we are able to easily evaluate p(z) for any given value of z, upto a normalizing constant Z p(z) = 1 Z p p(z) Unnormalized where p(z) can readily be evaluated but Z p is unknown e.g., p(z) is a mixture of Gaussians Note that we may know the mixture distribution but we need samples to generate expectations 19

Rejection sampling: Proposal distribution Samples are drawn from simple distribution, called proposal distribution q(z) Introduce constant k whose value is such that kq(z) p(z) for all z Called comparison function 20

Rejection Sampling Intuition Samples are drawn from simple distribution q(z) Rejected if they fall in grey area Between un-normalized distribution p ~ (z) and scaled distribution kq(z) Resulting samples are distributed according to p(z) which is the normalized version of p ~ (z) 21

Determining if sample is shaded area Generate two random numbers z 0 from q(z) u 0 from uniform distribution [0,kq(z 0 )] This pair has uniform distribution under the curve of function kq(z) If is retained u 0 > p (z 0 ) the pair is rejected otherwise it Remaining pairs have a uniform distribution under the curve of p(z) and hence the corresponding z values are distributed according to p(z) as desired 22

Example of Rejection Sampling Task of sampling from Gamma Gam( z a, b) = b a 1 exp( bz) Γ( a) Since Gamma is roughly bellshaped, proposal distribution is is Cauchy Cauchy has to be slightly generalized To ensure it is nowhere smaller than Gamma a z Scaled Cauchy Gamma 23

Adaptive Rejection Sampling When difficult to find suitable analytic distribution p(z) Straight-forward when p(z) is log concave When ln p(z) has derivatives that are non-increasing functions of z Function ln p(z) and its gradient are evaluated at initial set of grid points Intersections are used to construct envelope A sequence of linear functions 24

Dimensionality and Rejection Sampling Gaussian example Acceptance rate is ratio of volumes under p(z) and kq(z) diminishes exponentially with dimensionality Proposal distribution q(z) is Gaussian Scaled version is kq(z) True distribution p(z) 25

5. Importance Sampling Principal reason for sampling p(z) is evaluating In Bayesian regression, expectation of some f(z) E[ f ] = f (z) p(z) dz Given samples z (l), l=1,..,l, from p(z), the finite sum approximation is ˆf = 1 L L i=1 f (z (l ) ) But drawing samples p(z) may be impractical Importance sampling uses: a proposal distribution like rejection sampling But all samples are retained Assumes that for any z, p(z) can be evaluated p(t x)= where p(t x,w)~n(y(x,w),β -1 ) p(t x,w)p(w)dw 26

Determining Importance weights Samples {z (l) } are drawn from simpler dist. q(z) E[ f ] = f (z)p(z)dz = f (z) p(z) q(z) q(z)dz = 1 L L p(z (l ) ) f (z (l ) ) q(z (l ) ) Unlike rejection sampling l=1 All of the samples are retained Proposal distribution Samples are weighted by ratios r l = p(z (l) ) / q(z (l) ) Known as importance weights Which corrects the bias introduced by wrong distribution 27

Likelihood-weighted Sampling Importance sampling of graphical model using ancestral sampling Weight of sample z given evidence variables E r(z) = p(z i pa i ) z i E 28

6. Sampling-Importance Re-sampling (SIR) Rejection sampling depends on suitable value of k where kq(z)> p(k) For many pairs of distributions p(z) and q(z) it is impractical to determine value of k If it is sufficiently large to guarantee a bound then impractically small acceptance rates Method makes use of sampling distribution q(z) but avoids having to determine k Name of method: Sampling proposal distribution followed by determining importance weights and then resampling Rejection Sampling Proposal q(z) is Gaussian. Scaled version is kq(z) True distribution p(z) 29

SIR Method Two stages Stage 1: L samples z (1),..,z (L) are drawn from q(z) Stage 2: Weights w 1,..,w L are constructed As in importance sampling w l = p(z (l) ) / q(z (l) ) Finally a second set of L samples are drawn from the discrete distribution {z (1),..,z (L) } with probabilities given by {w 1,..,w L } If moments wrt distribution p(z) are needed, use: E[ f ] = f (z)p(z)dz L = w l f (z (l ) ) l=1 30