Introduction to Machine Learning CMU-10701

Similar documents
Convex Optimization CMU-10725

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Monte Carlo Methods. Leon Gu CSD, CMU

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

MCMC: Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Stochastic optimization Markov Chain Monte Carlo

Markov Chain Monte Carlo

6 Markov Chain Monte Carlo (MCMC)

Understanding MCMC. Marcel Lüthi, University of Basel. Slides based on presentation by Sandro Schönborn

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

ELEC633: Graphical Models

16 : Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo

Markov Chains and MCMC

CPSC 540: Machine Learning

Brief introduction to Markov Chain Monte Carlo

Markov chain Monte Carlo Lecture 9

Lecture 7 and 8: Markov Chain Monte Carlo

Bayesian Methods for Machine Learning

Convergence Rate of Markov Chains

Markov Chains, Random Walks on Graphs, and the Laplacian

Quantifying Uncertainty

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

16 : Approximate Inference: Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Markov Chain Monte Carlo Methods

CSC 446 Notes: Lecture 13

17 : Markov Chain Monte Carlo

Markov Chains and MCMC

Introduction to Machine Learning CMU-10701

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Robert Collins CSE586, PSU Intro to Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

STA 294: Stochastic Processes & Bayesian Nonparametrics

Markov Chain Monte Carlo (MCMC)

MCMC and Gibbs Sampling. Sargur Srihari

Markov Chains Handout for Stat 110

Machine Learning for Data Science (CS4786) Lecture 24

Session 3A: Markov chain Monte Carlo (MCMC)

MARKOV CHAIN MONTE CARLO

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Computational statistics

Bayesian Inference and MCMC

Markov Chain Monte Carlo

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

MCMC and Gibbs Sampling. Kayhan Batmanghelich

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

MCMC Methods: Gibbs and Metropolis

Lecture 6: Markov Chain Monte Carlo

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo

Markov-Chain Monte Carlo

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Markov Chain Monte Carlo (MCMC)

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Sampling Methods (11/30/04)

Chapter 7. Markov chain background. 7.1 Finite state space

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

The Ising model and Markov chain Monte Carlo

Markov chain Monte Carlo

Lecture 8: The Metropolis-Hastings Algorithm

Stochastic Simulation

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC


Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

I. Bayesian econometrics

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Markov chain Monte Carlo

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Advanced Introduction to Machine Learning CMU-10715

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Markov chain Monte Carlo

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov chain Monte Carlo methods in atmospheric remote sensing

Ch5. Markov Chain Monte Carlo

25.1 Markov Chain Monte Carlo (MCMC)

Kobe University Repository : Kernel

Markov Chain Monte Carlo

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Results: MCMC Dancers, q=10, n=500

Discrete solid-on-solid models

Answers and expectations

Markov Chain Monte Carlo Lecture 6

Bayesian Estimation of Input Output Tables for Russia

Advanced Sampling Algorithms

Previously Monte Carlo Integration

Advances and Applications in Perfect Sampling

Bayesian model selection in graphs by using BDgraph package

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Transcription:

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh

Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov Chains Properties MCMC sampling Hastings-Metropolis Gibbs 2

Monte Carlo Methods 3

The importance of MCMC A recent survey places the Metropolis algorithm among the 10 algorithms that have had the greatest influence on the development and practice of science and engineering in the 20 th century (Beichl&Sullivan, 2000). The Metropolis algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC). 4

MCMC Applications MCMC plays significant role in statistics, econometrics, physics and computing science. Sampling from high-dimensional, complicated distributions Bayesian inference and learning Marginalization Normalization Expectation Global optimization 5

The Monte Carlo principle Our goal is to estimate the following integral: The idea of Monte Carlo simulation is to draw an i.i.d. set of samples {x (i) } from a target density p(x) defined on a high-dim. space X. Estimator: 6

The Monte Carlo principle Theorems a.s. consistent Unbiased estimation Independent of dimension d! Asymptotically normal 7

The Monte Carlo principle One tiny problem Monte Carlo methods need sample from distribution p(x). When p(x) has standard form, e.g. Uniform or Gaussian, it is straightforward to sample from it using easily available routines. However, when this is not the case, we need to introduce more sophisticated sampling techniques. MCMC sampling 8

Sampling Rejection sampling Importance sampling 9

Main Goal Sample from distribution p(x) that is only known up to a proportionality constant For example, p(x) 0.3 exp( 0.2x 2 ) +0.7 exp( 0.2(x 10) 2 ) 10

Rejection Sampling 11

Rejection Sampling Conditions Suppose that p(x) is known up to a proportionality constant p(x) 0.3 exp( 0.2x 2 ) +0.7 exp( 0.2(x 10) 2 ) It is easy to sample from q(x) that satisfies p(x) M q(x), M < M is known 12

Rejection Sampling Algorithm 13

Rejection Sampling Theorem The accepted x (i ) can be shown to be sampled with probability p(x) (Robert & Casella, 1999, p. 49). Severe limitations: It is not always possible to bound p(x)/q(x) with a reasonable constant M over the whole space X. If M is too large, the acceptance probability is too small. In high dimensional spaces it can be exponentially slow to sample points. (The points usually will be rejected) 14

Importance Sampling 15

Importance Sampling Goal: Sample from distribution p(x) that is only known up to a proportionality constant Importance sampling is an alternative classical solution that goes back to the 1940 s. Let us introduce, again, an arbitrary importance proposal distribution q(x) such that its support includes the support of p(x). Then we can rewrite I(f) as follows: 16

Importance Sampling Consequently, 17

Importance Sampling Theorem This estimator is unbiased Under weak assumptions, the strong law of large numbers applies: Some proposal distributions q(x) will obviously be preferable to others. Which one should we choose? 18

Importance Sampling Theorem This estimator is unbiased Under weak assumptions, the strong law of large numbers applies: Some proposal distributions q(x) will obviously be preferable to others. 19

Importance Sampling Find one that minimizes the variance of the estimator! Theorem The variance is minimal when we adopt the following optimal importance distribution: 20

Importance Sampling The optimal proposal is not very useful in the sense that it is not easy to sample from High sampling efficiency is achieved when we focus on sampling from p(x) in the important regions where f (x) p(x) is relatively large; hence the name importance sampling Importance sampling estimates can be super-efficient: For a given function f (x), it is possible to find a distribution q(x) that yields an estimate with a lower variance than when using q(x)= p(x)! In high dimensions it is not efficient either 21

MCMC sampling - Main ideas Create a Markov chain, which has the desired limiting distribution! 22

Andrey Markov Markov Chains 23

Markov Chains Markov chain: Homogen Markov chain: 24

Markov Chains Assume that the state space is finite: 1-Step state transition matrix: Lemma: The state transition matrix is stochastic: t-step state transition matrix: Lemma: 25

Markov Chains Example Markov chain with three states (s = 3) Transition matrix Transition graph 26

Markov Chains, stationary distribution Definition: [stationary distribution, invariant distribution, steady state distributions] The stationary distribution might be not unique (e.g. T= identity matrix) 27

Markov Chains, limit distributions Some Markov chains have unique limit distribution: If the probability vector for the initial state is it follows that and, after several iterations (multiplications by T ) no matter what initial distribution µ(x 1 ) was. limit distribution The chain has forgotten its past. 28

Markov Chains Our goal is to find conditions under which the Markov chain converges to a unique limit distribution (independently from its starting state distribution) Observation: If this limiting distribution exists, it has to be the stationary distribution. 29

Limit Theorem of Markov Chains Theorem: If the Markov chain is Irreducible and Aperiodic, then: That is, the chain will convergence to the unique stationary distribution 30

Markov Chains Definition Irreducibility: For each pairs of states (i,j), there is a positive probability, starting in state i, that the process will ever enter state j. = The matrix T cannot be reduced to separate smaller matrices = Transition graph is connected. It is possible to get to any state from any state. 31

Markov Chains Definition Aperiodicity: The chain cannot get trapped in cycles. Definition A state i has period k if any return to state i, must occur in multiples of k time steps. Formally, the period of a state i is defined as (where "gcd" is the greatest common divisor) For example, suppose it is possible to return to the state in {6,8,10,12,...} time steps. Then k=2 32

Markov Chains Definition Aperiodicity: The chain cannot get trapped in cycles. In other words, a state i is aperiodic if there exists n such that for all n' n, Definition A Markov chain is aperiodic if every state is aperiodic. 33

Markov Chains Example for periodic Markov chain: Let In this case If we start the chain from (1,0), or (0,1), then the chain get traps into a cycle, it doesn t forget its past. It has stationary distribution, but no limiting distribution! 34

Reversible Markov chains (Detailed Balance Property) How can we find the limiting distribution of an irreducible and aperiodic Markov chain? Definition: reversibility /detailed balance condition: Theorem: A sufficient, but not necessary, condition to ensure that a particular π is the desired invariant distribution of the Markov chain is the detailed balance condition. 35

How fast can Markov chains forget the past? MCMC samplers are irreducible and aperiodic Markov chains have the target distribution as the invariant distribution. the detailed balance condition is satisfied. It is also important to design samplers that converge quickly. 36

Spectral properties Theorem: If π is the left eigenvector of the matrix T with eigenvalue 1. The Perron-Frobenius theorem from linear algebra tells us that the remaining eigenvalues have absolute value less than 1. The second largest eigenvalue, therefore, determines the rate of convergence of the chain, and should be as small as possible. 37

The Hastings-Metropolis Algorithm 38

The Hastings-Metropolis Algorithm Our goal: Generate samples from the following discrete distribution: We don t know B! The main idea is to construct a time-reversible Markov chain with (π 1,,π m ) limit distributions Later we will discuss what to do when the distribution is continuous 39

The Hastings-Metropolis Algorithm Let {1,2,,m} be the state space of a Markov chain that we can simulate. No rejection: we use all X 1, X 2, X n, 40

Example for Large State Space Let {1,2,,m} be the state space of a Markov chain that we can simulate. d-dimensional grid: Max 2d possible movements at each grid point (linear in d) Exponentially large state space in dimension d 41

The Hastings-Metropolis Algorithm Theorem Proof 42

The Hastings-Metropolis Algorithm Observation Proof: Corollary Theorem 43

The Hastings-Metropolis Algorithm Theorem Proof: Note: 44

The Hastings-Metropolis Algorithm It is not rejection sampling, we use all the samples! 45

Continuous Distributions The same algorithm can be used for continuous distributions as well. In this case, the state space is continuous. 46

Experiment with HM An application for continuous distributions Bimodal target distribution: p(x) 0.3 exp( 0.2x 2 ) +0.7 exp( 0.2(x 10) 2 ) q(x x (i ) ) = N(x (i), 100), 5000 iterations 47

Good proposal distrib. is important 48

HM on Combinatorial Sets Generate uniformly distributed samples from the set of permutations Let n=3, and a=12: {1,2,3}: 1+4+9=14 {1,3,2}: 1+6+6=13 {2,3,1}: 2+6+3=11 {2,1,3}: 2+2+9=13 {3,1,2}: 3+2+6=11 {3,2,1}: 3+4+3=10 49

HM on Combinatorial Sets To define a simple Markov chain on neighboring elements (permutations):, we need the concept of Definition: Two permutations are neighbors, if one results from the interchange of two of the positions of the other: (1,2,3,4) and (1,2,4,3) are neighbors. (1,2,3,4) and (1,3,4,2) are not neighbors. 50

HM on Combinatorial Sets That is what we wanted! 51

Gibbs Sampling: The Problem Suppose that we can generate samples from Our goal is to generate samples from 52

Gibbs Sampling: Pseudo Code 53

Gibbs Sampling: Theory Consider the following HM sampler: Let and let Observation: By construction, this HM sampler would sample from We will prove that this HM sampler = Gibbs sampler. 54

Gibbs Sampling is a Special HM Theorem: The Gibbs sampling is a special case of HM with Proof: By definition: 55

Gibbs Sampling is a Special HM Proof: 56

Gibbs Sampling in Practice 57

Simulated Annealing 58

Simulated Annealing Goal: Find 59

Simulated Annealing Theorem: Proof: 60

Simulated Annealing Main idea Let λ be big. Generate a Markov chain with limit distribution P λ (x). In long run, the Markov chain will jump among the maximum points of P λ (x). Introduce the relationship of neighboring vectors: 61

Simulated Annealing Uniform distribution Use the Hastings- Metropolis sampling: 62

Simulated Annealing: Pseudo Code With prob. α accept the new state with prob. (1-α) don't accept and stay 63

Simulated Annealing: Special case In this special case: With prob. α=1 accept the new state since we increased V 64

Simulated Annealing: Problems 65

Simulated Annealing Temperature = 1/ λ 66

Simulated Annealing 67

Monte Carlo EM E Step: Monte Carlo EM: Then the integral can be approximated! 68

Monte Carlo EM 69

Thanks for the Attention! 70