Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Similar documents
Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Markov Chains and MCMC

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning CMU-10701

Computational Genomics

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Networks.

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Markov chain Monte Carlo Lecture 9

Results: MCMC Dancers, q=10, n=500

Answers and expectations

16 : Approximate Inference: Markov Chain Monte Carlo

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Probabilistic Machine Learning

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

6 Markov Chain Monte Carlo (MCMC)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Robert Collins CSE586, PSU Intro to Sampling Methods

Introduction to Artificial Intelligence. Unit # 11

Introduction to Machine Learning CMU-10701

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

Convergence Rate of Markov Chains

Uncertainty and Randomization

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Recitation 2: Probability

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Bayesian Methods for Machine Learning

Lecture 6: Markov Chain Monte Carlo

Approximating a single component of the solution to a linear system

CPSC 540: Machine Learning

Bayesian Networks. Motivation

Convex Optimization CMU-10725

The Perceptron Algorithm

The Markov Chain Monte Carlo Method

CS 188: Artificial Intelligence. Bayes Nets

COMP90051 Statistical Machine Learning

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

17 : Markov Chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

Machine Learning using Bayesian Approaches

Approximate Inference

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Markov Chain Monte Carlo

Lecture 9: PGM Learning

Online Social Networks and Media. Link Analysis and Web Search

Markov Chain Monte Carlo methods

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Propp-Wilson Algorithm (and sampling the Ising model)

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Simulated Annealing for Constrained Global Optimization

Motivation. Bayesian Networks in Epistemology and Philosophy of Science Lecture. Overview. Organizational Issues

Artificial Intelligence Markov Chains

Markov Chain Monte Carlo (MCMC)

0.1 Naive formulation of PageRank

Graphical Models and Kernel Methods

an introduction to bayesian inference

CS 188: Artificial Intelligence Spring 2009

Lecture 8: Bayesian Networks

Session 3A: Markov chain Monte Carlo (MCMC)

Coupling. 2/3/2010 and 2/5/2010

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

Previously Monte Carlo Integration

Reasoning Under Uncertainty: Bayesian networks intro

Machine Learning for Data Science (CS4786) Lecture 24

Sampling Methods (11/30/04)

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Intelligent Systems Discriminative Learning, Neural Networks

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

The coupling method - Simons Counting Complexity Bootcamp, 2016

Machine Learning CPSC 340. Tutorial 12

ECE521 week 3: 23/26 January 2017

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Naïve Bayes classification

Today. Why do we need Bayesian networks (Bayes nets)? Factoring the joint probability. Conditional independence

Introduction to Data Mining


April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Simulation - Lectures - Part III Markov chain Monte Carlo

Intelligent Systems (AI-2)

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

16 : Markov Chain Monte Carlo (MCMC)

Quantum algorithms based on quantum walks. Jérémie Roland (ULB - QuIC) Quantum walks Grenoble, Novembre / 39

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Paul Karapanagiotidis ECO4060

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CSC 446 Notes: Lecture 13

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Model Counting for Logical Theories

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Graphical Models

Transcription:

Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016

Course progress Learning from examples Definition + fundamental theorem of statistical learning, motivated efficient algorithms/optimization Convexity, greedy optimization gradient descent Neural networks Knowledge Representation NLP Logic Bayes nets Optimization: MCMC (TODAY) Next: reinforcement learning

Goal: inference in Bayes networks values for the lung cancer example. Pollution Cancer Smoking Node name Type Values Pollution Binary {low, high} Smoker Boolean {T, F} Cancer Boolean {T, F} Dyspnoea Boolean {T, F} X-ray Binary {pos, neg} X-ray Dyspnoea

How to sample from a distribution? How to generate a random number? Von-Neumann s coin given a biased coin turns up heads w.p. p " # how to generate a random bit?

How to sample from a distribution? How to generate a random number? Von-Neumann s coin: From now on: assume we have access to U[0,1] Uniformly at random on an interval? Exponential?

Inverse transform method Cumulative distribution function

Inverse transform method Let F: R [0,1] be the CDF we want to sample from, let F -" : 0,1 R be its inverse. Algorithm: sample Y U[0,1] and return X = F -" (Y) Theorem: X F Exponential distribution: F x = 1 e -;< for x 0, so sample y U[0,1], and return " ln (1 y) ;

Inverse transform method Let F: R [0,1] be the CDF we want to sample from, let F -" : 0,1 R be its inverse. Algorithm: sample Y U[0,1] and return X = F -" (Y) Theorem: X F Exponential distribution: F x = 1 e -;< for x 0, so sample y U[0,1], and return " ln (1 y) ;

How to sample from a distribution? How to generate a random number? Von-Neumann s coin: From now on: assume we have access to U[0,1] Uniformly at random on an interval? Exponential? Gaussian/Normal?

Normal random variable (Gaussian) PDF and CDF: why can t we use inverse transform?

Gaussian sample: Box-Muller algorithm Idea convert to radial basis Sample two variables r exp " # Cartesian coordinates: X = r cos θ, Y = r sin θ, θ U[0,1] and return the

Gaussian sample: Box-Muller algorithm Idea convert to radial basis Sample two variables r exp " # Cartesian coordinates: X = r cos θ, Y = r sin θ, θ U[0,1] and return the Theorem: X, Y N[0,1] and are independent Proof idea: sampling to i.i.d normal RV, is rotation symmetric, and radius distributed as exp(1/2).

How to sample from a distribution? Sampling from a multi-dimensional distribution? Inverse transform -> many times hard to compute other methods (importance sampling, etc.) degrade exponentially with the dimension Many times provably computationally hard But also very important!

Los Alamos simulations Need to simulate complicated multi-particle experiments 0-1 assignment, valid configuration: no neighboring 1 s

Sampling in Bayes networks?? values for the lung cancer example. Pollution Cancer Smoking Node name Type Values Pollution Binary {low, high} Smoker Boolean {T, F} Cancer Boolean {T, F} Dyspnoea Boolean {T, F} X-ray Binary {pos, neg} X-ray Dyspnoea P[X M = a M X " = a ",, X Q = a Q, X R =? ] =? P[X " = a " X T = a T ] =?

The MCMC paradigm to sample from a distribution p, design a Markov Chain whose stationary distribution is π = p. Then simulate the Markov Chain and sample from it after it has mixed (reached stationarity).

The MCMC paradigm to sample from a distribution p, design a Markov Chain whose stationary distribution is π = p. Then simulate the Markov Chain and sample from it after it has mixed (reached stationarity). 1. What is a Markov Chain & stationary dist.? 2. When does it have a stationary distribution and how to find it / sample from it efficiently? 3. How to design a Markov Chain for a given distribution?

Markov Chain Markov chain with three states (s = 3) Directed graph, and a transitition matrix giving, for each i, j the probability of stepping to j when at i. Transition matrix Transition graph 2

Markov Chains usage and examples Common example: PageRank (google s webpages initial ranking system) Webgraph: Nodes = webpages, Edges = hyperlinks i, j E T MR = probability to move from page i to page j = XY Z 0 o/w d M = degree (outgoing links) from page i " PageRank score for page = π i = prob. in stationary distribution!

Random Walks in a Markov Chain Markov chain with three states (s = 3) Starting from state i, the distribution after one step is given by p " = e M, p # = e M T After n steps: p Q = e M T T T = e M T Q-" Transition matrix Transition graph 2 Let: π = lim Q f e M T Q

Random Walks in a Markov Chain Markov chain with three states (s = 3) Let: Thus, π = lim Q f e M T Q πt = π stationary distribution Transition matrix Transition graph 2

Random Walks in a Markov Chain Markov chain with three states (s = 3) For this MC: p " = e " = (1,0,0) p # = e " T = (0,1,0) p g = p # T = (0,0.1,0.9) p j = p g T = (0.54,0.37,0.09) π = p f (0.22,0.41,0.37) Transition matrix Transition graph 2

Stationary Distribution Distribution =( 1,..., m ) is stationary if i 0 8i, X i = 1 and T = i (Taking one step according to the markov chain leaves this distribution unchanged)

Non-stationary Markov chains 1 1 2 periodic 1

Non-stationary Markov chains 0.5 1 2 1 0.5 2 1 reducible

Ergodic theorem Amazingly, every irreducible and a-periodic Markov chain has a unique stationary distribution, and every random walk starting from any node converges to it! à implication to PageRank Mixing time is: n r s. t. e M T Q u π ε In general depends polynomially in #nodes, very hard to bound. Many times in practice depends logarithmically in #nodes!

Designing a Markov chain: Metropolis-Hastings Input: distribution we wish to sample from, probability of even i is given by p i Output: sample from Markov Chain whose stationary distribution is π = p MH algorithm: for t=1,2,,t 1. Start in arbitrary state i, and let s " = i 2. At time t, pick state j from [n] uniformly at random (or some other Reasonable distribution). 3. Update the step according to the rule: s xy" = z j w. p. min{1, p R p M } s } o/w 4. Return to (2), unless t=t, in which case stop and return s t

Designing a Markov chain: Metropolis-Hastings Theorem: Markov Chain below is always stationary with stationary distribution being π = p MH algorithm: for t=1,2,,t 1. Start in arbitrary state i, and let s " = i 2. At time t, pick state j from [n] uniformly at random (or some other Reasonable search rule). 3. Update the step according to the rule: s xy" = z j w. p. min{1, p R p M } s } o/w 4. Return to (2), unless t=t, in which case stop and return s t

Metropolis Hastings Algorithm Create Markov chain start from any state simulate MC many many times (mixing time) return final state By theorem it is a sample from p! Next: applying it to Bayesian inference!

Sampling in Bayes networks by MCMC values for the lung cancer example. Pollution Cancer Smoking Node name Type Values Pollution Binary {low, high} Smoker Boolean {T, F} Cancer Boolean {T, F} Dyspnoea Boolean {T, F} X-ray Binary {pos, neg} X-ray Dyspnoea P[X M = a M X " = a ",, X Q = a Q, X R =? ] =? P[X " = a " X # = a #, X T = a T ] =?

Sampling in Bayes networks by MCMC Goal: estimate P[X " = a " X # = a #, X T = a T ] Method: sample from P[X " X # = a #, X T = a T ] and estimate the probability of value X " = a ". Assume binary values, a M {0,1}. Graph: all possible assignments to all variables (2 Q nodes!). The MCMC algorithm: 1. Start in arbitrary state b ", b #,, b Q, such that b # = a #, b T = a T 2. Pick random variable X R X #, X T, move to state b ", b #,, 1 b R,, b Q with probability P[b ", b #,, 1 b R,, b Q ] P[b ", b #,, b Q ] 3. Return to (2), unless reached limit, in which case return current state

Different search rule (in Metropolis-Hastings) Goal: estimate P[X " = a " X # = a #, X T = a T ] Method: sample from P[X " X # = a #, X T = a T ] and estimate the probability of value X " = a ". Assume binary values, a M {0,1}. Graph: all possible assignments to all variables (2 Q nodes!). The MCMC algorithm: 1. Start in arbitrary state b ", b #,, b Q, such that b # = a #, b T = a T 2. Pick two variable X R, X R X #, X T, move to state b ",, 1 b R,, 1 b R,, b Q with probability P[ b ",, 1 b R,, 1 b R,, b Q ] P[b ",b #,, b Q ] 3. Return to (2), unless reached limit, in which case return current state

The MCMC paradigm - summary to sample from a distribution p, design a Markov Chain whose stationary distribution is π = p. Then simulate the Markov Chain and sample from it after it has mixed (reached stationarity). 1. Metropolis-Hastings general methodology for designing Markov chains for a given distribution 2. Can be applied to Bayes networks, since only ratio of local probabilities needed 3. Mixing time the hard quantity to bound (usually poly-graph-size, which is exponential in theory)