Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Similar documents
Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Results: MCMC Dancers, q=10, n=500

Introduction to Machine Learning CMU-10701

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov chain Monte Carlo Lecture 9

Clustering using Mixture Models

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

MCMC and Gibbs Sampling. Sargur Srihari

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

CPSC 540: Machine Learning

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Monte Carlo Methods. Leon Gu CSD, CMU

Markov Networks.

MCMC: Markov Chain Monte Carlo

Bayesian Methods for Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

17 : Markov Chain Monte Carlo

Final Exam, Machine Learning, Spring 2009

DAG models and Markov Chain Monte Carlo methods a short overview

Machine Learning for Data Science (CS4786) Lecture 24

Graphical Models and Kernel Methods

Convex Optimization CMU-10725

STA 4273H: Sta-s-cal Machine Learning

The Ising model and Markov chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo

Advanced Introduction to Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

MCMC Methods: Gibbs and Metropolis

Pattern Recognition and Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Networks BY: MOHAMAD ALSABBAGH

Variational Inference (11/04/13)

Quantifying Uncertainty

Markov Chains and MCMC

Bayesian Inference and MCMC

Markov Chains Handout for Stat 110

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Markov Chain Monte Carlo methods

Markov Chains and MCMC

Markov Chain Monte Carlo

Introduction to Graphical Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

19 : Slice Sampling and HMC

Lecture 7 and 8: Markov Chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov-Chain Monte Carlo

STA 294: Stochastic Processes & Bayesian Nonparametrics

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

STA 4273H: Statistical Machine Learning

Machine Learning, Fall 2009: Midterm

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Markov Chain Monte Carlo, Numerical Integration

CPSC 540: Machine Learning

ECE521 week 3: 23/26 January 2017

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS281A/Stat241A Lecture 22

Introduction to Machine Learning Midterm, Tues April 8

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Markov Chain Monte Carlo

Mathematical Formulation of Our Example

Introduction to Machine Learning Midterm Exam

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Chris Bishop s PRML Ch. 8: Graphical Models

Sampling Methods (11/30/04)

Introduction to Machine Learning Midterm Exam Solutions

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Stochastic optimization Markov Chain Monte Carlo

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Probabilistic KNN.

CMU-Q Lecture 24:

ELEC633: Graphical Models

Bayesian GLMs and Metropolis-Hastings Algorithm

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Bagging During Markov Chain Monte Carlo for Smoother Predictions

CS534 Machine Learning - Spring Final Exam

Intelligent Systems:

Advanced Introduction to Machine Learning

Non-Parametric Bayes

Markov chain Monte Carlo

Introduction to Machine Learning CMU-10701

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning Linear Classification. Prof. Matteo Matteucci

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

STA 414/2104: Machine Learning

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Robert Collins CSE586, PSU Intro to Sampling Methods

Advances and Applications in Perfect Sampling

CSC 2541: Bayesian Methods for Machine Learning

Transcription:

Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo

Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain Monte Carlo (MCMC) It keeps a record of the current state and the proposal depends on that state Most common algorithms are the Metropolis- Hastings algorithm and Gibbs Sampling 2 Group

Markov Chains Revisited A Markov Chain is a distribution over discretestate random variables x 1,...,x M p(x 1,...,x T )=p(x 1 )p(x 2 x 1 ) = p(x 1 ) so that The graphical model of a Markov chain is this: TY t=2 p(x t x t 1 ) T We will denote as a row vector p(x t x t 1 ) t A Markov chain can also be visualized as a state transition diagram. 3 Group

The State Transition Diagram time k=1 A 11 A 11 states k=2 k=3 A 33 A 33 t-2 t-1 t 4 Group

The Stationary Distribution KX The probability to reach state k is Or, in matrix notation: t = t 1 A k,t = i=1 i,t 1 A ik We say that t is stationary if t = t 1 To find the stationary distribution we need to solve the eigenvector problem A T v = v. The stationary distribution is then = v T where v is the eigenvector for which the eigenvalue is 1. This eigenvector needs to be normalized so that it is a valid distribution. 5 Group

Existence of a Stationary Distribution A Markov chain can have many stationary distributions 0.1 Necessary for having a unique stationary 0.1 distribution: we can reach every state from any other state in finite steps (the chain is regular) 1.0 0.9 1 2 3 4 0.5 0.5 0.9 A transition distribution t satisfies the property of detailed balance if i A ij = j A ji The chain is then said to be reversible 6 Group

Reversible Chain: Example 1 3 A 31 + A 13 A 31 3 1 A 13 + t-1 t 7 Group

Existence of a Stationary Distribution Theorem: If a Markov chain with transition matrix A is regular and satisfies detailed balance wrt. the distribution, then is a stationary distribution of the chain. Proof: KX KX K X i A ij = j A ji = j A ji = j 8j i=1 i=1 i=1 it follows. = A This is only a sufficient condition, however it is not necessary. 8 Group

Sampling with a Markov Chain The main idea of MCMC is to sample state transitions based on a proposal distribution q. The most widely used algorithm is the Metropolis-Hastings (MH) algorithm. In MH, the decision whether to stay in a given state is based on a given probability. If the proposal distribution is stay in state x 0 with probability min Unnormalized target distribution q(x 0 x) 1, p(x0 )q(x x 0 ) p(x)q(x 0 x), then we 9 Group

The Metropolis-Hastings Algorithm Initialize x 0 for define s =0, 1, 2,... sample compute acceptance probability compute x = x s x 0 q(x 0 x) = p(x0 )q(x x 0 ) p(x)q(x 0 x) r =min(1, ) sample u U(0, 1) set new sample to x s+1 = ( x 0 if u<r x s if u r 10 Group

Why Does This Work? We have to prove that the transition probability of the MH algorithm satisfies detailed balance wrt the target distribution. Theorem: If p MH (x 0 x) is the transition probability of the MH algorithm, then p(x)p MH (x 0 x) =p(x 0 )p MH (x x 0 ) Proof: 11 Group

Why Does This Work? We have to prove that the transition probability of the MH algorithm satisfies detailed balance wrt the target distribution. Theorem: If p MH (x 0 x) is the transition probability of the MH algorithm, then p(x)p MH (x 0 x) =p(x 0 )p MH (x x 0 ) Note: All formulations are valid for discrete and for continuous variables! 12 Group

Choosing the Proposal A proposal distribution is valid if it gives a nonzero probability of moving to the states that have a non-zero probability in the target. A good proposal is the Gaussian, because it has a non-zero probability for all states. However: the variance of the Gaussian is important! with low variance, the sampler does not explore sufficiently, e.g. it is fixed to a particular mode with too high variance, the proposal is rejected too often, the samples are a bad approximation 13 Group

Example MH with N(0,1.000 2 ) proposal Target is a mixture of 2 1D Gaussians. Proposal is a Gaussian 0.2 0.1 0 0 200 with different variances. 400 600 800 Iterations 1000 100 50 0 Samples 50 100 MH with N(0,500.000 2 ) proposal MH with N(0,8.000 2 ) proposal 0.06 0.04 0.03 0.02 0.02 0 0 200 400 600 800 Iterations 1000 100 50 0 Samples 50 100 0.01 0 0 200 400 600 800 Iterations 1000 100 50 0 Samples 50 100 14 Group

Gibbs Sampling Initialize {z i : i =1,...,M} For Sample Sample... =1,...,T Sample z ( +1) 1 p(z 1 z ( ) 2,...,z( ) M ) z ( +1) 2 p(z 2 z ( +1) 1,...,z ( ) M ) z ( +1) M p(z M z ( +1) 1,...,z ( +1) M 1 ) Idea: sample from the full conditional This can be obtained, e.g. from the Markov blanket in graphical models. 15 Group

Gibbs Sampling: Example Use an MRF on a binary image with edge potentials (x s,x t )=exp(jx s x t ) ( Ising model ) and node potentials (x t )=N (y t x t, 2 ) x t 2 { 1, 1} y t x s x t 16 Group

Gibbs Sampling: Example Use an MRF on a binary image with edge potentials (x s,x t )=exp(jx s x t ) ( Ising model ) and node potentials (x t )=N (y t x t, 2 ) Sample each pixel in turn sample 1, Gibbs 1 sample 5, Gibbs 1 mean after 15 sweeps of Gibbs 1 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 1 1 1 After 1 sample After 5 samples Average after 15 samples 17 Group

Gibbs Sampling for GMMs Again, we start with the full joint distribution: p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( ) (semi-conjugate prior) It can be shown that the full conditionals are: KY k=1 p(z i = k x i, µ,, ) / k N (x i µ k, k ) NX p( z) =Dir({ k + z ik } K k=1) i=1 p(µ k k,z,x)=n(µ k m k,v k ) p( k µ k,z,x)=iw( k S k, k ) p(µ k )p( k ) (linear-gaussian) 18 Group

Gibbs Sampling for GMMs First, we initialize all variables Then we iterate over sampling from each conditional in turn In the end, we look at and µ k k 10 8 "data_gibbs.dat" 10 8 class 1 class 2 mu1 mu2 sigma1 sigma2 6 6 4 4 2 2 0 0-2 -2-4 -6-4 -2 0 2 4 6 8 10 12 14 16-4 -6-4 -2 0 2 4 6 8 10 12 14 16 19 Group

How Often Do We Have To Sample? 8 7 mu1x mu1y mu2x mu2y 6 5 4 3 2 1 0 0 50 100 150 200 Here: after 50 sample rounds the values don t change any more In general, the mixing time is related to the eigen gap of the transition matrix: = 1 2 apple O( 1 log n ) 20 Group

Gibbs Sampling is a Special Case of MH The proposal distribution in Gibbs sampling is q(x 0 x) =p(x 0 i x i )I(x 0 i = x i ) This leads to an acceptance rate of: = p(x0 )q(x x 0 ) p(x)q(x 0 x) = p(x0 i x0 i )p(x0 i )p(x i x 0 i ) p(x i x i )p(x i )p(x 0 i x i) =1 Although the acceptance is 100%, Gibbs sampling does not converge faster, as it only updates one variable at a time. 21 Group

Summary Markov Chain Monte Carlo is a family of sampling algorithms that can sample from arbitrary distributions by moving in state space Most used methods are the Metropolis-Hastings (MH) and the Gibbs sampling method MH uses a proposal distribution and accepts a proposed state randomly Gibbs sampling does not use a proposal distribution, but samples from the full conditionals 22 Group

Group Prof. Daniel Cremers 11. Evaluation and Model Selection

Evaluation of Learning Methods Very often, machine learning tries to find parameters of a model for a given data set. But: Which parameters give a good model? Intuitively, a good model behaves well on new, unseen data. This motivates the distinction of Training data: used to generate different models Validation data: used to adjust the model parameters so that their performance is optimal Test data: used to evaluate the models with the optimized parameters 24 Group

Loss Function A common way to evaluate a learning algorithm (e.g. regression, classification) is to define a loss function: In the case of regression, a common choice is the squared loss In classification, where numbers, we use the 0/1-loss: and t are natural 25 Group

Some Loss Functions 2 x 0.2 2 x 1.0 2 x 2.0 1 1 1 0 2 1 0 1 2 0 2 1 0 1 2 L(y, a) = y Quadratic loss is useful for continuous parameters a q 0 2 1 0 1 2 a = action, e.g. labeling Its minimum is the posterior mean E[y x] Absolute loss is less sensitive to outliers Its minimum is the median of the posterior 26 Group

Model Selection Model selection is used to find model parameters to optimize the classification result. Possible methods: Minimizing the training error Hold-out testing Cross-validation Leave-one-out rule Evaluation is done using an appropriate loss function. 27 Group

Minimizing the Training Error The training error is defined as: Where are all input/target value pairs from the training data set. Problem: A model that minimizes the training error does not (necessarily) generalize well. It only behaves well on the data it was trained with. Example: Polynomial regression with high model complexity (see above) 28 Group

Hold-out Testing Hold-out testing splits the data set up into A validation data set of size A smaller training data set of size The model parameters are then selected so that the error is minimized. Problems: How big should be? Which elements should be chosen for evaluation? Also: Iteration of the model design may lead to overfitting on the validation data 29 Group

Hold-out Testing Training set Validation set 30 Group

Cross Validation Idea of cross validation: Perform hold-out testing times on different evaluation (sub-)sets. Validation subsets: Error function: Minimizing this error function gives good results, but requires huge computational efforts. The training must be done times. 31 Group

Cross-Validation Training set 32 Validation set Group

Cross-Validation Training set 33 Group

Cross-Validation Training set 34 Group

Leave-one-out Rule Idea: do cross-validation with Yields similarly good results compared to crossvalidation Still requires to do the training times Useful if the data set is particularly scarce 35 Group

Leave-one-out Rule Training set 36 Validation set Group

Leave-one-out Rule Training set 37 Validation set Group

BIC and AIC Observation: There is a trade-off between the obtained data likelihood (i.e. the quality of the fit) and the complexity M of the model. Idea: Define a score function that balances both. Two common examples are: Akaike Information Criterion Bayesian Information Criterion But: These criteria tend to favor too simple models. 38 Group

Classifier Evaluation Different methods to evaluate a classifier True-positive / false-positive rates Precision-recall diagram Receiving operator characteristics (ROC curves) Loss function The evaluation of a classifier is needed to find the best classification parameters (e.g. for the kernel function) 39 Group

Classifier Evaluation Number of all data points: Classification result = -1 Classification result = 1 Correct label = -1 True negative rate False positive rate Correct label = 1 False negative rate Number of positive examples Number of negative examples 40 True positive rate Group

Classifier Evaluation Terminology We define the precision as intuitively: probability that the real label is positive in case the classifier returns positive The recall is defined as intuitively: probability that a positive labeled data point is detected as positive by the classifier. 41 Group

Precision-Recall Curves Usually, precision and recall are plotted into the same graph. The optimal classifier parameters are obtained where precision equals recall (break-even-point). 42 Group

Receiver Operating Characteristics Receiving Operating Characteristics (ROC) curves plot the true positive rate vs. the false positive rate. To find a good classifier we can search for a point where the slope is 1. 43 Group

Loss Function Another method to evaluate a classifier is defined by evaluating its loss function. The simplest loss function is the 0/1-loss: Where is a classifier, is a feature vector and is the known class label of. Important: The pair should be taken from an evaluation data set that is different from the training set. 44 Group