ELEC633: Graphical Models

Similar documents
Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Convex Optimization CMU-10725

Stochastic optimization Markov Chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Monte Carlo Methods. Leon Gu CSD, CMU

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

The Singular Value Decomposition

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Introduction to Machine Learning CMU-10701

Achieving Stationary Distributions in Markov Chains. Monday, November 17, 2008 Rice University

Maths for Signals and Systems Linear Algebra in Engineering

STA 294: Stochastic Processes & Bayesian Nonparametrics

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

17 : Markov Chain Monte Carlo

LinGloss. A glossary of linear algebra

Quantifying Uncertainty

Markov Chains, Stochastic Processes, and Matrix Decompositions

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

MCMC: Markov Chain Monte Carlo

Lecture 2: September 8

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Markov Chains, Random Walks on Graphs, and the Laplacian

Session 3A: Markov chain Monte Carlo (MCMC)

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Chain Monte Carlo

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

A Short Note on Resolving Singularity Problems in Covariance Matrices

Bayesian Inference and MCMC

Understanding MCMC. Marcel Lüthi, University of Basel. Slides based on presentation by Sandro Schönborn

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

Finite-Horizon Statistics for Markov chains

eqr094: Hierarchical MCMC for Bayesian System Reliability

Markov Chains and MCMC

MONTE CARLO METHODS. Hedibert Freitas Lopes

Markov Chains Handout for Stat 110

Bayesian Methods for Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 8: Linear Algebra Background

LECTURE 15 Markov chain Monte Carlo

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

1 Principal component analysis and dimensional reduction

Multivariate Statistical Analysis

Robert Collins CSE586, PSU Intro to Sampling Methods

16 : Approximate Inference: Markov Chain Monte Carlo

Probabilistic Machine Learning

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Sampling Methods (11/30/04)

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

The Google Markov Chain: convergence speed and eigenvalues

Review problems for MA 54, Fall 2004.

Robert Collins CSE586, PSU. Markov-Chain Monte Carlo

Markov Chain Monte Carlo methods

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

MARKOV CHAIN MONTE CARLO

CSC 446 Notes: Lecture 13

Numerical Methods I Singular Value Decomposition

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

The Singular Value Decomposition

Spectral Theorem for Self-adjoint Linear Operators

Markov chain Monte Carlo Lecture 9

Kobe University Repository : Kernel

Lect4: Exact Sampling Techniques and MCMC Convergence Analysis

Introduction to Restricted Boltzmann Machines

The Singular Value Decomposition and Least Squares Problems

Lecture notes on Quantum Computing. Chapter 1 Mathematical Background

Background Mathematics (2/2) 1. David Barber

Singular Value Decomposition

Sampling Algorithms for Probabilistic Graphical models

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Linear Algebra in Actuarial Science: Slides to the lecture

Perron Frobenius Theory

Markov Chain Monte Carlo Methods

16 : Markov Chain Monte Carlo (MCMC)

1 Singular Value Decomposition and Principal Component

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

EIGENVALUES AND SINGULAR VALUE DECOMPOSITION

Markov Chains and MCMC

Linear Least Squares. Using SVD Decomposition.

Computational math: Assignment 1

Machine Learning for Data Science (CS4786) Lecture 24

Mathematical foundations - linear algebra

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Convergence Rate of Markov Chains

Homework 1. Yuan Yao. September 18, 2011

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Factorization of Seperable and Patterned Covariance Matrices for Gibbs Sampling


Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Lecture 7 and 8: Markov Chain Monte Carlo

A Single Series from the Gibbs Sampler Provides a False Sense of Security

MCMC Methods: Gibbs and Metropolis

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Transcription:

ELEC633: Graphical Models Tahira isa Saleem Scribe from 7 October 2008 References: Casella and George Exploring the Gibbs sampler (1992) Chib and Greenberg Understanding the Metropolis-Hastings algorithm (1995) Green Reversible jump MCMC computations and bayesian model selection (1995) Geman and Geman Stochastic relaxation, Gibbs distributions and Bayesian restoration of images (1984) 1 Importance Sampling In statistics, importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both. Consider x (i) generated from p(x), a probability measure which is difficult to sample from. Then the expectation of f under p can be written as If] = E p [f(x)] = f(x)p(x)dx = 1 f(x (i) ) We easily obtain the Monte-Carlo empirical estimate of E[f(X) p] n Ê n [f] = f(x)dp n (x) = 1/n f(x i ) The basic idea of importance sampling is to draw from a distribution other than p, say q, and modify the following formula to still get a consistent estimate of E[f(X)]. A second reason for the procedure is the potential to reduce the variance of Ê[f(X)] by an appropriate choice of q, hence the name importance sampling, as samples from q can be more important for the estimation of the integral. Consider the function q(x) which approximates p(x), and has the same support. ow we have: I[f] = f(x) p(x) q(x) q(x)dx w(x)q(x)dx where w(x) = p(x) q(x), is known as the importance weight and the distribution q is frequently referred to as the sampling or proposal distribution. 1

FORM 1 Î[f] = 1 f(x (i) ) p(x(i) ) q(x (i) ) FORM 2 Î[f] = w i f(x (i) ) Why should we bother using FORM 2? Let s work it out: FORM 2 Î[f] = = = J J w i f(x (i) ) w i w i p (i) q (i) p (J) q (J) f(x (i) ) f(x (i) But this is not equal to FORM 1!!! Let s find the motivation between the two forms: Î[f] = 1 f(x (i) ) p(x(i) y) q(x (i) y) f(x) p(x) q(x) I[f] = q(x)dx p(x) q(x) q(x)dx If you make an approximation by α-divergence such that α > 0, you ll cover the PDF. (It is neccesary that the assumptions of the Central Limit Theorem hold true.) ote: The effective number of samples is obtained by the following formula eff = 1+var(w i) In summary: Importance sampling (IS) is a variance reduction technique that can be used in the Monte Carlo method. The idea behind IS is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these important values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in IS is to choose a distribution which encourages the important values. This use of biased distributions will result in a biased estimator if it is applied directly in the simulation. However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new IS estimator is unbiased. The weight is given by the likelihood ratio, that is, the Radon-ikodym derivative of the true underlying distribution with respect to the biased simulation distribution. The fundamental issue in implementing IS simulation is the choice of the biased distribution which encourages the important regions of the input variables. Choosing or designing a good biased distribution is the art of IS. The rewards for a good distribution can be huge run-time savings; the penalty for a bad distribution can be longer run times than for a general Monte Carlo simulation without importance sampling. 2 Review of Markov Chains A Markov chain, named after Andrey Markov, is a stochastic process with the Markov property. Having the Markov property means that, given the present state, future states are independent of the past states. 2

Consider a k-state markov chain where π J (0) = p(x 0 = s J ), π J (t) = p(x t = s J ) By D-separation, we know x t x t 1, x 1, x 2,..., x t 2. We have the transition matrix P i (i J) = P(x t = S t x t 1 = S t 1 ) and the following: [P ij ] = P(i J) Π(t) = PΠ(t 1) Π(t) = P t Π(0) The Perron-Frobenius theorem applies to positive stochastic matrices and asserts that the eigenvalue λ = 1 is simple and all other eigenvalues λ 1 of A satisfy λ < 1. Also, in this case there exists a vector having positive entries, summing to 1, which is a positive eigenvector associated to the eigenvalue λ = 1. Both properties can then be used in combination to show that the limit A := lim k A k exists and is a positive stochastic matrix of matrix rank one. In other words, we have the following claim from the Perron-Frobenius theorem: P λ 1, λ 2,..., λ k, where 1 = λ 1 λ 2... λ k 1 Regardless of our Π(0), under irreducible periodicity, the Π(t) converges to a stationary distribution. For more information see the following references: J.L. Doob. Stochastic Processes. ew York: John Wiley and Sons, 1953. ISB 0-471-52369-0. S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. London: Springer-Verlag, 1993. ISB 0-387-19832-6. online: http://decision.csl.uiuc.edu/ meyn/pages/book.html. Second edition to appear, Cambridge University Press, 2008. 3 Relation between Singular Value Decomposition and Eigen Value Decomposition Statement of the SVD Theorem: Suppose M is an m-by-n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form M = UΣV where U is an m-by-m unitary matrix over K, the matrix Σ is m-by-n diagonal matrix with nonnegative numbers on the diagonal, and V denotes the conjugate transpose of V, an n-by-n unitary matrix over K. Such a factorization is called a singular-value decomposition of M. -The matrix V thus contains a set of orthonormal input or analysing basis vector directions for M -The matrix U contains a set of orthonormal output basis vector directions for M -The matrix Σ contains the singular values, which can be thought of as scalar gain controls by which each corresponding input is multiplied to give a corresponding output. A common convention is to order the values Σ i,i in non-increasing fashion. In this case, the diagonal matrix Σ is uniquely determined by M (though the matrices U and V are not). 3

The singular value decomposition is very general in the sense that it can be applied to any m n matrix. The eigenvalue decomposition, on the other hand, can only be applied to certain classes of square matrices. evertheless, the two decompositions are related. Given an SVD of M, as described above, the following two relations hold: M M = V Σ U UΣV = V (Σ Σ)V MM = UΣV V Σ U = U(ΣΣ )U The right hand sides of these relations describe the eigenvalue decompositions of the left hand sides. Consequently, the squares of the non-zero singular values of M are equal to the non-zero eigenvalues of either M M or MM. Furthermore, the columns of U (left singular vectors) are eigenvectors of MM and the columns of V (right singular vectors) are eigenvectors of M M. In the special case that M is a normal matrix, which by definition must be square, the spectral theorem says that it can be unitarily diagonalized using a basis of eigenvectors, so that it can be written M = UDU for a unitary matrix U and a diagonal matrix D. When M is Hermitian positive semi-definite, the decomposition M = U DU is also a singular value decomposition. However, the eigenvalue decomposition and the singular value decomposition differ for all other matrices M: the eigenvalue decomposition is M = UDU?1 where U is not necessarily unitary and D is not necessarily positive semi-definite, while the SVD is M = UΣV where Σ is a diagonal positive semi-definite, and U and V are unitary matrices that are not necessarily related except through the matrix M. 4 Example Irreducibility aperiodic. P (x t = s J x 0 = s i ) = [P t ] Ji > 0 n n P = P 2 = 0.2 0.1 0.3 0.4 0.7 0.3 0.4 0.2 0.4 0.20 0.15 0.21 0.48 0.59 0.45 0.32 0.26 0.34 Perron Frobenius Theorem: Claim: the largest eigenvalue is 1 λ 1 = 1 > λ 2 >... 1 1 0 0 P = [µ 1 µ 2 µ 3 ] 0 λ 2 0 [µ 1 µ 2 µ 3 ] 1 0 0 λ 3 P n = µ 1 n 0 0 0 n λ 2 0 0 0 n λ 3 µ 1 = µ 1 n 0 0 0.... µ 1 4

P n = [µ00]µ 1 PΠ = Π If we calculate the eigenvalue of our matrx P we find λ 1 = 1.0000λ 2 = 0.3562, λ 3 = 0.0562.Whicha dheres the the Perron-Frobenius theorem. 5