ELEC633: Graphical Models Tahira isa Saleem Scribe from 7 October 2008 References: Casella and George Exploring the Gibbs sampler (1992) Chib and Greenberg Understanding the Metropolis-Hastings algorithm (1995) Green Reversible jump MCMC computations and bayesian model selection (1995) Geman and Geman Stochastic relaxation, Gibbs distributions and Bayesian restoration of images (1984) 1 Importance Sampling In statistics, importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both. Consider x (i) generated from p(x), a probability measure which is difficult to sample from. Then the expectation of f under p can be written as If] = E p [f(x)] = f(x)p(x)dx = 1 f(x (i) ) We easily obtain the Monte-Carlo empirical estimate of E[f(X) p] n Ê n [f] = f(x)dp n (x) = 1/n f(x i ) The basic idea of importance sampling is to draw from a distribution other than p, say q, and modify the following formula to still get a consistent estimate of E[f(X)]. A second reason for the procedure is the potential to reduce the variance of Ê[f(X)] by an appropriate choice of q, hence the name importance sampling, as samples from q can be more important for the estimation of the integral. Consider the function q(x) which approximates p(x), and has the same support. ow we have: I[f] = f(x) p(x) q(x) q(x)dx w(x)q(x)dx where w(x) = p(x) q(x), is known as the importance weight and the distribution q is frequently referred to as the sampling or proposal distribution. 1
FORM 1 Î[f] = 1 f(x (i) ) p(x(i) ) q(x (i) ) FORM 2 Î[f] = w i f(x (i) ) Why should we bother using FORM 2? Let s work it out: FORM 2 Î[f] = = = J J w i f(x (i) ) w i w i p (i) q (i) p (J) q (J) f(x (i) ) f(x (i) But this is not equal to FORM 1!!! Let s find the motivation between the two forms: Î[f] = 1 f(x (i) ) p(x(i) y) q(x (i) y) f(x) p(x) q(x) I[f] = q(x)dx p(x) q(x) q(x)dx If you make an approximation by α-divergence such that α > 0, you ll cover the PDF. (It is neccesary that the assumptions of the Central Limit Theorem hold true.) ote: The effective number of samples is obtained by the following formula eff = 1+var(w i) In summary: Importance sampling (IS) is a variance reduction technique that can be used in the Monte Carlo method. The idea behind IS is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these important values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in IS is to choose a distribution which encourages the important values. This use of biased distributions will result in a biased estimator if it is applied directly in the simulation. However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new IS estimator is unbiased. The weight is given by the likelihood ratio, that is, the Radon-ikodym derivative of the true underlying distribution with respect to the biased simulation distribution. The fundamental issue in implementing IS simulation is the choice of the biased distribution which encourages the important regions of the input variables. Choosing or designing a good biased distribution is the art of IS. The rewards for a good distribution can be huge run-time savings; the penalty for a bad distribution can be longer run times than for a general Monte Carlo simulation without importance sampling. 2 Review of Markov Chains A Markov chain, named after Andrey Markov, is a stochastic process with the Markov property. Having the Markov property means that, given the present state, future states are independent of the past states. 2
Consider a k-state markov chain where π J (0) = p(x 0 = s J ), π J (t) = p(x t = s J ) By D-separation, we know x t x t 1, x 1, x 2,..., x t 2. We have the transition matrix P i (i J) = P(x t = S t x t 1 = S t 1 ) and the following: [P ij ] = P(i J) Π(t) = PΠ(t 1) Π(t) = P t Π(0) The Perron-Frobenius theorem applies to positive stochastic matrices and asserts that the eigenvalue λ = 1 is simple and all other eigenvalues λ 1 of A satisfy λ < 1. Also, in this case there exists a vector having positive entries, summing to 1, which is a positive eigenvector associated to the eigenvalue λ = 1. Both properties can then be used in combination to show that the limit A := lim k A k exists and is a positive stochastic matrix of matrix rank one. In other words, we have the following claim from the Perron-Frobenius theorem: P λ 1, λ 2,..., λ k, where 1 = λ 1 λ 2... λ k 1 Regardless of our Π(0), under irreducible periodicity, the Π(t) converges to a stationary distribution. For more information see the following references: J.L. Doob. Stochastic Processes. ew York: John Wiley and Sons, 1953. ISB 0-471-52369-0. S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. London: Springer-Verlag, 1993. ISB 0-387-19832-6. online: http://decision.csl.uiuc.edu/ meyn/pages/book.html. Second edition to appear, Cambridge University Press, 2008. 3 Relation between Singular Value Decomposition and Eigen Value Decomposition Statement of the SVD Theorem: Suppose M is an m-by-n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form M = UΣV where U is an m-by-m unitary matrix over K, the matrix Σ is m-by-n diagonal matrix with nonnegative numbers on the diagonal, and V denotes the conjugate transpose of V, an n-by-n unitary matrix over K. Such a factorization is called a singular-value decomposition of M. -The matrix V thus contains a set of orthonormal input or analysing basis vector directions for M -The matrix U contains a set of orthonormal output basis vector directions for M -The matrix Σ contains the singular values, which can be thought of as scalar gain controls by which each corresponding input is multiplied to give a corresponding output. A common convention is to order the values Σ i,i in non-increasing fashion. In this case, the diagonal matrix Σ is uniquely determined by M (though the matrices U and V are not). 3
The singular value decomposition is very general in the sense that it can be applied to any m n matrix. The eigenvalue decomposition, on the other hand, can only be applied to certain classes of square matrices. evertheless, the two decompositions are related. Given an SVD of M, as described above, the following two relations hold: M M = V Σ U UΣV = V (Σ Σ)V MM = UΣV V Σ U = U(ΣΣ )U The right hand sides of these relations describe the eigenvalue decompositions of the left hand sides. Consequently, the squares of the non-zero singular values of M are equal to the non-zero eigenvalues of either M M or MM. Furthermore, the columns of U (left singular vectors) are eigenvectors of MM and the columns of V (right singular vectors) are eigenvectors of M M. In the special case that M is a normal matrix, which by definition must be square, the spectral theorem says that it can be unitarily diagonalized using a basis of eigenvectors, so that it can be written M = UDU for a unitary matrix U and a diagonal matrix D. When M is Hermitian positive semi-definite, the decomposition M = U DU is also a singular value decomposition. However, the eigenvalue decomposition and the singular value decomposition differ for all other matrices M: the eigenvalue decomposition is M = UDU?1 where U is not necessarily unitary and D is not necessarily positive semi-definite, while the SVD is M = UΣV where Σ is a diagonal positive semi-definite, and U and V are unitary matrices that are not necessarily related except through the matrix M. 4 Example Irreducibility aperiodic. P (x t = s J x 0 = s i ) = [P t ] Ji > 0 n n P = P 2 = 0.2 0.1 0.3 0.4 0.7 0.3 0.4 0.2 0.4 0.20 0.15 0.21 0.48 0.59 0.45 0.32 0.26 0.34 Perron Frobenius Theorem: Claim: the largest eigenvalue is 1 λ 1 = 1 > λ 2 >... 1 1 0 0 P = [µ 1 µ 2 µ 3 ] 0 λ 2 0 [µ 1 µ 2 µ 3 ] 1 0 0 λ 3 P n = µ 1 n 0 0 0 n λ 2 0 0 0 n λ 3 µ 1 = µ 1 n 0 0 0.... µ 1 4
P n = [µ00]µ 1 PΠ = Π If we calculate the eigenvalue of our matrx P we find λ 1 = 1.0000λ 2 = 0.3562, λ 3 = 0.0562.Whicha dheres the the Perron-Frobenius theorem. 5