Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Similar documents
Probabilistic Graphical Models

A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Propp-Wilson Algorithm (and sampling the Ising model)

Minicourse on: Markov Chain Monte Carlo: Simulation Techniques in Statistics

Graphical Models and Kernel Methods

Lecture 7 and 8: Markov Chain Monte Carlo

A = {(x, u) : 0 u f(x)},

Markov Chain Monte Carlo (MCMC)

Exponential Families

STA 4273H: Statistical Machine Learning

Introduction to Graphical Models

16 : Markov Chain Monte Carlo (MCMC)

Likelihood Inference for Lattice Spatial Processes

Text Mining for Economics and Finance Latent Dirichlet Allocation

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Study Notes on the Latent Dirichlet Allocation

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

13: Variational inference II

Statistics 3858 : Maximum Likelihood Estimators

Markov Chains and MCMC

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Introduction to Machine Learning CMU-10701

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Introduction to Restricted Boltzmann Machines

CSC 412 (Lecture 4): Undirected Graphical Models

Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

The Ising model and Markov chain Monte Carlo

Calculus 221 worksheet

Lecture 1: August 28

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CPSC 540: Machine Learning

Generalized Exponential Random Graph Models: Inference for Weighted Graphs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

4 : Exact Inference: Variable Elimination

Information geometry for bivariate distribution control

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Neural networks COMS 4771

Topic Modelling and Latent Dirichlet Allocation

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Monte Carlo Methods. Leon Gu CSD, CMU

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Stat 516, Homework 1

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Probabilistic Graphical Models

CSC 2541: Bayesian Methods for Machine Learning

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

LINK scheduling algorithms based on CSMA have received

Bayesian Methods for Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

6 Markov Chain Monte Carlo (MCMC)

Computer Intensive Methods in Mathematical Statistics

Intelligent Systems:

Probabilistic Graphical Models

Convex Optimization CMU-10725

STA 414/2104: Machine Learning

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Chapter 16. Structured Probabilistic Models for Deep Learning

MCMC algorithms for fitting Bayesian models

The Origin of Deep Learning. Lili Mou Jan, 2015

Bayesian Methods with Monte Carlo Markov Chains II

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Basic Definitions: Indexed Collections and Random Functions

Generalized Linear Models I

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

STA 4273H: Statistical Machine Learning

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

STA 294: Stochastic Processes & Bayesian Nonparametrics

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Sample Spaces, Random Variables

Training an RBM: Contrastive Divergence. Sargur N. Srihari

7.1 Coupling from the Past

Bayesian Estimation with Sparse Grids

Lecture 16 Deep Neural Generative Models

10708 Graphical Models: Homework 2

1.1 Review of Probability Theory

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

CPSC 540: Machine Learning


Theory of Stochastic Processes 8. Markov chain Monte Carlo

REVIEW OF DIFFERENTIAL CALCULUS

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Notes on pseudo-marginal methods, variational Bayes and ABC

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Bayes Model Selection with Path Sampling: Factor Models

Probabilistic Graphical Models

Conditional probabilities and graphical models

September Math Course: First Order Derivative

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Practice Problems Section Problems

Stochastic Proximal Gradient Algorithm

Metric spaces and metrizability

y = 7x 2 + 2x 7 ( x, f (x)) y = 3x + 6 f (x) = 3( x 3) 2 dy dx = 3 dy dx =14x + 2 dy dy dx = 2x = 6x 18 dx dx = 2ax + b

Markov random fields. The Markov property

Transcription:

1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1 t 1x+η T 2 t 2x Λη 1,η 2. Then, we might ask about the family of conditional distributions P η t 1 X A t 2 X = s 2. If we suppose that tx = t 1 X, t 2 X has a density f T1,T 2 on R p then we see the conditional density of T 1 T 2 = s 2 has the form f T1,T 2 t 1, s 2 R k f T1,T 2 s 1, s 2 ds 1 = where m 0 is the push-forward of m 0 under t : Ω R 2. This is a k-parameter exponential family: e ηt 1 t 1+η T 2 s2 m 0 t 1, s 2 R e k ηt 1 s 1+η2 T s 2 m 0 s 1, s 2 ds 1 e ηt 1 t1 m 0 t 1, s 2 = R e k ηt 1 s 1 m 0 s 1, s 2 ds 1 1. The reference measure has density m 0 t 1, s 2 with respect to dt 1, Lebesgue measure on R k. 2. The sufficient statistic is t 1. 3. The CGF is log e ηt 1 s1 m 0 s 1, s 2 ds 1. R k 1.1.1 Example: the Poisson trick Suppose we observe independent X i Poissonµ i, 1 i k. This is a k-parameter exponential family of distributions on R k dp ηµ dx = k e x i log µ i µ i = exp x i η i e η i. i The reference measure has density 1 k x i! with respect to counting measure on Z k restricted to the non-negative orthant. From this family, we can form a new family of distributions on R k+1 by considering the push forward under k fx 1,..., x k = x 1,..., x k, x i. 1

The push forward of m 0 under f will be counting measure restricted to { } k x 1,..., x k+1 : x i 0, x i = x k+1 with the same density as above. From the general picture for conditioning, we see that X 1,..., X k k X i is a k-parameter exponential family with sufficient statistic x 1,..., x k. The marginal distributon of k X k i is of course Poisson µ k i = Poisson log η i. Hence, for x 1,..., x k A n,k we see P X 1,..., X k = x 1,..., x k k k n X i = n = exp η i x i x 1,..., x k This is just the multinomial p.m.f. No surprise here... 1.1.2 Example: the Dirichlet from independent Gammas Suppose we observe independent X i Gamma1, α i, 1 i k i.e. scale 1 but shape parameter α i. This is an exponential family of distributions on R k dp η k dx = e α i logx i log Γα i e x i 1 [0, x i dx i. x i From this family, we can form a family of distributions on R k+1 by considering k fx 1,..., x k = x 1,..., x k, x i. This is again an exponential family see exercise. The marginal density of the sum of independent Gamma random variables of fixed scale is another Gamma with scale 1 and shape parameter k α i. This leads to the conclusion that X 1,..., X k k Dirichletα. 1.1.3 Exercise: behaviour under affine transformation Suppose we observe independent X i Gamma1, α i, 1 i k i.e. scale 1 but shape parameter α i. Set 1. Set fx 1,..., x k = x 1,..., x k, k x i. Show that the push-forward of the the original exponential family of distributions on R k is an exponential family of distributions on R k+1. What is the sufficient statistic? 2

2. What is the dimension of the natural parameter space i.e. how many parameters are there? 3. What is the reference measure? 4. Suppose gx = Ax + b. Give a sufficient condition on A, b so that the push forward of an exponential family is still an exponential family. Give an example of A, b for which the push forward fails to be an exponential family. 1.1.4 Exercise: conditioning in the general case 1. Give a general formula for Show that it is an exponential family. 2. What is its sufficient statistic? P η t1 X A t 2 X. 3. What is its reference measure? Be formal about it: what is the sample space? What is the measure? 1.2 Ising Models The Ising model is an extensively studied object in statistical physics. In statistical settings, it has applications to image analysis. For us, it is an example. The sample space of the Ising model is based on a graph G = V, E specified by a set of vertices V and a set of undirected edges E. The sample space is Ω V = { 1, 1} V Z V with reference measure counting measure restricted to Ω V. The density with respect to this reference is exp Q 1 i x i + Q 2 ijx i x j. i V i,j E The natural parameters are therefore Q 1, Q 2 R V R V V and the sufficient statistics are x, xx T Z V Z V V. We see that the CGF is ΛQ 1, Q 2 = Q 1 i x i + Q 2 ijx i x j. x { 1,1} V exp i V i,j E In general, this is quite a complicated function so we see that not all exponential families have tractable CGF not that we didn t know this already. While we won t touch on it too much right now, what makes things possible for this model, and other models with intractable CGFs is that it is often possible to simulate from the distribution P Q 1,Q 2 which makes it possible to compute an unbiased estimate of Λ Q 1,Q 2. This is the basis of stochastic optimization. 1. Given a procedure that takes arguments Q 1, Q 2 and produces a random vector DQ 1, Q 2 with E Q 1,Q 2 DQ 1, Q 2 = Λ Q 1,Q 2 3

2. A stochastic optimization procedure has the form ] Q 1 k+1, Q 2 k+1 = Q 1 k, Q 2 k α k [DQ 1 k, Q 2 k tx where the α k satisfy some growth assumptions, usually roughly of the form 1.3 Markov random fields k j=1 k j=1 α k k 0 k α j α 2 j k < The Ising model is an example of a something called a Markov random field. The descriptor Markov relates to a certain type of Markov property, that is a type of conditional independence. In the Ising model, suppose we consider the distribution of x i for some fixed i V, conditional on x V \i = x i. As above, this is an exponential family, but what are its parameters? First of all, note that the sample space depends on x i Ω x i = { y { 1, 1} V : y i = x i } which is sort of equivalent to { 1, 1} i but not quite the same. There are two points in Ω x i, and we can take the reference measure to be counting measure on these two points. With that, we see that the CGF is log Q 1 jy j + Q 2 ijy i y j. As a function on Ω x i, j,k E y Ω x i exp j V Q 1 jy j = Q 1 i y i + C 1 j V Q 2 ijy i y j = i,k E Q 2 ik y iy j + j,k E j,i E Q 2 jiy j y i + C 2 where C 1, C 2 are constant on Ω x i. If we assume G 2 ij is symmetric which we might as well then Q 2 jk y jy k = 2y i Q 2 ik y k + C. j,k E i,k E Finally, for the Ising model we see that the CGF of x i under this measure is log e Q1 i +2 i,k E Q2 ik x k + e Q1 i 2 i,k E Q2 ik x k + C 4

Let s write this CGF as Λ Q 1, Q 2 x i. This is the CGF of a {1, 1} valued random variable with natural parameter ηx i, Q 1, Q 2 = Q 1 i + 2 and counting measure on { 1, 1} as reference measure. i,k E 1.3.1 Exercise: natural parameter under conditioning Q 2 ik x k. The notation ηx i, Q 1, Q 2 suggests that the natural parameter corresponding to sufficient statistic x i, when conditioning on x i, has changed. Does this conflict with what we saw earlier about conditioning? What we can take away from this picture is that conditioning on x i yielded a new exponential family is a function of the original natural parameters Q 1, Q 2 and the value of x i. Also, the natural parameter depends on x i only through the value at neighbours of i in G. This is the afore-mentioned Markov property. 1.3.2 Definition of a Markov random field A Markov random field is a generalization of the Ising to more general sample spaces, and more complicated interactions. We will not dwell on them too much here but, in their most natural form, they are exponential families of distributions on something like R V for some set of vertices though they could take on values other than R. The sufficient statistics are specified by subsets A V f A x = g A x i, i A i.e. f A is measurable with respect to σx i, i A. The general form of the density is dp η = exp η A f A x Λη dm 0 A V with η = η A A V where m 0 is some reference measure on R V. This is a huge natural parameter space, so obviously some restrictions are made by constraining many the η A to be 0. Let s call this a model M, i.e. a set of A such that η A is not constrained to be 0. By default, we assume M is monotone, i.e. B M, A B = A M. 1.3.3 Exercise: Ising model as random field Write the Ising model above as a Markov random field field above. Be specific as possible. 1. What are the f A s? 2. What are the η A s? 3. What is the reference measure? 5

The Markov property of these random fields can be expressed in terms of the Markov neighbourhood of i V. We define this assuming monotonicity of M as Then, the Markov property can be stated as Ni, M = {j : {i, j} M}. ηx i, η σ x j, j Ni, M. Or, conditioning on x i yields an exponential family whose random natural parameter depends only on the neighbours of i in the model M and the true underlying η. This property has some very useful consequences, particularly when the resulting exponential family has a simple form. 1.3.4 Pseudo-likelihood In the Ising model, the full CGF is very complicated if V is of any reasonable size. conditional CGFs of x i x i are particularly simple. This is the basis of the pseudo-loglikelihood for Q 1, Q 2 in the Ising model But, the l pseudo Q 1, Q 2 = i V [ ηx i, Q 1, Q 2 x i ΛQ 1, Q 2 x i ]. 1.3.5 Exercise: Ising model pseudolikelihood 1. Write out the pseudolikelihood for the Ising model as explicitly as possible. 2. Is it convex in Q 1, Q 2? 3. Describe a Newton-Raphson algorithm to estimate Q 1, Q 2 based on maximizing the pseudolikelihood. Be as specific as possible, i.e. compute gradients and Hessians as explicitly as possible. 1.3.6 Gibbs sampler This simple form of the conditional distributions is also the basis of one of the most natural forms of MCMC algorithms, the Gibbs sampler. The Gibbs sampler for the Ising model, say is a Markov chain on { 1, 1} V whose stationary distribution is P Q 1,Q 2. The algorithm continuously cycles through the coordinates of V in some possibly random order and each step consists of drawing from P Q 1,Q 2 X i X i. That is, if at the k-th step we are updating coordinate i, we set X k+1 i P Q 1,Q 2 X i X i X k+1 i = X k i. 6

1.3.7 Exercise: sampling from an Ising model Consider an Ising model on L, the 100 100 lattice in Z 2 with Q 1 = α 1, Q 2 = β 11 T. 1. For β = 0, α = 1. initialize the Gibbs sampler at some random initial condition. Run the Gibbs sampler Markov chain on { 1, 1} L for some time. What do you expect the binary image to look like? 2. Repeat for β > 0 and β < 0. Note: I am not asking for an exhaustive simulation, the goal is to just get the basic mechanics of a Gibbs sampler. 7