Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1 t 1x+η T 2 t 2x Λη 1,η 2. Then, we might ask about the family of conditional distributions P η t 1 X A t 2 X = s 2. If we suppose that tx = t 1 X, t 2 X has a density f T1,T 2 on R p then we see the conditional density of T 1 T 2 = s 2 has the form f T1,T 2 t 1, s 2 R k f T1,T 2 s 1, s 2 ds 1 = where m 0 is the push-forward of m 0 under t : Ω R 2. This is a k-parameter exponential family: e ηt 1 t 1+η T 2 s2 m 0 t 1, s 2 R e k ηt 1 s 1+η2 T s 2 m 0 s 1, s 2 ds 1 e ηt 1 t1 m 0 t 1, s 2 = R e k ηt 1 s 1 m 0 s 1, s 2 ds 1 1. The reference measure has density m 0 t 1, s 2 with respect to dt 1, Lebesgue measure on R k. 2. The sufficient statistic is t 1. 3. The CGF is log e ηt 1 s1 m 0 s 1, s 2 ds 1. R k 1.1.1 Example: the Poisson trick Suppose we observe independent X i Poissonµ i, 1 i k. This is a k-parameter exponential family of distributions on R k dp ηµ dx = k e x i log µ i µ i = exp x i η i e η i. i The reference measure has density 1 k x i! with respect to counting measure on Z k restricted to the non-negative orthant. From this family, we can form a new family of distributions on R k+1 by considering the push forward under k fx 1,..., x k = x 1,..., x k, x i. 1

The push forward of m 0 under f will be counting measure restricted to { } k x 1,..., x k+1 : x i 0, x i = x k+1 with the same density as above. From the general picture for conditioning, we see that X 1,..., X k k X i is a k-parameter exponential family with sufficient statistic x 1,..., x k. The marginal distributon of k X k i is of course Poisson µ k i = Poisson log η i. Hence, for x 1,..., x k A n,k we see P X 1,..., X k = x 1,..., x k k k n X i = n = exp η i x i x 1,..., x k This is just the multinomial p.m.f. No surprise here... 1.1.2 Example: the Dirichlet from independent Gammas Suppose we observe independent X i Gamma1, α i, 1 i k i.e. scale 1 but shape parameter α i. This is an exponential family of distributions on R k dp η k dx = e α i logx i log Γα i e x i 1 [0, x i dx i. x i From this family, we can form a family of distributions on R k+1 by considering k fx 1,..., x k = x 1,..., x k, x i. This is again an exponential family see exercise. The marginal density of the sum of independent Gamma random variables of fixed scale is another Gamma with scale 1 and shape parameter k α i. This leads to the conclusion that X 1,..., X k k Dirichletα. 1.1.3 Exercise: behaviour under affine transformation Suppose we observe independent X i Gamma1, α i, 1 i k i.e. scale 1 but shape parameter α i. Set 1. Set fx 1,..., x k = x 1,..., x k, k x i. Show that the push-forward of the the original exponential family of distributions on R k is an exponential family of distributions on R k+1. What is the sufficient statistic? 2

2. What is the dimension of the natural parameter space i.e. how many parameters are there? 3. What is the reference measure? 4. Suppose gx = Ax + b. Give a sufficient condition on A, b so that the push forward of an exponential family is still an exponential family. Give an example of A, b for which the push forward fails to be an exponential family. 1.1.4 Exercise: conditioning in the general case 1. Give a general formula for Show that it is an exponential family. 2. What is its sufficient statistic? P η t1 X A t 2 X. 3. What is its reference measure? Be formal about it: what is the sample space? What is the measure? 1.2 Ising Models The Ising model is an extensively studied object in statistical physics. In statistical settings, it has applications to image analysis. For us, it is an example. The sample space of the Ising model is based on a graph G = V, E specified by a set of vertices V and a set of undirected edges E. The sample space is Ω V = { 1, 1} V Z V with reference measure counting measure restricted to Ω V. The density with respect to this reference is exp Q 1 i x i + Q 2 ijx i x j. i V i,j E The natural parameters are therefore Q 1, Q 2 R V R V V and the sufficient statistics are x, xx T Z V Z V V. We see that the CGF is ΛQ 1, Q 2 = Q 1 i x i + Q 2 ijx i x j. x { 1,1} V exp i V i,j E In general, this is quite a complicated function so we see that not all exponential families have tractable CGF not that we didn t know this already. While we won t touch on it too much right now, what makes things possible for this model, and other models with intractable CGFs is that it is often possible to simulate from the distribution P Q 1,Q 2 which makes it possible to compute an unbiased estimate of Λ Q 1,Q 2. This is the basis of stochastic optimization. 1. Given a procedure that takes arguments Q 1, Q 2 and produces a random vector DQ 1, Q 2 with E Q 1,Q 2 DQ 1, Q 2 = Λ Q 1,Q 2 3

2. A stochastic optimization procedure has the form ] Q 1 k+1, Q 2 k+1 = Q 1 k, Q 2 k α k [DQ 1 k, Q 2 k tx where the α k satisfy some growth assumptions, usually roughly of the form 1.3 Markov random fields k j=1 k j=1 α k k 0 k α j α 2 j k < The Ising model is an example of a something called a Markov random field. The descriptor Markov relates to a certain type of Markov property, that is a type of conditional independence. In the Ising model, suppose we consider the distribution of x i for some fixed i V, conditional on x V \i = x i. As above, this is an exponential family, but what are its parameters? First of all, note that the sample space depends on x i Ω x i = { y { 1, 1} V : y i = x i } which is sort of equivalent to { 1, 1} i but not quite the same. There are two points in Ω x i, and we can take the reference measure to be counting measure on these two points. With that, we see that the CGF is log Q 1 jy j + Q 2 ijy i y j. As a function on Ω x i, j,k E y Ω x i exp j V Q 1 jy j = Q 1 i y i + C 1 j V Q 2 ijy i y j = i,k E Q 2 ik y iy j + j,k E j,i E Q 2 jiy j y i + C 2 where C 1, C 2 are constant on Ω x i. If we assume G 2 ij is symmetric which we might as well then Q 2 jk y jy k = 2y i Q 2 ik y k + C. j,k E i,k E Finally, for the Ising model we see that the CGF of x i under this measure is log e Q1 i +2 i,k E Q2 ik x k + e Q1 i 2 i,k E Q2 ik x k + C 4

Let s write this CGF as Λ Q 1, Q 2 x i. This is the CGF of a {1, 1} valued random variable with natural parameter ηx i, Q 1, Q 2 = Q 1 i + 2 and counting measure on { 1, 1} as reference measure. i,k E 1.3.1 Exercise: natural parameter under conditioning Q 2 ik x k. The notation ηx i, Q 1, Q 2 suggests that the natural parameter corresponding to sufficient statistic x i, when conditioning on x i, has changed. Does this conflict with what we saw earlier about conditioning? What we can take away from this picture is that conditioning on x i yielded a new exponential family is a function of the original natural parameters Q 1, Q 2 and the value of x i. Also, the natural parameter depends on x i only through the value at neighbours of i in G. This is the afore-mentioned Markov property. 1.3.2 Definition of a Markov random field A Markov random field is a generalization of the Ising to more general sample spaces, and more complicated interactions. We will not dwell on them too much here but, in their most natural form, they are exponential families of distributions on something like R V for some set of vertices though they could take on values other than R. The sufficient statistics are specified by subsets A V f A x = g A x i, i A i.e. f A is measurable with respect to σx i, i A. The general form of the density is dp η = exp η A f A x Λη dm 0 A V with η = η A A V where m 0 is some reference measure on R V. This is a huge natural parameter space, so obviously some restrictions are made by constraining many the η A to be 0. Let s call this a model M, i.e. a set of A such that η A is not constrained to be 0. By default, we assume M is monotone, i.e. B M, A B = A M. 1.3.3 Exercise: Ising model as random field Write the Ising model above as a Markov random field field above. Be specific as possible. 1. What are the f A s? 2. What are the η A s? 3. What is the reference measure? 5

The Markov property of these random fields can be expressed in terms of the Markov neighbourhood of i V. We define this assuming monotonicity of M as Then, the Markov property can be stated as Ni, M = {j : {i, j} M}. ηx i, η σ x j, j Ni, M. Or, conditioning on x i yields an exponential family whose random natural parameter depends only on the neighbours of i in the model M and the true underlying η. This property has some very useful consequences, particularly when the resulting exponential family has a simple form. 1.3.4 Pseudo-likelihood In the Ising model, the full CGF is very complicated if V is of any reasonable size. conditional CGFs of x i x i are particularly simple. This is the basis of the pseudo-loglikelihood for Q 1, Q 2 in the Ising model But, the l pseudo Q 1, Q 2 = i V [ ηx i, Q 1, Q 2 x i ΛQ 1, Q 2 x i ]. 1.3.5 Exercise: Ising model pseudolikelihood 1. Write out the pseudolikelihood for the Ising model as explicitly as possible. 2. Is it convex in Q 1, Q 2? 3. Describe a Newton-Raphson algorithm to estimate Q 1, Q 2 based on maximizing the pseudolikelihood. Be as specific as possible, i.e. compute gradients and Hessians as explicitly as possible. 1.3.6 Gibbs sampler This simple form of the conditional distributions is also the basis of one of the most natural forms of MCMC algorithms, the Gibbs sampler. The Gibbs sampler for the Ising model, say is a Markov chain on { 1, 1} V whose stationary distribution is P Q 1,Q 2. The algorithm continuously cycles through the coordinates of V in some possibly random order and each step consists of drawing from P Q 1,Q 2 X i X i. That is, if at the k-th step we are updating coordinate i, we set X k+1 i P Q 1,Q 2 X i X i X k+1 i = X k i. 6

1.3.7 Exercise: sampling from an Ising model Consider an Ising model on L, the 100 100 lattice in Z 2 with Q 1 = α 1, Q 2 = β 11 T. 1. For β = 0, α = 1. initialize the Gibbs sampler at some random initial condition. Run the Gibbs sampler Markov chain on { 1, 1} L for some time. What do you expect the binary image to look like? 2. Repeat for β > 0 and β < 0. Note: I am not asking for an exhaustive simulation, the goal is to just get the basic mechanics of a Gibbs sampler. 7