Normalising constants and maximum likelihood inference Jakob G. Rasmussen Department of Mathematics Aalborg University Denmark March 9, 2011 1/14
Today Normalising constants Approximation of normalising constants: Importance sampling Path sampling (exponential family case) Maximum likelihood estimation Score function Fisher information Interchange of differentiation and expectation 2/14
Maximum likelihood inference for point processes Consider point processes specified by unnormalized density h θ (x), f θ (x) = 1 c(θ) h θ(x) Problem: Since c(θ) is unknown, log likelihood is also unknown. l(θ) = logh θ (x) logc(θ) Both maximum likelihood inference and Bayesian inference need this constant (it does not cancel out in for example MCMC based inference like it did for simulation). 3/14
Normalising constants and expectations Normalising constant: c(θ) = n=0 e µ(s) n! S n h θ ({x 1,...,x n })dx 1...dx n = Eh θ (Y) where Y Poisson(S, 1) (last equality follows from Poisson expansion). Note also that mean values can also be expressed with respect to other point processes: E θ [g(x)] = E[g(Y)f θ (Y)] where X f θ, and g : N lf R k is any suitable function. 4/14
Importance sampling c(θ) can be approximated using importance sampling. θ 0 is a fixed reference parameter: Importance sampling: Hence l(θ) logh θ (x) log c(θ) c(θ 0 ) c(θ) c(θ 0 ) = E h θ (X) θ 0 h θ0 (X) c(θ) c(θ 0 ) 1 n n i=1 h θ (X i ) h θ0 (X i ) where X 1,...,X n is a sample from f θ0 (ideally i.i.d. simulations, but simulations taken at regular spacing in MCMC will do just fine). 5/14
Importance sampling Importance sampling can be used to approximate mean values of other functions of X than the density, say k( ) - e.g. k(x) = n(x) or k(x) = s(x). Importance sampling formula: [ E θ k(x) = E θ0 k(x) h ]/[ ] θ(x) c(θ) h θ0 (X) c(θ 0 ) This can be approximated by E θ,θ0,nk = n k(x m )w θ,θ0,n(x m ) m=1 where w θ,θ0,n(x m ) = h θ(x m )/h θ0 (X m ) n i=1 h θ(x i )/h θ0 (X i ) and X 1,...,X n is generated from f θ0. 6/14
Importance sampling in practice Theoretically c(θ) c(θ 0 ) 1 n n m=1 h θ (X m ) h θ0 (X m ) is an unbiased estimate of c(θ) c(θ 0 ). However, in practice it may be a very bad estimate since most X m will be located where h θ0 is high. Here h θ(x m ) h θ0 (X m ) is typically low, so most terms will count very little in the sum, while a few may count a lot! This is only a problem when θ and θ 0 is far away from each other, so we need a way of making a path between the two such we do not need to evaluate the ratio of densities when far from each other: path sampling. 7/14
Exponential families Exponential family has densities on the form h θ (x) = exp(t(x)θ ) Example, Strauss process (with fixed R): t(x) = (n(x),s(x)), θ = (logβ,logγ) Log likelihood: l(θ) = t(x)θ logc(θ) Ratio of normalizing constants used in importance sampling: c(θ) c(θ 0 ) = E θ 0 exp(t(x)(θ θ 0 ) ) If θ θ 0 is large, exp(t(x)(θ θ 0 ) ) has very large variance in many cases, i.e. many samples are needed in importance sampling. 8/14
Path sampling (exp. family case) Let θ(t) be a differentiable path linking θ 0 and θ 1 with θ(0) = θ 0 and θ(1) = θ 1 Example of path: θ(s) = θ 0 +s(θ 1 θ 0 ) (straight line). Path sampling: log c(θ 1) c(θ 0 ) = 1 0 E θ(s) [t(x)] dθ(s) ds ds Approximate E θ(s) t(x) by Monte Carlo and 1 0 by numerical quadrature (e.g. trapezoidal rule). Note: Another advantage of path sampling is that Monte Carlo approximation on log scale is often more stable. 9/14
Maximum likelihood estimation Maximum likelihood estimation is rarely possible analytically for point processes (the homogeneous Poisson is a rare exception). Instead we maximize the likelihood function using Newton-Raphson. For this we need the score function and the observed Fisher information. 10/14
Estimation of score function Score function: u(θ) = d dθ logl(θ) = V θ(x) d dθ logc θ = V θ (x) E θ V θ (X), where V θ (x) = d dθ logh θ(x) Note: we have assumed differentiation and expectation can be interchanged. Approximation of score function: u(θ) V θ (x) E θ,θ0,nv θ 11/14
Estimation of observed Fisher information Observed Fisher information: j(θ) = d dθ u(θ) = d dθ V θ(x)+ d2 = d dθ V θ(x)+e θ dθ dθ logc θ ] +Var θ V θ (X) [ d dθ V θ(x) Note: we have assumed differentiation and expectation can be interchanged. Approximation: j(θ) d [ ] d dθ V θ(x)+e θ,θ0,n dθ V θ +Vθ V θ E θ,θ0,nvθ E θ,θ 0,nV θ 12/14
Maximisation of likelihood Score and observed information in exp. family case: u(θ) = t(x) E θ t(x), j(θ) = Var θ t(x), Since j is a covariance matrix, it is positive semi-definite, and thus the log-likelihood is concave, i.e. it has a unique maximum. To find this we need to solve u(θ) = 0. Newton-Raphson iterations: θ m+1 = θ m +u(θ m )j(θ m ) 1 13/14
Interchange of differentiation and integration Sufficient condition for d dθ Eh θ(y) = E d dθ h θ(y) is that g i (θ,x) = h θ (x)/ θ i is locally dominated integrable. This is the case if for all θ Θ there is an ε > 0 and a function H θ,i so that b(θ,ε) Θ, EH θ,i (Y) < and g i ( θ, ) H θ,i ( ) for all θ b(θ,ε). 14/14