Normalising constants and maximum likelihood inference

Similar documents
Jesper Møller ) and Kateřina Helisová )

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Statistics & Data Sciences: First Year Prelim Exam May 2018

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Stat 710: Mathematical Statistics Lecture 12

1 One-way analysis of variance

Likelihood-Based Methods

A Very Brief Summary of Statistical Inference, and Examples

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Lecture 3 September 1

Inference in non-linear time series

Spatial point processes: Theory and practice illustrated with R

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Density Estimation. Seungjin Choi

A Very Brief Summary of Statistical Inference, and Examples

Chapter 3 : Likelihood function and inference

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

Frequentist-Bayesian Model Comparisons: A Simple Example

ECE 275A Homework 7 Solutions

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Maximum Likelihood Estimation. only training data is available to design a classifier

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

STAT Advanced Bayesian Inference

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Lecture 17: Likelihood ratio and asymptotic tests

Stat 451 Lecture Notes Numerical Integration

1. Fisher Information

Monte Carlo in Bayesian Statistics

Outline of GLMs. Definitions

DA Freedman Notes on the MLE Fall 2003

Econometrics I, Estimation

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Time Series Analysis

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Multimodal Nested Sampling

EM Algorithm II. September 11, 2018

Weighted Least Squares I

Brief Review on Estimation Theory

Likelihood Inference for Lattice Spatial Processes

Lecture 8: Policy Gradient

David Giles Bayesian Econometrics

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Model comparison and selection

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Mathematical Statistics

Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets

Bayesian Inference for directional data through ABC and homogeneous proper scoring rules

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

STAT 730 Chapter 4: Estimation

A Few Notes on Fisher Information (WIP)

Lecture 26: Likelihood ratio tests

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

ECE531 Lecture 8: Non-Random Parameter Estimation

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Computational statistics

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Advanced statistical inference. Suhasini Subba Rao

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

The loss function and estimating equations

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Generalized Linear Models I

Lecture 5. i=1 xi. Ω h(x,y)f X Θ(y θ)µ Θ (dθ) = dµ Θ X

y it = α i + β 0 ix it + ε it (0.1) The panel data estimators for the linear model are all standard, either the application of OLS or GLS.

Stat 710: Mathematical Statistics Lecture 27

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

Introduction to Rare Event Simulation

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Generalized Linear Models. Kurt Hornik

Parameter Estimation

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

ST495: Survival Analysis: Maximum likelihood

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Maximum Likelihood Estimation

Extreme Value Analysis and Spatial Extremes

Model Based Clustering of Count Processes Data

Information in a Two-Stage Adaptive Optimal Design

ECE531 Lecture 10b: Maximum Likelihood Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Fisher Information & Efficiency

STAT J535: Chapter 5: Classes of Bayesian Priors

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Transcription:

Normalising constants and maximum likelihood inference Jakob G. Rasmussen Department of Mathematics Aalborg University Denmark March 9, 2011 1/14

Today Normalising constants Approximation of normalising constants: Importance sampling Path sampling (exponential family case) Maximum likelihood estimation Score function Fisher information Interchange of differentiation and expectation 2/14

Maximum likelihood inference for point processes Consider point processes specified by unnormalized density h θ (x), f θ (x) = 1 c(θ) h θ(x) Problem: Since c(θ) is unknown, log likelihood is also unknown. l(θ) = logh θ (x) logc(θ) Both maximum likelihood inference and Bayesian inference need this constant (it does not cancel out in for example MCMC based inference like it did for simulation). 3/14

Normalising constants and expectations Normalising constant: c(θ) = n=0 e µ(s) n! S n h θ ({x 1,...,x n })dx 1...dx n = Eh θ (Y) where Y Poisson(S, 1) (last equality follows from Poisson expansion). Note also that mean values can also be expressed with respect to other point processes: E θ [g(x)] = E[g(Y)f θ (Y)] where X f θ, and g : N lf R k is any suitable function. 4/14

Importance sampling c(θ) can be approximated using importance sampling. θ 0 is a fixed reference parameter: Importance sampling: Hence l(θ) logh θ (x) log c(θ) c(θ 0 ) c(θ) c(θ 0 ) = E h θ (X) θ 0 h θ0 (X) c(θ) c(θ 0 ) 1 n n i=1 h θ (X i ) h θ0 (X i ) where X 1,...,X n is a sample from f θ0 (ideally i.i.d. simulations, but simulations taken at regular spacing in MCMC will do just fine). 5/14

Importance sampling Importance sampling can be used to approximate mean values of other functions of X than the density, say k( ) - e.g. k(x) = n(x) or k(x) = s(x). Importance sampling formula: [ E θ k(x) = E θ0 k(x) h ]/[ ] θ(x) c(θ) h θ0 (X) c(θ 0 ) This can be approximated by E θ,θ0,nk = n k(x m )w θ,θ0,n(x m ) m=1 where w θ,θ0,n(x m ) = h θ(x m )/h θ0 (X m ) n i=1 h θ(x i )/h θ0 (X i ) and X 1,...,X n is generated from f θ0. 6/14

Importance sampling in practice Theoretically c(θ) c(θ 0 ) 1 n n m=1 h θ (X m ) h θ0 (X m ) is an unbiased estimate of c(θ) c(θ 0 ). However, in practice it may be a very bad estimate since most X m will be located where h θ0 is high. Here h θ(x m ) h θ0 (X m ) is typically low, so most terms will count very little in the sum, while a few may count a lot! This is only a problem when θ and θ 0 is far away from each other, so we need a way of making a path between the two such we do not need to evaluate the ratio of densities when far from each other: path sampling. 7/14

Exponential families Exponential family has densities on the form h θ (x) = exp(t(x)θ ) Example, Strauss process (with fixed R): t(x) = (n(x),s(x)), θ = (logβ,logγ) Log likelihood: l(θ) = t(x)θ logc(θ) Ratio of normalizing constants used in importance sampling: c(θ) c(θ 0 ) = E θ 0 exp(t(x)(θ θ 0 ) ) If θ θ 0 is large, exp(t(x)(θ θ 0 ) ) has very large variance in many cases, i.e. many samples are needed in importance sampling. 8/14

Path sampling (exp. family case) Let θ(t) be a differentiable path linking θ 0 and θ 1 with θ(0) = θ 0 and θ(1) = θ 1 Example of path: θ(s) = θ 0 +s(θ 1 θ 0 ) (straight line). Path sampling: log c(θ 1) c(θ 0 ) = 1 0 E θ(s) [t(x)] dθ(s) ds ds Approximate E θ(s) t(x) by Monte Carlo and 1 0 by numerical quadrature (e.g. trapezoidal rule). Note: Another advantage of path sampling is that Monte Carlo approximation on log scale is often more stable. 9/14

Maximum likelihood estimation Maximum likelihood estimation is rarely possible analytically for point processes (the homogeneous Poisson is a rare exception). Instead we maximize the likelihood function using Newton-Raphson. For this we need the score function and the observed Fisher information. 10/14

Estimation of score function Score function: u(θ) = d dθ logl(θ) = V θ(x) d dθ logc θ = V θ (x) E θ V θ (X), where V θ (x) = d dθ logh θ(x) Note: we have assumed differentiation and expectation can be interchanged. Approximation of score function: u(θ) V θ (x) E θ,θ0,nv θ 11/14

Estimation of observed Fisher information Observed Fisher information: j(θ) = d dθ u(θ) = d dθ V θ(x)+ d2 = d dθ V θ(x)+e θ dθ dθ logc θ ] +Var θ V θ (X) [ d dθ V θ(x) Note: we have assumed differentiation and expectation can be interchanged. Approximation: j(θ) d [ ] d dθ V θ(x)+e θ,θ0,n dθ V θ +Vθ V θ E θ,θ0,nvθ E θ,θ 0,nV θ 12/14

Maximisation of likelihood Score and observed information in exp. family case: u(θ) = t(x) E θ t(x), j(θ) = Var θ t(x), Since j is a covariance matrix, it is positive semi-definite, and thus the log-likelihood is concave, i.e. it has a unique maximum. To find this we need to solve u(θ) = 0. Newton-Raphson iterations: θ m+1 = θ m +u(θ m )j(θ m ) 1 13/14

Interchange of differentiation and integration Sufficient condition for d dθ Eh θ(y) = E d dθ h θ(y) is that g i (θ,x) = h θ (x)/ θ i is locally dominated integrable. This is the case if for all θ Θ there is an ε > 0 and a function H θ,i so that b(θ,ε) Θ, EH θ,i (Y) < and g i ( θ, ) H θ,i ( ) for all θ b(θ,ε). 14/14