Likelihood-Based Methods

Size: px

Start display at page:

Download "Likelihood-Based Methods"

Kevin Gibbs
5 years ago
Views:

1 Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016

2 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML) NUMERICAL OPTIMIZATION INFERENCE NON-GAUSSIAN DATA

3 MODEL Recall the usual geostatistical model for points s D Y(s) = X(s)β + e(s), e GP{0, C(θ)}. Note that unlike Ch 3, we do not require stationarity here!

4 MODEL Recall the usual geostatistical model for points s D Y(s) = X(s)β + e(s), e GP{0, C(θ)}. Note that unlike Ch 3, we do not require stationarity here! Suppose the process Y( ) has been observed at n points in D, giving us Y = [Y(s 1 ),..., Y(s n )] and let Σ(θ) denote the associated n n covariance matrix.

5 LIKELIHOOD FUNCTION Because the process Y( ) is Gaussian, the joint distribution of the observed values is multivariate normal, Y N n {Xβ, Σ(θ)}. Then, the maximum likelihood (ML) estimates, ˆβ and ˆθ, are the values of β and θ that maximize l(β, θ; Y) 1 2 ln Σ(θ) 1 2 (Y Xβ)T Σ 1 (θ)(y Xβ).

6 PROFILING We know that for a fixed value of θ = θ 0, the maximizer of l(β; θ 0, Y) is the generalized least squares estimator ˆβ = [X T Σ 1 (θ 0 )X] 1 X T Σ 1 (θ 0 )Y. If we plug this expression in for β in the objective function l(β, θ; Y), we end up with a function of only θ, l(θ; Y) 1 2 ln Σ(θ) 1 2 YT P(θ)Y. We maximize this profile log-likelihood function to get ˆθ and plug in to the expression above to get ˆβ.

7 MOTIVATION ML is intuitive and its estimators have desirable properties, but they are biased in finite samples.

8 MOTIVATION ML is intuitive and its estimators have desirable properties, but they are biased in finite samples. For example, we all remember from Casella & Berger 1 that the ML estimator for σ 2 in the simple case of a N(µ, σ 2 ) model is n 1 i (X i X) 2, which is not an unbiased estimator. 1 Right?

9 MOTIVATION ML is intuitive and its estimators have desirable properties, but they are biased in finite samples. For example, we all remember from Casella & Berger 1 that the ML estimator for σ 2 in the simple case of a N(µ, σ 2 ) model is n 1 i (X i X) 2, which is not an unbiased estimator. This issue arises because we have to estimate µ with X, rather than knowing it. The same logic applies in our Gaussian process model. 1 Right?

10 RESTRICTED LIKELIHOOD APPROACH Restricted maximum likelihood (REML) avoids this issue by maximizing the log-likelihood function associated with error contrasts, rather than the observations themselves.

11 RESTRICTED LIKELIHOOD APPROACH Restricted maximum likelihood (REML) avoids this issue by maximizing the log-likelihood function associated with error contrasts, rather than the observations themselves. These error contrasts are a set of linearly independent vectors a 1,..., a n p such that E[a T Y] = 0 for all β, θ. Gathering these vectors into a matrix, A = [a 1 a 2... a n p ], it s straightforward to see that A T Y N n {0, A T Σ(θ)A}. Now, we have a likelihood that depends only on θ, so the source of the bias is gone.

12 RESTRICTED LIKELIHOOD FUNCTION With some work, we can show that the log-likelihood of the error contrasts, A T Y, can be written as l R (θ; Y) 1 2 ln Σ(θ) 1 2 ln XT Σ 1 (θ)x 1 2 YT P(θ)Y. Then, the REML estimate of θ, let s call it θ, is the value that maximizes l R. As before, we plug this into the generalized least squares estimator to get β.

13 EFFICACY There is no theoretical proof that either ML or REML is uniformly preferable. In some special cases, the two methods are equivalent. In large samples, they perform similarly. However, in small or medium sample problems, REML does effectively reduce bias. REML may also increase the estimation variance, unless the persistence of spatial dependence is weak.

14 NUMERICAL OPTIMIZATION METHODS Regardless of which approach we use, maximizing either l or l R is a constrained, non-linear optimization problem. In general, closed-form solutions don t exist, so we turn to numerical methods: Grid-search Gradient descent (Newton-Raphson, Fisher scoring) Non-gradient based iterative algorithms There is no guarantee that a unique solution exists, and even if it does we can t be sure that the identified solutions are global maxima.

15 INVERTIN AIN T EASY Gradient-based iterative algorithms update m-vector θ as θ (k+1) = θ (k) + ρ (k) M (k) g (k) where M (k) is an m m matrix inverse, usually of second derivatives. Evaluating M (k) and the inversion and determinant calculations required to evaluate the objective function at each iteration can make numerical methods very computationally expensive.

16 ALTERNATE APPROACHES Rather than trying to speed up the maximization of the objective function L, we can replace the objective function with approximation that is easier to maximize.

17 ALTERNATE APPROACHES Rather than trying to speed up the maximization of the objective function L, we can replace the objective function with approximation that is easier to maximize. Approximate Likelihood Rewrite L as the product of conditional likelihoods. Replace the complete conditional likelihoods with conditional likelihoods based on a subvector of the conditioning set.

18 ALTERNATE APPROACHES Rather than trying to speed up the maximization of the objective function L, we can replace the objective function with approximation that is easier to maximize. Approximate Likelihood Rewrite L as the product of conditional likelihoods. Replace the complete conditional likelihoods with conditional likelihoods based on a subvector of the conditioning set. Composite Likelihood Replace L with the product of component likelihoods, as though the components likelihoods result from independent subvectors of the data.

19 ALTERNATE APPROACHES Covariance Tapering Set elements of the covariance matrix corresponding to spatially distant pairs to zero and use sparse matrix methods.

20 ALTERNATE APPROACHES Covariance Tapering Set elements of the covariance matrix corresponding to spatially distant pairs to zero and use sparse matrix methods. Spectral Methods Stay tuned!

21 ASYMPTOTICS There are two asymptotic frameworks that are both reasonable to consider: Increasing Domain Asymptotics ML and REML estimators are consistent and asymptotically normal under certain regularity conditions. Infill (Fixed Domain) Asymptotics Available asymptotic results are limited, but ML estimators can behave quite differently in this framework.

22 MODEL COMPARISONS Under ML, Can perform hypothesis tests on β and θ. Full and reduced model comparisons can be carried out using a likelihood ratio test, as usual, assuming asymptotic normality. Non-nested models can be compared using penalized likelihood criteria, like AIC or BIC.

23 MODEL COMPARISONS Under ML, Can perform hypothesis tests on β and θ. Full and reduced model comparisons can be carried out using a likelihood ratio test, as usual, assuming asymptotic normality. Non-nested models can be compared using penalized likelihood criteria, like AIC or BIC. Under REML, Hypotheses about θ can be tested as long as the compared models differ only in their covariance structure. Comparisons can be made using AIC or BIC, using the restricted likelihood in the computation.

24 NON-GAUSSIAN DATA For continuous data that are skewed or have bounded support, we can try to transform the data to make it more Gaussian and then use the methods we ve discussed. For discrete data, like counts or binary indicators, we don t often have that luxury. Instead, we rely on spatial generalized linear mixed models (GLMMs).

25 SPATIAL GENERALIZED LINEAR MIXED MODELS (GLMMS) Assume that we have a latent, stationary Gaussian random field e GP{0, C(h; θ)}. Then, conditional on e(s), Y( ) is an independent process with conditional mean E[Y(s) e(s)] = µ(s). Finally, for some link function g, we write g [µ(s)] = X(s)β + e(s). We use this to form a likelihood function L(β, θ; Y) and proceed as before.

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,