Discussion of Maximization by Parts in Likelihood Inference

Similar documents
Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Introduction to Maximum Likelihood Estimation

Using Estimating Equations for Spatially Correlated A

Lecture V. Numerical Optimization

Likelihood-Based Methods

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

LOGISTIC REGRESSION Joseph M. Hilbe

Modeling Real Estate Data using Quantile Regression

Principles of Bayesian Inference

Bayesian Regression Linear and Logistic Regression

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Default Priors and Effcient Posterior Computation in Bayesian

Rejoinder. 1 Phase I and Phase II Profile Monitoring. Peihua Qiu 1, Changliang Zou 2 and Zhaojun Wang 2

Comparison of Modern Stochastic Optimization Algorithms

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Econometrics I, Estimation

POMP inference via iterated filtering

GENERALIZED LINEAR MIXED MODELS AND MEASUREMENT ERROR. Raymond J. Carroll: Texas A&M University

Automated Likelihood Based Inference for Stochastic Volatility Models using AD Model Builder. Oxford, November 24th 2008 Hans J.

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Variational Inference via Stochastic Backpropagation

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Numerical Analysis for Statisticians

Open Problems in Mixed Models

Lecture 3 September 1

Introduction An approximated EM algorithm Simulation studies Discussion

POLI 8501 Introduction to Maximum Likelihood Estimation

Outline of GLMs. Definitions

Maximum Likelihood Estimation

Monte Carlo in Bayesian Statistics

Regularization in Cox Frailty Models

Marginal Specifications and a Gaussian Copula Estimation

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Nonlinear Optimization: What s important?

Introduction to General and Generalized Linear Models

Maximum Likelihood Estimation

13 Notes on Markov Chain Monte Carlo

Contents. Part I: Fundamentals of Bayesian Inference 1

The performance of estimation methods for generalized linear mixed models

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Statistical Applications in the Astronomy Literature II Jogesh Babu. Center for Astrostatistics PennState University, USA

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Bootstrap and Parametric Inference: Successes and Challenges

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Semiparametric Gaussian Copula Models: Progress and Problems

Approximate Likelihoods

EM Algorithm II. September 11, 2018

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Inference for partially observed stochastic dynamic systems: A new algorithm, its theory and applications

A short introduction to INLA and R-INLA

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

A Bayesian perspective on GMM and IV

The Newton-Raphson Algorithm

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Principles of Bayesian Inference

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Stat 5101 Lecture Notes

Generalized linear mixed models for biologists

MAXIMUM LIKELIHOOD INFERENCE IN ROBUST LINEAR MIXED-EFFECTS MODELS USING MULTIVARIATE t DISTRIBUTIONS

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Uncertainty quantification for Wavefield Reconstruction Inversion

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Bayesian Deep Learning

Theory and Methods of Statistical Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian linear regression

Basic Sampling Methods

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Can we do statistical inference in a non-asymptotic way? 1

Biostat 2065 Analysis of Incomplete Data

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fitting The Unknown 1/28. Joshua Lande. September 1, Stanford

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Parametric Techniques Lecture 3

Semi-parametric predictive inference for bivariate data using copulas

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

STA 4273H: Statistical Machine Learning

Linear and logistic regression

STAT Advanced Bayesian Inference

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Multistate Modeling and Applications

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

APPLICATION OF NEWTON RAPHSON METHOD TO NON LINEAR MODELS

Parametric Techniques

Theory and Methods of Statistical Inference. PART I Frequentist likelihood methods

Maximum Likelihood Estimation. only training data is available to design a classifier

Computing Likelihood Functions for High-Energy Physics Experiments when Distributions are Defined by Simulators with Nuisance Parameters

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Transcription:

Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu I thank the authors for a stimulating paper on an important topic. Maximum likelihood is the mostly widely used estimation technique, though certainly Bayesian methods are catching up with the development of MCMC. Computational methods for the MLE are well understood in many important cases, e.g., nonlinear regression and generalized linear models, where software is readily available. However, for models that do not fit into these standard classes, there is surprisingly little discussion in the literature on practical issues. In particular, considering the importance of the topic, there is relatively little advice about which maximization methods are reliable and when. Maximization by parts might be an important new tool. The authors have done an excellent job describing its properties and showing by example that it can handle a variety of problems. The next step is to compare maximization by parts with other numerical methods for likelihood maximization. In this comment, I will discuss my own (limited) experience with maximum likelihood computations. My experience with techniques such as quasi-newton methods, numerical differentiation, and what I call below Gauss-Newtontype algorithms has been much more positive that what the authors comments suggest. I have avoided EM in my own work, especially Monte Carlo EM, because there is much evidence in the literature that Monte Carlo EM algorithms are exceedingly slow. Thus, I was not surprised that the authors also found Monte Carlo EM to be very computationally expensive. There are at least two issues that must always be addressed, plus a third that arises

when there are latent variables:. finding a starting value for iterative algorithms, 2. calculation of derivatives and Hessians, 3. numerically integrating out latent variables (missing data) from their joint density with the observed data to obtain the likelihood of the latter. Starting values can be found either using an inefficient, but easily calculated, estimator, or by maximizing the likelihood on some grid. There is a well-known result that starting with a root-n consistent preliminary estimator, one-step of Newton-Raphson is asymptotically efficient (see, for example, Lehmann (999, Theorem 7.3.3). Of course, more that one step might be preferable in finite-samples and one often does not use exact Newton-Raphson since it requires computation of second derivatives. But the principle is that a good starting value is a valuable commodity. The authors use the maximizer of l w as a starting value for maximization by parts. This estimator could be used to start other algorithms, though there are many other choices to consider. Often there is no obvious preliminary estimator and then a grid search is needed. Computers are so fast nowadays that searching a rather large grid is feasible. In higher dimensions, a Latin hypercube can be used to control the size of the grid. Random searches are also possible. When bootstrapping, extensive grid searches might be too slow, but then there is an obvious preliminary estimator, the MLE from the original sample. I have programmed in MATLAB a fairly standard maximum likelihood algorithm, currently called maxlik0, which I find to be rather reliable. This program does not require that the gradient and Hessian be programmed but rather uses two-sided (central) numerical gradients. The authors claim that algorithms using numerical gradients can be very fragile does not agree with my experience; I have even had success with less accurate one-sided (forward) gradients, which I used in the past when computers were slower. maxlik0 approximates the Fisher information by B n (θ) = n i= l( θ){ l( θ)} T. I will call algorithms using 2

B n (θ k ) Gauss-Newton-type, because of similarities with the Gauss-Newton algorithm for nonlinear least-squares. The authors state that this approach can be very unstable due to variation in the estimated information matrix. I have not found this to be true, even for small samples. A rough approximation to the Fisher information is adequate, provided that it is positive definite and one uses step-halving to guarantee that each step increases the likelihood. To appreciate this, consider the algorithm θ k = θ k + δan l(θ k ), where A is positive definite and δ is positive. As δ 0 (with A and n fixed), we have l(θ k ) = l(θ k ) + δn { l(θ k )} T A l(θ k ) + o(δ). Because A is positive definite for δ sufficiently small l(θ k ) > l(θ k ), () unless l(θ k ) = 0. Step-halving starts with δ = and halves δ until () holds. Step-halving is a widely-used technique and is an important component of my maxlik0 algorithm. Even if A is the exact negative Hessian of l, step-halving may be needed to achieve an increase in l. Clearly, B n (θ k ) must be positive semi-definite but it is not guaranteed to be positive-definite. However, even if B n (θ k ) is only positive semi-definite one can easily show that { l(θ k )} T B n (θ k ) l(θ k ) > 0 unless l(θ k ) = 0. After convergence, maxlik0 uses a numerical Hessian to approximate the observed Fisher information matrix. If one worries about the accuracy of numerical Hessians, then the bootstrap can be used. Bootstrapping is a more computationally intensive but generally more accurate way to compute standard errors, even if the exact Fisher information (observed or theoretical) can be found. For standard errors, I am not aware of anyone recommending B n (θ k ) as a approximate observed Fisher information and I do not recommend this myself; this is a case where the variability of B n (θ k ) may cause a problem. However, I n B n I, where I n is the observed Fisher information and I n and B n are evaluated at the final iterate, is the well-known sandwich formula, which is called robust since it consistently estimates standard errors even under model misspecification. To test maxlik0 on a problem studied by the authors, I simulated data from the n 3

bivariate exponential copula model in Example 6.2 with n = 0, /α 2 = 0, and ρ = 0.7. As starting values I used α j = y j as suggested by the authors and ρ equal to the correlation between Φ {F (y ; )} and Φ {F 2 (y 2 ; α 2 )}. To guarantee that the estimates satisfied their nature constraints, I used the reparameterization = exp(θ j ), j =, 2, and ρ = 2H(θ 3 ) where H is the logistic function and θ, θ 2, θ 3 were unconstrained. It is often essential that, as here, the likelihood is either not defined or is complex-valued outside the parameter space, so reparameterization is useful to ensure that all iterates are in the parameter space. The upper-left plot in Figure is a plot of the starting value for ρ and the estimate from maxlik0 for 00 simulations. In the axis label, MLE is the estimator calculated by maxlik0. I also calculated the MLE using fminunc in MATLAB Optimization toolbox. The fminunc parameter LargeScale was turned off so that the BFGS quasi-newton method was used. This is called MLE2. The upper-right plot shows that the two MLEs are nearly identical. The lower plots in Figure show results for estimation of. In this case, the starting value is rather close to the MLEs. In this example, I did not encounter any of the numerical difficulties mentioned by the authors even though the sample size was small, though perhaps these problems would show up in higher dimensions. I also tested a derivative-free method, the Nelder-Mead simplex algorithm, implemented in MATLAB s fminsearch, and it gave results that were indistinguishable from those of maxlik0 and fminunc. It would be interesting to compare maximization by parts and other algorithms such as maxlik0, fminunc, and fminsearch on high-dimensional copula models. Integrating out latent variables is the most difficult numerical problem associated with maximum likelihood. The integration method depends heavily on the dimension of the latent variables. The conventional wisdom is that numerical integration is suitable for low dimension, say 3. Importance sampling works for somewhat higher dimensions and MCMC is effective in very high dimensions. One should use numerical integration when the dimension allows, as the discussion in Section 7 illustrates. My own experience has been 4

with penalized splines (Ruppert, Wand, and Carroll, 2003) where the dimension is usually 0 or higher and numerical integration is not feasible. I have also worked on an application in environmental engineering (Crainiceanu, Stedinger, Ruppert, and Behr, 2003) where the model was a combination of two GLMMs, one with Poisson responses and normal random effects. There were thousands of random effects in a three-level hierarchical structure, and MCMC implemented in WinBUGS was quite satisfactory. Despite the size and complexity of this problem, we did not experience the type of problems suggested by the authors statement that One concern about the choice of normal random effects is the numerical difficulty of evaluating integrals for the marginal likelihood. Monte Carlo methods such as importance sampling and MCMC are computationally intensive and not that well suited for combination with iterative algorithms. Although many theoretical papers have been written on this topic, practitioners seem happier with the single MCMC run of a Bayesian analysis than with a MCMC run at each step of a iteration to find the MLE. At least, that is my conclusion from reading applied papers, informal conversations with other statisticians, and looking at the software currently available, say in R or WinBUGS. One reason I like a Baysian MCMC analysis is that it is often an alternative to Monte Carlo EM, which as shown by the authors can be exceedingly slow. In Example 2 of Section 3, one could use the MLE from the working model where [α θ] is Gaussian as a starting value for direct maximization of the likelihood, say using the quasi-newton algorithm in fminunc. As the authors note, this preliminary estimator is consistent, in fact, root-n consistent. Direct maximization requires that one compute the integral p(y θ) = φ(y θ, α)p(α θ)dα, but there are many ways of doing this and the integral should be no more difficult to compute that l e using (0). It would be interesting to know how maximization by parts compares with direct maximization of the likelihood in terms of computational accuracy and efficiency. An issue for consideration when computing l e using (0) is that p(α θ)/φ(α θ) will be unbounded if p(α θ) has heavier than Gaussian tails. 5

One principle that I believe in strongly is that statisticians should be encouraged to explore alternative models and should avoid letting computational convenience dictate model choice. For this reason, I prefer very general algorithms such as quasi-newton and Gauss- Newton-type that can be used with nearly any likelihood, at least if there are no insurmountable integration problems. (I like a Bayesian MCMC even more since it handles integration problems so effectively.) I also prefer numerical gradients, despite widespread warnings from numerical analysts about their dangers. If one must program the gradient for each model under consideration, then there is a strong disincentive to explore alternative models. This paper raises more questions than it answers, though good papers often do that. Is maximization by parts widely applicable or is it a specialized tool because it requires a suitable composition, l w + l e, where l w is easy to maximize? How stable and reliable are quasi-newton and Gauss-Newton type algorithms compared to maximization by parts? Should we use maximum likelihood where an integration is needed at each step of an iterative algorithm or, instead, Bayesian MCMC where all integrals can be approximated by averaging over a single MCMC sample? References Crainiceanu, C., Stedinger, J., Ruppert, D., and Behr, C. (2003), Modeling the United States National Distribution of Waterborne Pathogen Concentrations with Application to Cryptosporidium parvum, Water Resources Research, 39, no. 9, 235 249 Lehmann, E. (999) Elements of Large-Sample Theory, Springer, New York. Ruppert, D., Wand, M., and Carroll, R. (2003) Semiparametric Regression, Cambridge University Press, Cambridge. 6

0.8 0.8 ρ MLE 0.6 0.4 0.2 ρ MLE2 0.6 0.4 0.2 0 0 0.2 0.2 0 0.5 ρ starting 0 0.5 ρ MLE 2.5 2.5 MLE 2.5 MLE2 2.5 0.5 0.5.5 2 2.5 starting 0.5 0.5.5 2 2.5 MLE Figure : Comparison of starting estimates and two methods of computing the MLE for a bivariate exponential copula model. ρ- starting means the starting value of ρ, -MLE is the MLE of from maxlik0, and so forth. 7