Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu I thank the authors for a stimulating paper on an important topic. Maximum likelihood is the mostly widely used estimation technique, though certainly Bayesian methods are catching up with the development of MCMC. Computational methods for the MLE are well understood in many important cases, e.g., nonlinear regression and generalized linear models, where software is readily available. However, for models that do not fit into these standard classes, there is surprisingly little discussion in the literature on practical issues. In particular, considering the importance of the topic, there is relatively little advice about which maximization methods are reliable and when. Maximization by parts might be an important new tool. The authors have done an excellent job describing its properties and showing by example that it can handle a variety of problems. The next step is to compare maximization by parts with other numerical methods for likelihood maximization. In this comment, I will discuss my own (limited) experience with maximum likelihood computations. My experience with techniques such as quasi-newton methods, numerical differentiation, and what I call below Gauss-Newtontype algorithms has been much more positive that what the authors comments suggest. I have avoided EM in my own work, especially Monte Carlo EM, because there is much evidence in the literature that Monte Carlo EM algorithms are exceedingly slow. Thus, I was not surprised that the authors also found Monte Carlo EM to be very computationally expensive. There are at least two issues that must always be addressed, plus a third that arises
when there are latent variables:. finding a starting value for iterative algorithms, 2. calculation of derivatives and Hessians, 3. numerically integrating out latent variables (missing data) from their joint density with the observed data to obtain the likelihood of the latter. Starting values can be found either using an inefficient, but easily calculated, estimator, or by maximizing the likelihood on some grid. There is a well-known result that starting with a root-n consistent preliminary estimator, one-step of Newton-Raphson is asymptotically efficient (see, for example, Lehmann (999, Theorem 7.3.3). Of course, more that one step might be preferable in finite-samples and one often does not use exact Newton-Raphson since it requires computation of second derivatives. But the principle is that a good starting value is a valuable commodity. The authors use the maximizer of l w as a starting value for maximization by parts. This estimator could be used to start other algorithms, though there are many other choices to consider. Often there is no obvious preliminary estimator and then a grid search is needed. Computers are so fast nowadays that searching a rather large grid is feasible. In higher dimensions, a Latin hypercube can be used to control the size of the grid. Random searches are also possible. When bootstrapping, extensive grid searches might be too slow, but then there is an obvious preliminary estimator, the MLE from the original sample. I have programmed in MATLAB a fairly standard maximum likelihood algorithm, currently called maxlik0, which I find to be rather reliable. This program does not require that the gradient and Hessian be programmed but rather uses two-sided (central) numerical gradients. The authors claim that algorithms using numerical gradients can be very fragile does not agree with my experience; I have even had success with less accurate one-sided (forward) gradients, which I used in the past when computers were slower. maxlik0 approximates the Fisher information by B n (θ) = n i= l( θ){ l( θ)} T. I will call algorithms using 2
B n (θ k ) Gauss-Newton-type, because of similarities with the Gauss-Newton algorithm for nonlinear least-squares. The authors state that this approach can be very unstable due to variation in the estimated information matrix. I have not found this to be true, even for small samples. A rough approximation to the Fisher information is adequate, provided that it is positive definite and one uses step-halving to guarantee that each step increases the likelihood. To appreciate this, consider the algorithm θ k = θ k + δan l(θ k ), where A is positive definite and δ is positive. As δ 0 (with A and n fixed), we have l(θ k ) = l(θ k ) + δn { l(θ k )} T A l(θ k ) + o(δ). Because A is positive definite for δ sufficiently small l(θ k ) > l(θ k ), () unless l(θ k ) = 0. Step-halving starts with δ = and halves δ until () holds. Step-halving is a widely-used technique and is an important component of my maxlik0 algorithm. Even if A is the exact negative Hessian of l, step-halving may be needed to achieve an increase in l. Clearly, B n (θ k ) must be positive semi-definite but it is not guaranteed to be positive-definite. However, even if B n (θ k ) is only positive semi-definite one can easily show that { l(θ k )} T B n (θ k ) l(θ k ) > 0 unless l(θ k ) = 0. After convergence, maxlik0 uses a numerical Hessian to approximate the observed Fisher information matrix. If one worries about the accuracy of numerical Hessians, then the bootstrap can be used. Bootstrapping is a more computationally intensive but generally more accurate way to compute standard errors, even if the exact Fisher information (observed or theoretical) can be found. For standard errors, I am not aware of anyone recommending B n (θ k ) as a approximate observed Fisher information and I do not recommend this myself; this is a case where the variability of B n (θ k ) may cause a problem. However, I n B n I, where I n is the observed Fisher information and I n and B n are evaluated at the final iterate, is the well-known sandwich formula, which is called robust since it consistently estimates standard errors even under model misspecification. To test maxlik0 on a problem studied by the authors, I simulated data from the n 3
bivariate exponential copula model in Example 6.2 with n = 0, /α 2 = 0, and ρ = 0.7. As starting values I used α j = y j as suggested by the authors and ρ equal to the correlation between Φ {F (y ; )} and Φ {F 2 (y 2 ; α 2 )}. To guarantee that the estimates satisfied their nature constraints, I used the reparameterization = exp(θ j ), j =, 2, and ρ = 2H(θ 3 ) where H is the logistic function and θ, θ 2, θ 3 were unconstrained. It is often essential that, as here, the likelihood is either not defined or is complex-valued outside the parameter space, so reparameterization is useful to ensure that all iterates are in the parameter space. The upper-left plot in Figure is a plot of the starting value for ρ and the estimate from maxlik0 for 00 simulations. In the axis label, MLE is the estimator calculated by maxlik0. I also calculated the MLE using fminunc in MATLAB Optimization toolbox. The fminunc parameter LargeScale was turned off so that the BFGS quasi-newton method was used. This is called MLE2. The upper-right plot shows that the two MLEs are nearly identical. The lower plots in Figure show results for estimation of. In this case, the starting value is rather close to the MLEs. In this example, I did not encounter any of the numerical difficulties mentioned by the authors even though the sample size was small, though perhaps these problems would show up in higher dimensions. I also tested a derivative-free method, the Nelder-Mead simplex algorithm, implemented in MATLAB s fminsearch, and it gave results that were indistinguishable from those of maxlik0 and fminunc. It would be interesting to compare maximization by parts and other algorithms such as maxlik0, fminunc, and fminsearch on high-dimensional copula models. Integrating out latent variables is the most difficult numerical problem associated with maximum likelihood. The integration method depends heavily on the dimension of the latent variables. The conventional wisdom is that numerical integration is suitable for low dimension, say 3. Importance sampling works for somewhat higher dimensions and MCMC is effective in very high dimensions. One should use numerical integration when the dimension allows, as the discussion in Section 7 illustrates. My own experience has been 4
with penalized splines (Ruppert, Wand, and Carroll, 2003) where the dimension is usually 0 or higher and numerical integration is not feasible. I have also worked on an application in environmental engineering (Crainiceanu, Stedinger, Ruppert, and Behr, 2003) where the model was a combination of two GLMMs, one with Poisson responses and normal random effects. There were thousands of random effects in a three-level hierarchical structure, and MCMC implemented in WinBUGS was quite satisfactory. Despite the size and complexity of this problem, we did not experience the type of problems suggested by the authors statement that One concern about the choice of normal random effects is the numerical difficulty of evaluating integrals for the marginal likelihood. Monte Carlo methods such as importance sampling and MCMC are computationally intensive and not that well suited for combination with iterative algorithms. Although many theoretical papers have been written on this topic, practitioners seem happier with the single MCMC run of a Bayesian analysis than with a MCMC run at each step of a iteration to find the MLE. At least, that is my conclusion from reading applied papers, informal conversations with other statisticians, and looking at the software currently available, say in R or WinBUGS. One reason I like a Baysian MCMC analysis is that it is often an alternative to Monte Carlo EM, which as shown by the authors can be exceedingly slow. In Example 2 of Section 3, one could use the MLE from the working model where [α θ] is Gaussian as a starting value for direct maximization of the likelihood, say using the quasi-newton algorithm in fminunc. As the authors note, this preliminary estimator is consistent, in fact, root-n consistent. Direct maximization requires that one compute the integral p(y θ) = φ(y θ, α)p(α θ)dα, but there are many ways of doing this and the integral should be no more difficult to compute that l e using (0). It would be interesting to know how maximization by parts compares with direct maximization of the likelihood in terms of computational accuracy and efficiency. An issue for consideration when computing l e using (0) is that p(α θ)/φ(α θ) will be unbounded if p(α θ) has heavier than Gaussian tails. 5
One principle that I believe in strongly is that statisticians should be encouraged to explore alternative models and should avoid letting computational convenience dictate model choice. For this reason, I prefer very general algorithms such as quasi-newton and Gauss- Newton-type that can be used with nearly any likelihood, at least if there are no insurmountable integration problems. (I like a Bayesian MCMC even more since it handles integration problems so effectively.) I also prefer numerical gradients, despite widespread warnings from numerical analysts about their dangers. If one must program the gradient for each model under consideration, then there is a strong disincentive to explore alternative models. This paper raises more questions than it answers, though good papers often do that. Is maximization by parts widely applicable or is it a specialized tool because it requires a suitable composition, l w + l e, where l w is easy to maximize? How stable and reliable are quasi-newton and Gauss-Newton type algorithms compared to maximization by parts? Should we use maximum likelihood where an integration is needed at each step of an iterative algorithm or, instead, Bayesian MCMC where all integrals can be approximated by averaging over a single MCMC sample? References Crainiceanu, C., Stedinger, J., Ruppert, D., and Behr, C. (2003), Modeling the United States National Distribution of Waterborne Pathogen Concentrations with Application to Cryptosporidium parvum, Water Resources Research, 39, no. 9, 235 249 Lehmann, E. (999) Elements of Large-Sample Theory, Springer, New York. Ruppert, D., Wand, M., and Carroll, R. (2003) Semiparametric Regression, Cambridge University Press, Cambridge. 6
0.8 0.8 ρ MLE 0.6 0.4 0.2 ρ MLE2 0.6 0.4 0.2 0 0 0.2 0.2 0 0.5 ρ starting 0 0.5 ρ MLE 2.5 2.5 MLE 2.5 MLE2 2.5 0.5 0.5.5 2 2.5 starting 0.5 0.5.5 2 2.5 MLE Figure : Comparison of starting estimates and two methods of computing the MLE for a bivariate exponential copula model. ρ- starting means the starting value of ρ, -MLE is the MLE of from maxlik0, and so forth. 7