Submitted to the Brazilian Journal of Probability and Statistics

Similar documents
Spring 2012 Math 541B Exam 1

Brief Review on Estimation Theory

simple if it completely specifies the density of x

Chapter 3. Point Estimation. 3.1 Introduction

Statistical Inference

A Very Brief Summary of Statistical Inference, and Examples

DA Freedman Notes on the MLE Fall 2003

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Section 8.2. Asymptotic normality

Stein s Method Applied to Some Statistical Problems

Lecture 3 September 1

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

STAT 730 Chapter 4: Estimation

Multivariate Analysis and Likelihood Inference

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Probability Background

Maximum Likelihood Estimation

Definition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X.

Economics 583: Econometric Theory I A Primer on Asymptotics

Large Sample Properties of Estimators in the Classical Linear Regression Model

Chapter 4: Asymptotic Properties of the MLE

1. Fisher Information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Maximum likelihood estimation

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

6 The normal distribution, the central limit theorem and random samples

Joint work with Nottingham colleagues Simon Preston and Michail Tsagris.

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Normal Approximation for Hierarchical Structures

Maximum Likelihood Estimation

MCMC algorithms for fitting Bayesian models

Statistics 3858 : Maximum Likelihood Estimators

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

MAS223 Statistical Inference and Modelling Exercises

Fisher Information & Efficiency

A Study on the Correlation of Bivariate And Trivariate Normal Models

Haruhiko Ogasawara. This article gives the first half of an expository supplement to Ogasawara (2015).

The properties of L p -GMM estimators

STA 294: Stochastic Processes & Bayesian Nonparametrics

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Density estimators for the convolution of discrete and continuous random variables

BTRY 4090: Spring 2009 Theory of Statistics

Lecture 4 September 15

The Expectation-Maximization Algorithm

Notes on Random Vectors and Multivariate Normal

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Stein s Method and the Zero Bias Transformation with Application to Simple Random Sampling

Z-estimators (generalized method of moments)

Lecture 1 October 9, 2013

Review and continuation from last week Properties of MLEs

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

MODERATE DEVIATIONS IN POISSON APPROXIMATION: A FIRST ATTEMPT

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Sampling distribution of GLM regression coefficients

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Boolean Inner-Product Spaces and Boolean Matrices

Flat and multimodal likelihoods and model lack of fit in curved exponential families

Linear Algebra Massoud Malek

Random Bernstein-Markov factors

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Differential Stein operators for multivariate continuous distributions and applications

ECE534, Spring 2018: Solutions for Problem Set #3

Section 9: Generalized method of moments

Probability on a Riemannian Manifold

Some Approximations of the Logistic Distribution with Application to the Covariance Matrix of Logistic Regression

1 One-way analysis of variance

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

arxiv: v3 [stat.me] 10 Mar 2016

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Chapter 7. Hypothesis Testing

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

4 Sums of Independent Random Variables

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

A Very Brief Summary of Statistical Inference, and Examples

Inference in non-linear time series

Euler Equations: local existence

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Master s Written Examination

Lecture Stat Information Criterion

Random matrices: Distribution of the least singular value (via Property Testing)

Quick Tour of Basic Probability Theory and Linear Algebra

conditional cdf, conditional pdf, total probability theorem?

Limiting Distributions

A Few Notes on Fisher Information (WIP)

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

Theory of Statistical Tests

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Transcription:

Submitted to the Brazilian Journal of Probability and Statistics Multivariate normal approximation of the maximum likelihood estimator via the delta method Andreas Anastasiou a and Robert E. Gaunt b a The London School of Economics and Political Science b The University of Manchester Abstract. We use the delta method and Stein s method to derive, under regularity conditions, explicit upper bounds for the distributional distance between the distribution of the maximum likelihood estimator (MLE) of a d-dimensional parameter and its asymptotic multivariate normal distribution. Our bounds apply in situations in which the MLE can be written as a function of a sum of i.i.d. t-dimensional random vectors. We apply our general bound to establish a bound for the multivariate normal approximation of the MLE of the normal distribution with unknown mean and variance. Introduction The asymptotic normality of maximum likelihood estimators (MLEs), under regularity conditions, is one of the most well-known and fundamental results in mathematical statistics. In a very recent paper, Anastasiou (06) obtained explicit upper bounds on the distributional distance between the distribution of the MLE of a d-dimensional parameter and its asymptotic multivariate normal distribution. In the present paper, we use a different approach to derive such bounds in the special situation where the MLE can be written as a function of a sum of i.i.d. t-dimensional random vectors. In this setting, our bounds improve on (or are at least as good as) those of Anastasiou (06) in terms of sharpness, simplicity and strength of distributional distance. The process used to derive our bounds consists a combination of Stein s method and the multivariate delta method. It is a natural generalisation of the approach used by Anastasiou and Ley (07) to obtain improved bounds only in cases where the MLE can be expressed as a function of MSC 00 subject classifications: Primary 60F05, 6E7, 6F Keywords and phrases. Multi-parameter maximum likelihood estimation, multivariate normal distribution, Stein s method

A. Anastasiou and R. E. Gaunt a sum of i.i.d. random variables than those obtained by Anastasiou and Reinert (07) for the asymptotic normality of the single parameter MLE. Let us now introduce the notation and setting of this paper. Let θ = (θ,..., θ d ) denote the unknown parameter found in a statistical model. Let θ 0 = (θ 0,..., θ 0d ) be the true (still unknown) value and let Θ R d denote the parameter space. Suppose X = (X,..., X n ) is a random sample of n i.i.d. t-dimensional random vectors with joint probability density (or mass) function f(x θ), where x = (x, x,..., x n ). The likelihood function is L(θ; x) = f(x θ), and its natural logarithm, called the log-likelihood function, is denoted by l(θ; x). We shall assume that the MLE ˆθ n (X) of the parameter of interest θ exists and is unique. It should, however, be noted that uniqueness and existence of the MLE can not be taken for granted; see, for example, Billingsley (96) for an example of non-uniqueness. Assumptions that ensure existence and uniqueness of the MLE are given in Mäkeläinen et al. (98). In addition to assuming the existence and uniqueness of the MLE, we shall assume that the MLE takes the following form. Let q : Θ R d be such as all the entries of the function q have continuous partial derivatives with respect to θ and that q(ˆθ n (x)) = n g(x i ) = n ( g (x i ),..., g d (x i ) ) (.) for some g : R t R d. This setting is a natural generalisation of that of Anastasiou and Ley (07) to multiparameter MLEs. The representation (.) for the MLE allows us to approximate the MLE using a different approach to that given in Anastasiou (06), which results in a bound that improves on that obtained by Anastasiou (06). As we shall see in Section 3, a simple and important example of an MLE of the form (.) is given by the normal distribution with unknown mean and variance: { f(x θ) = exp } (x µ), x R. πσ σ In this paper, we derive general bounds for the distributional distance between an MLE of the form (.) and its limiting multivariate normal distribution. The rest of the paper is organised as follows. In Section, we derive a general upper bound on the distributional distance between the distribution of the MLE of a d-dimensional parameter and the limiting multivariate normal distribution for situations in which (.) is satisfied. In Section 3, we apply our general bound to obtain a bound for the multivariate

Multivariate normal approximation of the MLE 3 normal approximation of the MLE of the normal distribution with unknown mean and variance. The general bound In this section, we give quantitative results in order to compliment the qualitative result for the multivariate normal approximation of the MLE as expressed below in Theorem.. In the statement of this theorem, and the rest of the paper, I(θ 0 ) denotes the expected Fisher information matrix for one random vector, while [I(θ 0 )] denotes the principal square root of I(θ 0 ). Also, I d d denotes the d d identity matrix. To avoid confusion with the expected Fisher information matrix, the subscript will always be included in the notation for the d d identity matrix. Following Davison (008), the following assumptions are made in order for the asymptotic normality of the vector MLE to hold: (R.C.) The densities defined by any two different values of θ are distinct. (R.C.) l(θ; x) is three times differentiable with respect to the unknown vector parameter, θ, and the third partial derivatives are continuous in θ. (R.C.3) For any θ 0 Θ and for X denoting the support of the data, there exists ɛ 0 > 0 and functions M rst (x) (they can depend on θ 0 ), such that for θ = (θ, θ,..., θ d ) and r, s, t, j =,,..., d, 3 l(θ; x) θ r θ s θ t M rst(x), x X, θ j θ 0,j < ɛ 0, with E[M rst (X)] <. (R.C.4) For all θ Θ, E θ [l Xi (θ)] = 0. (R.C.5) The expected Fisher information matrix for one random vector I(θ) is finite, symmetric and positive definite. For r, s =,,..., d, its elements satisfy [ n[i(θ)] rs = E l(θ; X) ] [ ] l(θ; X) = E l(θ; X). θ r θ s θ r θ s This implies that ni(θ) is the covariance matrix of (l(θ; x)). These regularity conditions in the multi-parameter case resemble those in Anastasiou and Reinert (07) where the parameter is scalar. The following theorem gives the qualitative result for the asymptotic normality of a vector MLE (see Davison (008) for a proof).

4 A. Anastasiou and R. E. Gaunt Theorem.. Let X, X,..., X n be i.i.d. random vectors with probability density (mass) functions f(x i θ), i n, where θ Θ R d. Also let Z N d (0, I d d ), where 0 is the d zero vector and I d d is the d d identity matrix. Then, under (R.C.)-(R.C.5), [I(θ0 )] d (ˆθ n (X) θ 0 ) N d(0, I d d ). n Let us now introduce some notation. We let Cb n(rd ) denote the space of bounded functions on R d with bounded k-th order derivatives for k n. We abbreviate h := sup i x i h and h := sup i,j x i x j h, where = is the supremum norm. For ease of presentation, we let the subscript (m) denote an index for which ˆθn (x) (m) θ 0(m) is the largest among the d components: (m) {,,..., d} : ˆθn (x) (m) θ 0(m) ˆθn (x) j θ 0j, j {,,..., d}, and also Q (m) = Q (m) (X, θ 0 ) := ˆθ n (X) (m) θ 0(m). (.) The following theorem gives the main result of this paper. Theorem.. Let X, X,..., X n be i.i.d. R t -valued random vectors with probability density (or mass) function f(x i θ), for which the parameter space Θ is an open subset of R d. Let Z N d (0, I d d ). Assume that the MLE ˆθ n (X) exists and is unique and that (R.C.)-(R.C.5) are satisfied. Let q : Θ R d be twice differentiable and g : R t R d such that the special structure (.) is satisfied. Let ξ ij = [ ] K l= kl (g l(x i ) q l (θ 0 )), where K = [ q(θ 0 )] [I(θ 0 )] [ q(θ 0 )]. (.)

Multivariate normal approximation of the MLE 5 Also, let Q (m) be as in (.). Then, for all h Cb (Rd ), [ ( n ))] E h [I(θ0 )] (ˆθ n (X) θ 0 E[h(Z)] π h { } 4 E ξ n 3/ ij ξ ik ξ il + E [ξ ij ξ ik ] E ξ il j,k,l= + h ) ɛ E (ˆθn (X) j θ 0j + h where j= { E [ Aj (θ 0, X) ] [ Q(m) < ɛ +E Aj (θ 0, X) ] } Q(m) < ɛ, j= A j (θ 0, X) = n { ( ) [I(θ 0 )] [ q(θ0 )] K k= A j (θ 0, X) = n { ( [I(θ 0 )] [ q(θ0 )] ) jk k= } (ˆθn ) (X) l θ 0l Mklm (X). jk m= l= (.3) ( q k (ˆθ n (X) ) ) } q k (θ 0 ) ) (ˆθn (X) m θ 0m Remark.3.. For fixed d, it is clear that the first term is O(n / ). For the second term, using Cox and Snell (968) we know that the mean squared error (MSE) of the MLE is of order n. This yields, for fixed d, E [ d j= (ˆθ n (X) j θ 0,j ) ] = O(n ). To establish that (.3) is order n /, it therefore suffices to show that for the third term the quantities E [ A j (θ 0, X) Q (m) < ɛ ] and E [ A j (θ 0, X) Q (m) < ɛ ] are O(). For A j (θ 0, X), we use the Cauchy-Schwarz inequality to get that E (ˆθn ) (X) m θ 0m )(ˆθn (X) l θ 0l Mjkl (X) [E [(ˆθn ) ) ] (X) m θ 0m (ˆθn ] / (X) l θ 0l [ E [ (M klm (X)) Q (m) < ɛ ]] /. (.4) Since from Cox and Snell (968) the MSE is of order n, we apply (.4) to the expression of A j (θ 0, X) to get that E [ A j (θ 0, X) Q (m) < ɛ ] = O(). Whilst we suspect that E [ A j (θ 0, X) Q (m) < ɛ ] are generally

6 A. Anastasiou and R. E. Gaunt O(), we have been unable to establish this fact. In the univariate d = case, this term vanishes (see Anastasiou and Ley (07)), and this is also the case in our application in Section 3 (in which case the matrices I(θ 0 ) and q(θ 0 ) are diagonal).. An upper bound on the quantity E[h( [I(θ 0 )] (ˆθ n (X) θ 0 ))] E[h(Z)] was given in Anastasiou (06). A brief comparison of that bound and our bound follows. The bound in Anastasiou (06) is more general than our bound. It covers the case of independent but not necessarily identically distributed random vectors. In addition, the bound in Anastasiou (06) is more general in the sense that it is applicable whatever the form of the MLE is; our bound can be used only when (.) is satisfied. Furthermore, Anastasiou (06) gives bounds even for cases where the MLE cannot be expressed analytically. On the other hand, our bound is preferable when the MLE has a representation of the form (.). Our bound in (.3) applies to a wider class of test functions h; our class is Cb (Rd ), as opposed to Cb 3(Rd ), and has a better dependence on the dimension d; for fixed n, our bound is O(d 4 ), whereas the bound of Anastasiou (06) is O(d 6 ). Moreover, in situations where the MLE is already a sum of independent terms our bound is easier to use since if we apply q(x) = x to (.) yields R = 0 in (.). This means that our bound (.3) simplifies to [ ( n ))] E h [I(θ0 )] (ˆθ n (X) θ 0 E[h(Z)] π h { } 4 E ξ n 3/ ij ξ ik ξ il + E [ξ ij ξ ik ] E ξ il. j,k,l= Another simplification of the general bound is given in Section 3, where X, X,..., X n are taken to be i.i.d. normal random variables with unknown mean and variance, which can also be treated by Anastasiou (06). The simpler terms in our bound lead to a much improved bound for this example (see Remark 3. for more details). 3. If for all k, m, l {,,..., d}, θ m θ l q k (θ) is uniformly bounded in Θ, then we do not need to use ɛ or conditional expectations in deriving an upper bound (we essentially apply Theorem. with ɛ ). This leads to

Multivariate normal approximation of the MLE 7 the following simpler bound: [ ( n ))] E h [I(θ0 )] (ˆθn (X) θ 0 E[h(Z)] π h { } 4 E ξ n 3/ ij ξ ik ξ il + E[ξ ij ξ ik ] E ξ il + h j,k,l= { E A j (θ 0, X) + E A j (θ 0, X) }. (.5) j= To prove Theorem., we employ Stein s method and the multivariate delta method. Our strategy consists of benefiting from the special form of q (ˆθn (X) ), which is a sum of random vectors. It is here where multivariate delta method comes into play; instead of comparing ˆθ n (X) to Z N d (0, I d d ), we compare q (ˆθ n (X) ) to Z and then find upper bounds on the distributional distance between the distribution of ˆθ n (X) and q (ˆθ n (X) ). As well as recalling the multivariate delta method, we state and prove two lemmas that will be used in the proof of Theorem.. Theorem.4. (Multivariate delta method) Let the parameter θ Θ, where Θ R d. Furthermore, let Y n be a sequence of random vectors in R d which, for Z N d (0, I d d ), satisfies d n (Yn θ) Σ Z, n where Σ is a symmetric, positive definite covariance matrix. Then, for any function b : R d R d of which the Jacobian matrix b(θ) R d d exists and is invertible, we have that for Λ := [ b(θ)] Σ [ b(θ)], Lemma.5. Let d n (b (Yn ) b(θ)) Λ Z. n W = K [g(x i ) q(θ 0 )], (.6) where K is as in (.). Then EW = 0 and Cov(W) = I d d. Proof. By the multivariate central limit theorem, d (g(x i ) Eg(X )) N d(0, Cov(g(X )), (.7) n

8 A. Anastasiou and R. E. Gaunt and, by Theorem.4, ( ) ) d n q (ˆθ n (X) q(θ 0 ) N d(0, K). (.8) n Since q(ˆθ n (X)) = n n g(x i) and X, X,..., X n are i.i.d. random vectors, it follows that Eq(θ 0 ) = Eg(X ). Furthermore, K = Cov(g(X )) because the limiting multivariate normal distributions in (.7) and (.8) must have the same mean and covariance. It is now immediate that EW = 0, and a simple calculation using that K = Cov(g(X )) yields that Cov(W) = I d d. Lemma.6. Suppose that X,j,..., X n,j are independent for a fixed j, but that the random variables X i,,..., X i,d may be dependent for any fixed i. For j =,..., d, let W j = n n X ij and denote W = (W,..., W d ). Suppose that E [W] = 0 and Cov(W) = I d d. Then, for all h C b (Rd ), π h { E [h(w)] E [h(z)] 4 E X n 3/ ij X ik X il j,k,l= } + E [X ij X ik ] E X il. (.9) Proof. The following bound follows immediately from Lemma. of Gaunt and Reinert (06): E [h(w)] E [h(z)] 3 f(x) { n 3/ x i x i x i3 E X ij X ik X il j,k,l= } + E [X ij X ik ] E X il, (.0) where f solves the so-called Stein equation f(w) w f(w) = h(w) E [h(z)]. From Proposition. of Gaunt (06), we have the bound 3 f(x) π x i x i x i3 h, which when applied to (.0) yields (.9).

Multivariate normal approximation of the MLE 9 Proof of Theorem.. The triangle inequality yields [ ( n[i(θ0 ))] E h )] (ˆθ n (X) θ 0 E[h(Z)] R + R, (.) where [ ( nk ( R E h q (ˆθ n (X) ) ))] q (θ 0 ) E[h(Z)], [ ( n[i(θ0 )) R E h )] (ˆθ n (X) θ 0 ( nk ( h q (ˆθn (X) ) ]. q (θ 0 ))) (.) The rest of the proof focuses on bounding the terms R and R. Step : Upper bound for R. For ease of presentation, we denote by W = ( ) nk q(ˆθ n (X)) q(θ 0 ) = K as in (.6). Thus, W = (W, W,..., W d ) with W j = n l= [ K ]jl (g l(x i ) q l (θ 0 )) = n [g(x i ) q(θ 0 )], ξ ij, j {,..., d}, where ξ ij = d [ ] l= K jl (g l(x i ) q l (θ 0 )) are independent random variables, since the X i are assumed to be independent. Therefore, W j can be expressed as a sum of independent terms. This together with Lemma.5 ensures that the assumptions of Lemma.6 are satisfied. Therefore, (.9) yields R = E[h(W )] E[h(Z)] π h { } 4 E ξ n 3/ ij ξ ik ξ il + E[ξ ij ξ ik E ξ il. j,k,l= Step : Upper bound for R. For ease of presentation, we let ( n[i(θ0 )) C := C (h, q, X, θ 0 ) := h )] (ˆθn (X) θ 0 ( nk ( h q (ˆθ n (X) ) )) q (θ 0 ),

0 A. Anastasiou and R. E. Gaunt and our aim is to find an upper bound for E[C ]. Let us denote by [A] [j] the j th row of a matrix A. We begin by using a first order Taylor expansion to obtain ( n[i(θ0 )) ( nk ( h )] (ˆθ n (x) θ 0 = h q (ˆθ n (X) ) )) q (θ 0 ) + { n[[i(θ0 ) )] ][j] (ˆθn (x) θ 0 j= [ K ] [j] ( q (ˆθ n (X) ) ) } q(θ 0 ) h(t(x)), x j where t(x) is between ) n[i(θ 0 )] ( (ˆθn (x) θ 0 and nk q (ˆθn (X) ) q (θ 0 ) ). Rearranging the terms gives [ ] C h n [I(θ 0 )] j= [ K ] [j] [j] (ˆθ n (x) θ 0 ) ( q (ˆθ n (X) ) q(θ 0 )). (.3) For θ0 between ˆθ n (x) and θ 0, a second order Taylor expansion of q j (ˆθn (x) ) about θ 0 gives that, for all j {,,..., d}, q j (ˆθ n (x) ) = q j (θ 0 ) + + ) (ˆθn (x) k θ 0k q j (θ 0 ) θ k k= k= l= ) ) (ˆθn (x) k θ 0k (ˆθn (x) l θ 0l and after a small rearrangement of the terms we get that ( ) q j θ θ k θ 0 l ) [ q(θ 0 )] [j] (ˆθ n (x) θ 0 = qj (ˆθ n (x) ) q j (θ 0 ) ) (ˆθn (x) k θ 0k )(ˆθn (x) l θ 0l q j (θ0 θ k θ ). l k= l= Since this holds for all j {,,..., d}, we have that q(θ 0 ) (ˆθ ) n (x) θ 0 = q (ˆθ n (X) ) q(θ 0 ) ) (ˆθn (x) k θ 0k )(ˆθn (x) l θ 0l q(θ0 θ k θ ), l k= l=

Multivariate normal approximation of the MLE which then leads to ) n[i(θ0 )] (ˆθ n (X) θ 0 [I(θ 0)] [ q(θ0 )] = ( n[i(θ 0 )] [ q(θ0 )] q (ˆθ n (X) ) ) q(θ 0 ) k= l= ) (ˆθn (x) k θ 0k )(ˆθn (x) l θ 0l θ k θ l q(θ 0 ). (.4) Subtracting the quantity ( nk q (ˆθ n (X) ) ) q (θ 0 ) from both sides of (.4) now gives ) n[i(θ0 )] (ˆθ n (X) θ 0 ( nk q (ˆθ n (X) ) ) q(θ 0 ) = ( ) ( n [I(θ 0 )] [ q(θ0 )] K q (ˆθn (X) ) ) q(θ 0 ) [I(θ 0)] [ q(θ0 )] k= l= ) (ˆθn (x) k θ 0k )(ˆθn (x) l θ 0l q(θ0 θ k θ ). l Componentwise we have that [ ] ) n [I(θ 0 )] (ˆθ n (X) θ 0 [ K ] ( q (ˆθ [j] [j] n (X) ) ) q(θ 0 ) = ( [ n [I(θ 0 )] [ q(θ0 )] ] [ K ] ( [j] [j]) q (ˆθ n (X) ) ) q(θ 0 ) [ [I(θ 0 )] [ q(θ0 )] ] [j] ) ) (ˆθn (x) k θ 0k (ˆθn (x) l θ 0l q(θ0 θ k θ ) l = k= l= { ( ) [I(θ 0 )] [ q(θ0 )] K k= ( [I(θ 0 )] [ q(θ0 )] ) jk k= m= l= jk ) ) (ˆθn (x) m θ 0m (ˆθn (x) l θ 0l ( q k (ˆθ n (X) ) ) } q k (θ 0 ) q k (θ0 θ m θ ) l. (.5) We need to take into account that θ k θ l q j (θ) is not uniformly bounded in θ. However, due to (R.C.3), it is assumed that for any θ 0 Θ there

A. Anastasiou and R. E. Gaunt exists 0 < ɛ = ɛ(θ 0 ) and constants M jkl (X), j, k, l {,,..., d}, such that θ k θ l q j (θ) M jkl (X) for all θ Θ such that θ j θ 0j < ɛ, j {,,..., d}. Therefore, with ɛ > 0 and Q (m) as in (.), the law of total expectation yields R = E[C ] E C = E [ C Q (m) ɛ ] P ( Q (m) ɛ ) + E [ C Q (m) < ɛ ] P ( Q (m) < ɛ ). (.6) The Markov inequality applied to P ( Q (m) ɛ ) yields P ( ) ( ) Q(m) ɛ = P ˆθn (X) (m) θ 0(m) ɛ ) P (ˆθn (X) j θ 0j ɛ j= ) ɛ E (ˆθn (X) j θ 0j. (.7) j= In addition, P ( Q (m) < ɛ), which is a reasonable bound because the consistency of the MLE, a result of (R.C.)-(R.C.5), ensures that ˆθn (x) (m) θ 0(m) should be small and therefore P ( Q (m) < ɛ) = P ( ˆθ n (x) (m) θ 0(m) < ɛ). This result along with (.7) applied to (.6) gives R h ) ɛ E (ˆθn (X) j θ 0j + E [ C Q (m) < ɛ ]. (.8) j= For an upper bound on E [ C Q (m) < ɛ ], using (.3) and (.5) yields R h ) ɛ E (ˆθn (X) j θ 0j + h {E [ A j (θ 0, X) Q n (m) < ɛ ] j= + E [ A j (θ 0, X) Q (m) < ɛ ]}, with A j (θ 0, X) and A j (θ 0, X) given as in the statement of the theorem. Summing our bounds for R and R gives the assertion of the theorem. j= 3 Normally distributed random variables In this section, we illustrate the application of the general bound of Theorem. by considering the important case that X, X,..., X n are i.i.d. N(µ, σ )

Multivariate normal approximation of the MLE 3 random variables with θ 0 = (µ, σ ). Here the density function is f(x θ) = { exp πσ σ (x µ) }, x R. (3.) It is well-known that in this case the MLE exists, is unique and equal to ˆθ n (X) = ( X, n ( n Xi X ) ) ; see for example (Davison, 008, p.8). As we shall see in the proof of the following corollary, the MLE has a representation of the form (.). Thus, Theorem. can be applied to derive the following bound. Empirical results are given after the proof. Corollary 3.. Let X, X,..., X n be i.i.d. N ( µ, σ ) random variables and let Z N (0, I ). Then, for all h Cb (R ), [ ( n ))] E h [I(θ0 )] (ˆθ n (X) θ 0 E[h(Z)] 7 h + h. (3.) n n Remark 3.. An upper bound for the distributional distance between the MLE of the normal distribution, treated under canonical parametrisation, with unknown mean and variance has also been obtained in Anastasiou (06). That bound is also of order O(n / ), but our numerical constants are two orders of magnitude smaller. Furthermore, our test functions h are less restrictive, coming from the class Cb (R ) rather than the class Cb 3(R ). Finally, the bound in (3.) does not depend on the parameter θ 0 = (µ, σ ), while the bound given in Anastasiou (06) depends on the natural parameters and blows up when the variance tends to zero or infinity. Proof. From the general approach in the previous section, we want to have q : Θ R, such that We take which then yields ( q(ˆθ n (X)) = q X, n q (ˆθ n (x) ) = n ( ) g (x i ). g (x i ) q(θ) = q(θ, θ ) = ( θ, θ + (θ µ) ), (3.3) ) ( (X i X) = X, n = ( n X i, n (X i X) + ( ) ) X µ ) (X i µ), (3.4)

4 A. Anastasiou and R. E. Gaunt and therefore g (x i ) = x i, g (x i ) = (x i µ), (3.5) and thus the MLE has a representation of the form (.). We now note that the Jacobian matrix of q(θ) evaluated at θ 0 = (µ, σ ) is ( ) 0 q(θ 0 ) =. (3.6) 0 Furthermore, it is easily deduced from (3.3) that k, m, l {, }, we have that θ m θ l q k (θ). Hence, using Remark.3 we conclude that the general bound is as in (.5) and no positive constant ɛ is necessary. Now, for l(θ 0 ; X) denoting the log-likelihood function, we obtain from (3.) that the expected Fisher information matrix is I(θ 0 ) = ( ) n E 0 [ (l(θ 0 ; X))] = σ. (3.7) 0 σ 4 The results in (3.6) and (3.7) yield ( ) K = [ q(θ 0 )] [I(θ 0 )] [ q(θ 0 )] = [I(θ 0 )] σ 0 = 0 σ 4. (3.8) Using (.6), (3.4), (3.5) and (3.8), we get that W = K = ( n ( [g(x i ) q(θ 0 )] = σ 0 n X i µ σ n (X i µ) σ nσ ) and therefore, for i {,..., n},, 0 σ ) ( X i µ ) (X i µ) σ ξ i = X i µ, ξ i = (X i µ) σ. σ σ For the first term of the general upper bound in (.5), we are required to bound certain expectations involving ξ ij. We first note that X i µ d σ = Z N(0, ). Using the standard formulas for the moments and absolute moments of the standard normal distribution (see Winkelbauer (0)), we have, for all i =,..., n, E ξ i = E Z = π, Eξ i = EZ =, E ξ i 3 = E Z 3 = π

and, by Hölder s inequality, Multivariate normal approximation of the MLE 5 Therefore, and j,k,l E ξ i = E Z (E ( Z ) ) / =, Eξ i = E ( Z ) =, E ξ i 3 = = = E ξ ij ξ ik ξ il = j,k,l= 3/ E Z 3 ( 3/ E ( Z ) ) 4 3/4 ( EZ 8 3/ 4EZ 6 + 6EZ 4 4EZ + ) 3/4 3/ (05 60 + 8 4 + )3/4 = 5 3/4. (E ξ i 3 + 3EξiE ξ i + 3E ξ i Eξi + E ξ i 3 ) ( ) π + 3 + 3 π + 53/4 < 4.6n, (3.9) Eξ ij ξ ik E ξ il = j,l= ( ) EξijE ξ il = n +. (3.0) π Applying the results of (3.9) and (3.0) to the first term of the general bound in (.5) yields π h { } 4 E ξ n 3/ ij ξ ik ξ il + Eξ ij ξ ik E ξ il j,k,l= ( ( ) ) π h 4 4n + + 4.6n = 6.833 h 7 h. (3.) n 3/ π n n For the second term, we need to calculate A j (θ 0, X) and A j (θ 0, X) as in Theorem.. The results of (3.6) and (3.8) yield to [I(θ 0 )] [ (q(θ0 ))] K = 0, where 0 denotes the by zero matrix. Therefore A j (θ 0, X) = 0 for

6 A. Anastasiou and R. E. Gaunt j =,. In order to calculate A j (θ 0, X), we note that θ m θ l q (θ) = 0, m, l {, }, q (θ) = q (θ) = q (θ) = 0, θ θ θ θ θ q (θ) =. θ Using this result as well as (3.6) and (3.7), we get that A (θ 0, X) = 0, A (θ 0, X) = n σ ( ) n ( ) X µ = X µ. σ Therefore the last term of the general upper bound in (.5) becomes h { E Aj (θ 0, X) + E A j (θ 0, X) } n j= = h n [ ( E A (θ 0, X) = h ) ] n σ E X µ = h. (3.) n Finally, the bounds in (3.) and (3.) are applied to (.5) to obtain the assertion of the corollary. Here, we carry out a large-scale simulation study to investigate the accuracy of the bound in (3.). For n = 0 j, j = 3, 4, 5, 6, we start by generating 0 4 trials of n random independent observations, X, following the N(µ, σ ) distribution. We take µ =, σ = for our simulations. Then [I(θ0 )] ) (ˆθ n (X) θ 0 is evaluated in each trial, which in turn gives a vector of 0 4 values. The function h(x, y) = ( x + y + ), which belongs in Cb (R ), is then applied to these values in order to get the sample mean, which we denote by Ê[ h ( [I(θ 0 )] ))] (ˆθn (X) θ 0. For the function h, it is straightforward to see that h = 3 3 8, h =. We use these values to calculate the bound in (3.). We define [ ( n ))] Q h (θ 0 ) := Ê h [I(θ0 )] (ˆθn (X) θ 0 Ẽ[h(Z)],

Multivariate normal approximation of the MLE 7 where Ẽ[h(Z)] = 0.46 is the approximation of E[h(Z)] up to three decimal places. We compare Q h (θ 0 ) with the bound in (3.), using the difference between their values as a measure of the error. The results from the simulations are shown in Table below. Table Simulation results for the N(, ) distribution n Q h (θ 0) Upper bound Error 0 3 0.0 0.457 0.446 0 4 0.00 0.45 0.35 0 5 0.009 0.046 0.037 0 6 0.006 0.04 0.008 We see that the bound and the error decrease as the sample size gets larger. To be more precise, when at each step we increase the sample size by a factor of ten, the value of the upper bound drops by a factor close to 0, which is expected since the order of the bound is O(n / ), as can be seen from (3.). Acknowledgements This research occurred whilst AA was studying for a DPhil at the University of Oxford, supported by a Teaching Assistantship Bursary from the Department of Statistics, University of Oxford, and the EPSRC grant EP/K5033/. RG acknowledges support from EPSRC grant EP/K0340/ and is currently supported by a Dame Kathleen Ollerenshaw Research Fellowship. We would like to thank the referee for their helpful comments and suggestions. References Anastasiou, A. (06). Assessing the multivariate normal approximation of the maximum likelihood estimator from high-dimensional, heterogeneous data. https://arxiv.org/ pdf/50.03679.pdf. Anastasiou, A. and C. Ley (07). Bounds for the asymptotic normality of the maximum likelihood estimator using the Delta method. ALEA: Latin American Journal of Probability and Mathematical Statistics 4, 53 7. Anastasiou, A. and G. Reinert (07). Bounds for the normal approximation of the maximum likelihood estimator. Bernoulli 3, 9 8. Billingsley, P. (96). Statistical Methods in Markov Chains. The Annals of Mathematical Statistics 3, No., 40. Cox, D. R. and E. J. Snell (968). A General Definition of Residuals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 30, 48 75. Davison, A. C. (008). Statistical Models (First ed.). Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

8 A. Anastasiou and R. E. Gaunt Gaunt, R. E. (06). Rates of Convergence in Normal Approximation Under Moment Conditions Via New Bounds on Solutions of the Stein Equation. Journal of Theoretical Probability 9, 3 47. Gaunt, R. E. and G. Reinert (06). The rate of convergence of some asymptotically chisquare distributed statistics by Stein s method. https://arxiv.org/pdf/603.0889. pdf. Mäkeläinen, T., K. Schmidt, and G. P. H. Styan (98). On the existence and uniqueness of the maximum likelihood estimate of a vector-valued parameter in fixed-size samples. The Annals of Statistics 9, No.4, 758 767. Winkelbauer, A. (0). Moments and absolute moments of the normal distribution. https://arxiv.org/pdf/09.4340.pdf.