Approximation of Sampling Variances and Confidence Intervals for Maximum Likelihood Estimates of Variance Components. Abstract

Size: px

Start display at page:

Download "Approximation of Sampling Variances and Confidence Intervals for Maximum Likelihood Estimates of Variance Components. Abstract"

John Fletcher
5 years ago
Views:

1 Running head : Approximate REML sampling variances Approximation of Sampling Variances and Confidence Intervals for Maximum Likelihood Estimates of Variance Components K. Meyer 1 and W.G. Hill Institute of Cell, Animal and Population Biology, Edinburgh University, West Mains Road, Edinburgh EH9 3JT, Scotland Abstract After reviewing pertinent literature on the estimation of sampling variances and confidence intervals in the maximum likelihood framework, a method to approximate these for individual parameters in a multi-parameter analysis is described. It is based on the profile likelihood, defined as the likelihood for a subset of parameter(s) of interest with the remaining parameters equal to their maximum likelihood estimates given the former. The formation (= inverse of information) matrix for the parameters of the profile likelihood is equal to the corresponding submatrix from the full likelihood. The likelihood ratio test for composite hypotheses effectively compares points on the profile likelihood surface for the parameters tested. Hence sampling errors and confidence intervals for each parameter can be estimated considering its profile likelihood. This can be approximated fitting a one-dimensional quadratic or higher order polynomial function. Numerical examples for a balanced hierarchical full-sib design are given. 1 Address for correspondence : Animal Genetics and Breeding Unit, University of New England, Armidale NSW 2351, Australia 1

2 Introduction Over the last decade, statistical methodology used in estimating genetic parameters has changed progressively from Analysis of Variance type methods to Maximum Likelihood (ML) and related procedures; see Meyer (1990) for a recent review. For a wide range of problems in applied quantative genetics, Restricted Maximum Likelihood (REML) estimation (Patterson and Thompson, 1971) in particular is now generally considered the method of choice. While considerable research effort has been directed towards the development of specialised algorithms, especially for problems and models of analyses common to animal breeding, questions concerning the accuracy of estimation and power of hypothesis testing have by and large been neglected. In part, this has been due to the often considerable computational burden imposed by (RE)ML estimation of variance components. Some algorithms utilise second derivatives of the likelihood function (L) in locating its maximum and thus provide estimates of sampling variances of parameters as a by-product. These are, however, generally the computationally most difficult and demanding ones and simpler procedures are preferred in practice. It has then been suggested to approximate second derivatives by using finite differences or by fitting a quadratic approximation to the (log) likelihood function. While these numerical techniques have been found to perform well for models with a single or a few not too highly correlated variables, they proved unreliable for other cases. In general, fitting of a multi-dimensional, higher order polynomial function is frequently affected by numerical problems. Difficulties arise in particular for models with multiple random effects, for example an animal model analysis fitting a maternal genetic as well as an additive genetic effect for each animal, or multivariate analyses. In these cases large, negative sampling correlations between parameters cause long, narrow ridges on the likelihood surface so that its shape is clearly not quadratic. In addition, the number of coefficients to be estimated, and thus the minimum number of points of log L around the maximum required, increases dramatically with the number of parameters. For a bivariate analysis, for instance, estimating 3 additive genetic and 3 error covariance components, the quadratic approximation involves 21 coefficients. Hence, unless the necessary points can be obtained as part of the estimation procedure, the additional computational effort required can be considerable. This paper reviews pertinent properties of the likelihood framework of inference and 2

3 shows that sampling variances and confidence intervals in a multi-parameter analysis can be estimated one parameter at a time, using one-dimensional approximation techniques. Theoretical Considerations Large sample theory Most statistical textbooks consider the properties and distribution of ML estimators for large samples. Let θ denote the vector of parameters to be estimated and L the likelihood function for θ given some vector of data y. Kendall and Stuart (1973, chapter 18), for instance, show that asymptotically the ML estimator of θ, ˆθ, has a normal distribution with lower bound variance matrix equal to (minus) the inverse of the information matrix, i.e. the matrix of expected values of second partial derivatives of the log likelihood function. I(ˆθ) = { E [ 2 log L(θ)/ θ i θ j ] } (1) V ar(ˆθ) = [ I(ˆθ) ] 1 (2) Assuming normality of the ML estimator holds, the (1 α)% confidence interval for a parameter θ i (with ML estimate ˆθ i ) can be estimated as ˆθ i ± z α/2 s(ˆθ i ) where α is the error probability, s(ˆθ i ) is the sampling error of ˆθ i, derived from the inverse of the information matrix, and z α/2 is the truncation point of the standardised normal distribution for which the area in the upper tail is α/2. With the likelihood ratio (LR) test, as proposed by Neyman and Pearson (1928), the likelihood framework provides a conceptually simple and flexible method to test both simple and composite hypotheses about ML estimates. While for some applications the LR test is equivalent to one of the standard statistical tests, it accommodates cases where no optimum test exists. Chapter 24 in Kendall and Stuart (1973) examines LR tests and their properties in detail, and the following section summarizes its main points. Partition the vector of parameters θ of length t into θ 1 of length t 1 (1 t 1 t) representing the parameters we want to examine, and θ 2 of length t 2 = t t 1. Let the null hypothesis (H 0 ) to be tested be that θ 1 is equal to some specified vector t and the alternative hypothesis (H A ) that it is not. H 0 : θ 1 = t H A : θ 1 t 3

4 Furthermore, let L( ˆθ 1, ˆθ 2 ) denote the maximum of the likelihood function pertaining to ˆθ. To test H 0, we need to determine the ML estimate of θ 2 under the null hypothesis, i.e. given that θ 1 = t. Using the superscript 0 to denote this, the estimate is ˆθ 0 2, with the corresponding likelihood L(t, θ ˆ0 2 ). The likelihood ratio is then λ = L(t, ˆθ 0 2 ) L( ˆθ 1, ˆθ 2 ). Kendall and Stuart (1973) argue that λ is an intuitively reasonable test criterion for H 0 as it is the maximum likelihood under H 0 as a fraction of the largest possible value, with a large value of λ indicating that H 0 is acceptable. To determine the critical region for any test, we need to know the distribution of the test criterion. From the asymptotic normality of ML estimators (under certain regularity conditions), it follows directly that 2 log λ asymptotically has a χ 2 distribution with t 1 degrees of freedom. Under the null hypothesis, this is a central χ 2 distribution. If H A holds, however, it has a non-centrality parameter depending on the deviation of θ 1 from its value under H 0, t, and its sampling variance matrix. Approximations There is extensive statistical literature on the approximation of likelihood functions and associated statistics, in particular LR tests and their power, for small or finite samples. Sprott (1973) emphasized that the shape of the likelihood function determined when large sample theory could be applied and when it was misleading. While the LR test criterion λ is bounded by 0 and 1, the variable ω = 2c log λ (3) is, for any positive constant c, distributed over the interval from 0 to. An obvious approximation to the distribution of ω is then through a χ 2 distribution, which spans the same interval, choosing c to make the approximation as close as possible (Bartlett, 1937). Lawley (1956) presented a general method, utilising third and higher order expansions of the likelihood function, to determine such a scaling factor, which yielded a variable ω with an expected deviation from a χ 2 distribution with t 1 degrees of freedom, proportional to the reciprocal of the square of the sample size. 4

5 Asymptotically, the likelihood function is normal, i.e. can be described accurately by a quadratic function (Sprott, 1973). Thus it can be determined by a Taylor series expansion about the maximum involving only second order terms. Examination of the actual likelihood function and its deviation from normality then gives an indication how appropriate is the use of large sample results. Minkin (1983) outlined a technique to find an upper bound for the error made with a quadratic approximation of the likelihood. Bartlett (1953a,b and 1955) attempted to correct for the skewness of the likelihood function in deriving confidence intervals. Sprott and Kalbfleisch (1969) and Sprott (1973 and 1975) compared observed and normal likelihoods for several literature examples and showed that, in some cases, a transformation of the parameters aimed at reducing the third order term in the Taylor series expansion of L (which is ignored when invoking large sample theory) makes the likelihood function more normal, and thus extends the domain for the application of large sample results. Modified likelihood Commonly, some modified form of the likelihood function rather than the full likelihood is adopted as the basis for ML estimation and statistical inference. Possible aims for this are to to achieve robustness, to deal with problems for which the full likelihood is difficult or impossible to compute, to develop methods based on second-moment properties, or to reduce dimensionality (Cox, 1975). Another motivation has been to reduce bias : the likelihood employed in REML estimation of variance components is a modified likelihood, modified to account for fixed effects in the model of analysis. Nomenclature tends to be confusing. Cox (1975) defines a partial likelihood which encompasses both marginal and conditional likelihood as special cases. Barndorff-Nielsen (1983) refers to such modified likelihood functions as quasi-likelihoods and includes in this concept proper likelihoods, quasi-likelihoods in the sense of Wedderburn (1974), partial likelihoods as defined by Cox (1975), as well as marginal, conditional and profile likelihoods. Let y denote the vector of observations for a random variable Y with distribution f Y (y;θ). Suppose Y is transformed to two new random variables V and W with corresponding data vectors v and w, the transformation not depending on the unknown parameters θ. The marginal likelihood based on V, with distribution f V (v;θ), is then the likelihood which would have been obtained if only v rather than y had been observed. Conversely, the conditional likelihood is based on W given V=v with density function f W V (w v;θ) (Cox, 5

6 1975). Often, we can divide the vector of parameters in the full likelihood into those of interest, often referred to as structural parameters, and the remainder consisting of so-called incidental or nuisance parameters (Neyman and Scott, 1948). Consider a sequence of independent random variables Y 1, Y 2,..., Y i,... whose distribution depends on a parameter θ which is independent of i and the parameters β i which depend on i. θ is a structural and the β i are incidental parameters. An example would be data sampled from normal distributions with the same variance (θ) but different means (β i ). The problem of obtaining consistent estimates of structural parameters in the presence of infinitely many incidental parameters was first discussed by Neyman and Scott (1948) and Kiefer and Wolfowitz (1956). Anderson (1970) described a conditional maximum likelihood estimator of θ, given minimal sufficient statistics of the β i, and showed that the resulting estimates are consistent and asymptotically normally distributed in the regular case. Kalbfleisch and Sprott (1970) considered various procedures to eliminate the incidental parameters from the likelihood function for this case, namely integrating over the β i (assuming some prior knowledge about their distribution), replacing them by their ML estimates for specific values of θ, and factoring the likelihood into two parts, with one of them containing no information on θ in the absence of knowledge about the β i, and then using the marginal or conditional likelihood of θ. In other cases, ML estimation in multi-parameter models has been carried out as a stepwise procedure. Richards (1961) discussed the type of problem where most difficulties arise through the estimation of one (or more) of the parameter(s), whereas knowledge of these would simplify estimation of the remaining parameters considerably. As an example, he considered fitting of a curve of form y = α + βf(x, ρ) where the function f is known and α, β and ρ are to be estimated. If ρ were known, f(x, ρ) could be replaced by x and the curve fitting problem would reduce to that of a simple linear regression, y = α + βx. Richards (1961) thus suggested obtaining ML estimates of α and β for various values of ρ and then maximizing over ρ by some numerical method, and showed that this approach could be made the basis of a method which yielded both parameter estimates and their estimated covariance matrix. Considering the likelihood of the structural parameters with the nuisance parameters equal to their ML estimates, one of the methods considered by Kalbfleisch and Sprott (1970) to eliminate the latter, Patefield (1977) arrived at the same results. 6

7 In REML estimation of variance components, this strategy has been utilised by Smith and Graser (1986) for univariate analyses for a class of models with two random factors and three variances σ1, 2 σ2 2 and σ 2. As analyses using an Expectation-Maximization (EM) type algorithm for models with one random factor could be carried out efficiently exploiting a transformation of the coefficient matrix in the mixed model equations to tridiagonal form, they obtained REML estimates of σ1 2 and σ 2 for given values of γ = σ2/σ and maximized with respect to γ using a quadratic approximation of the likelihood function. This proved computationally simpler and more efficient than the EM algorithm attempting to estimate all three components simultaneously described by Meyer (1987). Similarly, in the multivariate estimation for an animal model with equal design matrices using a derivative-free algorithm, a reparameterisation to elements of the canonical decomposition of the genetic and error covariance matrices, and sequentially maximizing with respect to the elements of the eigenvectors for given eigenvalues and then the eigenvalues, resulted in a dramatic reduction in computational requirements (Juga and Thompson, 1990; Meyer, 1991). Profile likelihood Let θ 1 be the vector of structural parameters or parameters which are difficult to estimate, and θ 2 the vector of nuisance parameters or parameters which are easy to estimate for given values of θ 1. The partially maximized or profile likelihood of θ 1 is then L P (θ 1 ) = L(θ 1, ˆθ 2 ) where ˆθ 2 is the ML estimate of θ 2 for given values of θ 1. As stated by Patefield (1977), (log) L P is essentially the projection of the complete (log) likelihood onto the axes pertaining to the elements of the subvector of parameters θ 1. As outlined above, the inverse of the information matrix gives an estimate of the large sample covariance matrix of ˆθ. The inverse of the information matrix is commonly referred to as the formation matrix (e.g. Edwards, 1966). An important property of the projection of log L(θ) on log L P (θ 1 ) is that the formation, or graphically the curvature of the likelihood surface, is preserved (Patefield, 1977). Let F(θ) denote the formation matrix pertaining to the complete parameter vector θ. Partition both I(θ) and F(θ) corresponding to the division of θ into θ 1 and θ 2. I(θ) = I 11(θ) I 12 (θ) I 21 (θ) I 22 (θ) 7

8 F(θ) = F 11(θ) F 21 (θ) F 12 (θ) F 22 (θ) Define F P (θ 1 ) as the formation matrix for θ 1 derived from the profile likelihood and I P (θ 1 ) the corresponding information matrix, i.e. F P (θ 1 ) = [ I P (θ 1 ) ] 1 I P (θ 1 ) = { E [ 2 log L P (θ 1 )/ θ i θ j ] } for i, j = 1,..., t 1. As shown by Richards (1961) and Patefield (1977), F P (θ 1 ) = F 11 (θ) (4) i.e. for each θ 1 with θ 2 replaced by ˆθ 2, the ML estimate of θ 2 given θ 1, the formation matrix from the profile likelihood is equal to the corresponding submatrix of the formation matrix for the complete likelihood. Partitioned matrix results then give I P (θ 1 ) = I 11 (θ) I 12 (θ)[i 22 (θ)] 1 I 21 (θ) i.e. the profile information for θ 1 is that available in the complete likelihood conditional on θ 2. Only if θ 1 and θ 2 are uncorrelated is I P (θ 1 )=I 11 (θ), which implies that θ 2 does not contain any information on θ 1 and the profile maximum likelihood estimator utilises all possible information for θ 1. The Method Sampling Variances From (4) it follows immediately that, in a multi-parameter problem, we can examine the sampling distribution of one or few parameters at a time based on their profile (log) likelihood. For algorithms which do not utilise second derivatives of log L in the search for its maximum and thus provide an estimate of the information and formation matrices as a by-product, this implies that only selected second derivatives of the log likelihood function need to be calculated explicitly. Alternatively, the numerical approximation of the likelihood surface required to estimate sampling (co)variances is of low dimension and thus less problematic and likely to be more accurate than an approximation involving all parameters. 8

9 In particular, sampling variances can be estimated for each parameter individually, i.e., in the above notation, t 1 = 1 and t 2 = t 1, and, with θ the parameter of interest, θ 1 ={θ}. Let L 0 = L(ˆθ) denote the maximum of log L with respect to the complete parameter vector θ. An estimate of the formation matrix for θ 1 can be obtained through a quadratic approximation of log L P (θ 1 ). For one parameter, this requires a minimum of three points, i.e., with L 0 known, two additional points on the profile likelihood curve need to be evaluated. These are chosen either side of the ML estimate ˆθ but close to it, so that L 0 is bracketed. Let θ + 1, for 1 < 0, and θ + 2 for 2 > 0, be the two points with function values L P 1 = L(θ + 1, ˆθ 2 ) and L P 2 = L(θ + 2, ˆθ 2 ), respectively. The second order Taylor series expansion to log L P (θ) about the maximum for any i is : L P i = L 0 + i q i q 2 /2 (5) Estimates of the coefficients in (5) are then q 1 = 2 2(L P 1 L 0 ) 2 1(L P 2 L 0 ) 1 2 ( 2 1 ) q 2 = 1(L P 2 L 0 ) 2 (L P 1 L 0 ) 1 2 ( 2 1 ) (6) (7) If the two points are equally spaced from ˆθ, i.e. 2 = and 1 =, (6) and (7) simplify to q 1 = (L P 2 L P 1 )/(2 ) (8) q 2 = (L P 1 + L P 2 2L 0 )/ 2 (9) q 2 is an estimate of δ 2 L P (θ)/(δθ) 2 at θ = ˆθ, i.e. measures the curvature of the profile log likelihood at its maximum. Hence, an estimate of the sampling variance for the ML estimator of θ is V ar(ˆθ) = 1/q 2 Correspondingly, q 1 provides an estimate of the slope of the profile log likelihood function at the maximum. Hence, its expected value is zero. From (8) it follows that this is the case when L P 1 = L P 2, i.e. when the profile likelihood is symmetric for the range defined by. The magnitude of q 1 gives some indication of the goodness of approximation of log L P (θ) by (5). If q 1 differs substantially from zero, q 2 or its reciprocal is unlikely to provide a reliable estimate of the sampling variance of θ. 9

10 For some problems, we are interested in estimates of sampling correlations between parameters as well as their sampling errors. These can be determined analogously for one pair of parameters at a time from their joint profile likelihood. Expanding (5) to two dimensions, we need to estimate two linear coefficients q 1i and three quadratic coefficients q 2ij (for i j = 1, 2), i.e at least 5 additional points on the profile log likelihood surface for the two parameters of interest, θ 1 and θ 2, need to be evaluated. Coefficients q 2ij are the elements of the symmetric information matrix I P (θ 1 ) (with t 1 = 2 and θ 1 = { θ 1 θ 2 }) which upon inversion yields estimates of the sampling variances and of the covariance between the two parameters. Alternatively, we may consider to reparameterize to a model which fits the sum of θ 1 and θ 2 as a single parameter, and to estimate the variance of the sum as described above. Assuming the sampling variances for each parameter have already been obtained, the covariance between them can be estimated from the variance of the sum. If feasible, this appears computationally advantageous, especially if many sampling covariances are to be evaluated. Confidence Intervals Cox and Hinkley (1974; Chapter 7) consider interval estimation and likelihood based confidence intervals for single and multiple parameters and the case of a single parameter of interest and several nuisance parameters. Matthews (1988) outlines an application to a multi-parameter problem. As outlined above, hypotheses testing within the ML framework of estimation can be performed using the LR test. It was shown that for composite hypotheses, examining only a subvector θ 1 of the parameters, this requires the numerical value of the likelihood function with θ 1 equal to its assumed value and the remainder of the parameters equal to their ML estimates given θ 1. Clearly, this is a point on the profile likelihood for θ 1. Again, consider one parameter θ, i.e. t 1 = 1 and θ 1 ={θ}. The likelihood ratio criterion to test the hypothesis whether θ is different from a value a is then λ = L(a, ˆθ 2 )/L(ˆθ, ˆθ 2 ) = L P (θ) θ=a /L(ˆθ) and 2 log λ has an asymptotic χ 2 distribution with one degree of freedom. Hence the ML estimate ˆθ is significantly different from a if log λ = L P (θ) θ=a L(ˆθ) > χ 2 1,α/2 10

11 where χ 2 1,α is the point of the cumulative χ 2 distribution with one degree of freedom pertaining to the error probability α. The (1 α)% confidence interval for θ is the range of values for which log λ does not exceed χ 2 1,α/2. Conversely, the upper and lower confidence limits for ˆθ are given by the points θ L and θ U on the profile likelihood curve for θ for which log λ is equal to χ 2 1,α/2. A number of numerical techniques exist which allow θ L and θ U to be determined in a simple one-dimensional search; see for instance Gill et al. (1981). However, these strategies generally require repeated function evaluations and thus are practically feasible only if the latter are computationally undemanding. In our case, the function is the log profile likelihood for the parameter of interest, i.e. each evaluation requires maximization of L for a given value of θ with respect to the remaining t 1 variance components. This tends to impose heavy computational requirements, i.e. it is desirable to derive estimates for θ L and θ U with the least number of evaluations. Hence it is suggested to replace the log profile likelihood function by a polynomial approximation and to approximate θ L and θ U by the points for which the value of log L P, predicted from this function, deviates from L 0 by χ 2 1,α/2. As outlined above, a quadratic approximation of of L P (θ) is equivalent to invoking large sample theory. While it may yield an accurate estimate of the curvature at the maximum even if L P (θ) is clearly non-normal, it is likely to give an inappropriate estimate of the confidence interval. In that case, a third or higher order Taylor series expansion of log L P (θ) about the maximum will provide a considerably better approximation to the likelihood curve. h L P i = L 0 + k i q k /k! (10) k=1 Thus, with a one-dimensional approximation we require at least h points on log L P (θ) to estimate the coefficients in (10). The confidence limits of ˆθ are the values θ C = ˆθ + C for C = L, U which satisfy h χ 2 1,α/2 = k C q k /k! k=1 These can be found easily using inverse interpolation or some numerical technique to find the roots of a polynomial function. For most cases, h = 3 or h = 4 will be sufficient to account for the deviation of log L P (θ) from normal shape, at least over the range considered in estimating the confidence interval for ˆθ. To ensure that the points used in determining coefficients q i approximately 11

12 span the area of interest, the following strategy is suggested : First, obtain estimates of the sampling variance of ˆθ through a quadratic approximation of L P (θ) as described above. This gives values of ˆθ ± z α/2 / q 2 as initial guesses for θ L and θ U. Secondly, evaluate one or both of these points to obtain a cubic or fourth order approximation to L P (θ) and determine the confidence interval. While algebraic expression for the cubic or quartic roots of a function exist (e.g. Weber, 1894), the two roots of interest can be readily determined numerically, knowing that they are in the vicinity of the initial estimates of θ L and θ U from the quadratic approximation. As above, let the points and their pertaining function values be θ + i and L P i (i = 1,..., h), respectively. For h = 3, for instance, choose the upper bound from the quadratic approximation as the third point (provided it is within the bounds of the parameter space), i.e. 3 = z α/2 / q 2. The coefficients of the cubic approximation are then : q 1 = ( 2 3 D D D 3 )/ D q 2 = 2 (( )D 1 + ( )D 2 + ( )D 3 )/ D q 3 = 6 (D 1 + D 2 + D 3 )/ D with D = [ 1 2 ( 2 1 ) ( 3 2 ) ( 1 3 )] D 1 = (L 1 L 0 ) 2 3 ( 3 2 ) D 2 = (L 2 L 0 ) 1 3 ( 1 3 ) D 3 = (L 3 L 0 ) 1 2 ( 2 1 ) The magnitude of the cubic coefficient, q 3, as well as the deviation of L P 3 from its large sample predicted value L 0 χ 2 1,α/2 and the change in the estimated confidence interval (from the initial estimate for h = 2) then give an indication of how much the shape of the actual likelihood differs from normal. These together with the computational requirements to evaluate a point on the profile likelihood curve then generally determine whether or not a quartic approximation (h = 4) needs to be fitted. Sprott (1980) extended the work of Sprott (1973, 1975) examining the normality of the likelihood curve to multi-parameter models by replacing the (relative) likelihood for a single parameter by the profile likelihood for the parameter of interest. Giving several examples, he emphasized the desirability of accounting for third and fourth standardized derivatives of the log likelihood when applying maximum likelihood theory to obtain confidence intervals. 12

13 Examples Consider data with a balanced hierarchical full-sib structure, i.e. s unrelated sires, each mated to d unrelated dams with n offspring per dam, with records on all animals (parents and offspring) available. The data can then be summarised in the form of three independent matrices of sums of squares (SS) and crossproducts (CP) as outlined by Thompson (1977). Let the variance components of interest be the additive genetic variance (σa), 2 the common environmental variance affecting full-sibs (σc) 2 and the error variance (σe). 2 For an animal model analysis not involving fixed effects, REML estimates of variance components, as well as the associated log likelihood and its derivatives, can be obtained readily, as described by Thompson (1977). Replacing the matrices of SS/CP by their expected values assuming the true (population) values of the variances are known, the shape of the likelihood curve and the sampling distribution of parameter estimates can be examined. Let σa 2 = 40, σc 2 = 0 and σe 2 = 60, for a large data set with s = 1000, D = 5 and n = 2, i.e. 16,000 records in total. Fitting a simple animal model with animals as the only random effect, the REML estimates of σa 2 and σe 2 are equal to their true values, 40 and 60, respectively, and the maximum of the log likelihood function is L 0 = Approximating sampling variances To approximate the sampling variance of ˆ σ2 A, evaluate two additional points on the log profile likelihood for σ 2 A, each 1% from the estimate, i.e. = 0.4. Thus, giving both coordinates (σ 2 A, σ 2 E), the points are (39.600, ) and (40.400, ) with corresponding function values L P 1 = and L P 2 = This gives coefficients q 1 = and q 2 = , and an estimate of the sampling error of ˆ σ2 A of , compared to a value of obtained from the inverse of the information matrix. Corresponding estimates for values of of 1.0, 0.1, 0.04, 0.01, and are , , , , and , i.e. the quadratic approximation of L P (σ 2 A) appears robust against the choice of, as long as it is reasonably small but not so small as to cause problems of numerical accuracy. Analogous calculations for ˆσ 2 E for points (40.527,59.400) and (39.480,60.600) ( = 0.6) with L P 1 = and L P 2 = give coefficients q 1 = and q 2 = , yielding an estimate of the sampling error of which again agrees well with the value of obtained from the information matrix. 13

14 Likelihood derived confidence intervals Figure 1 illustrates the relationship between the overall likelihood, profile likelihood and confidence limits for this example. For two parameters, contours on the likelihood surface are of elliptic form. The χ 2 table value for one degree of freedom and an error probabibility of 5% is 3.84, i.e. χ 2 1,α/2 = Shown in in the main part of Figure 1 is the log likelihood contour for which the deviation from the maximum is 1.92, i.e. points on and within the ellipsis give the joint 95% confidence region for σa 2 and σe. 2 The subgraphs to the right and above depict the log profile likelihood curves for σe 2 and σa, 2 respectively. Also shown in the main part are points on the profile likelihood for each parameter, i.e. the REML estimates of σe 2 for given values of σa 2 are plotted against the latter, and estimates of σa 2 for fixed σe 2 are plotted against these values for σe. 2 For this example, the resulting curves are almost linear. Obviously, they intersect at the overall maximum likelihood estimate. For each of these curves, the intersection with the 1.92 contour then gives the points representing the bounds of the 95% confidence intervals for the corresponding parameter. Note that these are the points on the contour ellipsis for which the tangents are orthogonal to the axis for the parameter of interest, i.e. the length of the confidence interval is given by the distance between the tangents. In turn, the tangents intersect the corresponding profile log likelihood curve at a value of For the example, confidence limits for σa 2 are given by points (σa, 2 σe) 2 = (36.772, ) and (43.360, ). For the large data set considered, this agrees well with the large sample values, obtained as estimate ±z α/2 sampling error with z 2.5% = 1.96, of and While the length of the observed (6.5882) and theoretical (6.5877) confidence intervals are almost identical, the observed interval is clearly not quite symmetric, the part below the estimate amounting to 96.1% of that above the estimate. Similarly, for σe 2 the curve of profile points intersects the 1.92 contour at (42.347, ) and (37.697, ) while predicted confidence limits are and Corresponding calculations for various data sets with a hierarchical full-sib structure are summarised in Table 1. Generally, observed confidence intervals for σe 2 agree better with their predictions and are more symmetric than those for σa. 2 Unless the data set is very small, 3.92 lower bound sampling error, the length of the predicted interval, agrees well with the length of the observed interval. However, the likelihood and profile likelihood curves are clearly not symmetric but skewed. For a very simple example, a sire model 14

15 (s = 100, d = 10, n = 1; no records on parents), the curvature of the likelihood has been tabulated elsewhere (Vischer et al., 1992). As illustrated in Figure 2, the asymmetry over the range considered decreases rapidly with increasing sample size. The expected sampling correlation between σa 2 and σe 2 does not appear to depend on the number of sire families but is affected by their structure. In general, parameters in a multi-parameter problem are correlated, i.e. a change in one parameter will result in a corresponding change in the values of the other parameters. In the estimation of variance components in particular, these sampling correlations are often large and negative, which implies considerable cross-substitution in estimates of the correlated parameters before the value of the likelihood changes substantially. Hence, we are not only interested in confidence intervals for individual parameters, but the numerical values of the other parameters at the bounds, i.e., graphically, all coordinates of the points in which the χ 2 1,α/2 contour on the likelihood surface has tangents orthogonal to the axis for the parameter in question. Gathering these for all parameters then yields a grid of values depicting the limits of the confidence ellipsoid for all parameters. In the example above, with a correlation of 0.71 between ˆσ A 2 and ˆσ E, 2 the lower confidence limit for ˆσ A 2 was with an associated value for σe 2 of Any further increase of σe 2 (given σa 2 = ) would yield a point outside the 1.92 likelihood contour, i.e. outside the approximate 95% confidence region. Table 2 gives the 6 points on the 1.92 confidence contour for a three-parameter problem for 3 examples. Obviously, the three-dimensional confidence ellipsoid is contained within the cube defined by the confidence limits, i.e. the ML estimate of a parameter θ i given another parameter θ j equal to one of its confidence limits, is within the confidence limits for ˆθ i. Approximating confidence intervals For the large example considered above, a quadratic approximation of L P (σa) 2 with = 0.4 gave coefficients q 1 = and q 2 = , and an estimate of the confidence interval of to Choosing the upper limit, with L P 3 = as third point, the cubic approximation of L P (σa) 2 is given by q 1 = , q 2 = and q 3 = with resulting estimates of the confidence limits of ˆσ A 2 of and Evaluating a fourth point of L P (σa), with L P 4 = , the quartic approximation has coefficients q 1 = , q 2 = , q 3 = and q 4 = This gives an approximate confidence interval of

16 to which, for this case, is equal to three decimal places to the estimate derived from the actual (log) profile likelihood curve. Figure 3 shows the actual (log) profile likelihood and its quadratic and cubic approximations for σc 2 estimated from a small sample, i.e. s = 100, d = 2 and n = 5 (c.f. Table 2). Using = 0.15, the quadratic approximation with coefficients q 1 = and q 2 = gives estimated confidence limits of and for ˆσ C 2 = 15. As above, the estimate of the quadratic coefficient is little affected by a higher order approximation. For h = 3, coefficients are q 1 = , q 2 = and q 3 = and the estimated confidence interval is to Similarly, for h = 4, q 1 = , q 2 = , q 3 = , q 4 = and predicted limits are and Simulation Examples presented so far assumed that the population variances were known. Sampling the matrices of SS/CP from a central Wishart distribution (with the appropriate degrees of freedom) instead, expected results derived from the population values can be compared to their empirical counterparts. Table 3 gives means and standard deviations over 500 replicates for confidence intervals derived from the information matrix (large sample), quadratic, cubic and quartic approximation of the profile likelihood, and the observed profile likelihood, for data with s = 200, d = 5 and n = 2. For this example, involving a sufficiently large number of sires and records (N = 3200), differences between large sample and observed values are only small. As noted above for one of the numerical examples, large sample values tend to approximate the interval length more closely then its bounds. The quartic approximation of the profile likelihood gave estimates of the confidence limits virtually identical to the actual likelihood for all three components. For the cubic approximation, estimates of the upper bounds agreed while the lower bounds differed slightly because the third point used to derive the coefficients of the cubic function was chosen equal to the upper bound predicted from the quadratic approximation. Simulation results agreed well with values expected from the population variances. Counting the number of replicates for each component with estimates within the expected large sample and likelihood derived confidence intervals gave equal proportions (94.8% for 16

17 σ 2 A, 95.4% for σ 2 C and 94.4% for σ 2 E) for both, indicating that, for this example, large sample theory gave a good description of the sampling distribution of estimates. Discussion In re-examining well known properties of the likelihood framework of inference, a pragmatic procedure has been suggested to assess the accuracy of estimates in a multi-parameter model. It appears especially suited to analyses involving several, highly correlated parameters or algorithms which do not yield an estimate of the information matrix as a by-product. Typical examples are REML analyses of several traits or fitting an animal model involving additional random effects using a derivative-free algorithm. The examples given show that a one-dimensional polynomial approximation of the profile likelihood for one parameter at a time will yield an appropriate estimate of its sampling error and confidence interval. Fitting a simple quadratic function appears to provide a good estimate of the curvature of the likelihood, i.e. the sampling variance, even for the case where its shape is not normal. For the estimation of confidence limits, a higher order approximation seems preferable. The strategy suggested ensures that the amount of extrapolation required out of the range of points used in the cubic or quartic approximation of profile the likelihood is reasonably small. Though not discussed here, the polynomial function approximating the (log) profile likelihood obviously can also be utilised to predict the value of the LR test criterion for various hypotheses concerning the parameter of interest. Clearly, in that case the quality of the prediction depends not only on the order of the polynomial function but also the degree of extrapolation involved. A simple, univariate analysis for data with a balanced, hierarchical full-sib structure has been chosen for ease of illustration. For the cases considered, there was little difference between information matrix and likelihood derived confidence intervals, i.e. large sample theory was quite appropriate. For unbalanced data, more complicated models of analysis or smaller data sets, differences are likely to be bigger. This is illustrated in Table 4 which shows the expected, approximated and likelihood based confidence intervals for an animal model analysis fitting a maternal genetic effect as an additional random effect, and estimating both the maternal variance and a direct-maternal genetic covariance. Considering the same data structure as in Table 3 but estimating an additional component and having much higher sampling correlations between parameters, confidence intervals are markedly wider and more asymmetric, indicating a considerably more skewed likelihood, especially 17

18 for σa. 2 An implicit assumption in deriving confidence intervals as outlined has been that the likelihood ratio test criterion indeed has a limiting χ 2 distribution with degrees of freedom equal to the number of parameters tested. There is extensive statistical literature on the distribution of the likelihood ratio and adjustments to improve it, i.e. make it follow a χ 2 distribution more closely; see, for instance, Barndorff-Nielsen and Cox (1984). Jensen (1991) concluded that for many cases with continuous data, an appropriately chosen Bartlett type adjustment (see eq. (3)) would work well, even for small sample sizes. LR tests for hypotheses dealing with covariance matrices have been considered (e.g. Nagarsenker and Pillai, 1973; Porteous, 1985; Møller, 1986), but little is known about the properties of the LR test applied to estimates of (co)variance components in quantitative genetics. Estimating the additive and dominance genetic variance for a small sample, Shaw (1987) reported the empirical power of the LR test found in a simulation study to be considerably less than expected. Research is required to establish when the assumption of a χ 2 distribution is inappropriate. The methodology described, however, should be applicable directly even if some modification is required. Acknowledgements This study was supported by the Agricultural and Food Research Council. Part of this work was carried out at the Animal Genetics and Breeding Unit of the University of New England, Armidale, NSW, supported by MRC grant UNE15. 18

19 References Anderson, E.B. (1970). Asymptotic properties of conditional maximum likelihood estimators. J. Roy. Statist. Soc. B 32 : Barndorff-Nielsen, O.E. (1983). On a formula for the distribution of the maximum likelihood estimator. Biometrika 70 : Barndorff-Nielsen, O.E. and Cox, D.R. (1984). Bartlett adjustments to likelihood ratio statistic and the distribution of the maximum likelihood estimator. J. Roy. Statist. Soc. Ser. B 46 : Bartlett, M.S. (1937). Properties of sufficiency and statistical tests. Proc. Roy. Soc. A 160 : Bartlett, M.S. (1953a). Approximate confidence intervals I. Biometrika 40 : Bartlett, M.S. (1953b). Approximate confidence intervals II. More than one unknown parameter. Biometrika 40 : Bartlett, M.S. (1955). Approximate confidence intervals III. A bias correction. Biometrika 42 : Cox, D.R. (1975). Partial likelihood. Biometrika 62 : Cox, D.R. and Hinkley, D.V Theoretical Statistics. Chapman & Hall, London. Edwards, A.W.F Likelihood. Cambridge University Press. Gill, P.E., Murray, W. and Wright, M.H Practical Optimization. Academic Press, New York. Jensen, J.L. (1991). A large deviation-type approximation of the Box class of likelihood ratio criteria. J. Amer. Statist. Assoc. 86 : Juga, J. and Thompson, R. (1990). Estimation of bivariate variance components. Proc. 4 th World Congr. Genet. Appl. Livest. Prod. XIII : Kalbfleisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to models involving large numbers of parameters. J. Roy. Statist. Soc. B 32 : Kendall, M.G. and Stuart, A The Advanced Theory of Statistics. Vol. 2 : Inference and Relationship. Charles Griffin & Co. Ltd., London. Kiefer, J. and Wolfowitz, J. (1956). Consistency of maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 27 : Lawley, D.N. (1956). A general method for approximating to the distribution of the likelihood ratio criteria. Biometrika 43 : Matthews, D.E. (1988). Likelihood-based confidence intervals for functions of many parameters. Biometrika 75 : Meyer, K. (1987). Restricted Maximum Likelihood to estimate variance components for mixed models with two random factors. Genet. Sel. Evol. 19 :

20 Meyer, K. (1990). Present status of knowledge about statistical procedures and algorithms to estimate variance and covariance components. Proc. 4 th World Congr. Genet. Appl. Livest. Prod. XIII : Meyer, K. (1991). Estimating variances and covariances for multivariate animal models by Restricted Maximum Likelihood. Genet. Sel. Evol. 23 : (67 83). Minkin, S. (1983). Assessing the quadratic approximation to the log likelihood function in nonnormal linear models. Biometrika 70 : Møller, J. (1986). Bartlett adjustments for structured covariances. Scand. J. Stat. 13 : Nagarsenker, B.N. and Pillai, K.C.S. (1973). Distribution of the likelihood ratio criterion for testing a hypothesis specifying a covariance matrix. Biometrika 60 : Neyman, J. and Pearson, E.S. (1928). On the use and interpretation of certain test criteria for the purpose of statistical inference. Part I. Biometrika 20A : Patefield, W.M. (1977). On the maximized likelihood function. Sankhya B 39 : Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika 58 : Porteous, B.T. (1985). Improved likelihood ratio statistics for covariance selection models. Biometrika 72 : Richards, F.S.G. (1961). A method of Maximum Likelihood estimation. J. Roy. Statist. Soc. B 23 : Shaw, R.G. (1987). Maximum-likelihood approaches applied to quantitative genetics of natural populations. Evolution 41 : Smith, S.P. and Graser, H.-U. (1986). Estimating variance components in a class of mixed models by Restricted Maximum Likelihood. J. Dairy Sci. 69 : Sprott, D.A. and Kalbfleisch, J.D. (1969). Examples of likelihoods and comparisons with point estimates and large sample approximations. J. Amer. Statist. Assoc. 64 : Sprott, D.A. (1973). Normal likelihoods and their relation to large sample theory of estimation. Biometrika 60 : Sprott, D.A. (1975). Application of maximum likelihood methods to finite samples. Sankhya 37 : Sprott, D.A. (1980). Maximum likelihood in small samples : estimation in the presence of nuisance parameters. Biometrika 67 : Thompson, R Estimation of quantitative genetic parameters. In Proc. Int. Conf. on Quantitative Genetics, E. Pollak, O. Kempthorne and T.B. Bailey, ed.s, Iowa State University Press, Ames, pp Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models and the Gauss-Newton method. Biometrika 61 :

21 Table 1 : Observed and predicted, approximate 95% confidence intervals for animal model variance component estimates from data with a hierarchical full-sib structure with s sires, d dams per sire and n offspring per dam, i.e. N = s(1 + d(1 + n)) records; estimates are 40 and 60 for the genetic and error variance, respectively. s d n N Genetic variance Error variance r a s.e. b low c up d len e sym f s.e. low up len sym g s.e.) a Expected sampling correlation between genetic and error variance b asymptotic (lower bound) sampling error c lower confidence limit d upper confidence limit e length of confidence interval f symmetry of confidence interval about the estimate (=(estimate-low)/(up-estimate) ) g 1 st line : observed value, derived from profile likelihood; 2 nd line : expected value (=estimate±

22 Table 2 : Approximate 95% confidence intervals for animal model estimates of the additivegenetic (σ 2 A), common environmental (σ 2 C) and error (σ 2 E) variance from a balanced hierarchical full-sib design with population values of 40, 15 and 45, respectively; values for all parameters are given for each point, with the confidence limit for the parameter in question (in bold) and the associated maximum likelihood estimates given that parameter on the first line and corresponding large sample predicted limits on the second line. Lower limit Upper limit length symmetry σa 2 σc 2 σe 2 σa 2 σc 2 σe sires, 2 dams per sire, 5 offspring per dam σa σc σe sires, 5 dams per sire, 2 offspring per dam σa σc σe sires, 5 dams per sire, 2 offspring per dam σa σc σe

RESTRICTED M A X I M U M LIKELIHOOD TO E S T I M A T E GENETIC P A R A M E T E R S - IN PRACTICE

RESTRICTED M A X I M U M LIKELIHOOD TO E S T I M A T E GENETIC P A R A M E T E R S - IN PRACTICE K. M e y e r Institute of Animal Genetics, Edinburgh University, W e s t M a i n s Road, Edinburgh EH9 3JN,