Likelihood Methods. 1 Likelihood Functions. The multivariate normal distribution likelihood function is

Size: px

Start display at page:

Download "Likelihood Methods. 1 Likelihood Functions. The multivariate normal distribution likelihood function is"

Horatio Griffith
5 years ago
Views:

1 Likelihood Methods 1 Likelihood Functions The multivariate normal distribution likelihood function is The log of the likelihood, say L 1 is Ly = π.5n V.5 exp.5y Xb V 1 y Xb. L 1 = 0.5[N lnπ + ln V +y Xb V 1 y Xb]. The term N lnπ is a constant that does not involve any of the unknown variances or effects in the model, therefore, it is commonly omitted during maximization computations. Maximizing the log likelihood maximizes the original likelihood function. Previously, therefore, V = R Z R 1 Z + G 1 G, ln V = ln R + ln G + ln Z R 1 Z + G 1. If R = Iσ 0, then ln R = ln Iσ 0 = lnσ 0 N I = N ln σ 01. Similarly, if G = + Iσi, where i = 1 to s, then ln G = ln Iσi = q i ln σi. Except, that in animal models one of the G i is equal to Aσi. In that case, which is ln Aσ i = lnσ i q i A ln Aσ i = q i ln σ i A = q i ln σ i + ln A. Recall that so that X C = R 1 X X R 1 Z Z R 1 X Z R 1 Z + G 1, C = Z R 1 Z + G 1 X V 1 X ln C = ln Z R 1 Z + G 1 + ln X V 1 X. 1

2 Maximum Likelihood Method Hartley Rao 1967 described the maximum likelihood approach for the estimation of variance components. Let L be equivalent to L 1 except for the constant involving π. L = 0.5[ln V +y Xb V 1 y Xb]. The derivatives of L with respect to b to σ i for i = 0, 1,... s are L b = X V 1 Xb X V 1 y L σ i =.5 tr[v 1 V/ σ i ] +.5y Xb V 1 V/ σ i V 1 y Xb =.5 tr[v 1 V i ] +.5y Xb V 1 V i V 1 y Xb Equating the derivatives to zero gives Recall that ˆb = X V 1 X X V 1 y, tr[v 1 V i ] = y Xˆb V 1 V i V 1 y Xˆb. Py = V 1 y Xˆb, where P is the projection matrix, that V i = Z i Z i, then tr[v 1 Z i Z i] = y PV i Py. In usual mixed model theory, the solution vector for a rom factor may be written as so that Also, Let û i = G i Z ipy, y PV i Py = y PZ i G i G i G i Z ipy = û ig i û i = û iûi/σi 4. tr[v 1 V i ] = tr[r 1 R 1 ZZ R 1 Z + G 1 1 Z R 1 Z i Z i]. T = Z R 1 Z + G 1 1, R = Iσ 0, G = + Iσ i,

3 then tr[v 1 V i ] = trz iz i σ 0 trz iztz Z i σ 4 0. If T can be partitioned into submatrices for each rom factor, then Tσ 0 Z Z + + Iαi = I, + TZ Zσ0 = I T Iσ i, TZ Z i σ 0 = I T ii σ i, which yields trz iztz Z i σ0 4 = trz iz i σ0 tri T ii σ i σi. Finally, Combining results gives for i = 1,,..., s, for i = 0 gives tr[v 1 V i ] = tri T ii σi σi = triσ i = q i σ i trt ii σ 4 i trt ii σ 4 i. ˆσ i = û iûi + trt iiˆσ 0/q i ˆσ 0 = y y ˆb X y û Z y/ N..1 The EM Algorithm EM sts for Expectation Maximization. The procedure alternates between calculating conditional expected values maximizing simplified likelihoods. The actual data y are called the incomplete data in the EM algorithm, the complete data are considered to be y the unobservable rom effects, u i. If the realized values of the unobservable rom effects were known, then their variance would be the average of their squared values, i.e., ˆσ i = u iu i /q i. However, in real life the realized values of the rom effects are unknown. The steps of the EM algorithm are as follows: Step 0. Decide on starting values for the variances set m = 0. 3

4 Step 1.E-step Calculate the conditional expectation of the sufficient statistics, conditional on the incomplete data. Eu iu i y = σ 4m i y P m Z i Z ip m y +trσ m i I σ 4m i Z iv m 1 Z i = ˆt m i Step.M-step Maximize the likelihood of the complete data, σ m+1 i = ˆt m i /q i, i = 0, 1,,..., s. Step 3. If convergence is reached, set ˆσ = σ m+1, otherwise increase m by one return to Step 1. This is equivalent to constructing solving the mixed model equations with a given set of variances, σ m, then σ m+1 0 = y y ˆb X y û Z y/n, σ m+1 i = û iûi + σ m+1 0 trt ii /q i. 4

5 3 Restricted Maximum Likelihood Method Restricted or Residual maximum likelihood REML, was first suggested by Thompson 196, described formally by Patterson Thompson The procedure requires that y has a multivariate normal distribution. The method is translation invariant. The maximum likelihood approach automatically keeps the estimator within the allowable parameter spacei.e. zero to plus infinity, therefore, REML is a biased procedure. REML was proposed as an improvement to ML in order to account for the degrees of freedom lost in estimating fixed effects. The likelihood function used in REML is that for a set of error contrasts i.e. residuals assumed to have a multivariate normal distribution. The multivariate normal distribution likelihood function for the residual contrasts, K y, where K X = 0, K has rank equal to N rx, is LK y = π.5n rx K VK.5 exp.5k y K VK 1 K y. The natural log of the likelihood function is L 3 =.5N rx lnπ.5 ln K VK.5y KK VK 1 K y. Notice that.5n rx lnπ is a constant that does not depend on the unknown variance components or factors in the model, therefore, can be ignored to give L 4. Searle 1979 showed that ln K VK = ln V + ln X V 1 X y KK VK 1 K y = y Py = y Xˆb V 1 y Xˆb for any K such that K X = 0. Hence, L 4 can be written as L 4 =.5 ln V.5 ln X V 1 X.5y Xˆb V 1 y Xˆb. REML can be calculated a number of different ways. 1. Derivative Free approach is a search technique to find the parameters that maximize the log likelihood function. Two techniques will be described here.. First Derivatives EM is where the first derivatives of the log likelihood are determined set to zero in order to maximize the likelihood function. Solutions need to be obtained by iteration because the resulting equations are non linear. 3. Second Derivatives are generally more computationally deming. Gradient methods are used to find the parameters that make the first derivatives equal to zero. Newton- Raphson involves the observed information matrix Fishers Method of Scoring involves the expected information matrix have been used. Lately, the average information algorithm averages the observed expected information matrices has been used to reduce the computational time. 5

6 3.1 Example Problem All of the approaches attempt to maximize the log likelihood function of the error contrasts. To illustrate the methods, consider a single trait model with three factors F, A, B, of which A B are rom factors. There were a total of 90 observations, the total sum of squares was 356,000. The least squares equations for this small example are shown below F 1 F A 1 A A 3 B 1 B B 3 B 4 = Derivative Free REML Imagine an s dimensional array containing the values of the likelihood function for every possible set of values of the ratios of the components to the residual variance. The technique is to search this array find the set of ratios for which the likelihood function is maximized. There is more than one way to conduct the search. Care must be taken to find the global maximum rather than one of possibly many local maxima. At the same time the number of likelihood evaluations to be computed must also be minimized. Various alternative forms of L 4 can be derived. Note that that that combining these results gives Now note that ln V = ln R + ln G + ln G 1 + Z R 1 Z ln X V 1 X = ln C ln Z R 1 Z + G 1 L 4 =.5 ln R.5 ln G.5 ln C.5y Py. ln R = ln Iσ 0 = N ln σ0, ln G = q i ln σi, ln C = ln X R 1 X + ln Z SZ + G 1 6

7 Then where ln X R 1 X = ln X Xσ0 = lnσ0 X X = ln X X rx ln σ0, Z SZ + G 1 = σ0 MZ + G 1 = σ 0 Z MZ + G 1 σ 0. ln C = ln X X rx ln σ 0 q ln σ 0 + ln Z MZ + G 1 σ 0, finally, the log-likelihood function becomes L 4 =.5N rx q ln σ0.5 q i ln σi.5 ln C.5y Py, where Note that C = X X X Z Z X Z Z + G 1 σ 0. q i ln σi = q i ln σ0/α i = q i ln σ0 ln α i so that L 4 =.5[N rx ln σ0 q i ln α i + ln C +y Py]. The quantity y Py is y y Xˆb Zû/σ0. The computations are achieved by constructing the following matrix, X X X Z X y Z X Z Z + G 1 σ0 Z y y X y Z y y = C W y y W y, y then by Gaussian elimination of one row at a time, the sum of the log of the non-zero pivots using the same ordering for each evaluation of the likelihood gives log C y y Xˆb Zû. Gaussian elimination, using sparse matrix techniques, requires less computing time than inverting the coefficient matrix of the mixed model equations. The ordering of factors within the equations could be critical to the computational process some experimentation may be necessary to determine the best ordering. The likelihood function can be evaluated without the calculation of solutions to the mixed model equations, without inverting the coefficient matrix of the mixed model equations, without computing any of the σi. The formulations for more general models multiple trait models are more complex, but follow the same ideas. Searching the array of likelihood values for various values of α i can be done in several different ways. One method is to fix the values of all but one of the s α i, then evaluate L for four 7

8 or more different values of the α i that were not fixed. Then one can use a quadratic regression analysis to determine the value of that one ratio which maximizes L given that the other ratios are fixed. This is repeated for each of the s ratios, the process is repeated until a maximum likelihood is obtained. The calculations are demonstrated in the example that follows. Begin by fixing the value of α B = 10 letting the value of α A take on the values of 5, 10, 0, 30, 40. Using L 4 to evaluate the likelihood, then the results were as follows: α A L For example, the likelihood value for α A = 40, would be L 4 = 1 [N rx ln σ 0 q A ln α A q B ln α B + ln C +y y Xˆb Zû/σ0] where ln C = , y Py = /σ0 = 88, q A ln α A = , q B ln α B = , σ0 = , ln σ0 = , N rx = 88, then L 4 = 0.5[ / ] = To find the value of α A that maximizes L 4 for α B = 10, let Q = Y = then ˆβ = Q Q 1 Q Y =

9 From this a prediction equation for L 4 can be written as L 4 = α A α A. This equation can be differentiated with respect to α A then equated to zero to find the value of the ratio that maximizes the prediction equation. This gives α A =.04489/ = Now keep α A = try a number of values of α B from to 10, which give the following results. α B L Applying the quadratic regression to these points gives α B = The next step would be to fix α B = 1.65 to try new values for α A, such as 5 to 40 by units of 1. The range of values becomes finer finer. To insure that one has found the global maximum, the entire process could be started with vastly different starting values for the ratios, such as α B = 50 let values for α A be 40, 50, 60, 70. The more components there are to estimate, the more evaluations of the likelihood that are going to be needed, the more probable that convergence might be to a local maximum rather than to the global maximum. Please refer to the literature for specification of the log likelihood function for particular models situations The Simplex Method The Simplex Method Nelder Mead, 1965 is a procedure for finding the minimum of a function i.e. the minimum of L 4 or the maximum of L 4 with respect to the unknown variances covariances. The best way to describe the method is using the example data from 9

10 the previous sections. Begin by constructing a set of points for which L 4 is to be evaluated. A point is a vector of values for the unknowns α A, α B, for example, θ 1 = , then form two more points by changing one unknown at a time. Let the three points be as shown in the following table. No. α A α B Now calculate L 4 for each point arrange from largest to lowest value. No. α A α B L The idea now is to find another point to replace the last onelowest L 4. This is done by a process called reflection. Compute the mean of all points excluding the one with the lowest L 4. θ m = , then the reflection step is θ 4 = θ m + r θ m θ last, where r is recommended by Nelder Mead 1965 to be 1, giving θ 4 = The corresponding L 4 for this point was Compared to those in the table it has the largest value, therefore, is a better point than the other three. No. α A α B L Given this success, the Simplex method calls for an expansion step, i.e. to make a bigger change. Thus, θ 5 = θ m + E θ 4 θ m, 10

11 where E is suggested to be equal to. Hence θ 5 = Then L 4 = , the exped point is better yet. Now drop θ 1 from the table put θ 5 at the top. No. α A α B L This completes one iteration. Begin the next iteration by computing the mean of all points excluding the point with the lowest L 4. θ m = Another reflection step gives θ 6 = θ m + r θ m θ last, = However, this gives L 4 = , which is between θ θ 4, can push out θ from the table. No. α A α B L Instead of an expansion step, a contraction step is needed because θ 6 did not give a greater L 4 than the first two. Thus, θ 7 = θ m + c θ 6 θ m, where c = 0.5 is recommended. Hence, θ 7 = Then L 4 = is better than that given by θ 4, but not by θ 5, thus the new table becomes as follows:. No. α A α B L

12 The following steps were taken in the next iteration. 1. The mean of the top two L 4 is θ m = A reflection step gives θ 8 = θ m + r θ m θ last, = , which gave L 4 = , which is better than θ Add θ 8 to the table drop θ 4. No. α A α B L Because L 4 for θ 8 was not larger than L 4 for θ 5 or smaller than L 4 for θ 7, then no expansion or contraction step is necessary. Begin the next iteration. The Simplex method continues in this manner until all point entries in the table are equal. The constants recommended by Nelder Mead 1965 for reflection, expansion, contraction could be adjusted for a particular data set. This method may converge to a local maximum, so different starting values are needed to see if it converges to the same point. The Simplex method does not work well with a large number of parameters to be estimated. 3. First Derivatives EM Algorithm To derive formulas for estimating the variance components take the derivatives of L 4 with respect to the unknown components. L 4 σ i 1 V =.5trV σi +.5y Xˆb V.5trX V 1 X X V 1 V σi V 1 y Xˆb 1 V σi V 1 X Combine the two terms involving the traces note that V 1 y Xˆb = Py, 1

13 then L 4 σ i for i = 1,..., s or =.5trV 1 V 1 XX V 1 X X V 1 V σ i =.5trPZ i Z i +.5y PZ i Z ipy =.5trP +.5y PPy for i = 0 for the residual component. Using P the fact that V 1 = R 1 R 1 ZZ R 1 Z + G 1 1 Z R 1 +.5y P V σi Py then trpz i Z i = q i /σi trc ii σ0/σ i 4 trp = N rxσ0 û iûi/σi. The other terms, y PZ i Z i Py y PPy, were simplified by Henderson 1973 to show that they could be calculated from the Mixed Model Equations. Note that Henderson 1973 showed then which when G i = Iσ i gives Py = V 1 y Xˆb, ˆb = X V 1 X X V 1 y, û i = G i Z ipy, y PZ i Z ipy = y PZ i [G i G 1 i = y PZ i G i G i = û ig i û i û iûi/σ 4 i. Similarly for the residual component, Henderson showed that where α i = σ 0 /σ i. G 1 i G i ]Z ipy G i Z ipy y PPy = [y y ˆb X y û iz iy + û iûiα i ]/σ0, Equate the derivatives to zero incorporating the above simplifications obtain ˆσ i = û iûi + trc ii σ 0/q i, ˆσ 0 = y Py/N rx. As with ML, solutions using the EM algorithm must be computed iteratively. Convergence is usually very slow, if it occurs, the process may also diverge. 13

14 Notice the differences between REML ML. The denominator for ˆσ 0 is N rx rather than N, in ˆσ i is trc ii rather than trt ii. The quadratic forms, however, are identical in REML ML. Accounting for the degrees of freedom to estimate b has resulted in the REML algorithm. A major computing problem with the EM algorithm is the calculation of trc ii, which is the corresponding inverse elements of the mixed model equations for the i th rom factor. With most applications in animal breeding, the order of the mixed model equations are too large to be inverted, solutions to the equations are obtained by Gauss-Seidel iterations. However, there have been several attempts to approximate trc ii, but these have not been totally suitable. To demonstrate the EM algorithm let α A = 10 α B = 5 be the starting values of the ratios for factors A B, respectively. There were N = 90 total observations, rx =. The solution vector is Then F 1 F A 1 A A 3 B 1 B B 3 B 4 = from the inverse of the coefficient matrix, which give rise to the following estimates, New ratios are formed as y Xˆb + Zû = 347, trc AA =.16493, trc BB = ˆσ 0 = 356, , /88 = , ˆσ A = /3 = , ˆσ B = /4 = α A = / = , α B = / =

15 these are used to form the mixed model equations again, new solutions traces are calculated, so on, until the estimated ratios the prior values of the ratios are equal. The estimates converge to ˆσ 0 = , ˆσ A =.569, ˆσ B = or α A = , α B = Second Derivatives, Average Information Second derivatives of the log likelihood lead to the expectations of the quadratic forms. One technique, MIVQUE Minimum Variance Quadratic Unbiased Estimation equates the quadratic forms to their expectations. The estimates are unbiased if all variances remain positive, then convergence will be to the REML estimates. However, due to a shortage of data or an inappropriate model, the estimates derived in this manner can be negative. Computing the expectations of the quadratic forms requires the inverse of the mixed model equations coefficient matrix, then products crossproducts of various parts of the inverse. A gradient method using first second derivatives can be used Hofer, The gradient, d the vector of first derivatives of the log likelihood, is used to determine the direction towards the parameters that give the maximum of the log likelihood, such that θ t+1 = θ t + M t d t, where d t are the first derivatives evaluated at θ = θ t, M t in the Newton-RaphsonNR algorithm is the observed information matrix, in the Fisher Method of ScoringFS it is the expected information matrix. The first derivatives are as follows from earlier in these notes: L 4 σ i =.5trPZ i Z i +.5y PZ i Z ipy = 0 for i = 1,..., s or L 4 σ0 =.5trP +.5y PPy = 0 for the residual component. Then from earlier results, trpz i Z i = q i /σ i trc ii σ 0/σ 4 i, y PZ i Z ipy = û iûi/σ 4 i 15

16 which combined give 0.5û iûi/σi 4 q i /σi + trc ii σ0/σ i 4 = 0, for i = 1,..., s, trp = N rxσ0 û iûi/σi y PPy = [y y ˆb X y û iz iy + û iûiα i ]/σ0 which combined give 0.5[y y ˆb X y û iz iy + û iûiα i ]/σ0 N rxσ0 + û iûi/σi = 0, which simplifies to 0.5[y y ˆb X y û iz iy]/σ0 N rxσ0 = 0. The second derivatives give a matrix of quantities. The elements of the observed information matrix Gilmour et al are L 4 σ i σ 0 = 0.5y PZ i Z ipy/σ 4 0, L 4 σ i σ j L 4 σ 0 σ 0 = 0.5trPZ i Z j 0.5trPZ i Z ipz j Z j +y PZ i Z ipz j Z jpy/σ 0 0.5y PZ i Z jpy/σ 0, = y Py/σ N rx/σ 4 0. The elements of the expected information matrix Gilmour et al are E[ L 4 σi ] = 0.5trPZ i Z i/σ0, σ 0 E[ L 4 σi ] σ j = 0.5trPZ i Z ipz j Z j, E[ L 4 σ0 ] = 0.5N rx/σ0. 4 σ 0 As the name Average Information implies, average the observed expected information matrices to give the following matrix of elements. I[σ i, σ 0] = 0.5y PZ i Z ipy/σ 4 0, I[σ i, σ j ] = y PZ i Z ipz j Z jpy/σ 0, I[σ 0, σ 0] = 0.5y Py/σ

17 The first derivatives form the vector, d t, M t = I[σ, σ] 1. The rest of this method is computational detail to simplify the requirements for inverse elements solutions to MME. The calculations can not be illustrated very easily for the example data because the y-vector is not available. 3.4 Animal Models The model commonly applied to estimation of variance components in livestock genetics since 1989 has been an animal model. The animal model assumes a large, rom mating population, an infinite number of loci each with a small equal effect on the trait, only additive genetic effects, all relationships among animals are known tracible to an unselected base population somewhere in the past. Animals may have more than one record each. The equation of the model is y = Xb + Za + Zp + e, where a is the vector of animal additive genetic effects one per animal, p is a vector of permanent environmental p.e. effects associated with each animal. V ar Ey = Xb, a p = e Aσa Iσp Iσe The matrix A is called the numerator relationship matrix. Wright defined relationships among animals as correlations, but A is essentially relationships defined as covariances the numerators of the correlation coefficients. Also, these only represent the additive genetic relationships between animals. The MME for this model are X X X Z X Z Z X Z Z + A 1 k a Z Z Z X Z Z Z Z + Ik p ˆb â ˆp = Note that k a is the ratio of residual to additive genetic variances, k p is the ratio of residual to permanent environmental variances. Also, in MME the inverse of A is required. The EM-REML procedure gives. X y Z y Z y ˆσ e = y y ˆb X y â Z y ˆp Z y/n rx, ˆσ a = â A 1 â + tra 1 C aaˆσ e/n, ˆσ p = ˆp ˆp + trc ppˆσ e/n,. 17

18 where n is the total number of animals, N is the total number of records, C aa are the inverse elements of the MME for the animal additive genetic effects, C pp are the inverse elements of the MME for the animal permanent environmental effects. An example of this model will be given in later notes Quadratic Forms in an Animal Model A necessary quadratic form in an animal model is â A 1 â, this can be computed very easily. Note that the inverse of A may be written as A 1 = T 1 D T 1, where T 1 is an upper triangular matrix, diagonal matrix D has elements equal to 1,, or 4/3 in noninbred situations, values greater than in inbred situations. In Henderson 1975, this inverse was shown to be composed of just three numbers, i.e. 0, 1 s on the diagonals, -.5 corresponding to the parents of an animal. For example, T 1 = Then T 1â = ˆm = â i 0.5â s + â d, for the i th animal, â s â d are the sire dam estimated breeding values, respectively. Consequently, â A 1 â = â T 1 B T 1â = ˆm B ˆm q = ˆm i b ii, where b ii are the diagonal elements of B, q is the number of animals. 4 EXERCISES Below are pedigrees data on 0 animals. 18

19 Animal Sire Dam Group Record Construct A 1 set up the MME.. Apply EM-REML to the model, y ij = µ + g i + a j + e ij, where group, animal additive genetic, residual effects are rom. Let σ e/σ a = 1.5, σ e/σ g = 5.0, to start the iterations. Do five iterations of EM-REML. 3. Apply EM-REML again to the model with σ e/σ a = 3.0, σ e/σ g =.5, do five iterations of REML from these starting values. 4. Do the results of the previous two question tend to give similar answers? Comment on the results. 5. Use the Derivative Free method compute the log likelihoods for various sets of parameters. 19

Estimation of Variance Components in Animal Breeding

Estimation of Variance Components in Animal Breeding L R Schaeffer July 19-23, 2004 Iowa State University Short Course 2 Contents 1 Distributions 9 11 Random Variables 9 12 Discrete Random Variables 10