Using the genomic relationship matrix to predict the accuracy of genomic selection

Size: px

Start display at page:

Download "Using the genomic relationship matrix to predict the accuracy of genomic selection"

Anissa Dawson
6 years ago
Views:

1 J. Anim. Breed. Genet. ISSN ORIGINAL ARTICLE Using the genomic relationship matrix to predict the accuracy of genomic selection M.E. Goddard 1,2, B.J. Hayes 2 & T.H.E. Meuwissen 3 1 Department of Agriculture and Food Systems, University of Melbourne, Melbourne, Vic., Australia 2 Biosciences Research Division, Victorian Department of Primary Industries, Bundoora, Vic., Australia 3 Norwegian University of Life Sciences, Ås, Norway Keywords Genomic selection; relationship matrix. Correspondence M. Goddard, Biosciences Research Division, Victorian Department of Agriculture, 1 Park Drive, Bundoora, Vic. 3083, Australia. Tel: ; Fax: ; mike.goddard@dpi.vic.gov.au Received: 6 February 2011; accepted: 18 August 2011 Summary Estimated breeding values (EBVs) using data from genetic markers can be predicted using a genomic relationship matrix, derived from animal s genotypes, and best linear unbiased prediction. However, if the accuracy of the EBVs is calculated in the usual manner (from the inverse element of the coefficient matrix), it is likely to be overestimated owing to sampling errors in elements of the genomic relationship matrix. We show here that the correct accuracy can be obtained by regressing the relationship matrix towards the pedigree relationship matrix so that it is an unbiased estimate of the relationships at the QTL controlling the trait. This method shows how the accuracy increases as the number of markers used increases because the regression coefficient (of genomic relationship towards pedigree relationship) increases. We also present a deterministic method for predicting the accuracy of such genomic EBVs before data on individual animals are collected. This method estimates the proportion of genetic variance explained by the markers, which is equal to the regression coefficient described above, and the accuracy with which marker effects are estimated. The latter depends on the variance in relationship between pairs of animals, which equals the mean linkage disequilibrium over all pairs of loci. The theory was validated using simulated data and data on fat concentration in the milk of Holstein cattle. Introduction The matrix of relationships among a group of individuals can be used to predict their breeding values, to manage inbreeding and in genetic conservation. This relationship matrix can be calculated from the pedigree, but it is also possible to calculate the relationship matrix from genotypes at genetic markers such as single-nucleotide polymorphisms (SNPs). Elements of the genomic relationship matrix are estimates of the realized proportion of the genome that two individuals share, whereas the pedigree-derived relationship matrix is the expectation of this proportion. This genomic relationship matrix can be used in genomic selection to estimate breeding values. Genomic selection refers to the use of a large number of genetic markers, such as SNPs, covering the whole genome to predict the genetic value of individuals (Meuwissen et al. 2001). The individuals might be people whose genetic risk of developing a complex disease is being predicted, or they might be domestic animals or plants in which estimates of ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011) doi: /j x

2 Predict the accuracy of genomic selection M. E. Goddard et al. their breeding value will be used to select parents to breed the next generation. In cattle, the availability of high-throughput, high-density genotyping with SNP chips has led to the widespread adoption of genomic selection in dairy cattle breeding programmes, where it is predicted to double the rate of genetic improvement (Schaeffer 2006; Dalton 2009). Traditionally, livestock have been selected on the basis of estimated breeding values (EBVs) calculated from data on phenotype and pedigree using a statistical technique called best linear unbiased prediction (BLUP) (Henderson 1984). A desirable feature of this method is that the accuracy of the EBVs could be calculated as part of the statistical analysis. This is not the case with many methods used for genomic selection. Currently, the most trusted method for assessing the accuracy of genomic EBVs is an empirical test in which a sample of animals have genomic EBVs calculated, and then additional phenotypic data are collected to assess how accurately the EBVs predict these new data. This is time-consuming, fails to predict the accuracy of individual EBVs and is wasteful in that the new data are used only to estimate accuracy and not to improve the prediction of breeding value. Other cross-validation techniques can also be used but they are also time-consuming and do not yield the accuracy of the final prediction, or individual accuracies. In practice, some authors have used the inverse of the mixed model or BLUP equations (e.g. VanRaden 2008; Hayes et al. 2009a,b) but, as we show in this paper, this can overestimate the accuracy. It would be very useful to be able to predict the accuracy of EBVs calculated using genomic selection in two situations. Firstly, after the data have been collected and are being analysed, it would be useful to calculate accuracies of EBVs as part of the statistical analysis, as is done for traditional EBVs, including for individuals without their own phenotypes. In this situation, we are interested in calculating the accuracy of the EBVs of individual animals. Secondly, when planning a selection programme using genomic selection, it would be useful to be able to predict the accuracy of alternative designs so that the best one could be implemented. In this situation, we wish to predict the accuracy of EBVs of classes of animals, which we might then use in deterministic simulations of alternative breeding programmes. This paper presents methods for both of these situations. After the data have been collected and are being analysed, it should be possible to predict the accuracy from the properties of the statistical method (as in VanRaden 2008 and Harris & Johnson 2010). However, this requires that the statistical model matches the true situation. For instance, if the model makes assumptions about the distribution of the effects of genes affecting the trait (QTL), this should match the real distribution. If we assume that there are a very large number of QTL whose effects follow a normal distribution with constant variance, the analysis (called BLUP by Meuwissen et al. 2001) is robust to departures from this assumption and the accuracy is little affected even if the distribution of QTL effects does not follow a normal distribution. However, empirical tests of the accuracy derived from the inverse of the BLUP equations often find that it is overestimated by this method, dramatically so if it is used to predict breeding values in one breed based on data from another breed (Hayes et al. 2009a). An anomaly of the method is that it does not predict increasing accuracy as the number of markers is increased and this explains why it overestimates accuracy as shown below. In this paper, we describe how to calculate the accuracy of genomic EBVs after the data have been collected, taking into account the number of markers used. The accuracy of genomic EBVs expected before data are collected has been considered by Goddard (2009) for unrelated animals and by Hayes et al. (2009b) for simple family structures such as groups of full-sibs and half-sibs. Their deterministic method treats the genome as if it were a series of small chromosomal segments, each of which is inherited independently. Here, we treat chromosomes as continuous and show that a similar prediction results. The objectives of this paper are twofold: (i) to derive a method for calculating the accuracy of EBVs calculated using a known genomic relationship matrix; and (ii) to derive a method to predict this accuracy before the data on individual animals are collected. In the Materials and Methods section, we first develop the theory to predict accuracy and then describe the simulation and real data in which it is tested. We only consider the genomic selection method called BLUP by Meuwissen et al. (2001). Materials and methods Theory Calculation of accuracy after data are collected Consider a group of T animals with breeding values are controlled by Q QTL. At the jth QTL, the genotypes (00, 01, 11) have frequency (1 ) p j ) 2, 2 p j 410 ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

3 M. E. Goddard et al. Predict the accuracy of genomic selection (1 ) p j ) and p j 2, respectively, and are described for the ith animal by w ij = )2 p j, 1-2 p j and 2-2 p j, respectively, for the three genotypes. This coding causes the mean of w ij over animals to be zero. Let; y ¼ fixed effects þ g þ e g ¼ Wu where, y is a T 1 vector of phenotypic values, g is a T 1 vector of breeding values, u is a Q 1 vector of effects of QTL assumed N(0, I r 2 u ),W is a T Q matrix describing the genotype of each animal i at each QTL j by w ij, e is a T 1 vector of environmental effects, and VðgÞ ¼WW 0 r 2 u VðeÞ ¼Ir 2 e VðÞ¼WW y 0 r 2 u þ Ir2 e h 2 ¼ r 2 g = r2 g þ r2 e Thus, ignoring fixed effects, the model y ¼ Wu þ e is equivalent to the conventional model y ¼ g þ e if VðgÞ ¼ Gr 2 g ¼ WW 0 r 2 u That is, the genomic model is equivalent to a conventional animal model with the relationship matrix calculated from the QTL genotypes. This equivalence has been pointed out before but usually in terms of marker genotypes and effects rather than QTL genotypes and effects (e.g. Nejati-Javaremi et al. 1997; Villanueva et al. 2005; Fernando 1998, Habier et al. 2007; VanRaden et al. 2009, Goddard 2009 Strandén & Garrick 2009). In practice, we will use the markers to estimate G, but this formulation makes it clear that it is the relationship matrix based on the QTL that we need to estimate. Let G = A + D where A is the relationship matrix based on pedigree and D represents deviations from A owing to the observed segregation of alleles at the QTL. In a conventional BLUP, we use A instead of G. This still leads to unbiased estimates of g and to appropriate estimates of accuracy because E(G A) =A. That is A is an unbiased estimate of G in the same way that BLUP EBVs are an unbiased estimate of true breeding value. We need an estimate of G based on the marker genotypes that also has this property. Let G m ¼ W mw 0 m =M,where W m is a matrix defined in the same way as W above but recording the genotypes at markers instead of QTL and M* =R2p j (1 ) p j ). Then; G m ¼ A þ D þ E where the errors (E) occur because the markers used are a sample of genomic positions and so G m includes some sampling errors. G mcan be improved by taking a weighted average of the relationship estimated from each marker where the weights are the inverse of the prediction error variance (PEV) (Powell et al. 2010; Yang et al. 2010). This can be achieved by defining x ij as w ij (sqrt(2p j (1 ) p j )) and calculating G m as G m ¼ XX 0 =M where M is the number of markers. However, G m is still not unbiased in the sense we require, i.e. E(G G m ) is not G m. Instead, we use ^G ¼ A þ bðg m AÞ ð1þ where b is the regression of elements of G ) A on elements of G m ) A ¼ VD ð Þ= ½VD ð ÞþVE ð ÞŠ ð2þ V(D) is the variance of D ij across all possible pairs of relationship, i.e. it is a scalar and V(E) is similarly defined. To indicate that V(D) is a variance across the elements of D, D is not written in bold type. The V(D) could be predicted from theory but it may be better to estimate it from the data. Assuming QTL have the same properties as the markers, we can assess how well the markers predict the relationship based on QTL by how well they predict the relationship based on other markers. If we randomly split the markers into two non-overlapping sets and calculate G m1 and G m2 for the two sets, then c = cov(g m1 ) A, G m2 ) A) estimates V(D) and V(G m1 ) G m2 ) estimates 2*V(E), where V(E) is the variance of the errors in G m1 (or G m2 ). [We exclude the diagonal elements of G and A from these calculations because they have ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

4 Predict the accuracy of genomic selection M. E. Goddard et al. slightly different properties to the non-diagonal elements and it is the latter that are important]yang et al. (2010) show that this is V(E) =1 M where M is the number of markers. Then, b can be estimated as b ¼ 1 1=M= ðc þ 1=MÞ ¼ 1 1= ðcm þ 1Þ: ð2aþ The regression coefficient b may vary among subsets of the data. For instance, relationships between animals from different breeds may have a small V(D) and hence a smaller value of b than relationships within a breed. It would be desirable to calculate b separately for categories of relationship that differ greatly in V(D), but we will not pursue that possibility in this paper. If the QTL differ in a systematic way from the markers, then V(E) will be greater than expected from the finite number of markers. An alternative derivation of ^G shows how to estimate b in this case. Let y ¼ q þ a þ e where a is a vector of polygenic effects not captured by the markers N(0, A r 2 a ),q = Xm is a vector of breeding values explained by markers, m is a vector of standardized marker effects N(0, I r 2 m ) VðqÞ ¼ G m r 2 q and G m ¼ XX 0 =M g ¼ q þ a r 2 g ¼ r2 q þ r2 a VðgÞ ¼ Gr 2 g ¼ G mr 2 q þ Ar2 a The variance components in this linear model can be estimated by REML, for example, and then b ¼ r 2 q =r2 g. Then ^G ¼ G m r 2 q þ Ar2 a ¼ ½ A þ b ð G m AÞŠr 2 g ð1aþ is the same as defined above in (1) but this time with the regression coefficient estimated from the phenotype data rather than predicted from the marker data. We can now use ^G in place of A in the normal mixed model equations (MME) and calculate EBVs for individual animals and the accuracy of these EBVs in the usual manner. Prediction of accuracy before data are collected In practice, the methodology, for calculating EBVs and their accuracy from genetic markers described above, would be implemented using Henderson s MME. If we ignore fixed effects, we can use the equivalent selection index approach. We assume that the training data consist of T unrelated animals (A = I) with phenotypes and marker genotypes. We wish to predict the accuracy of the EBVs for additional, unrelated test animals with marker genotypes but without phenotypes. We will use the model y = g + e and g = q + a introduced above. As VðaÞ ¼Ar 2 a ¼ Ir2 a,the only information about the EBV of the test animals comes from marker information, which allows us to estimate q. Therefore, reliability (accuracy squared) of EBV R 2 ¼ Vð^qÞ=VðgÞ ¼Vð^qÞ=VðqÞVðqÞ=VðgÞ ¼ Vð^qÞ=VðqÞ b ð3þ Therefore,we have to predict two quantities: the proportion of the genetic variance explained by the markers (b) and the accuracy with which the combined marker effects (q) are estimated (Dekkers 2007; Goddard 2009). b =V(q) V(g). We will only consider the case where the QTL are not systematically different from the markers and therefore this can be predicted from the properties of the markers, which in turn can be predicted from the theory of linkage disequilibrium. Equation (2) shows b depends on the true variation in relationship between pairs of animals, V(D), compared to the sampling error caused by a finite number of markers, V(E). Appendix 1 shows that V(D) = mean value of r 2 over all pairs of loci where r 2 is a standard measure of LD (Hill & Robertson 1968). The appendix derives the expected value of this mean, V(D) =1 M e, where M e ¼ 2N e Lk=logðN e LÞ ð4þ where N e is the effective population size, L is the average length of a chromosome in Morgan, k is the number of chromosomes and M e is called the effective number of chromosome segments segregating in the population (Goddard 2009; Hayes et al. 2009b). As V(E) =1 M b ¼ ½1=M e Š= ½1=M e þ 1=MŠ ¼ M= ðm e þ MÞ ð2bþ Vð^qÞ=VðqÞ:The selection index equation for the EBV of an animal without phenotype is ^q ¼ g 0 2 V 1 y where g 2 is the vector of covariances between the target animal and the training animals, and V ¼ G m r 2 q þ I r2 e þ r2 a,y is the vector of phenotypic 412 ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

5 M. E. Goddard et al. Predict the accuracy of genomic selection values of animals in the training set and the Vð^qÞ ¼g 0 2 V 1 g 2 : A convenient way to calculate g 2 V )1 g 2 for many different animals is to expand the matrix V to include the target animal V ¼ r2 q þ r2 e þ r2 a g0 2 g 2 V Then; g 0 2 V 1 g 2 ¼ V =V ð5þ where V* 11 is the first diagonal element of the inverse of V*. By treating each animal in turn as the target animal, it is possible to calculate the accuracy of the EBVs for T + 1 possible target animals (i.e. each animal in turn is regarded as the target animal that does not have a phenotype recorded). In Appendix 2, we describe a heuristic approximation for ½V 11 Š 1 r 2 e þ r2 a þ r2 q =ð1 þ hþ where h = Tbh 2 M e, so that Vð^qÞ=VðqÞ ¼ h=ð1 þ hþ ð6þ This formula (6) is very similar to those of Goddard (2009) and Daetwyler et al. (2008) who derived it by considering the accuracy of estimating a single marker effect. It does not account for the reduction in the error variance when all markers are fitted simultaneously. Daetwyler et al. (2008) proposed the following correction: If the reliability (accuracy squared) calculated by (3) is called R 2 w o, then the proposed estimate of the reliability (R 2 D )is Simulation R 2 D ¼ R2 w=o 1 þ R4 w=o h2 2h! ð7þ Populations Twenty replicate populations were simulated as described in Meuwissen and Goddard (2010). Briefly, Fisher-Wright s idealized populations were simulated with N e = 1000 for generations in order to achieve a mutation drift balance and linkage disequilibrium between the created SNPs. N e was assumed high here, which makes the probability that the training set contains close relatives of the predicted individual quite small, i.e. the predicted accuracies can be interpreted as accuracies of unrelated individuals. The genome consisted of one chromosome of 1 Morgan. Mutations were simulated according to the infinite sites mutation model (Kimura 1969) at a rate of 10 )8 per base-pair per meiosis. The recombination rate was also 10 )8 per base-pair per meiosis. This resulted on average in segregating mutations (SNPs). At random 1000 SNPs were sampled (without replacement) to enter SNP- SET1, 1000 other SNPs were sampled for SNP-SET2, and 30 SNPs were sampled to act as QTL (QTL-SET). QTL effects were sampled from the Normal distribution. Thus, SNP-SET1, SNP-SET2 and QTL-SET are disjunct sets, and there are no systematic differences between the SNPs that enter each of the sets. After the generations, the 20 populations were simulated for 10 more generations following the same population model, but now the sampled pedigree was recorded and used to set up a relationship matrix A of the last generation of animals. Both SNP-SET1 and SNP-SET2 were used to set up genomic relationship matrices G m1 and G m2, respectively, calculated as XX M with X as defined above. This is very similar to Yang et al. (2010), except that Yang et al. calculated the diagonals of the G mx matrices in a different manner in order to reduce their sampling error. The latter however resulted in negative eigenvalues of the G mx matrix, whereas XX is semi-positive definite. Although Yang et al. (2010) found that such negative eigenvalues did not cause any problems, we have been cautious and used XX. Prediction of accuracy using G m When using the GBLUP approach, the genomic relationship matrix, G m, can be used to predict the accuracy of individual genomic estimated breeding values (GEBV) following Henderson s (1984) mixed model theory. Equivalently, this prediction can be based on selection index theory. We will use selection index theory here and assume (arbitrarily) that genetic variance is 1, in which case the variance of the selection index equals its reliability. We assume a phenotyped and genotyped training population of size T and use selection index theory to calculate the accuracy of the EBV for an additional animal that has marker genotypes but no phenotype. As above, we do this by expanding the matrix V = V(y) tov* by including the animal whose EBV reliability is required as the first animal. Then, the reliability calculated from the selection index, or equivalent MME, is (V* 11 ) 1 V* 11 ). By treating each animal in turn as the test animal, whose EBV reliability is required, we can calculate the accuracy for all T +1 animals with only one matrix inversion. By defining r 2 a as zero, we can calculate the accuracy predicted by the MME when G m is used instead of ^G. ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

6 Predict the accuracy of genomic selection M. E. Goddard et al. Estimation of true accuracy True genetic values of all animals were calculated as:tbv i ¼ P 30 j¼1 W iju j, where summation is over the QTL in QTL-SET, w ij is the genotype of QTL j with 0, 1 and 2 denoting 0 0, 1 0 and 1 1, respectively, and u j is the normally distributed effect of the QTL. Within each data set, the variance of the TBV i was standardized to 1. To obtain phenotypic records, an environmental effect was added to the TBV, which was sampled from N(0, r 2 e ), where r2 e was 4, 1 or to obtain heritabilities of 0.2, 0.5 and 0.9, respectively. To obtain estimates of breeding values, the phenotypic data were analysed by the GBLUP model: y ¼ l þ Zg þ e where g is the genetic value of 500 training and 500 evaluation animals; Z is an incidence matrix indicating which animals have records; and V(g) =G m1 or ^G: Estimation of variance components pertaining to g and e and estimation of ^g were by ASREML (Gilmour et al., 2002). True accuracy of the GEBV was calculated as the correlation between the ^g of the evaluation animals and their TBV. Real data The data set consisted of 1200 Australian Holstein bulls. The phenotype used for each bull was the mean of his daughter s fat percentage (fat%) in their milk. We chose this phenotype because it is known to be affected by a gene of large effect (diglyceride acultransferase or DGAT) and we wished to test the theory under these conditions. To obtain this phenotype, we de-regressed the Australian Breeding Values (ABVs) to remove the contribution from relatives other than daughters (e.g. Pryce et al. 2010) while retaining the correction for non-genetic effects such as herd. All bulls with de-regressed EBVs had at least 80 daughters. The bulls were genotyped using the Illumina Bovine50K array, which includes single-nucleotide polymorphism (SNP) markers (Matukumalli et al. 2009). The following criteria and checks were applied to the bull s genotypes. Mendelian consistency checks revealed a small number of sons who were discordant with their sires at many (>1000) SNPs or sires with many discordant sons. These animals (17) were removed from the data set. We omitted bulls if they had more than 20% of missing genotypes. And 1181 bulls passed these criteria. Criteria for selecting SNPs were <5% pedigree discordants (e.g. cases where a sire was homozygous for one allele and progeny were homozygous for the other allele), 90% call rate, minor allele frequencies (MAF) >2%, Hardy Weinberg p > All of these criteria were met by SNPs. A small number of these were not assigned to any chromosome on Bovine Genome Build 4.0 and were omitted from the final data set, as were SNPs on the X chromosome. Parentage checking was then performed again, and any genotypes incompatible with pedigree were set to missing. To impute missing genotypes, the SNPs were ordered by chromosome position. All SNPs that could not be mapped or were on the X chromosome were excluded from the final data set, leaving SNPs. To impute missing genotypes, the genotype calls and missing genotype information were submitted to fastphase chromosome by chromosome (Scheet & Stephens 2006). The genotypes were taken as those filled in by fastphase. The matrix G m among the 1181 bulls was constructed as G m =XX M. The matrix A was constructed from the pedigree of the bulls, which had ancestors back to Then, ^G was calculated as described above in equation (1a), as ^G ¼ A þ b^ðg m AÞ where ^r 2 q and ^r 2 gwere estimated using ASREML (Gilmour et al. 2002) and ^b ¼ ^r 2 q =^r2 g : The discovery data set consisted of bulls progeny tested before 2004 (n = 756). The bulls in the validation data set were progeny tested during or after 2004 (n = 400). GEBV for the validation set bulls were predicted using the normal BLUP equations with the A matrix replaced by ^G. Only phenotypes of the reference set bulls were used in this prediction. Note that ^b was also derived using phenotypes from only the reference set bulls. The accuracy of the GEBV was calculated in two ways. The ^G realized accuracy was the correlation of the GEBV for the validation set bulls with their phenotypes divided by the accuracy of the phenotypes (0.9, from ADHIS). The ^G theoretical accuracy was calculated from the diagonal elements of the inverse of the coefficient matrix in the usual way. To assess the value of correcting the G matrix for the proportion of variance explained by the markers, we also calculated the realized and theoretical accuracies with G m in place of ^G. Calculations were performed with markers used to construct the genomic relationship matrices. ^r 2 g 414 ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

7 M. E. Goddard et al. Predict the accuracy of genomic selection Results Simulated data after the data are collected Table 1 summarizes the properties of the relationship matrix (G m ) calculated from the simulated marker data. Two separate sets of markers (1 and 2) are used so that the PEV in calculating elements of G m and the true variance among elements of G m can be assessed. In the simulated data, the PEV of elements of G m is (Table 1). This is as predicted from our theory that PEV = 1 M. As expected, the observed variance of G m (0.0045) is the sum of the true variance ( estimated as the cov(g m1, G m2 )) and the PEV. Therefore, the regression coefficient b in the calculation of ^G in equation (2) is 0.78= Appendix 2 derives the result that the true V(G m ) is equal to the mean LD r 2 over all pairs of markers. To confirm this, we calculated the mean r 2 over the pairs of SNPs and found it to be Table 2 compares the accuracy predicted using genomic relationship matrices in conventional MME with the true accuracy of the EBVs, which we can calculate from the correlation between EBV and true BV, which is known because this is simulated data. When G m is used in traditional MME to calculate EBVs and the accuracies of those EBVs are calculated from the MME in the normal manner, the accuracies of the EBVs are overestimated (Table 2). However, when ^G is used in the MME, the accuracies of the EBVs agree with the true accuracies calculated as the correlation of true BV with EBV, at least to within the standard error. Simulated data before the data are collected The aim in this section is to test the method of predicting the accuracy of genomic EBVs using information about the population (effective population size N e ), the genome length (L), the trait (heritability h 2 ) and the size of the discovery population (T). The prediction method presented in the Methods section is in two parts the proportion of variance Table 1 Properties of the estimated genomic relationships 1 Cov (G m1 ) A, G m2 ) A) 2 PEV 3 V(G m1 ) A) 2 Table 2 Comparison of predicted accuracy using genomic relationship matrices and true accuracy from correlation between true BV and EBV 1 h 2 Predicted accuracy G m Number of training records is T = 500. explained by the markers [b = V(q) V(g)] and the accuracy in estimating the marker effects ½vð^qÞ=vðqÞŠ. We consider these two separately and then the combined prediction. We compare the predicted accuracies against the accuracy calculated using the MME because we have shown above that that predicts the true accuracy. Vq ðþ=vðgþ For N e = 1000, L = 1 equation (4) gives the effective number of chromosome segments as M e =2N e L log(n e L) = 290. Therefore, we predict that the V(D) =1 M e = close to the observed figure of (Table 1). Therefore, b = 0.78 from equation 2b. Vð^qÞ=VðqÞ Using this value of M e in formulae 6, we predicted the accuracy of EBVs and compared that to the value expected from the MME using G m (Table 3). [In calculating h, we set b = 1 because use of G m in the MME implies that the markers explain all the variance]. Table 3 also contains the predicted accuracy before and after the correction of Daetwyler et al. (2008) using equation (7). Predicted accuracy increases with T and h 2 and is approximately in agreement with the accuracy from the MME. As expected, the correction of Daetwyler et al. (2008) has little effect when reliabilities are low but a more marked effect as accuracies approach 1.0 where this correction improves the agreement with the accuracies calculated from the MME equations. ^G Vð^qÞ=VðgÞ True accuracy Excluding self-relationships. 2 The A, G m1 and G m2 matrices are calculated as in the main text. 3 Prediction error variance calculated as V(G m1 ) G m2 ) 2. The results in Table 3 ignore the errors in the G m matrix both in predicting reliabilities and in calculating them in the MME. Table 2 shows that if the G m ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

8 Predict the accuracy of genomic selection M. E. Goddard et al. Table 3 Predicted accuracy of EBVs before data are collected and from the mixed model equations (MME) after data collection using relationship matrix G m1 h 2 T 2 Predicted 1 R w o MME Prediction with and without the Daetwyler et al. (2008) correction. 2 T = the number of animals in the training population. Table 4 Reliabilities (accuracy squared) predicted and from mixed model equations (MME) using ^G h 2 T R D Predicted (Daetwyler) MME matrix is regressed to ^G, MME correctly predicts the accuracy obtained by simulation. In Table 4, we combine the two aspects of predicting the accuracy by using equation (3) where b = Note that b is also needed in the formula for h. Table 4 shows that the predicted accuracies using equations (3) and (7) agree reasonably well with the accuracies calculated from the MME. The accuracies in Table 4 are lower than the comparable figure in Table 3 because in Table 3, we assumed that G m correctly described the genetic relationship matrix, while in Table 4, we used ^G that incorporates a residual polygenic variance not explained by the markers. In summary, the simulation study shows that the proportion of genetic variance explained by the markers (b) can be predicted by equation 2b, the accuracy of estimating marker effects can be predicted by equation (6), these two estimated can be combined using equations (3) and (7) to predict the accuracy of EBVs, and these accuracies agree with those obtained from the MME and with the true accuracies. Real data Figure 1 shows the accuracy calculated from the MME and realized for milk fat% in Holstein bulls. When G m is used in the MME, the calculated accuracy is greater than that actually realized but when ^G is used, the calculated accuracy is close to that realized. As the number of markers increases, the accuracy predicted by ^G and realized increases, whereas the accuracy calculated using G m decreases. This occurs because V(G m ) decreases as the number of markers is increased. In other words, when the number of markers is small, G m is subject to large sampling errors, which cause the calculated accuracy to appear high but the real accuracy is reduced. As the number of markers increases, the sampling errors become small and G m and ^G converge. Although BLUP methods of genomic selection assume normally distributed QTL effects, they are not very sensitive to departures from this assumption. This is illustrated here by the use of fat concentration in milk, a trait that is known to be influenced by a QTL with a large effect (DGAT). Despite the existence of this QTL, the theory correctly predicts the accuracy of EBVs calculated using BLUP methods. Discussion Equivalent models The equivalence of a model based on marker effects and a conventional animal model with the relationship matrix estimated from the markers has been pointed out by several authors (Nejati-Javaremi et al. 1997; Fernando 1998; Villanueva et al. 2005; Habier et al. 2007; Goddard 2009; Strandén & Garrick 2009; VanRaden et al. 2009). The equality of the mean LD and the variance of relationships (shown in Appendix 1) is another aspect of this same equivalence. Both LD and relationships are caused by the inheritance, without recombination, of segments of chromosomes from a common ancestor. If the genome comprised an infinite number of loci all inherited independently (i.e. no linkage), there would be no LD or variation in relationship except that caused by variation in pedigree relationship. Linkage causes 416 ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

9 M. E. Goddard et al. Predict the accuracy of genomic selection 0.9 Fat (%) Accuracy Gm_theoretical Gm_realised Ghat_theoretical Ghat_realised Number of SNPs Figure 1 The accuracy of estimated breeding values for milk fat concentration in Holstein bulls. points close together on a chromosome to have the same coalescence tree. As a consequence of this, there is a correlation between the relationship at one locus and that at neighbouring loci. This, in turn, causes variation in relationship in excess of that caused by variation in pedigree relationship. In the absence of LD, markers would not predict the genotypes at QTL, and the relationship at markers would not predict the relationship at QTL. If the relationship between all pairs of individuals were the same, then all individuals with no phenotype would receive the same EBV. This emphasizes the importance of variation in genomic relationship in driving the accuracy of genomic selection. Thus, it is meaningless to ask whether genomic selection works because it utilizes relationships over and above those due to pedigree or because it utilizes LD: the two explanations are equivalent. The equivalence between a model based on QTL effects and a conventional animal model would be invalidated if LD between QTL systematically increased or decreased total genetic variance. For instance, the Bulmer effect occurs when selection results in negative covariance between QTL because chromosomes tend to carry a mix of positive and negative QTL alleles. In this case, the total genetic variance will be less than that expected from the sum of the QTL variances (V(g) =WW r 2 u ). However, we usually define the genetic variance in a base population where there is assumed to be no Bulmer effect, and this genetic variance will agree with that calculated from the sum of the QTL variances. If the effective population size (N e ) is large, common ancestors tend to be in the distant past and so recombination will have broken up chromosomes into many small pieces that coalesce independently. Consequently, as N e increases, the variation in relationship decreases because the relationship between two individuals is an average over many independent chromosome segments. The derivation in Appendix 1 shows that the relationship is effectively an average over M e segments where M e =2N e Lk - log(n e L) [formula (4)]. In this paper, we point out another equivalence: that between a model using an unbiased estimate of the relationship matrix at the QTL (^G) and that using a residual polygenic effect as well as a random effect described by a relationship matrix at the markers (G m ). This equivalence explains why the number of markers used is important. If the number of markers (M) is too small, G m estimates G too imprecisely. The extent to which the markers track relationships at the QTL depends on M (M + M e ). This formula also describes the extent of LD between markers because M (M + M e )=1 (2 + 4N e c log(nel)) where c is the average distance between markers is Lk M. This is almost the same as the expectation of LD between neighbouring markers r 2 =1 (2 + 4N e c). The difference between the two formulae (log(nel)) can be thought as due to the LD between all other markers and a target marker, not just the nearest marker. Knowledge of the variation in relationship led us to an approximation for the inverse of the matrix V, where V =V(y), and this in turn led to an approximation for the accuracy of EBVs calculated from marker genotypes. This approximation, derived from a consideration of variation in relationships, is the same as that derived by Goddard (2009) and Daetwyler et al. (2008) from a consideration of the accuracy of estimating the effect of a single marker ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

10 Predict the accuracy of genomic selection M. E. Goddard et al. or, more correctly, an effective chromosome segment. Because the formula are the same, it was easy for us to include the correction of Daetwyler et al. (2008), which accounts for the increased accuracy in estimating the effect of one marker when all other markers have been fitted and hence reduced the residual variance. The accuracy of genomic selection depends on the proportion of the genetic variance explained by the SNPs and the accuracy with which the SNP effects are estimated (Dekkers 2007; Goddard 2009). These two components of the accuracy are also used in this paper. The proportion of the genetic variance explained by the markers is b = V(q) V(g), and the accuracy of estimating marker effects is Vð^qÞ=VðqÞ. The proportion of the genetic variance explained by the markers To estimate ^G from G m, we regress G m back towards A and the regression coefficient (b) is the proportion of genetic variance explained by the markers. VanRaden (2008) proposed the same regression equation, but without a thorough derivation of its coefficient b. If QTL are not systematically different to markers, b = M (M + M e ), as shown in Table 4. However, if the QTL are systematically different to the markers, b must be estimated from data on phenotypes. Research on human height (Yang et al. 2010) found that only half the genetic variance was explained by the SNPs owing to imperfect LD between the SNPs and the QTL. Of the remaining half, 10% was because of the finite number of SNPs used ( ) and 40% was because of systematic differences between QTL and SNPs. For instance, the QTL could have lower MAF than the SNPs. Within a breed of cattle such as Holsteins, recent N e is very small (100) compared with in humans. Consequently, the variation in relationships in cattle is large or, equivalently, LD is extensive, and so far fewer SNPs are necessary to explain most of the variance in relationship or, equivalently, most of the variation in QTL. Figure 1 shows that the accuracy of EBVs has reached an asymptote by markers but in humans this has not occurred even after markers. While within a breed (of cattle) the accuracy may reach an asymptote after markers, for between-breed prediction, the accuracy is likely to reach an asymptote at a much higher number of markers. For example, Hayes et al. (2009a) found that the theoretical accuracy in a combined Holstein Jersey population substantially overpredicted the actual accuracy when using G m, even with markers. This is because between breeds, the variation in relationship will be very small so that large numbers of markers are required to predict these relationships accurately (and capture the limited LD that exists between breeds). So for multibreed prediction of GEBV, it is important to use ^G rather than G m when calculating reliabilities and to calculate the regression coefficient b separately for within-breed and between-breed relationships. This parallels the finding of De Roos et al. (2008) that the phase of LD was not conserved between breeds when markers were used. In other words, SNPs are not enough to accurately detect relationships between cattle from different breeds such as Holstein and Jersey. Unified approaches to utilize phenotypic, full pedigree, and genomic information for genetic evaluation, for both genotyped and un-genotyped individuals, have been proposed, which use a single relationship matrix in the BLUP equations (Aguilar et al. 2010). This matrix has both G matrix and A matrix components (e.g. sub-matrices based on relationships derived from genotypes and sub-matrices derived from pedigree relationships. The ^G matrix proposed here would be the most suitable for describing (genomic) relationships among genotyped individuals in such an approach because, if G m was used, the accuracies might be overestimated. However, the relationships in G and A must be expressed to the same base before the matrices are combined (Meuwissen et al. 2011). Further developments Further developments of the methods presented here are desirable. For instance, how would the accuracy of genomic selection be predicted when there are pedigree relationships among the animals as well as relationships estimated from the markers? By analogy with the method used to combine the reliabilities from other sources of data, we suggest that h = R 2 (1 ) R 2 ) should be additive when independent sources of data are combined, but we have not investigated this suggestion in the present paper. Conclusions When the BLUP method of genomic selection is used, the results of this paper can be summarized as a series of recommendations for calculating the accuracy of genomic EBVs: After the data have been collected: Fit an animal model but with the relationship matrix calculated as 418 ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

11 M. E. Goddard et al. Predict the accuracy of genomic selection ^G: The regression coefficients (b) would ideally be calculated by estimating the variance components associated with SNPs and with the residual polygenic variance. However, if QTL are assumed to have properties similar to SNPs, we can estimate b as M (M + M e ). M e in turn can be estimated from observed variation in relationships [V(G ) A) minus the PEV] or from LD (mean r 2 ) or from N e L using equation (4). If ^G is used in the MME, the predicted accuracy of GEBV will also capture linkage and family information that is present among the reference population and selection candidates. Before the data have been collected: Calculate the proportion of genetic variance explained by the SNPs (b) as above. Calculate the accuracy of estimating SNP effects as h (h + 1) where h = Tbh 2 M e. Calculate the reliability (accuracy squared) as R 2 w o = b h (h + 1). Apply the Daetwyler correction to R 2 w=o to obtain R2 D. References Aguilar I., Misztal I., Johnson D.L., Legarra A., Tsuruta S., Lawlor T.J. (2010) A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci., 93, Daetwyler H.D., Villanueva B., Woolliams J.A., Weedon M.N. (2008) Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE, 3, e3395. doi: /journal.pone PMID: Dalton R. (2009) No bull: genes for better milk. Nature, 457, 369. De Roos A.P.W., Hayes B.J., Spelman R., Goddard M.E. (2008) Linkage disequilibrium and persistence of phase in Holstein Friesian, Jersey and Angus cattle. Genetics, 179, Dekkers J.C. (2007) Prediction of response to markerassisted and genomic selection using selection index theory. J. Anim. Breed. Genet., 124, Fernando R.L. (1998) Some true aspects of finite locus models. In: Proceedings of the 6th World Congress of Genetics Applied to Livestock Production, January University of New England, Armidale, Australia, 26, pp Gilmour AR, Gogel BJ, Cullis BR, Welham SJ, Thompson R. ASReml User Guide Release 1.0. VSN International Ltd., Hemel Hempstead, UK; Goddard M.E. (2009) Genomic selection: prediction of accuracy and maximisation of long term response. Genetica, 136, Habier D., Fernando R.L., Dekkers J.C. (2007) The impact of genetic relationship information on genome-assisted breeding values. Genetics, 177, Harris B.L., Johnson D.L. (2010) Genomic predictions for New Zealand dairy bulls and integration with national genetic evaluation. J. Dairy Sci., 93, Hayes B.J., Bowman P.J., Chamberlain A.C., Verbyla K., Goddard M.E. (2009a) Accuracy of genomic breeding values in multi-breed populations. Genet. Sel. Evol., 41, 51. Hayes B.J., Visscher P.M., Goddard M.E. (2009b) Increased accuracy of artificial selection by using the realized relationship matrix. Genet. Res., 91, Henderson C.R. (1984) Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, Ontario. Hill W.G., Robertson A. (1968) Linkage disequilibrium in finite populations. Theor. Appl. Genet., 38, Kimura M. (1969) The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics, 61, Matukumalli L.K., Lawley C.T., Schnabel R.D., Taylor J.F., Allan M.F., Heaton M.P., O Connell J., Moore S.S., Smith T.P., Sonstegard T.S., Van Tassell C.P. (2009) Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE, 4, e5350. Meuwissen T.H..E., Hayes B.J., Goddard M.E. (2001) Prediction of total genetic value using genome wide dense marker maps. Genetics, 157, Meuwissen T. H. E., Luan T., Woolliams J. A (2011) The unified approach to the use of genomic and pedigree information in genomic evaluations revisited. J. Anim. Breed. Genet., 128, Nejati-Javaremi A., Smith C., Gibson J. (1997) Effect of total allelic relationship on accuracy of evaluation and response to selection. J. Anim. Sci., 75, Powell J.E., Visscher P.M., Goddard M.E. (2010) Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet., 11, Pryce J.E., Bolormaa S., Chamberlain A.J., Bowman P.J., Savin K., Goddard M.E., Hayes B.J. (2010) A validated genome-wide association study in two dairy cattle breeds for milk production and fertility traits using variable length haplotypes. J. Dairy Sci., 93, Schaeffer L.R. (2006) Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet., 123, Scheet P., Stephens M.A. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet., 78, Strandén I., Garrick D.J. (2009) Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit J. Dairy Sci., 92, Sved J.A. (1971) Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor. Popul. Biol., 2, ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

12 Predict the accuracy of genomic selection M. E. Goddard et al. Tenesa A., Navarro P., Hayes B.J., Duffy D.L., Clarke G.M., Goddard M.E., Visscher P.M. (2007) Recent human effective population size estimated from linkage disequilibrium. Genome Res., 17, VanRaden P.M. (2008) Efficient methods to compute genomic predictions. J. Dairy Sci., 91, VanRaden P.M., Van Tassell C.P., Wiggans G.R., Sonstegard T.S., Schnabel R.D. et al. (2009) Invited Review: Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci., 92, Villanueva B., Pong-Wong R., Fernandez J., Toro M.A. (2005) Benefits from marker-assisted selection under an additive polygenic genetic model. J. Anim. Sci., 83, Yang J., Beben B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.F., Heath A.C., Martin N.G., Montgomery G.W., Goddard M.E., Visscher P.M. (2010) Missing heritability of human height explained by genomic relationships. Nat. Genet., 42, Appendix 1 The properties of the genetic covariance matrix WW Using the same model as in the main text, that is g = Wu and assuming u N(0, I), so that V(g) =WW and the ij element of this matrix is the covariance of breeding values between animals i and j. The elements of W (w ik ) describe the genotype of animal i at marker k. Then the ith diagonal element of WW 0 ¼ w 0 iw i Eðw 0 iw i Þ ¼ R2p j 1 p j ¼ r 2 g off diagonal element ij of WW 0 ¼ w 0 iw j ¼ Rw ik w jk E w 0 i w j ¼ 0 because the animals are unrelated Vðwi 0 w jþ¼eðrw ik w jk ÞðRw ik w jk Þ ¼ EðRRðw ik w jk Þðw il w jl Þ ¼ RREðw ik w jk Þðw il w jl Þ ¼ RREðw ik w il ÞEðw jk w jl Þ ¼ RRCovðw k ; w l Þ 2 ¼ RRrkl 2 2p kð1 p k Þ2p l ð1 p l Þ ¼ RR2p k ð1 p k Þ2p l ð1 p l ÞRRrkl 2 =½Qð1 QÞŠ where Q is the number of QTL ¼ r 4 g mean ðr2 Þ where r 2 is the usual measure of LD Thus, G = WW r 2 g is a relationship matrix with diagonal elements averaging 1. The off-diagonal elements of G have mean = 0 and their variance is the mean of the r 2 measure of linkage disequilibrium over all pairs of loci. If we consider the QTL to be M unlinked loci, then r 2 kk = 1 and r 2 kl = 0, so the mean of r 2 =1 M. However, if we assume that QTL are spread all along the chromosome, we can evaluate the mean by integrating: mean (r 2 )=[òò r 2 kl dk dl] (L 2 ), where the limits of integration are 0, and L, the length of the chromosome. Assuming E(r 2 )=1 (2 + 4Nc) (Tenesa et al. 2007), where N is the effective population size and c is the distance between loci in Morgan, the ZZ mean ðr 2 Þ¼½ rkl 2 dkdlš=ðl2 Þ ZZ ¼ 1=ð2 þ 4nðl kþþdkdl=l 2 ¼½ð2þ4NLÞ logð2 þ 4NLÞ 4NL 4NL log 2 2 log 2Š=ð8N 2 L 2 Þ logðnlþ=ð2nlþ for large NL If the genome is made up of k chromosomes, each of length L mean r 2 logðnlþ= ð2nlkþ for large NL: Thus, the number of effective QTL (M) is 2NLk log(nl) even if the number of actual QTL is infinite because linkage generates LD, which is a correlation between loci to one another. If one prefers to use E(r 2 )=1 (1 + 4Nc) (Sved 1971), because the LD is driven by inbreeding without new mutation, then mean (r 2 ) log(2nl) (2NLk) for large NL. Appendix 2 Heuristic approximation for V* )1 From the main text V ¼ G m r 2 q þ I r2 e þ r2 a where G m =XX M. G m can be written as I+D Then where EðDÞ ¼ 0 VðDÞ ¼ 1=Me ði þ DÞ 1 ¼ I D þ D 2... I þ D ª 2011 Blackwell Verlag GmbH J. Anim. Breed. Genet. 128 (2011)

Genotyping strategy and reference population

GS cattle workshop Genotyping strategy and reference population Effect of size of reference group (Esa Mäntysaari, MTT) Effect of adding females to the reference population (Minna Koivula, MTT) Value of