Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP

Size: px
Start display at page:

Download "Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP"

Transcription

1 Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP Daniel Gianola 1,,3 *, Kent A. Weigel, Nicole Krämer 4, Alessandra Stella 5, Chris-Carolin Schön 4 1 Department of Animal Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America, Department of Dairy Science, University of Wisconsin- Madison, Madison, Wisconsin, United States of America, 3 Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America, 4 Department of Plant Breeding, Technical University of Munich, Weihenstephan Center, Munich, Germany,5 Bioinformatics and Statistics Unit, Parco Tecnologico Padano, Lodi, Italy Abstract We examined whether or not the predictive ability of genomic best linear unbiased prediction (GBLUP) could be improved via a resampling method used in machine learning: bootstrap aggregating sampling ( bagging ). In theory, bagging can be useful when the predictor has large variance or when the number of markers is much larger than sample size, preventing effective regularization. After presenting a brief review of GBLUP, bagging was adapted to the context of GBLUP, both at the level of the genetic signal and of marker effects. The performance of bagging was evaluated with four simulated case studies including known or unknown quantitative trait loci, and an application was made to real data on grain yield in wheat planted in four environments. A metric aimed to quantify candidate-specific cross-validation uncertainty was proposed and assessed; as expected, model derived theoretical reliabilities bore no relationship with cross-validation accuracy. It was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against overfitting. Seemingly, 5 50 bootstrap samples was enough to attain reasonable predictions as well as stable measures of individual predictive mean squared errors. Citation: Gianola D, Weigel KA, Krämer N, Stella A, Schön C-C (014) Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP. PLoS ONE 9(4): e doi: /journal.pone Editor: Qin Zhang, China Agricultrual University, China Received December 18, 013; Accepted February 13, 014; Published April 10, 014 Copyright: ß 014 Gianola et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Research was partially supported by the Federal Ministry of Education and Research (BMBF, Germany) within the AgroClustEr 15 Synbreed- Synergistic plan and animal breeding (FKZ A), by a U.S. Department of Agriculture Hatch Grant (14- PRJ63CV) to DG, and by the Wisconsin Agriculture Experiment Station. The funders had no role on study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * gianola@ansci.wisc.edu Introduction A method for whole-genome enabled prediction of quantitative traits known as GBLUP, standing for genomic best linear unbiased prediction, was seemingly suggested first by Van Raden [1 ]. In GBLUP, a pedigree-based relationship matrix among individuals [3] is replaced by a matrix valued measure of genomic similarities constructed using molecular markers, such as single nucleotide polymorphisms (SNPs) [4]. This genomic relationship matrix or variants thereof [5 6], referred to as G hereinafter, defines a covariance structure among individuals (even if genetically unrelated in the standard sense), stemming from molecular similarity in state at additive marker loci among members of a sample. Given G and values of some needed variance components, one can use the theory of best linear unbiased prediction (BLUP) to obtain point and interval (e.g., prediction error variances) estimates of the genetic values of a set of individuals as marked by the battery of SNPs. While G can be constructed in different manners we do not address this issue here. However, we note that it is possible to separate genomic from residual variance components statistically even in the absence of genetic relatedness. Hence, care must be exercised when relating the genomic to the genetic variance; this is discussed in [7]. Using theory developed by Henderson [8 10] it can be shown that GBLUP is equivalent to a linear regression model on additive genotypic codes of markers, with the allelic substitution effects at marker loci treated as drawn independently from a distribution possessing a constant variance over markers [11 13]. There is also an equivalence between pedigree-based BLUP or G-BLUP (or of models using both pedigree and marker relationships) and nonparametric regression [14]. For instance, if the n p marker matrix X (n~number of observations, p~number of markers) is used to construct a kernel matrix XX 0, it can be established that GBLUP is a reproducing kernel Hilbert spaces regression procedure [15 16]. Also, BLUP and GBLUP can be represented as linear neural networks where inputs are entries of the pedigreebased or G matrices, respectively [17]. Hence, GBLUP can be motivated from several different perspectives. There are many competing procedures for genome-enabled prediction, such as the members of the growing Bayesian alphabet [14],[18 19], but most of these require a Bayesian Markov chain Monte Carlo (MCMC) implementation. On the other hand, GBLUP is simple, easy to understand, explain and compute, and there is software available for likelihood-based variance component estimation and for prediction of random effects. Also, GBLUP handles cross-sectional and longitudinal data flexibly and extends to multiple-trait settings in a direct manner. Further, GBLUP delivers a competitive predictive ability since members of the Bayesian alphabet typically differ by little in predictive performance and differences among methods are typically masked by cross-validation noise [19 3]. Last but not least, some members of this alphabet produce predictions that are sensitive PLOS ONE 1 April 014 Volume 9 Issue 4 e91693

2 to hyper-parameter specification [4]. Given these considerations, GBLUP or extensions thereof [5] are good candidates for routine whole-genome prediction in animal and plant breeding applications and possibly for prediction of complex traits in humans as well [7],[6]. The closeness between predictions and realized values depends mainly on three factors: prediction bias, variance of prediction error and amount of noise associated with future observations. The latter cannot be reduced by any prediction machine based on training data, so it is impossible to construct predictors attaining a perfect predictive correlation, even if the model holds. Theoretical and empirical results [1], [7 30] indicate that the proportion of (cross-validation) variance explained by a linear predictor increases up to a point with training sample size, then reaching a plateau. However, when a small number of individuals is available, any prediction machine is bound to produce predictions with a large variance. In this context, it seems worthwhile to explore avenues for enhancing accuracy (i.e., reduce bias) and reliability (i.e., decrease variance) of predictions when training size is small. The question examined here is whether or not the predictive ability of GBLUP can be improved by recourse to resampling methods used in machine learning. These include bootstrap aggregating sampling ( bagging ) and iterated bagging or debiasing [31 34]. Bagging uses bootstrap sampling to reduce variance and can enhance reliability and reduce mean squared error [31]. The conditional bias of GBLUP [19] cannot be removed by bagging, but iterated bagging has the potential of reducing variance while removing bias simultaneously. This study investigates bagging of GBLUP, with consideration of debiasing deferred to a future investigation. The second section of this paper gives a review of GBLUP and of its inherent inaccuracy (bias). The third section describes bagging in the context of GBLUP at the level of the genetic signal and of marker effects. The fourth section examines the performance of bagging in four simulated case studies, and the fifth section presents an application to real data on grain yield in wheat planted in four environments. The paper concludes with a discussion and with a proposal of a metric aimed to quantify candidate-specific cross-validation uncertainty. Materials and Methods GBLUP Idealized conditions. Assume that effects of nuisance factors (e.g., year to year variation) have been removed in a preprocessing stage (this can also be done in a single-stage, but we ignore this for simplicity). GBLUP can be motivated by positing the linear regression model on markers y~xbze, where y is an n 1 vector of observations or pre-processed data measured on a set of individuals or lines; X is an n p matrix of marker genotypes, with its typical element x ij being the genotype code at locus j observed in individual i, and with rankðxþƒminðn,pþ; b is a p 1 vector of unknown allelic substitution effects when marker genotypes are coded, e.g., as {1,0 and 1 for aa, Aa and AA at locus A, say, or when these coded values are deviated from the corresponding column means or standardized. Above, e*n 0,Ns e is a vector of residuals where s e is the residual variance and N is an n n diagonal matrix with typical element N ii ;ifyconsists of single measurements on individuals, N ii ~1 for all i~1,,:::,n, and if y is a vector of means, N ii would be the number of observations entering into the mean. In BLUP, a distribution is assigned to b and the simplest one is ð1þ bds b *N 0, 000I s b, where s b is the variance of marker allele substitution effects. Using this assumption together with model (1) gives as marginal distribution of the data (after assuming that X and y have been centered) ydx*n 0,XX 0 s b zns e. In BLUP s b and s e are treated as known but these parameters are typically estimated from data at hand [35]. With markers, most often nvp so it is convenient to form the best linear unbiased predictor of b as ^b~s b X0 V {1 y, where V~XX 0 s b zns e. If model (1) holds, it can be shown that ^b is unbiased in the sense that E ^bdx ~EðbÞ~0. Its covariance matrix (given X) is V^b ~s4 b X0 V {1 X, and its prediction error covariance matrix (also covariance matrix of the conditional distribution of b given y and X under normality assumptions) is ðþ V bdy ~V b {V^b ~s b I{s b X0 V {1 X, ð3þ where V b ~Is b. The diagonal elements of V bdy (v jj ) lead to a measure of reliability of prediction of marker effect j: rel j ~1{ v jj. A matrix of reliabilities and co-reliabilities is R~I{V bdy V {1 b ~V^b V{1 b : ð4þ If one wishes to predict a future vector y f ~X f bze f, with future residuals independent of past ones and provided that future and past residuals stem from the same stochastic process, under normality assumptions the predictive distribution [19] is y f Dy,X,X f *N X f ^b,x f V bdy X 0 f zi f s e : ð5þ Further, if y f is the predictand and BLUP X f b ~Xf ^b~^y f is the predictor, BLUP theory yields Var y f Dy ~Var y f {Var ^y f, ð6þ so that Var ^y f ~X f V^bX 0 f. If the marker effects could be estimated such that V bdy?0, then Var ^y f?x f V b X 0 f and Var y f Dy?I f s e. Hence, the distribution in (5) would still have variance, indicating that it is not possible to attain a predictive correlation equal to 1 even if a training set has an infinite size (more formally, if the training process conveys an infinite amount of information about markers); [7] gives a discussion of related matters. Blup is conditionally inaccurate While BLUP theory is well established, quantitative geneticists tend to interpret the unbiasedness property of BLUP as if it pertained to the true unknown b, when in fact it applies to the average of the distribution of b, that is, 0.Ifb in (1) were viewed as a model on unknown effects of known quantitative trait loci (QTL), it is obvious that one should think in terms of a fixed effects model, as per the standard finite number of loci model of s b PLOS ONE April 014 Volume 9 Issue 4 e91693

3 quantitative genetics [36]. Accordingly, if effects of markers are sought because these flag some genomic region of interest, the random sampling assumption made in BLUP is not relevant, although it might lead to a more stable estimator. In the fixed effects case both b and the marked genetic signal Xb are estimated with bias by BLUP even if the model holds [19]. Markers are not QTL and the latter are causes of generating a signal to phenotype. Suppose that the true model is linear on effects ðqþ of QTL relating to phenotypes via incidence matrix Q, that is y~qqze: If the QTL effects q are viewed as fixed entities (arguably geneticists have this in mind in their quest of finding genes), EðyDQqÞ~Qq. In this situation BLUP produces the following average outcomes and E ^bdq,q,x h i ~E s b X0 V {1 ydq ~s b X0 V {1 Qq, h i E X^bDq,Q,X ~E s b XX0 V {1 ydq ~s b XX0 V {1 Qq, so the bias h of the i estimated signal is Qq{s b XX0 V {1 Qq~ I{s b XX0 V {1 Qq: On the other hand, if q is assigned a distribution, say, N 0,Is q and the p markers are treated as random as well, e.g., bds b *N 0,Is b, under normality assumptions one has where E qd^b,q,x ~EðqÞzCov(q,^b 0 )V {1 ^b ^b~^q markers Cov(q,^b 0 )~Cov(q,s b y0 V {1 X) Cov q,q0 Q 0 V {1 X~s b s q Q0 V {1 X: Using this in (8) ~s b ^q markers ~s b s q Q0 V {1 XV {1 ^b ^b where Q 0 V {1 X conveys unknown linkage disequilibrium relationships between QTL and marker genotypes; similarly, the best statement about the signal is E QqD^b,Q,X ~Q^q markers : Unfortunately, neither s q nor Q are known, so statements made about QTL from markers are based on untestable assumptions, including the view that the QTL effects and the genotypes are linearly related, as in (7). Predictive correlation when markers are the QTL Imagine a best case scenario where the markers are the QTL, and consider predicting y f ~Q f qze f. Here, Q f is the incidence matrix relating QTL effects to yet to be realized phenotypes y f, and e f is a future vector of residuals. BLUP theory, using Q f ^q (here, ^q is the BLUP(q) under the true model) as predictor, gives ð7þ ð8þ the following squared correlation between the ith elements of y f and Q f ^q (below Q 0 f,i is the ith row of Q f ) Var Q 0 r ii ~ f,i^q ~ Q0 f,ivarð^q ÞQ f,i Var y f,i s q Q0 f,i Q : f,izs e Let now n?? in which case BLUP theory yields Q 0 f,ivarð^q ÞQ f,i?q 0 f,i Q f,is q, so that h i r ii?1= 1zs e = s q Q0 f,i Q f,i : This shows that it is impossible to attain a perfect predictive correlation even when the markers are the QTL. Further, Q 0 f,i Q P nq f,i~ Q f,ij, where Q f,ij is the genotype at QTL locus j j~1 (j~1,:::,n q ) of individual i in the testing set. If QTL genotypes are centered and assumed to be in Hardy-Weinberg equilibrium E Q f,ij ~p j 1{p j, so approximately r ii & 6 4 1z P nq j~1 3 s e 7 5 p j 1{p j s q {1 ~h, ð9þ where p j is the frequency of a reference allele, Pnq p j 1{p j s q is additive genetic variance and h ~ P nq j~1 P nq j~1 p j 1{p j s q, p j 1{p j s q zs e is trait heritability. Hence, the predictive correlation has an upper bound at h. If instead of predicting individual phenotypes the problem is that of predicting an average, the upper bound for the predictive correlation is higher. The corresponding formula is easy to arrive at and it just requires replacing s e in (9) by s e =n Ave, the number of observations intervening in the average. Then r ii & 6 4 1z P nq j~1 3 s e =n Ave 7 5 p j 1{p j s q {1 j~1 n Ave h ~ 1zðn Ave {1Þh, which is heritability as used in plant breeding, or heritability of a mean [36]. The predictive correlation never reaches 1 but can get close to 1 if n Ave is very large. Values of squared predictive correlations in the range of have been attained in dairy cattle progeny tests where predictands are processed averages of production records of cows sired by a bull []. Bagging GBLUP (BGBLUP) and marker effects Bagging GBLUP. Bagging exploits the idea that predictors can be rendered more stable by repeated bootstrapping and averaging over bootstrap samples, thus reducing variance [31]. PLOS ONE 3 April 014 Volume 9 Issue 4 e91693

4 Bagging has been found to have advantages in cases where predictors are unstable, i.e., when small perturbations of the training set produce marked changes in model training [31], [34], [37]; for example, with ordinary least-squares under severe multicolinearity. An important application of bagging is in prediction using random forests [38]. Prediction methods that use regularization, such as those applied in genome-enabled selection, are often stable because penalties on model complexity reduce the effective number of parameters drastically, thus lowering variance. However, this is attained at the expense of bias with respect to marker effects and to the unknown function to be predicted (marked genetic value). A priori it would seem that bagging would not bring advantages in the context of a regularized method such as GBLUP. However, this issue has not been examined so far and there may be cases, e.g., in small populations, where random variation in training sets of small sizes has a marked impact on predictive ability. To motivate bagging, we recall that GBLUP is a regression of phenotypes on genomic relationships between individuals. Let g~xb be a vector of genomic values, G!XX 0 and assume that G {1 exists; then (1) can be written in equivalent form as y~gg {1 gze~gg ze ð10þ where g*n 0,Gs b and g ~G {1 g*n 0,G {1 s b. In scalar form, the datum for individual i is expressible under this parameterization as the regression y i ~ Xnq j~1 G ij g j ze i ~G 0 i g ze i, ð11þ where G ij is an element of the n n matrix G and G 0 i is its i th row. From basic principles, BLUPðg Þ~Cov(g,y 0 )V {1 y~s b V{1 y~^g, and since Gg ~g, use of invariance gives BLUPðgÞ~GBLUP(g )~s b GV{1 y~^g ð1þ Note that s b GV{1 is an n n matrix of heritabilities and coheritabilities ; the diagonal elements of G may be distinct from each other, so each individual may have a different heritability ascribed to it, as noted by de los Campos et al. (013). The intuitive idea behind bagging was outlined in [1]. Suppose there is a large number of training samples from the same population; by averaging over the predictions made from these samples we would end up with a reduction of variance but with the bias properties remaining the same as those from the predictor derived from a single training set. This large supply of training sets can be emulated by bootstrap sampling: by averaging over samples, one gets closer (in the mean squared error sense) to the true value, on average. Technical details of why this works are given in Appendix S1. Let the predictor formed from a single training set of size N Train be ^w G 0 i ~G 0 i ^g ði~1,,:::,n Train Þ. Its variance can be lowered by taking B bootstrap copies of size N Train (i.e., sampling with replacement from the training set) and then averaging over copies. A given y i,g 0 i may not appear at all or may be repeated several times over the B bootstrap samples. The bagging algorithm is: N For each copy b ðb~1,,:::,bþ run GBLUP using estimates of variance components obtained from the entire data set (to simplify computations), find the regressions ^g b and form a bootstrap draw for BLUP(g) as ^g b ~G b^g b ; b~1,,:::,b: ð13þ N After running the B GBLUP implementations take the following averages and ^g Bagged ~ ^w bagged ~ P B ^g b b~1 P B b~1 B, ^g b B : ð14þ N Predict vector y Test in the testing set as ^w Test ~G Test,Train^g Bagged where G Test,Train is a matrix of genomic relationships between individuals in testing and training sets. If y Test pertains to records on the same individuals the predictor is ^w Test ~^w bagged : To see how improvement arises consider the following argument. We seek to learn signal g from the GBLUP predictor ^g. Under the setting of BLUP (where g and y both vary at random over conceptual repeated sampling), the best predictor of g in the mean squared error sense is ^g because this is the conditional expectation under normality [39 40]. However, if g is the signal of a fixed target set of candidates, ^g is biased as shown earlier. Thus, Eð^gDg Þ~gzd, where d is a vector of biases, and the mean squared error matrix of ^w bagged is MSE ^w bagged jg ~ E ^w bagged {g E ^w bagged {g zvar ^w bagged : Now, if the B bootstrap copies of GBLUP are drawn from the same distribution, ^w bagged has the same expectation as ^g and, therefore, the same bias: E ^w bagged {gdg ~d. Further, if the B copies are viewed as drawn independently, the variance of ^w bagged should be B times smaller than that of any ^g b. Thus MSE ^w bagged ~dd 0 Varð^g Þ z B ; a formula for taking the correlation between samples into account is not available but the reduction in MSE would be obviously smaller than in the idealized situation where samples are independent. Hence, ^w bagged has the same bias of GBLUP but at best is B times less variable. If a prediction machine has little 0 PLOS ONE 4 April 014 Volume 9 Issue 4 e91693

5 variance, as it is the case for shrinkage methods with a large amount of regularization, predictive performance might be degraded [31], but whether this occurs in a particular problem or not can be assessed empirically only. Bagging marker effects Alternatively, one can work directly with the linear model (1), but this is more involved computationally because the problem becomes a p-dimensional one (in GBLUP there are nvp unknowns). Assuming centered data, the ridge regression estimator (BLUP of marker effects) is where some l~ s e s b ^b~ X 0 {1X 0 XzIl y, (in the BLUP framework) is available, estimated from data at hand or chosen over a cross-validation grid. Given the data matrix D~ ½y,XŠ, of order n ð1zpþ, draw B bootstrap copies by randomly sampling n rows from D with replacement, such that a particular bootstrap sample is D b ~ ½y b,x b Š. The bagged ridge regression estimator is ^b Bagged ~ 1 B X B b~1 {1X X 0 b X 0 bzil b y b: Then, given an out of sample case with marker matrix X Test, the yet-to be realized phenotypes are predicted as ^y Bagged ~X Test ^b Bagged. Using the relationship g~xb, one can obtain indirect samples of the bootstrap distribution of ^g as {1X X b X 0 b X 0 bzil b y b~h b y b (where {1X H b ~X b X 0 b X 0 bzil b is the hat matrix for sample b) and form the indirect bagged GBLUP as Results ^g Bagged,indirect ~ 1 B X B b~1 H b y b : ð15þ Simulated case studies Case Study 1: model training with 00 known QTL and N Train ~500. To evaluate the impact of bagging on model training, we simulated 00 QTL with known binary genotypes (as in inbred lines, where genotypes at a given locus are either aa or AA). Genotypes were sampled with a frequency of 0:05 at all loci and their effects were drawn independently from a N(0, 1 ) distribution; sample size was N Train ~500 so all QTL effects were likelihood identified (estimable) in the regression model described below. Any resulting disequilibrium was due to finite sample size only. Phenotypes were formed by summing the product of QTL genotype codes (0,1) times their corresponding effects over the 00 QTL and then adding a random residual drawn from Nð0,10Þ; the effective heritability (variance among simulated realized genetic values as a fraction of the total variance) attained in the simulation was The true model was employed in the training process using the true variance ratio l~0 and the effect of the number of bootstrap samples (B), each of size N Train ~500, on the bagged predictor was examined by taking B~50, 00 and 500. While B~5 or 50 is often adequate [31], a larger number of bootstrap samples is not harmful. Regressions of the 00 elements of q on either their ordinary least-squares estimates (OLS), BLUP-ridge regression ðl~0þ and on a bagged mean obtained by averaging over bootstrap sample estimates of the BLUPs of q were calculated. This was also done for the regression of the 500 simulated genetic signals ðg~qqþ on either GBLUP and on the average of the GBLUP bootstrap samples ^g Bagged. Table 1 presents results. As expected from BLUP theory [10] the regressions of either q or g on their corresponding BLUPs were near their expected value: 1. By construction, Var½BLUPðqÞŠ~Covðq,BLUPðqÞÞ, so the expected regression is necessarily 1. For QTL effects, OLS, even though being an unbiased estimator of q, produced a regression of about 0.4. The bagging procedure produced regressions that exceeded 1, both at the level of the QTL effects and of the genetic signal g. From the point of view of goodness of fit to the data, ridge regression and bagging produced models that accounted for more variation of the training data than OLS: for OLS R was about 0:49 whereas it ranged between 0.51 and 0.53 for bagging and ridge regression. Increasing the number of bootstrap copies in the bag increased R mildly with no sizable gain resulting from increasing B from 00 to 500. The regressions of g on ^g Bagged were larger than 1 but the bagged means accounted for the same proportion of variation in true g values as ^g GBLUP did, with R ~0:63 for the latter and for bagged GBLUP with B~00 or 500. Figure 1 (left panel) gives the distribution of estimates of the 00 QTL effects within each of 4 bootstrap samples for the run with B~500 as well as within the vector of bagged means. As expected bagging produced less variability among individual effect estimates, as extreme values are tempered by the averaging procedure. The impact of bagging is seen in the right panel of Figure 1: the variance among bagged means of individual effects was much less than the variance within any of the 500 bootstrap copies. Figure (left panel) shows the variance reduction when bagging was applied to the GBLUPs of individuals: the horizontal line at the bottom is the variance among the bagged GBLUPs of the 500 individuals in the training sample. The right panel of Figure shows that bagged GBLUP (BGBLUP) produces an understatement of genetic values relative to what is predicted by GBLUP for individuals ranked as high by the latter, but an overstatement otherwise; however, the two procedures give aligned predictions of genetic values. Bagging regresses extreme GBLUP estimates towards their average, which might attenuate influence from idiosyncratic samples on which GBLUP is trained. Hence, bagging enhances the shrinkage towards 0 inherent to GBLUP. Case Study : model training with 1000 known QTL and N Train ~500. As in case 1, sample size was N Train ~500 but 1000 QTL with known binary genotypes and unknown effects were simulated. Here nvrank(q) so least-squares cannot be used due to lack of estimability. The setting was as in case 1, but effective heritability was 0.68, that is, twice as when 00 QTL were simulated. The true model was employed in the training process using the true variance ratio l~0, and the number of bootstrap samples for bagging was B~5,50 or 100. Results are presented in Table. The regressions of q on ^q Bagged were much larger than 1 and the slope seemingly stabilized at about 1.3 with a bag consisting of 50 bootstrap copies. As expected, the regression of q on ^q Ridge was near 1. However, the proportion of variation in true q accounted for by the variation in PLOS ONE 5 April 014 Volume 9 Issue 4 e91693

6 Table 1. Regression coefficients of true QTL effects ðqþ on their ordinary least-squares ð^q OLS Þ, ridge regression BLUP ^q Ridge and bagged ridge regression BLUP estimates ^q Bagged, and regressions of true genetic signal ðgþ on genomic BLUP ð^g GBLUP Þ and bagged genomic BLUP at varying number of bootstrap samples ðbþ. ^g Bagged No. Bootstrap samples? B~0 B~00 B~500 Regressions of q on ; ^q OLS 0:4 R ~0:49 { { { ^q Ridge 0:98 R ~0:53 { { { ^q Bagged { 1:05 R ~0:51 1:07 R ~0:53 1:08 R ~0:53 Regressions of g on ; ^g GBLUP 0:99 R ~0:63 { { { ^g Bagged - 1:05 R ~0:61 1:08 R ~0:63 1:08 R ~0:63 R is the coefficient of determination of the regression fitted. The simulation involved a training set of 500 individuals with 00 true additive QTL fitted in the model. doi: /journal.pone t001 bagged or ridge regression estimates was smaller than in case 1, with R *0:9{0:30 for bagging and ridge regression versus R *0:51{0:53 in case 1. On the other hand, bagged GBLUP accounted for about 73% of the variation of the true genetic values g, providing a fit to signal similar to that of GBLUP. The regression of g on ^g GBLUP was also near its expected value of 1, whereas true differences in genetic values among individuals were exaggerated by a factor of about by bagging. The left panel in Figure 3 shows the alignment between 4 randomly chosen bootstrap samples and the 500 GBLUP estimates obtained in N Train. The right panel illustrates clearly that bagging ðb~100þ pulls down individuals with large GBLUP values and pulls up those placed at the left tail of the distribution of GBLUPs. Case Study 3: model training with 0 unknown QTLs, 00 markers and N Train ~500. The setting was as in case 1 ðn Train ~500Þ but here the genetic signal was generated from q~0 QTL with unknown binary genotypes that were in linkage disequilibrium (LD) with p~00 binary markers as specified below. The true model was y~qqze, where q is a 0 1 vector of allelic substitution effects. QTL genotypes were equally frequent at all loci and their effects were Figure 1. Simulation with 00 known QTL, 500 individuals in the training sample and 500 bootstrap copies for bagging. Left panel: distribution of 00 effects within each of 4 bootstrap samples (1 4), and within the average (bag) of 500 samples (5). Right panel: distribution of variance among 00 effects within each of 500 bootstrap samples and within their average (item 501, flagged with arrow). doi: /journal.pone g001 Figure. Simulation with 00 known QTL, 500 individuals in the training sample and 500 bootstrap copies for bagging. Left panel: variance (VAR) among 500 GBLUPs in each of 500 bootstrap samples. The horizontal line gives the variance among bootstrap (bagged) means for the 500 individuals. Right panel: scatter plot of bagged GBLUP (BGBLUP) versus exact GBLUP for 500 individuals. doi: /journal.pone g00 PLOS ONE 6 April 014 Volume 9 Issue 4 e91693

7 Figure 3. Simulation with 1000 known QTL, 500 individuals in the training sample and 100 bootstrap copies for bagging. Left panel: relationship between GBLUP with the entire sample and GBLUPs in 4 bootstrap samples of size 500. Right panel: relationship between bagged GBLUP (BGBLUP) and GBLUP of the 500 individuals. doi: /journal.pone g003 drawn independently from a N(0, 1 ) distribution. Phenotypes were formed by summing the product of QTL genotype codes (0,1) times their corresponding effects over the 0 QTL, and adding a random residual drawn from Nð0,10Þ. The method employed for training was ridge regression (BLUP) on 00 markers using the true variance ratio l~0, and the number of bootstrap samples for bagging was B~100: The simulation generated an effective heritability of about LD was simulated statistically by introducing correlations among columns of the Q (QTL genotypes) and X (marker genotypes) matrices, respectively. This was done by drawing N Train Table. Regression coefficients of true QTL effects ðqþ on their ridge regression BLUP ^q Ridge and bagged ridge regression BLUP estimates ^q Bagged, and regressions of true genetic signal (g) on genomic BLUP ð^g GBLUP Þ and bagged genomic BLUP ^g Bagged at varying number of bootstrap samples ðbþ. No. Bootstrap samples? B~5 B~50 B~100 Regressions of q on ; ^q Ridge 0:97 R ~0:30 { { { ^q Bagged { 1:30 R ~0:9 1:3 R ~0:9 1:3 R ~0:9 Regressions of g on ; ^g GBLUP 0:99 R ~0:73 { { { ^g Bagged - 1:5 R ~0:73 1:7 R ~0:73 1:7 R ~0:73 R is coefficient of determination of the regression fitted. The simulation involved a training set of 500 individuals with 1000 true additive QTL fitted in the model. doi: /journal.pone t00 PLOS ONE 7 April 014 Volume 9 Issue 4 e91693

8 independent Beta(a,b) random variables corresponding to the rows of these matrices and then sampling a Bernoulli random variable (i.e., 0 for aa and 1 to AA, say) with probability of success given by the draw from the Beta distribution. Thus, columns of matrices Q or X (with entries 0 or 1) had a beta-binomial distribution (e.g., Casella and George, 199) and an expected correlation equal to 1=(azbz1); employing a~0:30 and b~0:70 a correlation equal to 1 would be expected. Using this approach the first 10 QTL were in LD among themselves as well as in LD with the first 100 markers; these markers were in mutual LD themselves with a correlation of 1 also. On the other hand, QTL 11 0 were in mutual linkage equilibrium as well as with all other markers. To illustrate, in the sample simulated realized genotypes at QTL 1 and had a correlation of 0.48, whereas QTL 1 and 15 had a correlation of 0.0. Likewise, the correlation between markers 1 and was 0.50, that between markers 1 and 00 was 0.04 and QTL was correlated with marker 105 at QTL genotypes at each locus were multipleregressed on the 00 marker genotypes: for QTL 1 10 (in LD with markers) the R of the regression of QTL genotypes on the 00 marker genotypes was about 0.70 or larger. For the 10 QTL in LE with markers R fluctuated around This last result illustrates that even a null association accounts for some variation, merely because the likelihood increases monotonically with model complexity (in this case there are 00 partial regressions of each QTL genotype on markers). The regression of phenotypes on QTL genotypes or on markers had an R at 0.4 and 0.43, respectively, with the latter being larger simply because more parameters are fitted in the model. On the other hand, the squared correlation between true signal and fitted values was 0.87 for the QTL model versus 0.15 for the marker-based model: even though markers captured more variation (because of higher model complexity) than a regression on true genotypes, their ability of capturing signal was much less. We measured similarity among individuals by constructing genomic correlation matrices. This was done by centering both Q and X, calculating Q Centered Q 0 Centered and X CenteredX 0 Centered, and converting these into correlation matrices R Q and R M, respectively. A plot of the off-diagonal elements of R M versus those of R Q is shown in Figure 4 for the corresponding pairs of individuals. Although there is an association between genomic correlations at the QTL and marker levels, the latter ones were smaller in absolute values. This association was not perfect: at a given level of correlation at the QTL level, there was much variation in relationships when measured by markers. Implications of this on accuracy of genome-enabled prediction are discussed in [7]. GBLUP and BGBLUP were calculated at each of l,l and 3 l (with l~0) as variance ratio, to investigate the impact of regularization on the ability of capturing true signal, Qq. The regression of signal on GBLUP was 0.48, 0.54 and 0.60 for the three values of the regularization parameter, respectively, whereas that for BGBLUP was 0.50, 0.59 and 0.65, respectively. This illustrates that the regression of true values on GBLUP predictions is not 1 when the model is incorrect, as it is the case when markers are not QTL, as simulated here. The corresponding R values were 0.5, 0.7 and 0.8 for GBLUP, and 0.4, 0.8 and 0.30 for BGBLUP. While l~0 is the correct regularization to be exerted at the QTL level, this is not so for the model based on markers, where a stronger degree of regularization is needed ðthe number of markers was larger than the number of QTLÞ. An approximation is that the correct variance ratio for Figure 4. Simulation with 0 unknown QTL, 00 markers and 500 individuals: genomic correlations among 500 individuals at QTL or marker loci. doi: /journal.pone g004 the marker based model should be p=n q ~10 times larger than l, where n q ~0 is the number of QTL. We found (results not shown) that the regressions of signal on GBLUP and BGBLUP increased as larger values of the regularization parameter were applied to the marker based model, with the optimum being near 10l, as expected. Case Study 4: predictive cross-validation with 100 unknown QTLs and 500 markers. The simulation posed 100 unknown QTL whose additive effects were drawn from a Nð0,1:5Þ distribution and 500 individuals genotyped for 500 markers. The LD structure was similar to that of case 3: the first 50 QTL were in mutual linkage disequilibrium as well as in LD with the first 50 markers. QTL were in LE among themselves as well with all markers; the first 50 markers were in mutual LD and markers were in LE. Residuals were drawn from Nð0,10Þ and the effective heritability attained was about The 500 individuals were distributed at random into two nonoverlapping training and testing sets with 50 members in each. This was done 100 times at random, producing 100 trainingtesting pairs enabling estimation of the cross-validation distribution. GBLUP and BGBLUP were fitted to the training data ðn Train ~50Þ for each of 13 values of the regularization parameter l in the sequence l Seq ~ 1 16, 1 8, 1 4, 1,1,,4,8,16,3,64,18,56 l00 True 00, ð16þ where l00 True 00~ s e s b with s b &s q n q. In each training instance, p BGBLUP was implemented with B~5 bootstrap samples of size N Train drawn from the training set of size 50. Three metrics were used to evaluate the two prediction methods: goodness of fit in the training set (correlation between fitted and observed values), predictive correlation (predicted phenotypes in the testing set and realized values) and predictive mean-squared error, that is, average squared difference between predicted and realized values over the N Test ~50 cases in the testing set. PLOS ONE 8 April 014 Volume 9 Issue 4 e91693

9 The preceding involved 13 BGBLUP and GBLUP implementations in each of the 100 random cross-validations, for a total of 1300 comparisons. Overall (Figure 5), BGBLUP attained a better predictive performance than GBLUP because predictive correlations were typically larger (in some cases more than twice as large as GBLUP) and mean squared errors of prediction were lower as well. Thus, BGBLUP was more reliable (larger correlation) and more accurate (smaller mean squared error) than GBLUP. The superiority of BGBLUP over GBLUP became smaller when regularization was stronger over the grid defined by l Seq. However, it was not until log 10 l Seq became 5 or more times larger than (log 10 l00 True00) that GBLUP caught up with bagging but was never better. Actually, it was not until the l value used for model training was 3 times larger than l True that the two predictors delivered the same performance, suggesting a robustness property of BGBLUP that GBLUP seems to lack. Model complexity is reduced as l increases (the effective number of parameters decreases) so the variance of GBLUP decreases as well, in which case bagging offers little help. Contrary to what was stated by [7] we did not find evidence that bagging damaged predictive performance under any of the regularization regimes entertained. Results for selected settings are discussed in the following paragraph. Figure 6 shows training and predictive correlations and mean squared errors for BGBLUP (y-axis) and GBLUP (x-axis) at levels of under-regularization : l~l True =16 and l~l True =. Here, where shrinkage is less than it should be, given the model, BGBLUP reduced overfitting (smaller training set correlations), increased predictive correlations and reduced mean squared errors, relative to BLUP. Differences were marked: in the upper (lower) middle panels of Figure 6 it is seen that the largest predictive correlation attained by GBLUP was smaller than 0.30 (0.39) and that bagging increased the corresponding correlation to about 0.38 (0.41). The effect of bagging on reducing predictive mean squared was also clear. Figure 7 presents results when the true value of l (400) was used for training. GBLUP was again close to overfitting and BGBLUP reduced training correlations, thus tempering the problem. Predictively, BGBLUP was better than GBLUP in all 100 comparisons, both in the correlation and MSE senses. Over the 100 cross-validation runs, the predictive correlation ranged between and (median 0.8) for GBLUP and between 0.43 and 0.43 (median 0.330) for BGBLUP. MSE of prediction ranged between and (median 38.47) for GBLUP and between 30.4 and (median 35.01) for BGBLUP. Comparing the medians of the distributions, BGBLUP enhanced the predictive correlation by 17% and MSE was 91% of that of GBLUP. Figure 8 depicts what was found when excessive shrinkage was applied in training: the regularization parameter values were 8l True (upper panel) and 16l True (lower panel). BGBLUP was only marginally better than GBLUP. Strong shrinkage rendered the training model exceedingly simple so both methods delivered similar same predictive ability. Differences between methods vanished when 56l True was used as shrinkage parameter. We repeated the experiment with the setting of case study 4 but increasing training and testing sample sizes to 500 each. Here, training sample size was 5 times larger than the number of markers: no differences between BGBLUP and GBLUP were found at any level of regularization. The reason is that n now exceeds p, making GBLUP fairly stable, in which case the variance reduction property of BGBLUP does not help much. In summary, in our simulations BGBLUP was typically better than GBLUP at most points of the regularization grid considered. Its performance was better than that of GBLUP at values close to optimal regularization, and differences were large when shrinkage was small, because bootstrap sampling with averaging reduced variance. If l is small, GBLUP tends to overfit and to be variable but BGBLUP alleviates these problems. In addition to helping with overfitting, BGBLUP was robust with respect to departures from optimal regularization, e.g., to errors in the variance ratio. The experiment with a much larger training sample size than the number of markers indicated that the performance of bagging depends on p=n Train : we conjecture that as this ratio increases (as it will surely be the case with DNA sequence data) use of BGBLUP may enhance predictive ability in some real data situations, simply because overfitting and colinearity will be exacerbated by introducing a massive number of variates in the model. Analysis of wheat data The data set is at BLR/index.html, in the BLR package in R, and has been used by, e.g., [17],[4 43]. The data represents 599 wheat inbred lines each genotyped with 179 DArT (Diversity Array Technology) markers, and planted in 4 distinct environments. DArT markers may take on one of two values, denoting presence or absence of an allele. Records came from several international trials conducted at the International Maize and Wheat Improvement Center (CIMMYT), Mexico. The trait considered was average grain yield for each line in each of the four environments. This response variable is an average over a balanced set of plots and replicates and, within environment, the residual variance is expected to be constant over lines. To provide proof of concept, we used a ridge-regression BLUP model with 600 randomly chosen markers. All markers were not employed in order to facilitate calculations, since matrix inversion was used for every bootstrap sample and crossvalidation. With a large number of markers or individuals or for routine application, GBLUP can be computed in using iterative algorithms, but this is a numerical, as opposed to conceptual, issue. Further, N Train ~449 and N Test ~599{449~150; the model was trained using a grid with 10 values of l: 5, 10, 50, 100, 150, 00, 50, 300, 350 and 400. All lines were present in each partition of the data into the two disjoint training and testing sets, with the process repeated at random 100 times, to estimate the crossvalidation distribution. Each implementation of BGBLUP used 5 bootstrap samples and performance was evaluated via predictive correlation, predictive mean squared error and mean absolute deviation between predicted and realized phenotypes. We found (results not shown) that the optimum l was around in environments 1 and with stronger regularization needed in environments 3 and 4. BGBLUP was slightly better than GBLUP in environment 1 near optimum regularization, and clearly better at the lower values of l because GBLUP nearly overfitted; a similar picture emerged in environment. Overall, BGBLUP was better than GBLUP when l was below the optimum, sometimes slightly better at the optimum, with no difference if regularization was excessive, due to the fact that the two models were rendered effectively simpler as l grew. The upper and lower panels of Figure 9 give results for environments 3 and 4, respectively; results for environments 1 and were similar. As found with the simulated data, over the 100 repetitions 10 values of l~1000 comparisons, BGBLUP tended to produce larger predictive correlations and smaller mean squared errors than GBLUP. However, this superiority was not uniform and depended on whether or not the model was under or over regularized, with BGBLUP having slightly better PLOS ONE 9 April 014 Volume 9 Issue 4 e91693

10 Figure 5. Simulation with 100 unknown QTL, 500 markers and 50 individuals in each training and testing set: training correlations, predictive correlations and predictive mean-squared error (MSE) for 1300 comparisons between bagged GBLUP (BGBLUP, 5 bootstrap samples) and GBLUP. doi: /journal.pone g005 performance in the former situation but slightly worse in the latter. This is illustrated for environment 4 in the upper left panel of Figure 10, where average differences between BGBLUP and GBLUP over the grid of lambda values are shown for predictive correlations, predictive mean square error and average absolute difference between predicted and realized values. The other panels of Figure 10 indicate that, over the 100 cross-validations, BGBLUP was better at low values of l, slightly better at near optimum values of l and mildly worse when regularization was extreme ðl~400þ. Hence, it would seem that BGBLUP performs at least as well as GBLUP unless a gross error is made in assessing the value of the regularization parameter in model training. Such a large error is unlikely if the variances are estimated from training data (unless the sample is small) or evaluated over a grid of suitable candidate values. One should be cautious about elicitations of the regularization parameter based on simple theoretical arguments that may not hold. Discussion We examined whether or not bootstrap sampling in the context of GBLUP can enhance predictive ability in cross-validation. Simulation (with known or unknown QTL) and a wheat data set with grain yield information were used for this purpose. In the simulations, it was found that bagging BLUP estimates of marker effects or of genomic signal increased the slope of the regression of true marker or marked breeding value on predictor relative to what is expected under BLUP theory. When an individual was evaluated as extreme by GBLUP, bagging made the estimate less extreme. If the linear model entertained holds, the regression of true signal on GBLUP is expected to be 1, but the regression on BGBLUP is steeper because the latter has smaller variance. This is easy to see: if T is a predictand and ^T is its BLUP, then Cov(T, ^T)~Var( ^T) so the slope of the regression of T on ^T is one. Now, if ^T Bagged is the average of B bootstrap copies of ^T, the variance of ^T Bagged is about B times (assuming samples are mildly correlated) smaller than that that of ^T, but Cov(T, ^T)~Cov (T, ^T Bagged ). Hence, the regression must be larger than 1. It was also found that bagging conferred robustness to GBLUP because it is less prone to over-fitting and often delivered better predictions in terms of correlation and mean squared error even when regularization was optimal. At least in simulated data, PLOS ONE 10 April 014 Volume 9 Issue 4 e91693

11 Figure 6. Simulation with 100 unknown QTL, 500 markers and 50 individuals in each training and testing set: training correlations, predictive correlations and predictive mean-squared error (MSE) for 100 comparisons between bagged GBLUP (BGBLUP, 5 bootstrap samples) and GBLUP at two levels of under-regularization. doi: /journal.pone g006 BGBLUP was not inferior to GBLUP when shrinkage was beyond what it should be. Bagging allows estimating a cross-validation prediction error mean squared error for each subject tested. In theory, given training data, GBLUP in a testing set can be computed as s e ^g Train ~G Train G Train zi Ntrain ps b y~ I Ntrain zg {1 Train l {1y, Optimum! {1 with l Optimum being some optimum value of the regularization parameter. If the problem is that of predicting a future set of records y Test of the same individuals, the variance-covariance matrix of prediction errors (under normality assumptions) is Varðy Test Dy Train Þ~VarðgDy Train ÞzIs e : Given that the assumptions hold, there are two sources of uncertainty here. The first is uncertainty about signal (breeding value) given training data, and the second is noise variability associated with the yet to be realized observations. The model derived (expected) reliabilities of the predicted genomic value values from the training data are given by the diagonals of matrix PLOS ONE 11 April 014 Volume 9 Issue 4 e91693

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013 Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 013 1 Estimation of Var(A) and Breeding Values in General Pedigrees The classic

More information

Recent advances in statistical methods for DNA-based prediction of complex traits

Recent advances in statistical methods for DNA-based prediction of complex traits Recent advances in statistical methods for DNA-based prediction of complex traits Mintu Nath Biomathematics & Statistics Scotland, Edinburgh 1 Outline Background Population genetics Animal model Methodology

More information

Prediction of genetic Values using Neural Networks

Prediction of genetic Values using Neural Networks Prediction of genetic Values using Neural Networks Paulino Perez 1 Daniel Gianola 2 Jose Crossa 1 1 CIMMyT-Mexico 2 University of Wisconsin, Madison. September, 2014 SLU,Sweden Prediction of genetic Values

More information

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013 Lecture 28: BLUP and Genomic Selection Bruce Walsh lecture notes Synbreed course version 11 July 2013 1 BLUP Selection The idea behind BLUP selection is very straightforward: An appropriate mixed-model

More information

GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP

GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP RIDGE REGRESSION THE ESTIMATOR (HOERL AND KENNARD, 1979) WAS DERIVED USING A CONSTRAINED MINIMIZATION ARGUMENT Recirocal

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

3. Properties of the relationship matrix

3. Properties of the relationship matrix 3. Properties of the relationship matrix 3.1 Partitioning of the relationship matrix The additive relationship matrix, A, can be written as the product of a lower triangular matrix, T, a diagonal matrix,

More information

Mixed-Models. version 30 October 2011

Mixed-Models. version 30 October 2011 Mixed-Models version 30 October 2011 Mixed models Mixed models estimate a vector! of fixed effects and one (or more) vectors u of random effects Both fixed and random effects models always include a vector

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences Genet. Sel. Evol. 33 001) 443 45 443 INRA, EDP Sciences, 001 Alternative implementations of Monte Carlo EM algorithms for likelihood inferences Louis Alberto GARCÍA-CORTÉS a, Daniel SORENSEN b, Note a

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Computational Aspects of Aggregation in Biological Systems

Computational Aspects of Aggregation in Biological Systems Computational Aspects of Aggregation in Biological Systems Vladik Kreinovich and Max Shpak University of Texas at El Paso, El Paso, TX 79968, USA vladik@utep.edu, mshpak@utep.edu Summary. Many biologically

More information

Mixed-Model Estimation of genetic variances. Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012

Mixed-Model Estimation of genetic variances. Bruce Walsh lecture notes Uppsala EQG 2012 course version 28 Jan 2012 Mixed-Model Estimation of genetic variances Bruce Walsh lecture notes Uppsala EQG 01 course version 8 Jan 01 Estimation of Var(A) and Breeding Values in General Pedigrees The above designs (ANOVA, P-O

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture WS Evolutionary Genetics Part I 1

Lecture WS Evolutionary Genetics Part I 1 Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

VARIANCE COMPONENT ESTIMATION & BEST LINEAR UNBIASED PREDICTION (BLUP)

VARIANCE COMPONENT ESTIMATION & BEST LINEAR UNBIASED PREDICTION (BLUP) VARIANCE COMPONENT ESTIMATION & BEST LINEAR UNBIASED PREDICTION (BLUP) V.K. Bhatia I.A.S.R.I., Library Avenue, New Delhi- 11 0012 vkbhatia@iasri.res.in Introduction Variance components are commonly used

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Luc Janss, Bert Kappen Radboud University Nijmegen Medical Centre Donders Institute for Neuroscience Introduction Genetic

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Best unbiased linear Prediction: Sire and Animal models

Best unbiased linear Prediction: Sire and Animal models Best unbiased linear Prediction: Sire and Animal models Raphael Mrode Training in quantitative genetics and genomics 3 th May to th June 26 ILRI, Nairobi Partner Logo Partner Logo BLUP The MME of provided

More information

GBLUP and G matrices 1

GBLUP and G matrices 1 GBLUP and G matrices 1 GBLUP from SNP-BLUP We have defined breeding values as sum of SNP effects:! = #$ To refer breeding values to an average value of 0, we adopt the centered coding for genotypes described

More information

Quantitative characters - exercises

Quantitative characters - exercises Quantitative characters - exercises 1. a) Calculate the genetic covariance between half sibs, expressed in the ij notation (Cockerham's notation), when up to loci are considered. b) Calculate the genetic

More information

Lecture 9 Multi-Trait Models, Binary and Count Traits

Lecture 9 Multi-Trait Models, Binary and Count Traits Lecture 9 Multi-Trait Models, Binary and Count Traits Guilherme J. M. Rosa University of Wisconsin-Madison Mixed Models in Quantitative Genetics SISG, Seattle 18 0 September 018 OUTLINE Multiple-trait

More information

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL)

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL) GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL) Paulino Pérez 1 José Crossa 2 1 ColPos-México 2 CIMMyT-México September, 2014. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Best Linear Unbiased Prediction: an Illustration Based on, but Not Limited to, Shelf Life Estimation

Best Linear Unbiased Prediction: an Illustration Based on, but Not Limited to, Shelf Life Estimation Libraries Conference on Applied Statistics in Agriculture 015-7th Annual Conference Proceedings Best Linear Unbiased Prediction: an Illustration Based on, but Not Limited to, Shelf Life Estimation Maryna

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Lecture 2: Linear and Mixed Models

Lecture 2: Linear and Mixed Models Lecture 2: Linear and Mixed Models Bruce Walsh lecture notes Introduction to Mixed Models SISG, Seattle 18 20 July 2018 1 Quick Review of the Major Points The general linear model can be written as y =

More information

Should genetic groups be fitted in BLUP evaluation? Practical answer for the French AI beef sire evaluation

Should genetic groups be fitted in BLUP evaluation? Practical answer for the French AI beef sire evaluation Genet. Sel. Evol. 36 (2004) 325 345 325 c INRA, EDP Sciences, 2004 DOI: 10.1051/gse:2004004 Original article Should genetic groups be fitted in BLUP evaluation? Practical answer for the French AI beef

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

APPENDIX 1 BASIC STATISTICS. Summarizing Data

APPENDIX 1 BASIC STATISTICS. Summarizing Data 1 APPENDIX 1 Figure A1.1: Normal Distribution BASIC STATISTICS The problem that we face in financial analysis today is not having too little information but too much. Making sense of large and often contradictory

More information

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty Estimation of Parameters in Random Effect Models with Incidence Matrix Uncertainty Xia Shen 1,2 and Lars Rönnegård 2,3 1 The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden; 2 School

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

Lecture 2. Fisher s Variance Decomposition

Lecture 2. Fisher s Variance Decomposition Lecture Fisher s Variance Decomposition Bruce Walsh. June 008. Summer Institute on Statistical Genetics, Seattle Covariances and Regressions Quantitative genetics requires measures of variation and association.

More information

Model Building: Selected Case Studies

Model Building: Selected Case Studies Chapter 2 Model Building: Selected Case Studies The goal of Chapter 2 is to illustrate the basic process in a variety of selfcontained situations where the process of model building can be well illustrated

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research Linear models Linear models are computationally convenient and remain widely used in applied econometric research Our main focus in these lectures will be on single equation linear models of the form y

More information

arxiv: v1 [stat.me] 10 Jun 2018

arxiv: v1 [stat.me] 10 Jun 2018 Lost in translation: On the impact of data coding on penalized regression with interactions arxiv:1806.03729v1 [stat.me] 10 Jun 2018 Johannes W R Martini 1,2 Francisco Rosales 3 Ngoc-Thuy Ha 2 Thomas Kneib

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Distinctive aspects of non-parametric fitting

Distinctive aspects of non-parametric fitting 5. Introduction to nonparametric curve fitting: Loess, kernel regression, reproducing kernel methods, neural networks Distinctive aspects of non-parametric fitting Objectives: investigate patterns free

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

A simple method to separate base population and segregation effects in genomic relationship matrices

A simple method to separate base population and segregation effects in genomic relationship matrices Plieschke et al. Genetics Selection Evolution (2015) 47:53 DOI 10.1186/s12711-015-0130-8 Genetics Selection Evolution RESEARCH ARTICLE Open Access A simple method to separate base population and segregation

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Evolutionary quantitative genetics and one-locus population genetics

Evolutionary quantitative genetics and one-locus population genetics Evolutionary quantitative genetics and one-locus population genetics READING: Hedrick pp. 57 63, 587 596 Most evolutionary problems involve questions about phenotypic means Goal: determine how selection

More information

The Admixture Model in Linkage Analysis

The Admixture Model in Linkage Analysis The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

A nonparametric two-sample wald test of equality of variances

A nonparametric two-sample wald test of equality of variances University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 211 A nonparametric two-sample wald test of equality of variances David

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Zheyang Wu 1, Hongyu Zhao 1,2 * 1 Department of Epidemiology and Public Health, Yale University School of Medicine, New

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

BAYESIAN GENOMIC PREDICTION WITH GENOTYPE ENVIRONMENT INTERACTION KERNEL MODELS. Universidad de Quintana Roo, Chetumal, Quintana Roo, México.

BAYESIAN GENOMIC PREDICTION WITH GENOTYPE ENVIRONMENT INTERACTION KERNEL MODELS. Universidad de Quintana Roo, Chetumal, Quintana Roo, México. G3: Genes Genomes Genetics Early Online, published on October 28, 2016 as doi:10.1534/g3.116.035584 1 BAYESIAN GENOMIC PREDICTION WITH GENOTYPE ENVIRONMENT INTERACTION KERNEL MODELS Jaime Cuevas 1, José

More information

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013 Lecture 9 Short-Term Selection Response: Breeder s equation Bruce Walsh lecture notes Synbreed course version 3 July 2013 1 Response to Selection Selection can change the distribution of phenotypes, and

More information

Chapter 5 Prediction of Random Variables

Chapter 5 Prediction of Random Variables Chapter 5 Prediction of Random Variables C R Henderson 1984 - Guelph We have discussed estimation of β, regarded as fixed Now we shall consider a rather different problem, prediction of random variables,

More information

Chapter 11 MIVQUE of Variances and Covariances

Chapter 11 MIVQUE of Variances and Covariances Chapter 11 MIVQUE of Variances and Covariances C R Henderson 1984 - Guelph The methods described in Chapter 10 for estimation of variances are quadratic, translation invariant, and unbiased For the balanced

More information

Lecture 3. Introduction on Quantitative Genetics: I. Fisher s Variance Decomposition

Lecture 3. Introduction on Quantitative Genetics: I. Fisher s Variance Decomposition Lecture 3 Introduction on Quantitative Genetics: I Fisher s Variance Decomposition Bruce Walsh. Aug 004. Royal Veterinary and Agricultural University, Denmark Contribution of a Locus to the Phenotypic

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Variance Components: Phenotypic, Environmental and Genetic

Variance Components: Phenotypic, Environmental and Genetic Variance Components: Phenotypic, Environmental and Genetic You should keep in mind that the Simplified Model for Polygenic Traits presented above is very simplified. In many cases, polygenic or quantitative

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Evolution of quantitative traits

Evolution of quantitative traits Evolution of quantitative traits Introduction Let s stop and review quickly where we ve come and where we re going We started our survey of quantitative genetics by pointing out that our objective was

More information

Repeated Records Animal Model

Repeated Records Animal Model Repeated Records Animal Model 1 Introduction Animals are observed more than once for some traits, such as Fleece weight of sheep in different years. Calf records of a beef cow over time. Test day records

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

2 Inference for Multinomial Distribution

2 Inference for Multinomial Distribution Markov Chain Monte Carlo Methods Part III: Statistical Concepts By K.B.Athreya, Mohan Delampady and T.Krishnan 1 Introduction In parts I and II of this series it was shown how Markov chain Monte Carlo

More information