Bayesian Multilocus Association Models for Prediction and Mapping of Genome-Wide Data

Size: px

Start display at page:

Download "Bayesian Multilocus Association Models for Prediction and Mapping of Genome-Wide Data"

Ashlyn Ray
5 years ago
Views:

1 Bayesian Multilocus Association Models for Prediction and Mapping of Genome-Wide Data DOCTORAL THESIS IN ANIMAL SCIENCE Hanni P. Kärkkäinen ACADEMIC DISSERTATION To be presented, with the permission of the Faculty of Agriculture and Forestry of the University of Helsinki, for public criticism in the Lecture Hall of Koetilantie 5, Helsinki, on November 15th 2013, at 12 o clock noon. Helsinki 2013 DEPARTMENT OF AGRICULTURAL SCIENCES PUBLICATIONS 20

2 Supervisor: Professor Mikko J. Sillanpää University of Oulu Department of Mathematical Sciences Department of Biology and Biocenter Oulu P.O.Box 3000 FIN Oulu, Finland Co-supervisor: Adjunct Professor Jarmo Juga University of Helsinki Department of Agricultural Sciences P.O.Box 27 FIN Helsinki, Finland Reviewers: Professor Daniel Sorensen Aarhus University Department of Molecular Biology and Genetics P.O.Box 50 DK 8830 Tjele, Denmark Professor Otso Ovaskainen University of Helsinki Department of Biosciences P.O. Box 56 FIN Helsinki, Finland Opponent: Senior Researcher Luc Janss Aarhus University Department of Molecular Biology and Genetics P.O.Box 50 DK 8830 Tjele, Denmark ISBN (Paperback) ISBN (PDF) Electronic publication at Unigrafia Helsinki 2013

3 List of original publications The following original papers are referred in the text by their Roman numerals. (I) Kärkkäinen, H. P. and M. J. Sillanpää, 2012 Back to basics for Bayesian model building in genomic selection. Genetics 191: (II) Kärkkäinen, H. P. and M. J. Sillanpää, 2012 Robustness of Bayesian multilocus association models to cryptic relatedness.ann.hum.genet. 76: Corrected by: Corrigendum. Ann. Hum. Genet. 77:275. (III) Kärkkäinen, H. P. and M. J. Sillanpää, 2013 Fast genomic predictions via Bayesian G-BLUP and multilocus models of threshold traits including censored Gaussian data. G3 (Bethesda) 3: The publications have been reprinted with the kind permission of their copyright holders. The contributions of the authors HPK and MJS can be detailed as follows: I Both authors were involved in the conception and design of the study. HPK derived the fully conditional posterior distributions and the GEM algorithm, implemented the algorithm with Matlab, performed the data analyses and drafted the manuscript. Both authors participated in the interpretation of results and critically revised the manuscript. II Both authors were involved in the conception and design of the study. HPK derived the fully conditional posterior distributions and the GEM algorithm, implemented the algorithm with Matlab, performed the data analyses and drafted the manuscript. Both authors participated in the interpretation of results and critically revised the manuscript. III Both authors were involved in the conception and design of the study. HPK derived the fully conditional posterior distributions and the GEM algorithm, implemented the algorithm with Matlab, performed the data analyses and drafted the manuscript. Both authors participated in the interpretation of results and critically revised the manuscript. 3

4 Contents 1 Introduction 5 2 Objectives of the study 11 3 Hierarchical Bayesian model Gaussian likelihood Shrinkage inducing priors Hierarchical formulation of the prior densities Sub-models Polygenic component Indicator Hyperprior Student s t vs. Laplace prior Bayesian LASSO and its extensions Bayesian G-BLUP Fully conditional posterior densities Threshold model Binary response Censored Gaussian response Parameter estimation Generalized expectation-maximization Prior selection in MAP estimation GEM-algorithm for a MAP estimate Example analyses Data sets XIII QTL-MAS Workshop data Real pig (Sus scrofa) data Human HapMap data Discrete and censored data Pre-selection of the markers Genomic prediction Association mapping Decision making Diagnostics Of speed and convergence

5 6 Conclusions Current status What have we learned? What s next? Foreword Genome-wide marker data is used in animal and plant breeding in computing genomic breeding values, and in human genetics in identifying disease susceptibility genes, predicting unobserved phenotypes and assessing disease risks. While the tremendous number of markers available for easy and cost-effective genotyping is an invaluable asset in genetic research and animal and plant breeding, the ever increasing data sets are placing heavy demands on the statistical analysis methodology. The statistical methods proposed for genomic selection are based on either traditional best linear unbiased prediction (BLUP) or different Bayesian multilocus association models. In human genetics the most prevalent approach is a single SNP association model. The thesis consists of three original articles trying to obtain further understanding of the behavior of the different Bayesian multilocus association models and of the instances in which different methods work best, to seek connections between the different Bayesian models, and to develop a Bayesian multilocus association model framework, along with an efficient parameter estimation machinery, that can be utilized in phenotype prediction, genomic breeding value estimation and quantitative trait locus (QTL) location and effect estimation from a variety of genome-wide data. 1 Introduction The invention of single nucleotide polymorphisms (SNP) in conjunction with the utilization of microarray technology in high-throughput genotyping has exploded the availability of genome-wide sets of molecular markers. Whole genome SNP chips are available for a wide range of species, including humans, agriculturally important plant and animal species, and genetic model organisms. In human genetics the common goal of a genome-wide association (GWA) study is to detect disease susceptibility genes, predict unobserved phenotypes, and assess disease risks at the individual level (Lee 5

6 et al. 2008; de los Campos et al. 2010). The animal and plant breeders, on the other hand, are mainly interested in estimating genomic breeding values for genomic selection (Eggen 2012; Nakaya and Isobe 2012). Genomic selection refers to marker assisted selection using a genomewide marker information directly in predicting genomic breeding values, rather than first identifying the causal genes (Meuwissen et al. 2001). The basic principle of genomic selection includes a set of individuals, known as the training set or the reference population, with phenotypic records and genotypic information of a whole-genome SNP array, and a statistical model explaining the connection between the marker genotypes and the phenotypic observations. The training set data is employed in estimating the effects of the SNP markers or genotypes to the phenotype, that is, the parameters of the model. The acquired information is then used in predicting the heritable part of the phenotype, i.e. genomic breeding value, of new individuals (the prediction set) that have only genotypic information available. In animal and plant breeding, the most commonly used approach to predict genomic breeding values based on molecular markers is the genomic best linear unbiased prediction or G-BLUP, a direct descendant of the pedigree-based best linear unbiased prediction (BLUP) model (Henderson 1975). G-BLUP employs the marker information in estimating genomic relationships between the individuals, and utilizes the marker-estimated genomic relationship matrix in a mixed model context (e.g. VanRaden 2008; Powell et al. 2010). A relatively recent but promising contender for the BLUP-type of model in the genomic selection field is to apply simultaneous estimation and variable selection or regularization to multilocus association models (e.g. Meuwissen et al. 2001; Xu 2003). A multilocus association model uses the marker information directly by assigning different, possibly zero, effects to the marker alleles and quantifies the genomic breeding value of an individual as the sum of the marker effects. The advantage of a multilocus association over G-BLUP is that the former allows the estimated effect size to vary over the set of markers, while the latter assumes a constant impact throughout the genome. In human genetics the genome-wide association methods are mainly used for mapping of complex genetic traits. Association mapping utilizes the linkage disequilibrium (LD) between the markers and the causal loci in locating the actual causal genes by searching associations between the markers and the phenotype. Population-based association analyses are more powerful than within-family analyses in detecting the genetic loci associated with the phenotype of interest. As a draw-back, the population- 6

7 based studies often suffer from an inflated rate of false positives due to population stratification (i.e. model misspecification in the presence of hidden population structure) and cryptic relatedness (i.e. model misspecification in the presence of sample structure) (see Kang et al. 2010). For example, if two populations in Hardy-Weinberg proportions with divergent allele frequencies are combined, the combined population may have large amount of linkage disequilibrium simply due to the combination (e.g. Ewans and Spielman 1995). Equivalently, the sample structure of the data may lead to allelic association caused by close relatedness between the individuals rather than true association between the marker and the trait. As e.g. PLINK (Purcell et al. 2007) omits the sample and population structure from the model, the artificial linkage disequilibrium is likely to cause false positive and negative signals for marker loci without any connection to the studied trait. Although some other heavily-used association methods, including e.g. TASSEL (Bradbury et al. 2007), GenABEL (Aulchenko et al. 2007), EMMA (Kang et al. 2008) and EMMAX (Kang et al. 2010), provide a sample structure correction, they consider only one marker at the time, ignoring the possible effects of the other major loci. This is less than ideal in genome-wide study for a complex trait, as such traits are assumed to be affected by a multitude of genes (Weeks and Lathrop 1995). The problem with a multilocus association model applied to a genomewide data set is oversaturation: since usually the number of SNP markers is orders of magnitude greater than the number of individuals, there are far more explanatory variables than observations in the model. This leads to a situation where some kind of selection or regularization of the predictors is required, either by selecting a subset of the variables that explains a large proportion of the variation, by using orthogonal or nonorthogonal combinations of the variables, or by shrinking the effects of the variables towards zero (e.g. Sillanpää and Bhattacharjee 2005; Hoggart et al. 2008; O Hara and Sillanpää 2009; Wu et al. 2009; Ayers and Cordell 2010; Cho et al. 2010). The appeal in the shrunken estimates is that these methods keep the dimension constant across the possible models by not actually selecting a subset of variables, but instead setting the effect of unimportant ones to (or near) zero. The drawback is that the estimates tend to be biased towards too small values. The methods discarding markers irrelevant to the phenotype are often referred as variable selection, while the ones assigning a penalty term to shrink the marker effects towards zero are considered as variable regularization. 7

8 Contrary to the frequentist way of deriving a shrinkage estimator by subtracting a penalty from the gain function (in other words, by adding a penalty to the loss function), in the Bayesian context the regularization mechanism is included into the model by specifying an appropriate prior density for the regression coefficients. A penalized maximum likelihood estimate for the regression coefficients β is acquired by maximizing the penalized gain function β PML = arg max β log(p(data β)) λj(β), (1.1) where log(p(data β)) is the log likelihood and J(β) a penalty function. Commonly used penalty functions are derived from the sum of the L2 or L1 norms of the regression coefficients, J(β) = p β j 2 2 = p βj 2 and J(β) = p β j 1 = p β j, j=1 j=1 j=1 j=1 leading to Ridge Regression (Hoerl 1962) and LASSO (Tibshirani 1996) estimates, respectively. The frequentist penalty function is connected to the prior density of a Bayesian model, as the exponent of the function maximized in the frequentist method equals the product exp( log(p(data β)) λj(β)) = p(data β) exp( λj(β)), (1.2) where p(data β) is the likelihood and exp( λj(β)) represents the prior density function. For example, it can be easily seen that the Ridge Regression penalty equals a Gaussian prior density, as exp( (1/λ) p j=1 β2 j ) is a kernel of a Gaussian probability density function. Similarly the L1 penalty equals a double exponential or Laplace density. Although it is clearly more logical to consider the assumptions about the model sparseness as a part of the model (the prior is a part of the model) rather than a part of the estimator (a penalty is a part of the estimator), the difference may seem trivial in practice. However, the fact that in Bayesian context the model includes all available information, permits the estimator to be always the same, either the whole posterior density or a maximum a posteriori (MAP) point estimate, which in turn enables a straightforward translation of the model into an algorithm. In the Bayesian context the variable regularization is included into the model by specifying a spike and slab prior for the regression coefficients, with spike being the probability mass centered near zero and slab the probability mass distributed over the nonzero values (see O Hara and Sillanpää 2009). This prior represents the assumption that only a small pro- 8

9 portion of the predictors have a non-negligible effect ( slab ), while the majority of the effects are close to zero ( spike ). The Bayesian models proposed in the literature differ with respect to the spike and slab prior densities given for the regression coefficients. The desired shape for the prior density may be acquired either as a mixture of two densities, in which case the model includes a dummy variable indicating whether the effect of a given explanatory variable comes from the spike or from the slab part of the prior, or alternatively a single prior density approximating the spike and slab -shape may be assigned directly on the regression coefficients. In the latter case, the probability density functions commonly used for imitating the spike and slab shape are Student s t (e.g. Bayes A by Meuwissen et al. 2001; Xu 2003; Yi and Banerjee 2009) and Laplace densities (e.g. Park and Casella 2008; Yi and Xu 2008; de los Campos et al. 2009; Xu 2010; Li et al. 2011). Due to the connection to the frequentist L1 penalty function the models with a Laplace prior density are commonly denoted as Bayesian LASSO (Park and Casella 2008). Both Student s t and Laplace density functions possess several favorable features, including high kurtosis and heavy tails, that make them worthy candidates for shrinkage inducing priors. Compared to Gaussian density, these functions consist of a greater probability mass centered near zero and higher probability for large values inducing strong shrinkage to the intermediate sized estimate values and proportionally less shrinkage to the large values and the values near zero. While a Gaussian prior density, or equivalently frequentist Ridge Regression, assigns same penalty to all of the regression coefficients, the heavy-tailed functions work by producing a clearer distinction between large and small estimate values by pushing the intermediate sized values to either direction. For this reason the method is sometimes denoted as adaptive shrinkage. Several modifications of the indicator-type methods have been introduced, differing with respect to the mixture components (distributions that are used to form the mixture distribution) set for the regression coefficients and the hierarchical structure of the prior (the dependency between the indicator and the marker effect, and the participation of the indicator in the likelihood). While the stochastic search variable selection (SSVS) models considers the spike and slab as a mixture of two normal distributions (George and McCulloch 1993; Verbyla et al. 2009), or two Student s t distributions (e.g. Yi et al. 2003), majority of the methods straightforwardly set the regression coefficient to be zero when the indicator is zero (so the spike is in fact a point mass located at zero). A prior consisting a mixture 9

10 of a Student s t density and a point mass at zero has been used in several methods, including BayesB (Meuwissen et al. 2001), Hayashi and Iwata (2010) and Habier et al. (2011). A similar mixture based on a Laplace density has been used by Meuwissen et al. (2009) and Shepherd et al. (2010). The simplest hierarchical structure of the prior density, proposed by Kuo and Mallick (1998), determines the effect of the marker j to the phenotype as a product of the indicator γ j and the effect size β j, and considers these two to be a priori independent. Hence the joint prior of the marker effect γ j β j becomes simply p(γ j β j ) = p(γ j )p(β j ), where p(γ j ) is a Bernoulli density with a prior probability for a marker to be linked to the trait and p(β j ) is the Gaussian, Student s t or Laplace prior density given for the effect size. Other types of hierarchical structures presented in the literature include BayesB (Meuwissen et al. 2001) where the marker effect is given by β j alone since the likelihood does not include the indicator; instead, the indicator acts through the effect variance. In Gibbs variable selection, on the other hand, the marker effect is considered as a product of the indicator and the effect size, but the prior density of the effect size is dependent on the indicator (Dellaportas et al. 2002). Whether the model is based on a Student s t, Laplace, or a mixture prior density, the intensity of the shrinkage produced by the prior is determined by the prior parameters (i.e. hyperparameters) defining the shape of the prior density function. The models proposed in the literature differ from each other in terms of the procedures they use to determine the prior parameters. In the original BayesA and BayesB the parameters of the Student s t prior density were defined to produce the desired genetic variance (Meuwissen et al. 2001). The Xu (2003) method is otherwise similar to BayesA, except that the prior parameters are estimated instead of setting into constant values. Similar modifications of BayesB have been considered by e.g. Yi and Xu (2008) and Habier et al. (2011). Under the Bayesian LASSO the prior parameters are more commonly estimated from data (e.g. Yi and Xu 2008; de los Campos et al. 2009; Sun et al. 2010; Shepherd et al. 2010) than given as constants (Xu 2010). While the Bayesian models have proven workable, efficient and flexible, the tremendous number of markers in the modern genome-wide data sets make the computational methods traditionally connected to Bayesian estimation, e.g. Markov Chain Monte Carlo (MCMC), quite slow and cumbersome. For the same models fast alternative estimation procedures have been proposed, most commonly based on estimation of the maximum point of the posterior density (MAP-estimate), rather than the whole posterior 10

11 distribution, by expectation-maximization (EM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 1997; for the methods see e.g. Yi and Banerjee 2009; Hayashi and Iwata 2010; Figueiredo 2003; Sun et al. 2010; Xu 2010; Meuwissen et al. 2009; Shepherd et al. 2010; Lee et al. 2010). 2 Objectives of the study The objectives of this work are to 1) better understand the behavior of the different Bayesian multilocus association models, especially under the maximum a posteriori estimation context, and to obtain further information on the instances in which different methods work best, 2) seek connections between the different Bayesian models and try to see the different model variants as special cases or sub-models of a common model framework, 3) pay special attention to the significance of the parametrization and hierarchical structure of the model for elegant derivation and convergence properties of the estimation algorithm, and 4) to develop a flexible and versatile Bayesian multilocus association model framework, along with an efficient parameter estimation machinery, that can be utilized in phenotype prediction, genomic breeding value estimation and QTL (quantitative trait loci) detection and effect estimation from a variety of genome-wide data. The original papers I III contribute to the objectives in the following manner. In I we lay the foundation for our Bayesian model framework, examine the behavior and predictive performance of different sub-models and prior densities, including G-BLUP, and present a generalized expectationmaximization algorithm (GEM) for the parameter estimation. In II we apply selected parts of the model framework in QTL mapping context and, in particular, consider the impact of an additional polygenic component for the performance of the model and the GEM-algorithm. In III we generalize the model framework and the GEM-algorithm for ordered categorical and censored Gaussian phenotypes. 11

12 3 Hierarchical Bayesian model In Bayesian inference the learning from data is based on updating the prior belief concerning the model parameters into the posterior belief by applying the Bayes theorem. Let p(θ) denote the joint prior density for the unknown parameters and p(data Θ) the likelihood of the data given those parameters. Now the posterior density for the unknown parameters, given the data, is acquired from the Bayes formula p(θ data) = p(data Θ)p(Θ) p(data) p(data Θ)p(Θ), where the normalizing constant p(data) = p(data Θ)p(Θ)dΘ is the marginal likelihood of the data. As the marginal likelihood has a Θ constant value, it is usually omitted from the computation, and the joint posterior density is considered to be proportional to the product of the likelihood and the joint prior density. In addition to the prior conception of the parameter values, the joint prior density expresses the mutual relationships of the parameters, e.g. whether the parameters are considered a priori independent or conditional to some other parameters. This definition is denoted as the hierarchical structure of the Bayesian model. Let e.g. the parameter vector be Θ = (θ 1, θ 2 ), and let θ 1 be a priori dependent on θ 2. Now the joint prior is given by p(θ) = p(θ 1 θ 2 )p(θ 2 ), and the dependent parameter θ 1 is said to be located on a lower layer of the model hierarchy. In its complete form our hierarchical Bayesian model framework, depicted as a directed acyclic graph in Figure 3.1, consists of two separate parts, the linear Gaussian model and the threshold model. Under the linear Gaussian model the phenotype measurements are assumed to be continuous and follow a Gaussian density, while the additional threshold model handles binary, ordinal and censored Gaussian observations. The hierarchical model has a total of six layers, two of which are optional. The observed data, located on the 1st and 2nd layers in the graph, comprises phenotype and genotype information and, optionally, a known pedigree of a sample of related individuals. The continuous Gaussian phenotypes, denoted by a vector y, and the genetic data matrix X consisting the genotypes of biallelic SNP markers, are located on the observed data layer of the linear Gaussian model. As the binary, ordinal and censored Gaussian observations are handled via a latent variable parametrization, they are located on the optional observed layer of the threshold model in Figure 3.1. The possible pedigree information is given in a form of an additive genetic relationship matrix (Lange 1997), located on the optional observed layer 12

13 Figure 3.1: Hierarchical structure of the model framework. The ellipses represent random parameters and rectangles fixed values, while the roundcornered rectangles may be either, depending on the selected model. Solid arrows indicate statistical dependency and dashed arrows functional relationship. The background boxes indicate the main modules of the model framework. 13

14 in the directed acyclic graph (Figure 3.1) to represent its non-compulsory nature. In the following sections we first will consider the linear Gaussian model part, and only after that focus on the threshold model for the discrete or censored data. 3.1 Gaussian likelihood In the center of a Bayesian model there is the likelihood function of the data given the model parameters. The likelihood is based on the probability model (sometimes called the sampling model) determining how the independent variables or traits are connected to the explanatory variables. In our model framework the Gaussian phenotypes are connected to the marker and pedigree information with a linear Gaussian association model (see Figure 3.1) y = β 0 + XΓβ + Zu + ε, (3.1) where y denotes the phenotypic records of n individuals, β 0 is the population intercept, and ε corresponds to the residuals, assumed normal and independent, ε MVN(0, I n σ0). 2 If necessary, the intercept β 0 can be easily replaced with a vector of environmental variables. The second term on the right hand side of the equation (3.1) comprises the observed genotypes X and the allele substitution effects Γβ. The observed genotypes of the p biallelic SNP markers are coded with respect to the number of the rare alleles (0,1 and 2) and standardized to have null mean and unity variance. In the complete model the allele substitution effect (see Marker effect in Figure 3.1) is modeled following Kuo and Mallick (1998) as a product of the size of the effect and a variable indicating whether the marker is linked to the phenotype. In the equation (3.1), β denotes the additive effects sizes, and Γ is a diagonal matrix of indicator variables, whose jth diagonal element γ j has value 1 if the jth SNP marker is included in the model, and 0 otherwise. As depicted in Figure 3.1, the indicator and the effect size are considered a priori independent. The term u in the equation (3.1) denotes the additive polygenic effects due to the combined effect of infinite number of loci, and Z is a design matrix connecting the polygenic effects to the observed phenotypes. The individuals, or their phenotypic values y i, are assumed conditionally independent given the genotype information X and the polygenic effect u. This assumption and the described linear marker association model (3.1) 14

15 gives a multivariate normal likelihood p (y β 0, σ0, 2 β, Γ, u, X, Z) det(i n σ0) 2 1/2 ( exp 1 ) 2 (y β 0 XΓβ Zu) (I n σ0) 2 1 (y β 0 XΓβ Zu) (3.2) for the phenotypes given the parameter vector. Due to the independence of the observations, the likelihood can be interpreted also as an univariate normal N(β 0 + p j=1 γ jβ j x ij + u i, σ 2 0) given a single observation y i and the appropriate parameters. The parameters of the multilocus association model that are present in the likelihood function are located in the model parameters -layer of the linear Gaussian model in Figure Shrinkage inducing priors The second essential component of a Bayesian model consists of the prior densities for the model parameters. The prior for a given parameter represents the a priori understanding of the plausibility of different parameter values. In some cases there is no reason to believe that one parameter value would be more plausible than another, which conception is expressed with a flat or an uninformative prior density, e.g. by setting p(β 0 ) 1 and p(σ0) 2 1/σ0 2 (note the Noninformative uniform priors at layer 5 in Figure 3.1). In some cases, however, the prior density plays a most important role in the model operation. A central feature of handling an oversaturated model is the selection or regularization of the excess predictors. In the Bayesian context the regularization is included into the model by specifying such a prior density for the regression coefficients, that it represents the a priori understanding that the majority of the predictors have only a negligible effect, while there are a few predictors with possibly large effect sizes. A prior that would evince this idea should consist of a probability mass centered near zero and a probability mass distributed over the nonzero values, including a reasonably high probability for large values. The probability density functions we have used for imitating this spike and slab shape are Student s t (following e.g. Meuwissen et al. 2001; Xu 2003) and Laplace densities (following e.g. Park and Casella 2008; de los Campos et al. 2009), either alone or combined with a point mass at zero (e.g. Meuwissen et al. 2001; Shepherd et al. 2010). In our full model framework, (3.1) and Figure 3.1, the mixture prior with the point mass at zero is accomplished by adding a dummy variable to indicate whether the effect of a given predictor variable is included into 15

16 the model or not. Following Kuo and Mallick (1998) the marker effects are modeled as a product of the indicator variable γ j and the effect size β j, which are considered a priori independent, hence the joint prior of the marker effect becomes simply p(γ j β j ) = p(γ j )p(β j ), where p(γ j ) is a Bernoulli density with a prior probability π = P(γ j = 1) for a marker to be linked to the trait and p(β j ) is the prior density for the effect size Hierarchical formulation of the prior densities The Student s t and the Laplace distribution can both be expressed as a scale mixture of normal distributions with a common mean and effect specific variances. The hierarchical formulation of a Student s t-distribution with ν degrees of freedom, location µ and scale τ 2 is a scale mixture of normal densities with mean µ and variances following a scaled inverse-χ 2 distribution with ν degrees of freedom and scale τ 2, } β j σj 2 N(µ, σj 2 ) = β σj 2 ν, τ 2 Inv-χ 2 (ν, τ 2 j t ν (µ, τ 2 ), ) while a Laplace density with location µ and rate λ can be presented in a similar manner, the mixing distribution now being an exponential one, } β j σj 2 N(µ, σj 2 ) = β σj 2 λ 2 Exp(λ 2 j Laplace(µ, λ). /2) The hierarchical representation of the prior densities bears a twofold advantage (I). First, the derivation of the fully conditional posterior densities, and hence the derivation of the estimation algorithm, simplifies greatly. Within MCMC world, the hierarchical formulation of the prior densities, also known as model- or parameter expansion, is a well known device to simplify computations by transforming the prior into a conjugate and thus enabling Gibbs sampling. Conjugacy of a prior distribution means that the fully conditional posterior probability distribution of a given parameter will be of same type as the prior distribution of that parameter, and hence we are guaranteed to get a closed form fully conditional posterior with a known probability density function. The hierarchical formulation of a prior density is also known to accelerate convergence of a MCMC sampler by adding more working parts and therefore more space for the random walk to move (see e.g. Gilks et al. 1996; Gelman et al. 2004; Gelman 2004). In maximum a posteriori (MAP) estimation, on the other hand, a commonly adopted approach to try and simplify the model is to integrate out the effect variances. However, the conjugacy maintained by preserving the intermediate variance 16

17 layer (layer 4 in Figure 3.1) is a valuable feature also for MAP-estimation, as it enables the straightforward derivation of the fully conditional posterior density functions. Expressed as a scale mixture, the Student s t distribution leads to conjugate priors for normal likelihood parameters, and hence is a perfect choice for a conjugate analysis. Although the decomposition of the Laplace prior does not provide conjugacy, it leads to a tractable fully conditional posterior density for the inverse of the effect variance. Second, the estimation algorithm is likely to behave better under a hierarchical model. Even though the marginal distributions of the marker effects are mathematically equivalent in hierarchical and non-hierarchical models, we noted in I that the parametrization and model structure alter the properties and behavior of the model, and thus have influence on the mixing and convergence properties of an estimation algorithm, and also on the values of the actual estimates. We noted in I that in some cases the hierarchical Laplace model was clearly more accurate than its nonhierarchical counterpart. Also, contrary to the non-hierarchical version, the hierarchical Laplace model worked without the additional indicator variable, i.e. without a zero-point-mass in the prior of the marker effects. This simplification of the model leads not only to more straightforward implementation and faster estimation, but also to easier and more accurate selection of prior parameters. 3.3 Sub-models As mentioned above, we like to consider the full model in Figure 3.1 as a framework incorporating a set of model variants, or sub-models, embodying different components of the model framework. In I we covered a multitude of such variants, and also showed how the model variants correspond to the Bayesian phenotype prediction and genomic breeding value estimation methods proposed in the literature. The non-compulsory components of the multilocus association model comprise the polygenic component, the indicator variable and the 6th, optional hyperprior layer. The selection between the Student s t and the Laplace prior densities forms one means of modifying the prior density assigned for the marker effects, while the inclusion/exclusion of the indicator and the hyperprior layer forms another. The polygenic component, on the other hand, is clearly an external addition to the multilocus association model. 17

18 3.3.1 Polygenic component The polygenic component u is included into the model to represent the genetic variation possibly not captured by the SNP markers and to take account for putative residual dependencies between individuals (Yu et al. 2006). The sample or population structure is included into the model as the covariance matrix of the multivariate normal prior density given for the polygenic effect u (σu, 2 A) MVN(0, σua), 2 where σu 2 is the polygenic variance component and A is the genetic relationship matrix. The genetic relationship matrix is either a pedigree based additive genetic relationship matrix (see Lange 1997) (I and II), or, if there is no pedigree available, a finite locus approximation based on the markers not included in the actual multilocus association model (a genomic relationship matrix) (II). The polygenic variance component σu 2 has been given an Inverse-χ 2 (ν u, τu) 2 prior distribution with suitable data specific parameter values. On the basis of the existing literature the need for an additional polygenic component within a multilocus association model is unclear. Many authors have found the polygenic component irrelevant (e.g. Calus and Veerkamp 2007; Pikkuhookana and Sillanpää 2009), while e.g. de los Campos et al. (2009) and Lund et al. (2009) see it as a necessary. In I and II we examined the importance of the additional polygenic component in genomic selection and in association mapping context, respectively, with both simulated and real data. Within these works the estimates of the polygenic component were negligible, and had no influence neither in the prediction accuracy (I) nor in the gene location ability (II) of the model. None of the Bayesian multilocus models seemed to benefit from addition of the polygenic component with neither simulated (I and II) nor real data (I), the phenotype of the latter most likely being quite polygenic in nature. The polygenic component did not find extra information even when the task was made as easy as possible by generating the polygenic component of the data by using the same relationship matrix which was also used in the analyses (II). Therefore, to our experience, the polygenic component can safely be omitted from the multilocus association model (3.1) Indicator The indicator variable is added to the model framework to participate as a source of extra shrinkage in a mixture prior alongside the Student s t or the Laplace density. The usefulness of the indicator variable depends on the other source of shrinkage in the model. As mentioned above, the 18

19 hierarchical Laplace model does not seem to require the additional point mass at zero, on the contrary the model efficiency sustains damage if the indicator is added (Tables 2 5 in I). On the other hand, the Student s t model clearly benefits from the additional point mass. The latter observation is in strict concordance with the existing literature, as the superiority of BayesB (Student s t plus indicator) (Meuwissen et al. 2001) over BayesA (only Student s t) can be considered as common knowledge. While the main purpose of the indicator variable within our model framework is to participate in the mixture prior with the Student s t or Laplace densities, in II we have considered a pure indicator model. Under the Indicator model proposed in II, the prior for the effect sizes β j is Gaussian with zero mean and a predetermined variance, and therefore the prior for the marker effects γ j β j is a mixture of a Gaussian density and a point mass at zero. As the Gaussian prior introduces a constant shrinkage to the estimates, and hence the variable selection relies solely on the indicator, a Bayes factor based on the values of the indicators can be used in determining the significance of a marker effect. Contrary to phenotype or breeding value prediction, in gene mapping the significance of the individual marker effects is of importance. Nevertheless, the Indicator model in II is mainly considered as a curiosity and a proof of the power of a multilocus association treatment, as even an extremely simple multilocus association method may exceed the performance of a most sophisticated single marker method (Figure 1, A and B in II). The indicator has a Bernoulli prior with a prior probability π = P(γ j = 1) for the SNP j contributing to the trait. The value given for the probability π also represents our prior assumption of the proportion of the SNP markers that are linked to the trait. However, as the indicator affects the shrinkage of the marker effects concurrent with the shrinkage generated by the Student s t or the Laplace density, the parameters assigned for these densities affect the selection of π Hyperprior The optional hyperprior layer (the 6th layer in Figure 3.1) composes another facultative part of the model framework. The parameters of the prior densities (layer 5 in Figure 3.1) can be either predetermined or estimated simultaneously to the model parameters. As the prior densities for the effect size and the indicator are responsible for the regularization of the excess variables in the model, the impact of the parameter values of these priors is greater than of the other prior densities in the model. There- 19

20 fore the putative estimation of the prior parameters is limited to these two parameters. The estimation of the prior parameters is depicted in Figure 3.1 by considering the priors for the indicator and the effect variance as random variables, and adding the 6th layer into the model. If the parameters for the prior densities are considered fixed, the optional hyperprior layer is absent from the model. The fixed prior parameter values can be determined e.g. by cross validation or by Bayesian information criterion (see Sun et al. 2010). It is noteworthy, that even if the prior parameters are estimated from the data, i.e. the 6th layer is present in the model, the need for predetermined values does not vanish, but simply passes to the next layer of the model hierarchy. Hence, inevitably, at the very bottom of the model hierarchy the user has to determine some values prior to the actual parameter estimation. The hyperprior given for the effect size is a conjugate Gamma(κ, ξ) density for the scale τ 2 of the inverse-χ 2 density under the Student s t model, or, respectively, for the rate λ 2 of the exponential density under the Laplace model. There is neither conjugate prior nor closed form posterior density available for the degrees of freedom parameter of the Student s t model, and hence we have decided to consider it as fixed (I). For the indicator variable, the prior probability π = P(γ j = 1) of the marker j to be linked to the trait, is estimated with either an uninformative uniform Beta(1,1), or an informative Beta(a, b) density. The informative beta prior embodies our a priori assumed belief of the proportion of significant markers by considering a as the number of markers assumed to be linked to trait and b as the number of markers not to be linked (i.e. b = p a, p being the number of markers in the data set) Student s t vs. Laplace prior In the original work I one of our main interests was to consider the pros and cons of the Student s t and Laplace prior densities. The advantage of the hierarchically formulated Student s t density as a prior is the extremely easy derivation of the fully conditional posterior densities. Although the hierarchical Laplace prior also leads to tractable fully conditional posterior functions, the derivation of the posterior for the effect variance is clearly more complicated than with the Student s t density. However, the Student s t prior has some shortcomings too. The first problem we encountered with the Student s t model was the estimation of the parameters for the prior densities (5th layer in the Figure 3.1). We tried numerous hyperpriors for the effect variance and the indicator, but it appeared to be impossible to 20

21 select ones leading to a reasonable estimate. Hence, after several attempts, we decided on treating the prior parameters of the Student s t model as given. Under the Laplace model there was no such complications, and the prior parameters of the Laplace model are estimated from the data. Therefore, in the Laplace model the 6th layer of Figure 3.1 is always included in the model, while in the Student s t model it is always excluded from the model. Due to its shape, the shrinking ability of the Student s t prior is weaker than of the Laplace prior. While the hierarchical Laplace prior worked fine without the additional indicator variable, the Student s t prior required the additional point mass at zero in order to provide a strong enough shrinkage (Tables 2 5 in I). As pointed out previously, a low number of parameters is a desirable characteristic in a model. Apart from a single data set (table 2 in I), the prediction accuracy of the Laplace model was higher compared to the Student s t model (Tables 3-5 in I). The better performance of the Laplace model may be partially due to the easier and hence more accurate prior selection, partially due to the more favorable shape of the density itself. Also, as the prior parameters for the effect variance can be estimated, and hence there is an additional layer in the hierarchical model, the model may be more robust to the given hyperprior parameter values. Altogether, on the basis of our findings in I, we feel that the hierarchical Laplace model appears to have an advantage over the Student s t model, and therefore decided to concentrate on the former in II and in III Bayesian LASSO and its extensions The hierarchical Bayesian model with a Laplace prior density is commonly denoted as the Bayesian LASSO (Park and Casella 2008) since it leads to a nearly identical estimate as the frequentist LASSO by Tibshirani (1996). The Bayesian LASSO has been further modified by several authors, including Yi and Xu (2008), Mutshinda and Sillanpää (2010), Sun et al. (2010) and Fang et al. (2012). In II we considered a modification of the Bayesian LASSO introduced by Mutshinda and Sillanpää (2010) called the Extended Bayesian LASSO (EBL). Following common hierarchical Bayesian LASSO, the Laplace prior is expressed as a scale mixture of normal densities with exponential mixing distribution, so the EBL assigns a normal prior with independent locusspecific variances to the regression parameters given the locus variances β j σj 2 N(0, σj 2 ), and further an exponential prior to the variances σj 2 λ 2 j Exp(λ 2 j/2). Unlike Bayesian LASSO, the regularization parameters λ 2 j of 21

22 EBL are locus specific, and can be decomposed by setting λ j = δη j, where δ represents the model sparseness common to all loci, and η j is a locusspecific deviation representing the shrinkage working at locus j. Now the common Bayesian LASSO can be seen as a special case of EBL with the locus specific component set to η j = 1 j. Setting the common shrinkage parameter δ = 1 would lead to the Improved Bayesian LASSO proposed by Fang et al. (2012) Bayesian G-BLUP In addition to the multilocus association model, in I and III we have considered a Bayesian version of the genomic- or G-BLUP, a classical BLUP model where the numerator relationship matrix, estimated from the pedigree, is replaced by a genomic marker-based relationship matrix y = β 0 + Zu + ε. (3.3) In the model framework in Figure 3.1 the G-BLUP can be seen as a mirror image of the multilocus association model without the polygenic component, as here we have the polygene without the marker effects. The likelihood of the data under the G-BLUP is simply a multivariate normal with mean β 0 + Zu and covariance I n σ0. 2 Prior for the genetic values u and the population intercept β 0 are conjugate multivariate normal MVN(0, Gσu) 2 and uniform, respectively, G being the genomic relationship matrix. The variances σ0 2 and σu 2 have inverse-χ 2 priors, uninformative p(σ0) 2 1/σ0 2 and a level Inv-χ 2 (ν u, τu), 2 respectively. Under the G-BLUP the genetic marker data is incorporated into the model in a form of a genomic relationship matrix. There are numerous methods of generating the genomic relationship matrix, we have used the second method described in VanRaden (2008). This method is based on the identity by state (IBS) of the marker genotypes, and hence it measures the realized relationship between the individuals. The Bayesian approach differs from the frequentist G-BLUP in terms of handling the variance components. While the frequentist methods commonly estimate the genomic breeding values with known variance components, in a Bayesian approach the variance components are estimated simultaneously to the breeding values (Hallander et al. 2010). Therefore the Bayesian inference is always based on variances that are up-to-date and specific to the analyzed trait, letting also the uncertainty of the variance components to be incorporated into the estimates of the breeding values. Even though e.g. ASREML (Gilmour et al. 2009) estimates the variance 22

23 components from the data, and hence satisfies the up-to-date criterion, the variances are not estimated simultaneously to the breeding values, instead, the pre-estimated variance components are considered constant while estimating the breeding values. 3.4 Fully conditional posterior densities As depicted in the Figure 1, the model parameters β 0, σ 2 0, β, γ and u, located at the 3rd layer, are considered a priori independent of each other. The prior independence of the indicator and the effect size, as suggested by Kuo and Mallick (1998), leads to the most straightforward parametrization of a mixture prior for the effects. In conjunction with the conjugate, or otherwise well chosen, prior densities it enables an easy derivation of a closed form fully conditional posterior distribution for every parameter of the model framework. The joint posterior distribution of the parameters, given the data, is proportional to the product of the joint prior and the likelihood. We can easily extract the fully conditional posterior densities of individual parameters from the joint posterior by handling all other parameters as constants and leaving them out, and hence selecting only the terms including the parameter in question. For example, the fully conditional posterior distribution of a single regression coefficient β j, given all other parameters and the data, is derived from the joint distribution simply by selecting only the terms including β j, i.e. the likelihood and the conditional prior p(β j σ 2 j ). Under the full multilocus association model (3.1) we get the following, closed form, fully conditional posterior distributions for the model parameters (for simplicity: = the data, and the parameters except the one in question ): ( 1 β 0 N n n (y i i=1 σ 2 0 Inv-χ 2 (n, 1 n p γ j β j x ij u i ), j=1 n (y i β 0 i=1 σ0 2 ), (3.4) n p γ j β j x ij u i ) ), 2 (3.5) j=1 β j N(µ j, s 2 j), where (3.6) µ j n ( = γ j x ij y i β 0 ) / ( n ) γ l β l x il u i (γ j x ij ) 2 + σ2 0, σ 2 i=1 l j i=1 j s 2 j /( n ) = σ0 2 (γ j x ij ) 2 + σ2 0, σj 2 i=1 23

The joint posterior distribution of the unknown parameters and hidden variables, given the

DERIVATIONS OF THE FULLY CONDITIONAL POSTERIOR DENSITIES The joint posterior distribution of the unknown parameters and hidden variables, given the data, is proportional to the product of the joint prior