Aster Modeling of Chamaecrista fasciculata

Size: px

Start display at page:

Download "Aster Modeling of Chamaecrista fasciculata"

Simon Jones
5 years ago
Views:

1 Aster Modeling of Chamaecrista fasciculata Allen Clark February 9, 2018 Abstract Aster models are a type of graphical model where each node is modeled by an exponential family distribution. In biological applications, aster models are well suited for estimating the fitness of a species. In this report, the species of interest is Chamaecrista fasciculata, commonly known as the partridge pea. Fitness is assessed by the number of seeds each plant produced. Our model includes both fixed effects for each parameter in the graphical model and random effects to account for the genetic variability of each plant s parentage. We fit a Bayesian aster model to sample from the posterior distribution via Markov chain Monte Carlo (MCMC). The prior distributions were elicited via subject matter expertise and additional data on Chamaecrista fasciculata from another growing location, both provided by collaborators in the evolutionary biology group. Contents 1 Introduction 2 2 Aster Models Exponential Families Aster Graphs Requirements for the Aster Graph An Aster Graph Example Conditional and Unconditional Models Saturated Aster Models and Aster Submodels Aster Model Transformations Random Effects Breeding Values Pedigrees Avoiding Inverting the Numerator Relationship Matrix Data Why C. fasciculata? C. fasciculata Component of Fitness Data Pedigree Data Bayesian Analysis Log Likelihood Log Priors Fixed Effect Parameters

2 5.2.2 Random Effect Parameters Prior Elicitation Computation via MCMC Why MCMC? Markov Chains Monte Carlo Markov Chain Monte Carlo Metropolis Algorithm Proposal Distributions Metropolis Ratio Metropolis Rejection MCMC spacing: saving memory Checkpointing MCMC Central Limit Theorem Variance MCMC Diagnostics Time Series Plots Auto-correlation Function Plots Results 32 1 Introduction In evolutionary biology, one way to measure an organism s success at passing down genetic information to future generations is by counting the lifetime number of offspring produced. Plants or animals that produce more offspring will contribute more to the genetic makeup of the species in future generations. This counts as a reproductive success for the organism. On the other hand, organisms that do not produce any offspring will not directly contribute to the future of a species genetics. Following the terminology of Shaw et al. (2008, p. E35), this reports defines fitness as the lifetime number of offspring produced by an individual. 1 Note that two individuals with nearly the exact same genetics placed in the same environment can produce different numbers of offspring purely by random chance. Therefore, the fitness of an individual is a random variable. It can take on a range of values from zero to high and has an expected value which we call expected fitness. An organism must be able to survive to breeding periods to reproduce and organisms that reproduce more often will have higher observed fitness. Therefore, fitness is influenced by: 1. Longevity - An organism s ability to survive 2. Fecundity - An organism s frequency of reproduction Because fitness is count data, a naive researcher might estimate fitness with µ = x under a µ µx Poisson model which has probability mass function f(x) = e x!. Unfortunately, the easy way doesn t work. Figure 1 shows two problems with this Poisson approach. The observed data has a very heavy tail and too many individuals have zero offspring. Neither a Poisson distribution, nor any other known distribution can accommodate these features. The Poisson distribution considers fecundity (offspring produced), but longevity, another key aspect of fitness, has been left out of the equation. 1 Lifetime offspring count is sometimes called observed fitness to distinguish between an organisms ability to fit into it s environment and an observation of how well it can do so. See Beatty (1992, p ) for a discussion. 2

3 Simulated Poisson data Count Fitness Observed Fitness data Count Fitness Figure 1: Comparison of Poisson data versus fitness data: Poisson is not a good fit. Top: Simulated Poisson data. Bottom: Real fitness data from Chamaecrista fasciculata. Note the heavy tail and frequent occurrence of zero beyond what is expected from a Poisson distribution 3

4 A better approach is to model fitness conditioned on the survival and development of an organism throughout its fertile life. The overall fitness, which is the total number of offspring, can be broken down into components of fitness, which represent key measurements related to overall fitness. Examples of components of fitness include survival to fertility, the presence or absence of offspring, and the number of offspring produced in a given breeding period. Breaking down an organism s reproduction into its biologically relevant parts provides the model with the statistical flexibility it needs to produce valid results. The approach works as follows: 1. Identify the most important life cycle stages for an organism as components of fitness. 2. Connect these components of fitness by forming a graphical dependence structure based on how they influence each other biologically. 3. Decide on statistical distributions and parameters for each component of fitness. 4. Use data to estimate the parameters chosen. Prior to the introduction of aster models (Geyer et al., 2007), the practice was to model components of fitness separately, conditioned on survival. The sticking point was how to combine these separate analyses together to draw inference about fitness. Aster models offer a solution by directly modeling the joint distribution of all components of fitness and provide parameter estimates that directly correspond to fitness. Aster models are named for the first species to be analyzed using aster models, Echinacea angustifolia, a flowering plant in the aster family (Geyer et al., 2007). In aster models, the dependence structure between components of fitness can be represented visually in an aster graph, which for E. angustifolia is shown in figure 2. The dependence structure is the same across three years and consists of three layers: survival, flowering status, and flower count. If the plant survives, it has a chance to produce flowers. If the plant produces flowers, it can have any number of them. Survival status also depends on the survival status in the previous year, so arrows run across the survival layer from left to right. The terminal nodes, flower count, are proxies for offspring, so the sum over all terminal nodes is used as a proxy for fitness. Year 1 Year 2 Year 3 Bernoulli Bernoulli Bernoulli Initial = 1 Y 1 Y 2 Y 3 Plant Survival Layer Bernoulli Bernoulli Bernoulli F S 1 F S 2 F S 3 Flowering Status Layer 0-Poisson 0-Poisson 0-Poisson F C 1 F C 2 F C 3 Flower Count Layer Figure 2: Graphical dependence structure for E. angustifolia flower count data (Geyer et al., 2007) 4

5 2 Aster Models Aster models are exponential family graphical models. They are exponential family models in the sense that the conditional distribution of each component of fitness given its predecessor is an exponential family and in the sense that the joint distribution of all components of fitness is also an exponential family (see section 2.1). Aster models are also graphical models in that each component of fitness depends on the previous component of fitness in the aster graph (see section 2.2). There are six parameterizations available for aster models (see sections 2.3, 2.4, 2.5). 2.1 Exponential Families An exponential family is a category of distributions having a common form. Many well known distributions fit into this category (e.g. normal, binomial and Poisson). The advantage of exponential family distributions is to share theory. One only has to show that a given distribution is an exponential family and all the properties proven about exponential family distributions apply. Suppose X is the raw data, and Y (X) is a k dimensional statistic calculated from the raw data X. An exponential family is defined as any distribution with probability density function (PDF), probability mass function (PMF), or probability mass-density function (PMDF) 2 of the form ( k ) f θ (x) = h(x) exp y i (x)θ i c(θ) (1) Here h(x) is a function of data only. The statistic Y (X) is called the canonical statistic when it is in the form of equation 1. Likewise, the parameter θ is called the canonical parameter when it is in the form of equation 1. The function c(θ) is called the cumulant function. The key to exponential families is that the statistic Y (X), the parameter θ and the functions h( ) and c( ), are not allowed to include both data and the parameter. For example, an indicator function such as h(x) = I {x<θ} violates the separation of data and parameter rule. Following from equation 1, the log likelihood of an exponential family is l(θ) = log h(x) + = i=1 k y i θ i c(θ) i=1 k y i θ i c(θ) i=1 dropping terms without θ The cumulant function c( ) is useful because it allows calculation of the mean of Y. Assuming that θ Interior (Θ) so that the canonical parameter θ is in the interior of its parameter space, exponential families have the property that µ = E θ {Y } = c(θ) Var θ (Y ) = 2 c(θ) The parameter µ also parameterizes an exponential family (Geyer, 2016). Since µ is also a parameter, the exponential family can use µ as a parameter in place of θ. Of course, this transformation will then require a matching transformation of Y in the distribution. The parameterization involving µ is so useful that is gets its own name, the mean-value parameterization. 2 When some components of a random vector are discrete and others are continuous, the distribution of the random vector is partly discrete and partly continuous. (2) (3) 5

6 The mean-value parameter is useful for drawing inference in an applied problem. The canonical parameter is useful because it allows use of theorems about exponential families and can be used for maximum likelihood estimation. The aster model estimation is accomplished using the canonical parameter which is then mapped to the mean-value parameter to make conclusions about applied problems. Important exponential family distributions for aster models include Bernoulli: Ber (p) Normal: N (µ, Σ) Negative-binomial: NegBin (r, p) Poisson: Pois (µ) Zero-truncated Poisson: 0-Pois (µ) The zero-truncated Poisson distribution is a Poisson distribution with the value zero cut out as a possible value. The PMF is ( ) ( 1 e µ µ x ) f µ (x) = 1 e µ y = 1, 2,... (4) x! Notice the second factor is the usual Poisson distribution. The first factor scales the distribution by the probability that x 0. Models using zero-truncated Poisson distributions can avoid the inflated zeros issue in figure 1 with a two-step setup. A Bernoulli variable models the zero values, and a zero-truncated Poison models the number observed given that the Bernoulli variable is one. 2.2 Aster Graphs Aster models have a graphical dependence structure that can be expressed visually in an aster graph. There are four rules for creating an aster graph (Geyer et al., 2007). 1. Nodes are random variables for components of fitness. 2. Edges are conditional distributions for components of fitness. 3. Predecessors are sample sizes. 4. Initial nodes are constant random variables. An aster graph relationship like X f θ(y x) Y (5) means that component of fitness X has direct influence on component of fitness Y. The conditional distribution is X iid Y X = Y i where Y i fθ (y i x) (6) i=1 where θ is the canonical parameter in the f θ distribution. Exponential family distributions have a special formula for the sum of n independent and identically distributed (IID) random variables. If Y 1,..., Y n are IID with the same exponential family distribution and cumulant function c(θ), then the sum n i=1 Y i is again an exponential family distribution with canonical statistic n i=1 Y i, canonical statistic θ, and cumulant function n c(θ) (Geyer, 2013, deck 2, slide 22). This gives a convenient way to find the conditional distribution of Y X in the aster graph. For many exponential family distributions, the resulting sum will be a well known distribution. For example, if Y 1, Y 2,..., Y n are IID, then the distribution of n i=1 Y i is 6

7 Bin (np) if Y i iid Bin (p) Pois (nµ) if Y i iid Pois (µ) N (nµ, σ 2 ) if Y i iid N (µ, σ 2 ) For other exponential family distributions, such as the zero-truncated Poisson, the sum does not follow any well known distribution. Regardless, the distribution of the sum of IID exponential family random variables is always an exponential family distribution with cumulant function nc(θ) Initial nodes must be constant to denote the sample size of the first non-degenerate random variable in the aster graph. For example, if an experimenter planted three seeds and then established a component of fitness to see if the seeds germinate, the first arrow in the graph would be 3 Ber(p) Germinate where Germinate would be the sum of three independent Ber(p) random variables, which by the Bernoulli sum rule is a single Bin(3, p) random variable. Components of fitness are often zero. When a predecessor variable is zero, the successor is an empty sum of zero terms. By convention this is also zero. Each node in the graph comes together to form the joint distribution for all the components of fitness. Suppose we have n nodes non-initial nodes X 1, X 2,..., X n in the aster graph. Then the joint distribution of X = {X 1, X 2,..., X n } is f(x 1, X 2,..., X n ) = n nodes i=1 f(x i X p(i) ) (7) where p(i) is the index of the predecessor node that comes immediately before X i. Notice that the initial nodes are not among the factors in this product. Immediate successors of initial nodes will have conditional distribution f(x i X p(i) ) = f(x i ) Requirements for the Aster Graph Aster models place some requirements on the aster graph (Geyer et al., 2007). At most one predecessor: Each node has at most one predecessor. The initial node has no predecessors. Acyclic: The aster graph must be acyclic. Without this property, the joint distribution cannot be factored into a product of conditionals as in equation 7. An acyclic graph is necessary to obtain closed form densities and likelihoods. Initial node is constant: An initial node must be constant. It still represents the sample size of the succeeding random variable. Suppose we plant seeds in a garden. If we place X initial = 3 seeds in a single slot, then there are three random variables (seeds) that can either sprout, or not sprout. If X initial = 1, we only plant one seed in the slot and either one or zero plants can sprout from that seed An Aster Graph Example To gain more intuition, let s apply these rules to a simple example. Suppose we want to know how many seeds a plant produces and we use three components of fitness: an initial node X initial, the number of flowers F, and the total number of seeds S the plant produces. For simplicity 7

8 suppose we plant one plant in each growing slot, so X initial = 1. Plants can t produce seeds without flowers so it makes sense for F to be the predecessor of S. A plant can have any number 0, 1, 2, 3,... of flowers so F follows a Poisson distribution. Depending on what type of plant we are modeling, flowers have at least one seed. If each flower must have at least one seed, then S F follows a zero-truncated Poisson distribution. Using this model we obtain the aster graph that follows in equation 8: 1 Pois(µ F ) F 0-Pois(µ S) S (8) With the aster graph above fully specified, the conditional distribution of seeds given flower count S F is concentrated at zero if F = 0 a zero-truncated Poisson if F = 1 the sum of n IID zero-truncated Poisson random variables if F = n > 1 In the first case, the plant has no flowers so F is zero. Hence, there are no seeds. Thus f(s F = 0) is a degenerate distribution concentrated at value zero. In the second case, the plant has one flower, so F = 1. We calculate f(s F = 1) as the distribution of single zero-truncated Poisson random variable. In the third case, F > 1 so we get a proper summation. We have to take the summation of IID zero-truncated Poisson random variables over all the flowers that the plant produced. 2.3 Conditional and Unconditional Models There are two different ways to parameterize the same saturated aster model (see section 2.4): through a conditional model or through an unconditional model. These conditional and unconditional models relate to equation 7, which we now repeat here. f ϕ (X 1, X 2,... X nnodes ) = n nodes i=1 f θi (X i X p(i) ) The right hand side is a product of conditional PMDFs. Each random variable X i X p(i) has conditional distribution f θi (X i X p(i) ), which is an exponential family with canonical parameter θ i. The joint distribution, which is obtained by taking the product, is also an exponential family distribution. The joint parameter θ = (θ 1, θ 2,..., θ nnodes ) is a valid parameter for the joint distribution, but is not the joint distribution canonical parameter ϕ The left hand side is a single unconditional PMDF. The joint random vector (X 1, X 2,..., X nnodes ) has an unconditional joint distribution f ϕ (X 1, X 2,..., X nnodes ) which is an exponential family with canonical parameter ϕ. The conditional distributions, which are obtained by factoring, are also exponential family distributions. Each conditional parameter ϕ i is a valid parameter for the corresponding conditional distribution f ϕi (X i X p(i) ), but is not the conditional distribution canonical parameter θ i. The relationship between θ and ϕ is determined by a mapping called the aster transform. The aster transform inv. aster transform aster transform maps θ ϕ. The inverse aster transform maps ϕ θ. Section 2.5 presents this mapping in more detail. The conditional model allows biologists to model the relationships between components of fitness that specify the aster graph(see figures 2 and 6). Both the conditional model and the unconditional model may be useful for inference depending on the application. 8

9 2.4 Saturated Aster Models and Aster Submodels So far we have discussed the aster model for a single organism. We now expand this to allow data on many individuals. In the most general case, each individual in the study may have its own aster graph. This can be handled mathematically by the framework presented in Geyer et al. (2007) and computationally with the tools included in the aster2 package (Geyer, 2017b). However, this is more general than necessary for either the E. angustifolia example (fig 2) or the C. fasciculata data (fig 6). If we assume all individuals share the same aster graph, the unconditional canonical statistic Y is a vector of length n nodes n ind. Likewise, ϕ and θ are vectors of length n nodes n ind. This means that the model given by the aster graph using θ or ϕ is saturated. It has as many parameters as there are data points available and has no degrees of freedom available. Aster models deal with the saturated model issue the same way that linear models and generalized linear models do; an affine submodel reduces the dimension of the problem thus gives back degrees of freedom. Model Type Saturated Model Affine Submodel Linear Regression Y X N(µ, σ 2 I n ) µ = o + Mβ Logistic Regression Y X Bin(n, p) logit(p) = o + M β Aster Models Y X ExpFam(ϕ) ϕ = o + Mβ. In each case, the saturated model, which is specified by giving a probability model to the target variable Y X, has too many parameters to estimate directly. Rather, some function of the parameter in the saturated model is reduced in dimension by mapping to o + Mβ. The dimension is now reduced to the number of columns of M, which gives back degrees of freedom. In linear or logistic regression, the model matrix M tracks covariate information. Continuous variables in X may become columns in M directly, categorical variables are converted to binary columns in M, and interaction columns or alternative basis functions may be included as desired. In aster models, the model matrix tracks component of fitness data in addition to other covariates. This is accomplished by treating the component of fitness as an indicator variable, and appending n nodes 1 binary columns to the model matrix M. For an affine submodel for aster models, ϕ = o + Mβ (9) the offset vector o is a known vector that may optionally depend on covariate information or be left out entirely. The β is called the canonical submodel parameter. 2.5 Aster Model Transformations So far we have introduced three parameterizations for aster models, conditional, unconditional and submodel. θ, ϕ and β are the canonical parameters for the conditional model, unconditional model, and submodel respectively. Aster models parameters can also take mean-value form rather than canonical form. Like canonical parameters, mean-value parameters exist for all three model types, conditional, unconditional and submodel. In total, there are 2 3 = 6 possible aster model parameterizations; first choose the parameters role, mean-value or canonical. Next choose the model type, unconditional, conditional or submodel. The mean-value parameter for the conditional model uses parameter ξ, defined as ξ i = E(Y i Y p(i) = 1) (10) 9

10 Conditional Model dim: n ind n nodes Unconditional Model dim: n ind n nodes Submodel dim: n coef Canonical Parameters Mean-value Parameters no closed form aster transform ϕ = o + Mβ θ ϕ β ξi = c i(θi) inv. aster transform β = M 1 (ϕ o) multiplication µ = Mτ ξ µ τ division no closed form µ = c(ϕ) τ = M 1 µ no closed form τ = csub(β) Figure 3: Parameters The ξ i can be obtained from θ i with ξ i = c i (θ i). The mean-value parameter for the unconditional model, µ is defined similarly as µ = E(Y ) (11) From exponential family theory, we have µ = E(Y ) = c(ϕ). Finally, the mean-value parameter for the submodel, τ, is the expected value of Y mapped through the submodel matrix. That is, τ = M T E(Y ) = M T µ (12) The six parameters θ, ϕ, β, ξ, µ, τ are displayed in figure 3. The conditional model and unconditional model parameters all have dimension n ind n nodes, since there is one component for each aster graph node per individual. The submodel parameters have dimension n coef, the number of columns in the model matrix M. The relationships between each of the six parameters are indicated by the arrows in figure 3. The aster transform converts the conditional canonical parameter θ to the unconditional canonical parameter ϕ. When we express the factorization of the log likelihood using θ we get l(θ; y) = n nodes i=1 l(θ i ; y i ) (13) Using the fact that the successor node Y i is the sum of Y p(i) IID random variables and applying the exponential family summation rule this becomes ( nnodes l(θ; y) = log exp ( y i θ i y p(i) c i (θ i ) )) = n nodes i=1 i=1 ( yi θ i y p(i) c i (θ i ) ) (14) This sum is linear in y. The y p(i) terms can be grouped into non-random initial node data and random non-initial node data. Let J be the set of random non-initial nodes in the aster graph. 10

11 Then the result matches the unconditional canonical parameterization for ϕ. l(θ; y) = n nodes i=1 y i = y ϕ c(ϕ) θ i c j (θ j ) y p(j) c j (θ j ) j J j J p(j)=i (15) The last step determines the aster transform by setting the i th component of ϕ to and the unconditional cumulant function to ϕ i = θ i j J p(j)=i c(ϕ) = j J c j (θ j ) (16) y p(j) c j (θ j ) (17) The inverse aster transform relies on back-solving the aster transform for θ i in terms of ϕ i. At terminal nodes, there are no successors so ϕ i = θ i. The approach is to start with terminal nodes and work towards the initial node to solve for θ i in terms of the components of ϕ. This specifies the inverse aster transform via θ i = ϕ i + j J p(j)=i c j (θ j ) (18) where θ j has already been determined from the previous step in the back-solving process. The relationship between µ and ξ is that of multiplication and division. µ i = E(Y i ) = E{E(Y i Y p(i) )} = E{y p(j) ξ i } = ξ i E{y p(i) } (19) = ξ i µ p(i) This process can be repeated until the evaluation reaches an initial node. 3 Random Effects In aster models, the components of fitness already account for variability via the variance of the conditional models for each component of fitness in the aster graph. Additional sources of variability come from explicit random effects introduced into the model. Specifically, we extended the submodel to allow for random effects. Recall equation 9, ϕ = o + Mβ. We now revise this submodel to include random effects. ϕ = o + Mβ + Za (20) As before, o is an offset term. But now there are two model matrices M and Z. M is the model matrix for fixed effects. Z is the model matrix for random effects. β and a are vectors 11

12 representing the fixed and random effects of the model. This formulation of random effect aster models was introduced in Geyer et al. (2013). Our purpose for random effects is to estimate the variability in fitness that arises from the genetic differences between individuals. Quantitative genetics literature (Lynch and Walsh, 1998; Wilson et al., 2010) provides a random effect called breeding value for this purpose. Here we take a to the breeding values, a vector with one component per individual. 3.1 Breeding Values Breeding values retain a dependence structure specifying how much genetic information is shared between individuals. This structure is determined by the numerator relationship matrix N and a scalar variance component σa 2 known as the additive genetic variance. Then the vector of breeding values a follows a normal distribution as a N ( 0, σ 2 AN ) (21) The numerator relationship matrix indicates how much genetic similarity there is between two individuals. This is measured via family relationships between individuals such as parent, child, sibling, half-sibling, etc. When there is no inbreeding (i.e. an individual s parents are unrelated) a parent and a child will share half of their genetic information since the child inherits half of its genetic information from each parent. Likewise, siblings also share half of their genetic information. The full rules for computing N under the most general circumstances can be found on page 763 in equations (26.16a) and (26.16b) of Lynch and Walsh (1998). For our purposes, we will leave out the possibility of inbreeding and limit the family structure to two generations. In the parent generation all individuals are thought to be unrelated. In the offspring generation, covariance is determined by the family relationships discussed in section 3.2. Under these assumptions, the possible relationships between individuals are reduced to parent, child, sibling, half-sibling, and unrelated. With these assumptions, the amount of genetic information shared between individuals i and j is represented by n ij and can be computed with the rules: Individual i with itself: n ii = 1 i is the parent of j: n ij = 1 2 i is a child of j: n ij = 1 2 i and j are siblings: n ij = 1 2 i and j are half-siblings: n ij = 1 4 i and j are unrelated: n ij = Pedigrees The family relationships between all the individuals in the study is known as a pedigree and can be visualized with a family tree diagram. These diagrams use the terminology sire and dam for the father and mother, as is common in quantitative genetics. The mathematical notation follows from this with s(i) and d(i) representing the father (sire) and mother (dam) of individual i. The pedigree for the data in this project has two generations: a parent generation without component of fitness data, and a offspring generation with component of fitness data. The parent generation is used only for the pedigree. There are two assumptions about the pedigree data: 12

13 m1 m2 m3 m4 Example Pedigree Parent Generation Offspring Generation Figure 4: A pedigree with a parent generation and an offspring generation. Blue squares denote sires, red circles denote dams, and black diamonds denote offspring. 1. There is no inbreeding between individuals. 2. Individuals either have two parents in the pedigree or none. An example pedigree is shown in figure 4. There are 11 individuals: 3 sires, 3 dams, and 5 offspring. For this example, the numerator relationship matrix would be /2 1/ /2 1/2 1/ /2 1/ / /2 N = /2 1/2 1/ /2 1/ /2 1/ /2 1 1/ /2 1/ /4 1/4 1 1/ /2 1/ / /2 1/ Since our data only includes component of fitness data for the offspring, we can drop the parents from the matrix. The reduced numerator relationship matrix which only includes information on the five offspring individuals 7, 8, 9, 10, 11 is the bottom right 5 5 block matrix of the original. 1 1/2 1/ /2 1 1/4 0 0 N = 1/4 1/4 1 1/ / To summarize, the pedigree graph provides a visual tool with which to calculate the numerator relationship matrix N. This is then used to fully specify the breeding values as a N(0, σ 2 A N), where dim (a) = n ind and dim (N) = n ind n ind. 13

14 3.3 Avoiding Inverting the Numerator Relationship Matrix With the breeding values fully specified, this information can be incorporated into the log likelihood. l(ϕ) = l(o + Mβ + Za) + l(a) = l(o + Mβ + Za) at σ 2 A N 1 a log ( σ 2n ind A Det (N) ) (22) The N 1 in equation 22 poses a problem. Inverting N is difficult when n ind is large, and it usually is. This prompts a factorization of a that avoids inverting N. The key is that the dependencies in a involve offspring depending on their sire and dam. Since the joint distribution of a is normal, the univariate conditionals a i a s(i), a d(i) must also be univariate normal. In Geyer (2012, p. 3), it is shown that ( ) as(i) + a d(i) a i N, σ2 A (23) 2 2 Thus the PDF of a can be factored as f(a) = ( ) ( ( ) 1 2 ) exp a2 i 1 2ai a s(i) a d(i) 2πσA 2σ 2 exp i F A πσa 2σ 2 i F A where F is the first generation of parents in the pedigree data. If we assume that there is no component of fitness data on this generation, then all individuals in the likelihood come from offspring generations. This further simplifies the PDF of a to f(a) = ( ( ) 2 ) 1 2ai a s(i) a d(i) exp πσa 2σ 2 (25) i F A This avoids inverting the numerator relationship matrix. The additive genetic variance parameter σa 2 is sometimes divided into the variance for the genetic information coming from the sire, dam and individual itself. This is expressed by (24) σ 2 A = σ 2 ind + σ 2 sire + σ 2 dam (26) where 4 Data σ 2 ind = σ 2 A/2 σ 2 sire = σ 2 A/4 σ 2 dam = σ 2 A/4 (27) The primary data used in this project records information on Chamaecrista fasciculata grown at McCarthy Lake, MN. A second population of C. fasciculata grown at the Grey Cloud Dunes is used for prior elicitation. Both data sets feature component of fitness information and a pedigree used to compute the numerator relationship matrix N. 4.1 Why C. fasciculata? There are several reasons why the C. fasciculata plant is well suited for a Darwinian fitness study. 14

15 Annual plants - C. fasciculata are annual plants, which means that each generation of plants lives only one year. This allows experimenters to collect a full generation of data in only one year. In contrast, experimenters are often not able to collect complete data on perennial plants because the lifespan can be too long. Primarily outcrossing - C. fasciculata are primarily out-crossing, as opposed to self fertilizing. Out-crossing allows plants to create more genetic diversity by passing genes on to different plants. In the case of C. fasciculata, a mechanical mechanism is in place to prevent self fertilization. Self fertilization is a complication for individual model research because it can mask the pedigree of an individual. It can be difficult to determine whether an offspring plant was self fertilized by its sole parent or crossed by a sire and dam plant. Out-crossing is a desirable trait for individual model research because it avoids this confusion and offers a clear pedigree. In the case of C. fasciculata, bees land on the flowers to collect nectar. The buzzing of the bees will shake pollen free and stick to the bee. The bee will carry the pollen to another plant to fertilize the female ovules. In this study, researchers used a device similar to an electric toothbrush to shake the pollen free from the plants. The pollen was given to the desired dam in order to produce the desired pedigree. Perfect flowers - C. fasciculata have perfect flowers, which means that a flower includes both male and female reproductive organs. This allows for easier determination of sire and dam plants. Low seed dormancy tendency - C. fasciculata have little tendency for seed dormancy. In some species, seeds can remain dormant in the soil for multiple years before germination. This adds an additional complication in determining the pedigree in individual models. It is possible that C. fasciculata seeds from previous years found their way into the soil at the experiment growing sites. If these seeds were to sprout it would mix plants from an older generation and unknown parentage with the current generation of plants from the experiment. If this were to happen, the data set would not contain accurate records of the sire and dam. Fortunately, seed dormancy is uncommon for C. fasciculata plants so the recorded sire and dam in the data set can be trusted. Natural range - The natural habitat for C. fasciculata stretches into Minnesota at the northern limit of the range (figure 5). If this were not the case, estimates of Darwinian fitness would be artificially low, since plants would still need to adapt to the new environment. Furthermore, observing non-native plants would complicate genetics by environment interactions. As a native species, C. fasciculata escapes these concerns. 4.2 C. fasciculata Component of Fitness Data The components of fitness chosen for C. fasciculata are based on its life cycle and reproduction. The first (non-initial) component of fitness is germination, a Bernoulli random variable indicating whether a seed successfully germinates and sprouts out of the ground. As a plant continues to grow, it may produce flowers, giving a plant a chance to reproduce. The flower status of a plant is also taken as a Bernoulli random variable and serves as the second component of fitness. Each flower then has the potential to produce fruit. For C. fasciculata this fruit is a pea-pod and is referred to as a pod in this report. The number of pods produced must be at least one. Therefore, it is reasonable to model the pod count as a zero-truncated Poisson random variable. This becomes the third component of fitness. Like all fruit, the C. fasciculata pods contain the seeds of the plant. The total number of seeds across all the pods is called the total seed count and is modeled by the fourth and final component of fitness in the aster graph as a zero-truncated 15

Figure 5: Native Range of C. fasciculata. Image from plants.usda.gov.

The resulting graph is visualized in figure 6 along with pictures of a C.

Root Germination Flowering Status Pod Count Seed Count Ber Ber 0-Pois 0-Pois 1 G FS PC

3 Pedigree Data The pedigree is divided into the parent generation, for which there is

16 Figure 5: Native Range of C. fasciculata. Image from plants.usda.gov. Poisson random variable. The resulting graph is visualized in figure 6 along with pictures of a C. fasciculata plant at each stage. Root Germination Flowering Status Pod Count Seed Count Ber Ber 0-Pois 0-Pois 1 G FS PC SC Planted Figure 6: C. fasciculata aster graph 4.3 Pedigree Data The pedigree is divided into the parent generation, for which there is no component of fitness data, and the offspring generation, for which there is component of fitness data. The parent generation contains 48 sires (fathers) and 132 dams (mothers). The offspring generation contains component of fitness and pedigree data on 3445 individuals. 16

17 5 Bayesian Analysis Bayesian analysis works by Bayes rule (Bayes and Price, 1763). A prior distribution represents the belief about the parameter before data is observed. A likelihood distribution specifies the model by which the data is generated. The prior and likelihood combine to make the posterior distribution, which gives an updated belief about θ after the data has been observed. prior likelihood {}}{{}}{ L(θ; X) P (θ) P (θ X) = }{{} L(θ; X) P (θ)dθ posterior }{{} normalizing constant (28) The normalizing constant, L(θ; X)P (θ)dθ is constant in θ because θ has been integrated out. Since the posterior distribution is a valid density, it must integrate to one. The normalizing constant is the constant factor making the posterior to integrate to one. As is standard practice, we relax this equation by dropping multiplicative factors that do not include θ. The normalizing constant in equation 28 is one such factor, but the likelihood and priors may contain other multiplicative factors not involving θ. In this way, any function of data only can be dropped. The posterior is now unnormalized, but this will not cause difficulties. P (θ X) }{{} unnormalized posterior = L(θ; X) }{{} likelihood P (θ) }{{} unnormalized prior (29) In simple cases, the priors may be chosen to be the conjugate family of the likelihood. If this is the case, then the posterior distribution is in the same distribution family as the prior. Analysis can then proceed analytically. In practice, conjugate priors are often undesirable or infeasible because they may not reflect the subject matter knowledge about a parameter or the likelihood may be a more complicated function having no known conjugate prior. Markov chain Monte Carlo (MCMC) is an alternative approach that provides a way to approximate the posterior distribution numerically. Taking the log of the posterior can often reduce difficulties with computer arithmetic overflow. On the log scale, the same multiplicative constant that were ignored in equation 29 become additive constants that may be safely ignored in the log likelihood. The equation becomes log P (θ X) }{{} = l(θ; X) }{{} + log(p (θ)) }{{} log unnormalized posterior log likelihood log unnormalized prior (30) Under the Bayesian viewpoint, uncertainty is handled by random variables. Since the parameters built into a model are uncertain, a Bayesian maintains that parameters are random variables. Likewise, latent variables and random effects hold uncertainty to their values. The Bayesian addresses this by treating random effects as unknown parameters which are themselves random variables. From this point onward, this report takes the Bayesian view using the terminology random effect parameters for the breeding values a. With this clarification on the Bayesian perspective, we now turn to the likelihood and priors in more detail. 17

18 5.1 Log Likelihood The log likelihood likelihood l(ϕ) = l(o + Mβ + Za) is computed via the minimal function in the animal package 3. Since a N(0, σa 2 N) the log likelihood can be expressed as l(ϕ) = l(β, a, σ 2 A) (31) The minimal function is written to be used with additional random effects, but can be adapted to fit an aster model with only a breeding value random effect. minimal assumes that the model has three random effect parameters, one for individual, sire, and dam random effect components. Adapting this to a model with a single random effect parameter for breeding value requires the constraint that σ 2 A = σ 2 ind + σ 2 sire + σ 2 dam where σ 2 ind = σ 2 A/2 σ 2 sire = σ 2 A/4 σ 2 dam = σ 2 A/4 The first random effect aster paper (Geyer et al., 2013) discusses another complication in the likelihood. We would like to allow random effect variance parameters to be zero. Modeling the standard deviation parameters instead allows the standard deviation to be negative or zero and removes the restriction that variance be positive. In addition, the minimal function avoids problems with taking the square root of negative numbers by modeling the additive genetic standard deviation instead of the additive genetic variance (Geyer et al., 2013, p. 1783). The resulting log likelihood becomes 5.2 Log Priors l(ϕ) = l(β, a, σ A ) = l(β, a, σ ind, σ sire, σ dam ) (32) The parameters in this model come from the aster submodel, ϕ = o + Mβ + Za. β is the parameter vector for fixed effects. It includes both component of fitness parameters and block effect parameters. The random effect parameter vector a includes one breeding value for each individual in the data. a has one associated variance parameter, σa 2, which is the additive genetic variance. Rather than model σa 2 directly, we chose to model the standard deviation σ A. Priors were chosen under the assumption of independence between the components Fixed Effect Parameters Each component of β has an independent logistic distribution prior. The logistic distribution is a location-scale family. We use µ hp to represent the location parameter and σ hp to represent the scale parameter, where the subscript hp indicates that µ hp and σ hp are hyper-parameters and is not the same as µ or σ used elsewhere in this report. f µhp,σ hp (β i ) = e (βi µ hp)/σ hp σ hp (1 + e (βi µ hp)/σ hp) 2 (33) Here µ hp really is the mean of β i, but σ hp is not the variance. Rather, Var(β i ) = σ2 hp π2 3 Computation is performed via the dlogis(..., log=true) function in R. 3 The minimal function computes what frequentists know as the complete data log likelihood, i.e. the log likelihood if the random effects a could be observed. Under the Bayesian perspective, this is just the log likelihood. 18

19 5.2.2 Random Effect Parameters The only random effect parameter vector is the breeding value, specified by a N(0, σa 2 N). Since N, the numerator relationship matrix is known, the only other parameter for random effects is the additive genetic variance, σa 2. In principle, the additive genetic variance could be zero or some positive number, so this must be reflected in the choice of the prior distribution for breeding values. The PDF of an exponential distribution, f λ (x) = λ exp λx has support on the positive real numbers. Moreover, lim x 0 f λ (x) = λ > 0. As a result, it is realistic to observe samples from an exponential distribution with value arbitrarily close to zero. This matches the modeling assumptions. Rather than model the additive genetic variance σa 2 directly, we use the additive genetic standard deviation σ A. If σa 2 Exp(λ) (34) then the change of variable formula gives Taking the log and dropping additive constant terms gives f λ (σ A ) = λe λσ2 A 2σA (35) logprior(σ A ) = log (σ A ) λσ 2 A (36) When the log prior for σ A is programmed in R, care must be taken to ensure computer arithmetic errors do not occur. For example, if equation 36 were used as is, R would encounter an overflow error when σ A is too large. R would compute equation 36 as logprior(σ A ) = log (σ A ) λσ 2 A = log(inf) λinf 2 = Inf Inf = NaN but the desired behavior is logprior(σ A ) = -Inf, since lim σa logprior(σ A ) =. solution is to return -Inf whenever the result of logprior(σ A ) is NaN or Inf. (37) One Prior Elicitation Researchers grew C. fasciculata plants at four different growing locations in Minnesota. The McCarthy Lake growing site is the primary location of interest. Consequently, the analysis was performed using the data from the McCarthy Lake site. Biologists consider growing locations at Grey Cloud Dunes to be most similar to those at McCarthy Lake. This makes Grey Cloud Dunes data useful for prior elicitation. The means and variances of each fixed effect were calculated on the Grey Cloud Data, then transformed into the µ hp and σ hp used as hyper-parameters for the logistic distribution fixed effects. Researchers came up with a point estimate for the σ A additive 1 genetic variance parameter to be used for the random effect. Then setting λ = point est specifies the prior distribution for σ A. 6 Computation via MCMC We used Markov chain Monte Carlo to sample from the posterior distribution. The code in this report uses the metrop function in the R package mcmc, which implements the Metropolis random-walk algorithm. This section follows a discussion of MCMC from Geyer (2011) and explains how this was implemented for the aster model fit on C. fasciculata data. 19

20 6.1 Why MCMC? Both Bayesian and frequentist approaches commonly encounter integrals that can not be solved analytically. One place where these integrals arise is the likelihood normalizing constant. Another place is in drawing inference. Two common examples are the posterior expectation E (θ X) = θ XP (θ X)dθ and marginal posterior distributions, which are obtained by integrating out Θ undesired components of θ from the posterior distribution. MCMC offers the ability to avoid computing these integrals. Integrating the normalizing constant is avoided with a cancellation trick. Integrals that arise in the inference step can be approximated using the samples produced by MCMC rather than numerical integration. Aster models with random effects must address these issues. With a handful of random effects, aster models may use approximate integrated likelihood with the Breslow-Clayton approximation (Breslow and Clayton, 1993). This approach assumes that the likelihood is nearly quadratic in the random effect parameters, approximates the likelihood with the with a form that can be solved analytically. This is implemented in the R package aster (Geyer, 2017a) through the function reaster. However, in high dimensional settings, the Breslow-Clayton approximation breaks down and other methods are needed. Geyer et al. (2013, p. 1793) point out the Breslow- Clayton approximation is not workable for quantitative genetic models with one random effect parameter per individual. Numerical integration could offer a solution to these integrals, but this technique comes with its own issues. In equation 28 we saw that the posterior distribution can be decomposed into the likelihood times the prior divided by the normalizing constant. P (θ X) = Θ L(θ; X) P (θ) L(θ; X) P (θ)dθ Computing the normalizing constant L(θ; X) P (θ)dθ requires integrating over each dimension of the vector θ. We will illustrate the computational complexity of this integration under Θ the simplest case possible, when each component of θ is binary. Suppose dim(θ) = d. Then the integral becomes a summation over the d dimensions of θ. L(θ; X) P (θ)dθ = L(θ 1,..., θ d ; X) P (θ 1,..., θ d ) (38) Θ θ i=0,1 i=1,2,...,d There are d components of θ taking 2 values each, so there will be 2 d terms in the sum. This is an exponential time algorithm and quickly gets out of hand. For instance, if d = 30, there are over a billion terms in the sum. This phenomenon of rapidly increasing computation with the dimension of θ is known as the curse of dimensionality and it prevents direct computation of the integral in a reasonable amount of time. In cases where components of θ are continuous rather than binary, numerical integration techniques such as the trapezoid rule or Simpson s rule suffer even more profoundly from the curse of dimensionality. These methods create a grid over the dimensions of θ. If there are k grid points in each dimension, the computation will run in exponential time O(k d ). More advanced numerical integration methods such as sparse grids (Heiss and Winschel, 2008) can improve on this running time to polynomial in d, but this still does not address the need to integrate both normalizing constant and other integrals needed for inference. Another approach peculiar to the Bayesian perspective is to use conjugate priors, a prior distribution chosen for the property that the posterior and the prior belong to the same family of distribution. For many simple likelihood distributions, the posterior can be determined analytically to belong to the same distribution family. For instance, when the likelihood is a binomial 20

21 distribution, setting the prior to follow a beta distribution guarantees that the posterior is also a beta distribution. However, no conjugate prior distributions are known for aster models with random effects for each individual. Consequently, we turn to MCMC to draw samples from the posterior distribution without actually knowing what that distribution is. The MCMC sampling algorithm uses a cancellation trick to produce samples from the posterior distribution without computing the normalizing constant. 6.2 Markov Chains From a Bayesian perspective, data is not random once it has been seen. Instead, the uncertainty in the model comes from the parameter vector θ. Thus θ, not X is a random. A Markov chain is a sequence of random vectors θ 1, θ 2,... having the property that the conditional distribution only depends on the most recent state in the chain, not on all the previous states. That is, P (θ n+1 θ 1, θ 2,..., θ n ) = P (θ n+1 θ n ) n = 1, 2, 3,... (39) Equation 39 describes the memoryless property, named because the process that determines θ n+1 has no memory of the previous states θ 1,... θ n 1. If further we have that P (θ 2 θ 1 ) = P (θ 3 θ 2 ) =... = P (θ n+1 θ n ) then the Markov chain has stationary transition probabilities. Markov chains with the this stationary transition probabilities are determined by two simpler distributions, the initial distribution P (θ 1 ) and the transition probability distribution P (θ n+1 θ n ). When the initial distribution and transition probability distribution interact so that the marginal distributions are equal P (θ 1 ) = P (θ 2 ) = = P (θ n ) we say that the Markov chain is in equilibrium and P (θ i ) is the equilibrium distribution. Often in MCMC simulations, the main interest for inference is not on the Markov chain itself, but on a particular function on the state space, g( ). That is, Raw MC: θ 1, θ 2,... Functional MC: g(θ 1 ), g(θ 2 ),... (40) 6.3 Monte Carlo A Monte Carlo method is a way of understanding a random variable θ through simulated data θ 1, θ 2,... θ n. The simplest case is known as ordinary Monte Carlo, where θ 1, θ 2,... iid f θ (θ). Monte Carlo uses the law of large numbers to show that the sample average will converge to its expected value. That is, for a function g( ), g(θ) n = 1 n n g(θ i ) E(g(θ)) (41) i=1 Since the samples θ i on the left hand side are readily available, Monte Carlo is effective when the expectation involves an integral that is difficult to compute numerically. Furthermore, the central limit theorem gives an approximation to the normal distribution as ) g(θ) n N (E(g(θ)), σ2 (42) n 21

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 1 Theory This section explains the theory of conjugate priors for exponential families of distributions,