A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

Size: px

Start display at page:

Download "A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen"

Alisha Lane
5 years ago
Views:

1 A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment of the requirements for the Ph.D. program in the Department of Statistics University of Wisconsin-Madison Committee Members: Professor Christina Kendziorski Professor Alan Attie Professor Michael Newton Professor Brian Yandell 1

2 Contents 1 Introduction 2 2 ETL mapping experiments 3 3 QTL Mapping Methods Single Phenotype - Single QTL Models Single Phenotype - Multiple QTL Models Multiple Phenotype - Single or Multiple QTL Models ETL Mapping Methods Transcript Based Approach Transcript Based Approach with FDR control Marker Based Approach Mixture Over Markers Model Current Status of ETL Mapping Methods Research Plan ETL Interval Mapping Pseudomarker-MOM Two-Stage Approach Theoretical Result Simulation Results Multiple ETL mapping Simulation Results Future Research Questions 20 References 23 Appendix 27 1

3 1 Introduction Identifying the genetic loci responsible for variation in quantitative traits is of great importance to biologists. Although quantitative trait loci (QTL) mapping studies have been going on for over 80 years starting with Sax in 1923 (Sax 1923; Rasmusson 1933; Thoday 1961), where he proposed that the association between seed weight and seed coat color in beans was due to the linkage between the genes controlling weight and the genes controlling color, the vast majority of studies have taken place in the last 20 years. The increased rate was due largely to two major advances in the 1980s: the advent of restriction fragment length polymorphisms (RFLPs) (Botstein et al. 1980) so that it s possible to genotype markers on a large scale and the advent of statistical methods for data analysis (Lander and Botstein 1989). A recent advance of comparable significance has been made in the area of phenotyping. With high throughput technologies now widely available, investigators can measure thousands of phenotypes at once. Gene expression measurements are particularly amenable to QTL mapping and much excitement abounds for this field of genetical genomics (Jansen and Nap 2001; Jansen 2003; Cox 2004; Broman 2005). The so called expression QTL (eqtl) or expression trait loci (ETL) studies have been used to identify candidate genes (Dumas et al. 2000; Eaves et al. 2002; Karp et al. 2000; Wayne et al. 2003; Schadt et al. 2003; Brstrykh et al. 2005; Hubner et al. 2005), to infer not only correlative but also causal relationships among modulator and modulated genes (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003), to better define traditional phenotypes (Schadt et al. 2003), and to serve as a bridge between genetic variation and the traditional complex traits of interest (Schadt et al. 2003). Although successful in many ways, the results obtained from ETL studies to date are limited. In the early published studies, the ETL mapping problem had been addressed by treating each transcript separately as a phenotype for QTL mapping. Single trait QTL analysis was then carried out thousands of times (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003). Notably, although adjustments were made for multiple tests across the genome, no adjustments had been considered for multiple tests across transcripts. There are hundreds of test locations across the genome but tens of thousands of transcripts leading to a potentially serious multiple testing problem and an inflated false discovery rate (FDR). For some labs, an inflated FDR is tolerable as many genes can be tested quickly for certain properties and discarded if found to be false positives. However, for many labs, such tests are prohibitively expensive. Statistical methods that control error rates and 2

4 that are more sensitive and more specific are needed. In a few recent studies, there has been some effort in attempting to account for both sets of multiplicities (Chesler et al. 2005; Hubner et al. 2005; Bystrykh et al. 2005). Permutation tests were performed to derive the genome-wide LOD score threshold and q-values were computed for the set of transcripts declared significant using the corresponding genome-wide empirical p-values. As discussed in Section 4.2, this last approach may not properly control FDR and may suffer from very low power. The main aim of the proposed thesis is to develop a statistical framework for ETL mapping that properly accounts for multiplicities while maintaining or improving upon the operating characteristics of currently used approaches. Section 2 provides a brief background on the questions addressed and data collected in ETL mapping experiments. Statistical methods for ETL mapping are reviewed in Section 4. As discussed there, the mixture over markers (MOM) model is the only statistically rigorous ETL mapping method developed to account for multiplicities across both markers and transcripts. However, MOM has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. When dense maps are available, the MOM model by itself may not be applicable as the number of mixture components is too big to fit. Finally, MOM does not allow for multiple ETL. Statistical methods to address each of these shortcomings are detailed in Section 5 and preliminary results are demonstrated on simulated data and data from a study of diabetes in mouse. Future directions are discussed in Section 6. 2 ETL mapping experiments The general data collected in an ETL mapping experiment consists of a genetic map, marker genotypes, and microarray data (phenotypes) collected on a set of individuals. A genetic marker is a region of the genome of known location. These locations make up the genetic map. The distance between markers is given by genetic distance, in the unit of centimorgan (cm). It is defined as the expected percentage of crossovers between two loci during meiosis. At each marker, genotypes are obtained. ETL mapping studies take place in both human and experimental populations. We focus on the latter. For these populations, the possibilities of marker genotypes are simplified. For example, studies with experimental populations most often involve arranging a cross between two inbred strains differing substantially in some trait of interest to produce F1 offspring. Segregating progeny are then typically derived from a B1 backcross (F1 x Parent) or an F2 3

5 intercross (F1 x F1). Repeated intercrossing (Fn x Fn) can also be done to generate so-called recombinant inbred (RI) lines. For simplicity of notation, we focus on a backcross population. Consider two inbred parental populations P 1 and P 2, genotyped as AA and aa, respectively, at M markers. The offspring of the first generation (F 1 ) have genotype Aa at each marker (allele A from parent P 1 and a from parent P 2 ). In a backcross, the F 1 offspring are crossed back to a parental line, say P 1, resulting in a population with genotypes AA or Aa at a given marker. We denote AA by 0 and Aa by 1. For each member of the backcross population, phenotypes are collected via microarrays. Microarrays allow us to snapshot the expressions of thousands of genes at the same time. The oligonucleotide and cdna microarrays are the two types of technology that are most widely used. A nice review of the microarray technologies can be found in Nguyen et al. (2002). We present a very brief, by no means complete review here. Affymetrix is one company that produces oligonucleotide chips which contain tens of thousands of probe sets, or DNA sequences related to a gene. We will refer to these sequences throughout this paper as transcripts. Each gene is represented by some number (usually 11-20) of features. Many Affymetrix arrays use 20 features (Nguyen et al. 2002). Each feature is a short sequence of oligonucleotides. Present in the features are pairs of perfect match (PM) and mismatch (MM) sequences. The PM is a piece of gene, 25 nucleotides in length; the corresponding MM is identical to PM except for the middle (i.e., 13th) position. After some pre-processing and normalization, one summary score is derived for each probe set. There are a number of methods for processing the probe set intensities and for normalization, such as DNA chip Analyzer by Li and Wong (2001), and Robust Multi-array Analysis (RMA) by Irizarry et al. (2003). In a cdna array experiment, a gene is represented by a long cdna fragment (500 to 1000 bases). The experimental sample of interest is often labeled with a red fluorescent dye, and a reference sample is labeled green. The amount of cdna hybridized to each probe can be captured through some imaging device which measures the amount of the fluorescent intensity. Image files are processed to give a summary expression score, log 2 (R/G). Yang et al. (2002) propose methods for cdna array data normalization and compare their methods with a number of other approaches. With proper pre-processing and normalization, from either technology we obtain a single summary score of expression for each transcript on each array. 4

6 3 QTL Mapping Methods As noted above, ETL studies are very similar to QTL studies, but with thousands of phenotypes. Perhaps it is not surprising then that early ETL studies repeatedly applied methods developed for QTL mapping to each transcript. The literature on QTL mapping methods is quite large. We here review only those methods relevant to this proposal and refer the interested reader to Doerge et al. (1997) or Lynch and Walsh (1998) for more information on QTL mapping methods. 3.1 Single Phenotype - Single QTL Models Consider a backcross with n progeny with univariate phenotypes y j measured on all the individuals, j = 1,..., n, together with genotypes for a set of M markers. Let m ij = 0 or 1 according to whether the individual j has genotype AA or Aa at the ith marker, i = 1,..., M, The simplest method to test for trait-marker association is marker regression (MR), to test mean trait value differences between different marker groups for a particular marker. Specifically, for a test at the i th marker, the single QTL model is: y j = µ + β i m ij + ɛ j (3.1) where ɛ j are independent and identically distributed (iid) as Normal(0, σ 2 ) and one can test H 0 : β i = 0 vs. H 1 : β i 0 or equivalently H 0 : µ 0 = µ 1 vs. µ 0 µ 1 where µ 0 = µ and µ 1 = µ + β i. This is equivalent to an analysis of variance (ANOVA) at each marker (Soller et al. 1976) when there are more than two marker genotype groups (a t-test for two genotype groups as in a backcross). Usually, instead of F or t-statistics, geneticists prefer to report a LOD score, which is defined as the (base 10) log-likelihood ratio comparing the two hypotheses. A LOD score is calculated at each marker position, and marker loci giving significant LOD scores are identified as putative QTLs. For these putative QTLs, we loosely say the phenotype is linked to them. This approach is conceptually very simple, and clearly there are problems with it (Lander and Botstein 1989). First, if the true QTL is not located exactly at the marker, its effect will likely be underestimated because of recombination between the marker and the true QTL. Second, because of the confounding effects, the power for QTL detection will decrease, especially when the markers are widely spread, requiring more individuals for the test. Third, this approach considers one 5

7 marker loci at a time, which is not very powerful comparing with multiple QTL models in the presence of more than one QTL. Lander and Botstein (1989) proposed interval mapping (IM) which addresses the first two problems above. Their approach also assumes a single segregating QTL, but allows for tests between markers where genotypes are not known. Specifically, for a backcross, the proposed model to test for a QTL in the marker i and i + 1 interval is y j = µ + β m kj + e j (3.2) where m k is the genotype at the test position between marker i and i + 1. It takes value 0 or 1 with probability depending on the genotypes of the flanking markers and the test position (see Table 1). β is the effect of the putative QTL. Technically, (3.2) is a mixture of two normal distributions, since p(y j ) = p(m j = 0 m, r)p(y j m kj = 0) + p(m j = 1 m, r)p(y j m kj = 1) (here, r denotes flanking marker distances). As in MR, tests are done at each location to test H 0 : β = 0 vs. H 1 : β 0 The test compares the hypothesis of a single QTL at the current locus to the null hypothesis of no QTL. The two likelihood functions must be maximized over their respective parameters. The procedure described above is repeated for each locus in the genome. In practice, test loci are set up every 1cM or some other user-defined distance. The likelihood under the alternative varies with the test locus, so the EM algorithm (Dempster et al. 1977) must be applied at each locus. A LOD score profile can be constructed by plotting the LOD scores against the test positions. The LOD score is then compared to a genome-wide threshold. Whenever the LOD profile exceeds such threshold, we infer there exists a QTL. Generally, the genome-wide threshold is obtained using the 95th percentile of the distribution of the maximum (genome-wide) LOD score, under the null hypothesis of no segregating QTLs. Much effort has been expended to derive the appropriate genome-wide LOD score cutoff value (Churchill and Doerge 1994; Dupuis etl al. 1995; Dupuis and Siegmund 1999; Feingold et al. 1993; Lander and Botstein 1989; Rebai et al. 1994; Rebai et al. 1995). The major advantage of IM over MR at marker loci is that it gives more precise estimates of the QTL locations and effects. However, the computational cost involved in IM is bigger, and in the case of dense genetic markers and complete genotype data, the advantage tends to be very little. More importantly, IM, like MR at marker loci, is still a single-qtl model. 6

8 3.2 Single Phenotype - Multiple QTL Models The principal reasons for modelling multiple QTls are to increase sensitivity and to achieve better separation of linked QTLs. Also, epistasis (i.e., interactions between alleles at different QTLs) can only be identified through multiple-qtl models. When a single QTL model is used when really there are multiple QTLs, the genetic variation due to other segregating QTLs is incorporated into the environmental variation. This reduces sensitivity. A straight-forward extension of MR to model multiple QTLs is multiple regression, which includes a number of different markers in the model, rather than looking at them one at a time. Let m ij = 0 or 1, according to whether individual j has genotype Aa or AA at marker i. The model becomes y j = µ + β i m ij + e j (3.3) i S where S is the set of markers for which β i is not 0. To implement this, one must find a way to search through the model space to find such a set of S. As the number of markers gets larger, it would be impossible to consider every possible model in the model space. In addition, there remains a question of how to choose between candidate models; some form of criterion is needed. Broman (2002) looked at this problem in a model selection framework. Direct use of the multiple regression analysis is not easy. Moreover, Zeng (1993) showed that the partial regression coefficient is generally a biased estimate of the relevant QTL effect. An approach for multiple QTL mapping that combines ideas from IM and multiple regression is composite IM (CIM) (Zeng 1993, 1994). The method attempts to reduce the multidimensional search for QTLs to a series of one-dimensional searches. It conditions on markers outside the region of interest while performing IM to control for the effects of QTLs in other intervals. That way, there will be better power for QTL detection and also the QTL effects can be estimated more accurately. CIM can be described as follows. One chooses a subset of markers, S, to control background genetic variation. As in IM, suppose one wants to test for QTL between marker i and i + 1. For a test at the k th location, between markers i and i + 1, the statistical model can be written as y j = µ + β m kj + β l m lj + e j l i,i+1 where β is the effect of the putative QTL. A general guideline for practice would be to drop those markers that are within 10cM of the test position (Broman 2002 calls this subset of markers S ). 7

9 Under this model, the contribution of each individual to the likelihood has the form of a mixture of two normal distributions with means µ + l S β l m lj and µ + β + l S β l m lj with mixing proportions equal to the conditional probabilities of the individual having QTL genotype 0 or 1, given flanking marker genotypes and test position. Zeng (1994) used the ECM algorithm (Meng and Rubin 1993) to obtain the maximum likelihood estimates. As in IM, a LOD score is calculated at each test position, comparing the likelihood assuming there is a QTL at the putative test locus, to the likelihood assuming that there is no QTL there. The LOD score is then plotted as a function of test positions in the genome, and is compared to a genome-wide threshold to declare significance. Jansen independently developed a similar approach to handle multiple QTL combining IM and MR, multiple-qtl mapping (MQM) (Jansen 1993; Jansen and Stam 1994). It fits single QTL models with selected markers as cofactors in the regression to eliminate the effects of possible QTLs in other intervals. The major problem with these approaches is how to choose the set of markers to be included in the model. Too many markers will give low power for QTL detection, and too few will cause low accuracy. Zeng (1994) compared the performance of including all other markers to including only unlinked markers through simulation. He then recommended some combinations of deleting or inserting some linked markers in practice. Jansen (1993) and Jansen and Stam (1994) used backward elimination with AIC (Akaike 1969), or a slight variant, to pick the subset of markers in the model. Broman (2002) recommended the use of the BIC δ criterion, with the value δ chosen by the approximate correspondence between BIC δ and a genome-wide threshold on the LOD score. Kao et al. (1999) proposed multiple IM (MIM) which uses multiple marker intervals simultaneously to construct multiple QTL in the model. In MIM, Kao et al. (1999) adopted stepwise selection with Likelihood Ratio Test (LRT) as the selection criterion to identify QTLs. 3.3 Multiple Phenotype - Single or Multiple QTL Models In many QTL mapping studies there is more than one trait being measured. Performing single trait analysis repeatedly is not optimal clearly because it doesn t take into account the correlation structure among the traits. It has been shown that analyzing traits jointly will increase the power of QTL detection (Jiang and Zeng 1995; Knott and Haley 2000). In these studies, joint distribution of the multi-trait is imposed, which requires specification of the covariance structure of the traits (for a review of multi-trait QTL mapping methods, see Lund et al and references therein.) 8

10 Multi-trait methods are very attractive in that they try to capture the inner structure among traits, because many traits are genetically or environmentally correlated. However, as the number of traits gets bigger, so does the number of parameters that need to be estimated. 4 ETL Mapping Methods To this date, most ETL mapping methods consider single transcripts at a time. Multi-trait mapping using available methods has not been attempted. Most recognize that this would be impossible with thousands of traits since estimation of a phenotype covariance matrix is not feasible. We detail exact methods used below. 4.1 Transcript Based Approach The earliest ETL mapping studies applied single phenotype-single QTL mapping methods to every transcript (Brem et al. 2003; Schadt et al. 2003). We call this type of approach transcript based (TB). In Brem et al. (2002), a Wilcoxon-Mann-Whitney rank sum test was applied to every transcript and marker pair. Nominal p-values were reported and the number of linkages expected by chanced was estimated by permuation tests (Churchill and Doerge 1994). In Schadt et al. (2003), transcript specific LOD score profiles were obtained using standard QTL IM. A common genome-wide LOD score threshold was chosen to account for the potential increase of type I error induced by testing across multiple markers. Neither study accounted for multiple tests across transcripts. We view this as a problem that needs serious attention because in ETL studies, we usually deal with thousands of traits. 4.2 Transcript Based Approach with FDR control Recently, investigators have made attempts to account for multiplicities across transcripts (Chesler et al. 2005; Hubner et al. 2005). They first computed genome-wide empirical p-values of the maximum LOD score for every transcript using permutation tests and then estimated the q-values (Storey and Tibshirani 2003) accordingly. It has to be pointed out that even though effort to adjust for multiple testings across the transcripts was presented, this is by no means a systematic way to approach the multiple testing problem. Preliminary results from simulations such as those described in Kendziorski et al. (2004) show that FDR is not properly controlled and power is 9

11 relatively lower than other approaches. This is consistent with the results of Chesler et al. (2005), where the q-value threshold was increased to 0.25 so that a reasonable number of transcripts could be identified. 4.3 Marker Based Approach Instead of conducting the analysis at every transcript, ETL mapping could be done by conducting the analysis at every marker. We call these marker based (MB) approaches. They consist of identifying differentially expressed (DE) transcripts across groups of animals where groups are determined by the genotype at a given marker. Any DE methods could be used. Usually, the DE evidence threshold can be chosen such that multiple testing performed across the transcripts can be accounted for. However, the MB approach does not consider multiplicities across the genome. As was shown in Kendziorski et al. (2004), both TB and MB approaches share similar flaws. In TB, separate tests are conducted for each transcript. In MB, each marker is tested separately. And for both, the evidence that a transcript maps to a marker is measured against the evidence that it doesn t map there. Since in reality a transcript can map to any of the marker locations, the evidence that a transcript maps to a particular marker should be judged relative to the possibility that it maps nowhere or to some other marker. This idea motivates what we call the Mixture Over Markers (MOM) model (Kendziorski et al. 2004). 4.4 Mixture Over Markers Model Let y t be the expression level for tth transcript, y t = {y t1, y t2,..., y tn }, where n is the number of animals in the ETL study. The MOM model assumes a transcript t maps nowhere with probability p 0 and maps to marker m with probability p m, such that M m=0 p m = 1, where M denotes the total number of markers. The marginal distribution of y t is then given by p 0 f 0 (y t ) + M p m f m (y t ) (4.4) m=1 where f m is the predictive density of the data if transcript t maps to marker m; f 0 is the predictive density when the transcript maps to nowhere. Specifically, suppose transcript abundance measurements y tj arise independently from some observation distribution f obs ( µ t,, θ). The dependence among the underlying means µ t, is captured by a distribution π(µ). With this setting, f 0 (y t ) = ( ) n j=1 f obs(y tj µ) π(µ)dµ. For a transcript that maps to marker m say, the underlying 10

12 expression means defined by the marker genotype groups are not equal (µ t,0 µ t,1 ), but they both are assumed to come from π(µ). The governing distribution for y t is then: f m (y t ) = f 0 (y 0 t ) f 0(y 1 t ) where y 0(1) t denotes the set of transcript t values for animals with genotype 0(1). Model fit proceeds via the EM algorithm. Once the parameter estimates are obtained, posterior probabilities of mapping nowhere or to any of the M locations can be calculated via Bayes rule. For instance, the posterior probability that transcript t maps to location l, l = 0,..., M is given by p l f l (y t ) p 0 f 0 (y t ) + M m=1 p (4.5) mf m (y t ) With the MOM approach, a transcript is identified to be DE if the posterior probability of DE exceeds some threshold, where the threshold is chosen to control the expected posterior false discovery rate (Newton et al. 2004). In order to make a transcript specific call, the highest posterior density (HPD) region can be constructed in a straightforward fashion. A 1 α HPD region is obtained by including those marker locations until the sum of the posterior probability exceeds 1 α. 4.5 Current Status of ETL Mapping Methods Here we present some of our early simulation results comparing different approaches. We considered a single ETL simulation with two chromosomes. Marker data was obtained from chr 2 and chr 3 from the F2 data from Dr. Alan Attie s lab on campus. Chromosome 2 has 17 markers and there are 6 markers on chromosome 3. A single ETL was simulated at marker 5 of chromosome 2. We generated 20 data sets for each of the seven values of ν 0, which is a tuning parameter to control the variance pattern in the simulated data (for details, see Kendziorski et al. 2004) so that operating characteristics could be evaludated without biasing towards one method. We consider applying MR for every transcript as transcript based MR (TB-MR). For TB-MR, the genome wide type I error rate per transcript is controlled at 5% (Dupuis and Siegmund 1999). We also have four marker based approaches. The first is EBarrays with LogNormal-Normal (LNN) model where posterior probabilities of DE are computed for all the transcripts at every marker (MB-EB). See Kendziorski et al. (2003) for details of the LNN model. The second one is a t-test of equal means for every transcript, followed by calculations of q-values (Storey and Tibshirani 11

13 2003) at every marker (MB-Q). The third and fourth methods calculate moderated t-statistics using SAM (Tusher et al. 2003) and LIMMA (Smyth 2004), followed by q-value calculation at every marker. We refer to these as MB-SAM and MB-LIMMA. For MB-EB, MB-Q, MB-SAM, and MB-LIMMA, the false discovery rate per marker is controlled at 5%. A fifth method considered attempts to test all the transcripts and all the markers simultaneously. P-values from t-tests for every transcript and marker pair were used to calculate q-values (Storey and Tibshirani 2003) at once. Note that by doing so, we assume that a certain dependence structure among tests is satisfied (Storey 2003), which is likely to be not true here. We, nevertheless, include this method (Q-ALL) in our simulation, because we d like to see the performance of this kind of ad-hoc procedure. In addition, Storey and Tibshirani (2003) use the ETL mapping data of Brem et al. (2002) as motivation for considering calculation of all q-values simultaneously. For Q-ALL, FDR control is targeted at 5%. Power is defined as the probability of calling marker 4, 5 or 6 (flanking region of the true ETL) on chromosome 2 for mapping transcripts. FDR is the proportion of transcripts identified incorrectly as mapping to chromosome 2 or 3, i.e., they were EE or they were DE but mapped outside the flanking region of the true ETL. Figure 1 shows the average power and FDR (over the 20 simulated data sets) against each ν 0 value, together with 95% point-wise confidence interval. As can be seen, the power is around 80% and 90% for all of the methods; however, for all methods except MOM, the FDR is well above 5% for most values of ν 0. These simple and somewhat ad-hoc approaches fail to control the FDR at their claimed level, because they couldn t adjust for multiple tests across the markers and the transcripts simultaneously. Although MOM does adjust appropriately for these multiplicities, the approach has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. Also, MOM does not allow for multiple ETL. Some dense maps are currently available or under development, particularly those that use single nucleotide polymorphisms (SNPs) (see The SNP Consortium). Because of its proximity, SNPs may be shared among groups of people with harmful but unknown mutations and serve as markers for them. Such markers can help to reveal the mutations and expedite therapeutic drug discovery. When dense maps are available, the MOM model as it is may not be applicable as the number of mixture components to fit will be huge. 12

14 5 Research Plan The main objective of the proposed thesis is to develop a powerful statistical framework within which ETL can be localized. The framework relies on the MOM model, in that it simultaneously controls for tests done across the genome locations and across all the transcripts. However, the proposed methods, detailed below, significantly increase the utility and applicability of MOM by addressing the first two shortcomings listed above. The effort of developing a method to handle dense maps is ongoing. 5.1 ETL Interval Mapping The genomic regions identified using MOM are limited by their size, which may be large as analysis is conducted at genotyped markers only. When dense maps are not available (Attie lab data, Schadt et al. 2003), this limitation can be a serious one. The biological techniques currently available to search for genes in large genomic regions (e.g. candidate gene approach, congenic lines) can take years and, as a result, additional statistical methods capable of narrowing down regions are necessary. We here propose a method for IM of ETL. Consider for simplicity of notation a backcross population genotyped as 0 or 1 at M markers. The observed phenotype data y is a T n matrix of transcript abundance levels for transcript t = 1,..., T and individuals j = 1,..., n; m is an M n matrix of marker genotypes for markers i = 1,..., M and individuals j = 1,..., n. Consider a set of L locations spanning the entire genome, we model the expression data for transcript t as a L + 1 component mixture. To be specific, we imagine that the transcript may map to nowhere with probability p 0, and to any of the L locations with probability p l, l = 1, 2,..., L. The p s are mixing proportions. As noted in Section 3.1, transcript t is said to be linked (mapped) to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the population of individuals with genotype 0(1) at location l. Let zt l be an indicator of whether transcript t maps to location l. If l is at the markers, then we can decompose the predictive density under the alternative hypothesis such that f l (y t ) = f 0 (yt 0) f 0(yt 1 ), where the grouping is determined by genotypes at that marker. However, when l is between markers, the decomposition 13

15 is no longer valid. Instead, we have f l (y t ) = = j G l 0 f obs (y tj µ t,0 ) j G l 1 f l (y t g l ) p(g l m) dg l f obs (y tj µ t,1 ) π(µ t,0 ) π(µ t,1 ) p(g l m) dµ t,0 dµ t,1 dg l where g l = (g1 l, gl 2,..., gl n ) denotes the unknown genotype vector at location l; G 0(1) l denotes the set of population having genotype 0(1) at location l. Under the null hypothesis, the predictive density of the data, f 0 (y t ) can be calculated as before since it doesn t rely on genotype groupings. Parameter estimates for mixing proportions and hyper parameters can be obtained via the EM algorithm (see Appendix A). We show that the posterior probability for transcript t to be mapped to location l, after integrating out the µ s, is given by p(z l t = 1 y, m) = p(zl t = 1) f l (y t g l )p(g l m)dg l p(y t m) (5.6) where p(zt l = 1) is the prior probability that transcript t maps to location l. At a particular location l, the conditional distribution of genotype given the expression and marker data p(g l m) is assumed to only depend on the two markers flanking l. Notice that g l is a vector of length n. Theoretically, there are 2 n possible genotype vectors. So the integral in (5.6) is a huge mixture. In practice, one can restrict to consider a smaller number (Table 1, Zeng 1994, reproduced here as Table 2) since a lot of them have small probabilities. However, as the number of individuals in the study is large, this 2 n problem quickly becomes computationally infeasible Pseudomarker-MOM To get around with the 2 n problem, here we propose a general framework of ETL IM using importance sampling and pseudomarker generation. The idea of pseudomarkers was introduced by Sen and Churchill (2001) (see Appendix B). Let us introduce some slightly more general notation. Recall that a transcript t is linked to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the populations of individuals with genotype 0(1) at location l. When this is the case, we say that transcript t is in expression pattern P1 at location l, denoted as P 1 l t; similarly P 0 l t denotes the null pattern of expression at location l (µ 0 t,l = µ 1 t,l ). It is useful to introduce specific patterns in our framework, since when an F2 population is considered, for example, there are numerous patterns. Each defines a way in which the latent means can be different across genotype groups at a location. All of the non-null patterns 14

16 imply linkage to that location). Two T L matrices, θ 0 and θ 1, contain the latent mean levels of expression (θ = (θ 0, θ 1 )); L denotes the total number of locations considered. Let zt l = 1 if transcript t is in expression pattern P1 at location l and 0 otherwise. Then z is a T L indicator matrix specifying QTL locations for each transcript. Let us first consider a simple case where a transcript is associated with at most one genomic location l and consider inference at location l. This assumption simplifies algebraic development and will be relaxed later. At location l, of primary interest is the posterior probability that zt l = 1 for transcript t. Reexpressing (5.6) in terms of the patterns, we have p(p kt l y, m) ( pl t,p k f P k yt g l) p(g l m)dg l (5.7) where p l t,p k denotes the prior probability that transcript t is in pattern k at location l and f P k describes the predictive density of the data. k = 0, 1 for backcross. A normalizing constant is not required if further calculations of operational characteristics such as FDR is not of interest, which may be the case if a simple ranking of the genes is desired. For the model we propose here, we are interested in the calculation of estimated FDR. Therefore, we must specify the normalizing constant p(y t m). Expanding the probability in terms of different possible mapping locations, we have p(y t m) = p(y t m, z t = 0)p(z t = 0) + L l =1 p(y t m, zl t = 1)p(zt l = 1), where p(z t ) + L l =1 p(zl t = 1) = 1 and p(z t = 0) implies that the transcript does not map to any of the L locations, i.e., the transcript is in pattern P0 at every location. Therefore, the exact form of (5.7) is given by ( p l p(p kt l t y, m) =,P k fp k yt g l) p(g l m) dg l ( pt,p 0f P 0 (y t ) + L l =1 pl t,p 1 f P 1 (y t g l ) ) (5.8) p(g l m) dg l The p t,p 0 and p l t,p 1 s are unknown. We estimate them from the data, with the average of posterior probability that each transcript belongs to a particular pattern. Due to the 2 n problem, calculation of p(g l y, m), the posterior distribution of the unobserved genotype at location l given the expression and marker data, becomes computationally prohibitive. Here we use importance sampling by first simulating multiple versions of pseudomarkers from p(g l m), and then replace the exact integral with its Monte Carlo approximation. In simulating the pseudomarkers, one can use a simple Markov Chain structure where the putative QTL genotype at a given location only depends on two flanking markers. However, this might not work well if there is genotyping error or noninformative markers in the marker data. A hidden Markov Model (HMM) can be considered where the true marker genotypes follow a 15

17 Markov Chain, and the observed marker genotypes are characterized by distributions conditional on the underlying state process. Using an HMM, one can account for genotyping error and missing marker data in a coherent way. R/qtl (Broman et al. 2003) has this option as well. Figure 11 gives an example. The upper stripe gives the marker data for one animal. Blue shows AA, and yellow for Aa. There are some missing marker data, represented by light blue. The lower panel has 20 realizations of sampling from the HMM model trained by the marker data. Suppose for each location l, Q genotype vectors are sampled from the proposal distribution p(g m), where g = {g 1, g 2,..., g L } to give (g1 l, gl 2,..., gl Q ) for l = 1,..., L. Then equation (5.8) can be approximated by Monte Carlo integration using importance samples (see Appendix C.) p l Q p(p kt l t y, m),p k q=1 f ( P k yt gq) l p Q t,p 0 q=1 f P 0 (y t ) + L Q l =1 pl t,p 1 q=1 f ( ) (5.9) P 1 yt gq l This approach is an extension of the MOM model evaluated by averaging over the pseudomarkers. We call it pseudomarker-mom Two-Stage Approach Pseudomarker-MOM scans the genome at some small distance step in order to find potential locations to which transcripts are mapped. But applying it over the entire genome, even for each chromosome separately, can be computationally prohibitive. We propose a two-stage approach where we first apply MOM at markers to identify interesting regions, then follow up by applying pseudomarker-mom to those regions to better localize ETLs. MOM calculates for each transcript, the posterior probability that it maps to a particular marker, or it doesn t map at all. To identify potential ETL regions, we average the linkage evidence across all the transcripts, to give a marker specific linkage score. We can choose those hot-spot markers by thresholding the marker specific linkage evidence, using an HPD region Theoretical Result We here justify the fact that picking those regions with the highest average linkage evidence is indeed the correct thing to do under simplified conditions. Theorem 1: If we assume transcripts map to at most one genomic location and the prior probability of mapping to a particular marker is known and equal for all the markers, and hyperparameter values are known, then the expected posterior probability of a transcript mapping 16

18 to a particular marker is a non-increasing function of the recombination frequency between that marker and the ETL. Proof: See Appendix D. As a result of Theorem 1, we can be sure that under these conditions, the interesting marker regions picked by using MOM will be those regions that are the closest to the ETL. Once these regions are defined, we set up some equally-spaced pseudomarker grid and use pseudomarker- MOM to help localize the ETL with greater accuracy. We show by simulation studies in the next section that this approach works quite well, even under more general conditions Simulation Results A simulation was set up where there are 5000 transcripts and 100 individuals. The proportion of differential expression (DE) was 10%. The hypothetical marker map is composed of one chromosome and is equally spaced with 10cM in between. There are 10 markers in total (i.e, from 0cM to 90cM). Intensity values are simulated as described in Section 4.4. Two simulations were considered: one with a single ETL at 35cM and one with two ETLs: one at 35cM and one at 75cM. The two-stage approach was applied in the simulation. Specifically, two hotspot marker regions were selected from the average posterior probability profile at every marker, obtained by MOM. Pseudomarker-MOM was then applied across a 2cM grid within the hot-spot regions, with 50 pseudomarker realizations (Q = 50). For comparison, we applied traditional QTL IM transcript by transcript on the same data set and obtained genome-wide cutoffs on LOD scores based on an approximation formula from Rebai et al. (1994). Here we show results from the two simulations. Figure 2 (left panel) shows the average posterior probability profile, averaged across the mapping transcripts. The ETL region is identified both by MOM and pseudomarker-mom. However, MOM picks up a wide peak between marker 4 and marker 5, whereas pseudomarker- MOM identifies the ETL location with much better accuracy. To ensure that non-mapping transcripts were not falsely identified, we considered the posterior probability profiles averaged across non-mapping transcripts. They show little structure, as expected (figure not shown). Figure 2 (right panel) shows the 96.8% HPD regions for true mapping transcripts (96.8% is used to compare with IM results below). As shown, the ETL is identified correctly for most of the mapping transcripts. Just as shown in left panel, pseudomarker-mom provides good ETL localization. 17

19 In comparison, we also considered IM on the same simulated data set. This was implemented using R/qtl (Broman et al. 2003). Figure 3 (left panel) is the average LOD score, averaged across mapping transcripts. The region containing the ETL has the highest average LOD, but the average LOD scores are overall very high for mapping transcripts, and it s not clear what cutoff one should use in order to correctly identify the ETL region. For example, if we use 5, then it chooses almost the entire simulated chromosome. To compare with Figure 2, a confidence interval was constructed around the ETL using a 1-LOD drop interval around peak LOD score (Mangin et al. 1994). This procedure is designed to approximate a 96.8% confidence interval, but in general, these intervals can be biased in that they are too small and bootstrap procedure has been recommended (Visscher et al. 1996). In the ETL setting, obtaining bootstrap samples for thousands of transcripts doesn t seem feasible. On the other hand, confidence intervals that are slightly too small favors IM here as ETL appear to be better localized. It is not always clear which peaks to construct confidence intervals around. To give IM the best results, we consider a 10 cm window around the true ETL (35cM) and define the LOD peak as the highest LOD within the window. The 1-LOD drop interval is then constructed. Of course, in practice, one does not have the luxury of knowing where to choose these peaks and perhaps only the largest peak would be identified. For these reasons, this method of identifying ETL regions favors IM. Even so, in comparison to pseudomarker MOM, IM does not provide as precise estimates of ETL location. Similar results were obtained for the 2-ETL case. As shown in Figure 4 (left panel), for a particular simulation, the two ETL regions are identified both by MOM and pseudomarker-mom. However, spurious peaks show up in different places from different simulations, but in general have low average posterior probability compared with the main peaks (e.g., in the bottom row). As shown, pseudomarker-mom increases localization of the ETL compared with MOM. In Figure 4 (right panel), the distinct ETLs are identified for most of the mapping transcripts. We see that pseudomarker-mom provides good ETL localization. We again considered IM on the same two simulated data sets. Figure 5 (left panel) shows the average LOD score profile, averaged across mapping transcripts. The regions containing the ETLs has the highest average LOD, but distinct ETLs are not as clear as in Figure 3. As before, the 96.8 % confidence intervals were around each ETL using a 1-LOD drop interval around peak LOD scores, with peak chosen from the two 10cM windows around the true ETLs (35cM and 75cM). IM suffers from the same problem as before in that has relatively more false positive calls. 18

20 5.2 Multiple ETL mapping The MOM approach can be extended to account for multiple ETL. For example, if transcript t is possibly affected by two genotype locations l 1 and l 2, then four latent means are of interest: µ 0,0 t,(l 1,l 2 ), µ0,1 t,(l 1,l 2 ), µ1,0 t,(l 1,l 2 ) and µ1,0 t,(l 1,l 2 ), where µj,k t,(l 1,l 2 ) denotes the latent mean level of expression for transcript t for the populations of individuals with genotype (j, k) at locations l 1 and l 2. These latent means can be arranged in 15 possible expression patterns, all of which may be of interest. For simplicity, we consider: P0: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P1: µ 0,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P2: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P3: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) P4: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P5: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) Pattern P 0 allows for the possibility that a transcript maps to neither location. The latent means of a transcript mapping only to location l 1 would satisfy pattern P 1 since only allelic differences at l 1 affect the mean level of expression. Similarly, the means of transcripts mapping only to location l 2 satisfy P 2. Patterns P3-P5 describe possible ways in which the allelic variation at both locations can act and interact to affect expression level. P 3 describes a scenario in which the alleles at each location have equal, but not dominant, effects. A dominant model would be described by P 4 and an additive model by P 5. The multiple ETL MOM model (M-MOM) has 5 ( ) M patterns. As before, of primary interest is the posterior probability of particular expression patterns. They can be calculated similarly as in (4.5) for any pattern of interest, where k = 0, 1,..., Simulation Results We apply M-MOM to the simulated 2-ETL data described in Section We perform a two dimensional scan, by looking at every possible marker pair and all the expression patterns simultaneously. In the simulation, the two ETL effect sizes are generated to be the same, thus corresponding to pattern 3. If we look at the average posterior probabilities for all non-null patterns, most of them are negligible as expected, except for pattern 3 (figure not shown). Looking closely at pattern 3, we plot the average posterior probabilities for every marker pair in Figure 6. The first ETL location is on the Y axis and the second ETL location on the X axis. As 19

21 shown, the plot locates the two ETLs at 30cM and 80cM, which is very close to the truth and the best accuracy that M-MOM can achieve, since the true ETLs are not at markers. For comparison purposes, we implement a 2-D marker regression scan on the same data set (see Figure 7.) On the upper triangle, we plot the average LOD score for epistasis, which has very low probability, as expected. The diagonal is the average LOD sore from 1-D IM scan. The lower triangle gives the average joint LOD scores. Also shown in the plot are the contour lines over the range of the LOD scores obtained. The region between 30cM, 40cM and 70cM to 90cM has relatively high average probabilities compared to the others. In order to assess the significance, we randomly sample 10 transcripts corresponding to every 10th percentile of the log of expression means, and perform permutation tests of size 1000 on each of them. The average 95th percentile of the LOD scores from each of the 1000 permutation tests is about 3.2. Using this as our 2-D LOD score cutoff, we see from the contour lines that it gives a much wider region than the actual ETL. In line with the previous comparisons, using traditional QTL mapping techniques on the ETL data seems to yield more false positives. 6 Future Research Questions The proposed framework for ETL mapping enables IM of single ETL and the identification of multiple ETL at markers. Preliminary results from simulated data sets are encouraging. There are a lot of questions that need to be explored further. 1. Our methods are extensions of the MOM model and a number of questions of implementation remain open. One question that is important to our extension is: How sparse is sparse? In other words, how many markers can MOM handle relative to the number of transcripts? Generally, one might also be concerned about whether MOM could fit mixtures with so many components. We did a series of simulations, trying to shed some light on this. Consider a backcross, and suppose the genome has 400 equally spaced markers with 5cM in between. There are 25% of DE transcripts, mapping to 80% of the markers with equal probabilities. We varied the number of transcripts to be 100, 1000, 5000 and 40000, representing very small, small, moderate and large number of transcripts in a realistic experiment, with animal number being 10, 60 and 100. Here 10 might be a little unrealistic, but we were curious about how the results will look like. 20

22 We plot the posterior probabilities for all the transcripts of a particular mixture component in the MOM model in Figures 8, 9 and 10. The number of transcripts mapping to that mixture component is 1, 1, 4 and 22 corresponding to the four values of total transcripts number. The true DE transcripts are colored in red. Surprisingly, it seems that MOM can fit a mixture model with the number of components bigger than the number of transcripts pretty well (first panel in Figure 9 and Figure 10), as long as there are enough animals in the experiment. Our impression is that when the number of animals is small, there tends to be very few recombinations. The degree of distincitiveness between the MOM components is not very high, resulting in relatively low power (see Figure 8.) When the number of transcripts is small, there are also very few DE transcripts mapped to every marker. We might expect the posterior probabilities of those transcripts not to be very accurate. But as seen from Figure 9, this does not seem to be the case. There are 60 animals, but even when we only have 100 transcripts, MOM still detects the one DE transcript with posterior probabilitiy being 1.0. When there are transcripts, among the top 22 transcripts with the highest posterior probabilities, 20 of them are true DE (22 DE in total). The posterior probabilities range from to 1.0. In Figure 10, when there are 100 animals and transcripts, 21 out of the top 22 transcripts are DE, and their posterior probabilities range from 0.76 to 1.0. Much more work is required to investigate these results and define precise conditions under which parameters are well estimated. 2. Importance sampling is applied in pseudomarker-mom. Importance samples are taken from the proposal distribution p(g l m) to approximate what we really desire p(g l y, m). We will investigate ways to choose Q so that we balance between computational burden and accuracy. Perhaps a lower bound on Q can be obtained. 3. The proposed two-stage approach relies on the ability of MOM to identify the correct interesting regions for follow-up. We have shown theoretically that under simplified conditions, the highest posterior probability region obtained using MOM is that region closest to ETL. The simplified conditions need to be relaxed. In particular, we will investigate the case of multiple ETL and varying prior probabilities. 4. We have developed a method for mapping multiple-etl. Preliminary results from simulations investigating a 2 ETL system are encouraging. We would like to extend this 21

Multiple QTL mapping

Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power