A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen
|
|
- Alisha Lane
- 5 years ago
- Views:
Transcription
1 A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment of the requirements for the Ph.D. program in the Department of Statistics University of Wisconsin-Madison Committee Members: Professor Christina Kendziorski Professor Alan Attie Professor Michael Newton Professor Brian Yandell 1
2 Contents 1 Introduction 2 2 ETL mapping experiments 3 3 QTL Mapping Methods Single Phenotype - Single QTL Models Single Phenotype - Multiple QTL Models Multiple Phenotype - Single or Multiple QTL Models ETL Mapping Methods Transcript Based Approach Transcript Based Approach with FDR control Marker Based Approach Mixture Over Markers Model Current Status of ETL Mapping Methods Research Plan ETL Interval Mapping Pseudomarker-MOM Two-Stage Approach Theoretical Result Simulation Results Multiple ETL mapping Simulation Results Future Research Questions 20 References 23 Appendix 27 1
3 1 Introduction Identifying the genetic loci responsible for variation in quantitative traits is of great importance to biologists. Although quantitative trait loci (QTL) mapping studies have been going on for over 80 years starting with Sax in 1923 (Sax 1923; Rasmusson 1933; Thoday 1961), where he proposed that the association between seed weight and seed coat color in beans was due to the linkage between the genes controlling weight and the genes controlling color, the vast majority of studies have taken place in the last 20 years. The increased rate was due largely to two major advances in the 1980s: the advent of restriction fragment length polymorphisms (RFLPs) (Botstein et al. 1980) so that it s possible to genotype markers on a large scale and the advent of statistical methods for data analysis (Lander and Botstein 1989). A recent advance of comparable significance has been made in the area of phenotyping. With high throughput technologies now widely available, investigators can measure thousands of phenotypes at once. Gene expression measurements are particularly amenable to QTL mapping and much excitement abounds for this field of genetical genomics (Jansen and Nap 2001; Jansen 2003; Cox 2004; Broman 2005). The so called expression QTL (eqtl) or expression trait loci (ETL) studies have been used to identify candidate genes (Dumas et al. 2000; Eaves et al. 2002; Karp et al. 2000; Wayne et al. 2003; Schadt et al. 2003; Brstrykh et al. 2005; Hubner et al. 2005), to infer not only correlative but also causal relationships among modulator and modulated genes (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003), to better define traditional phenotypes (Schadt et al. 2003), and to serve as a bridge between genetic variation and the traditional complex traits of interest (Schadt et al. 2003). Although successful in many ways, the results obtained from ETL studies to date are limited. In the early published studies, the ETL mapping problem had been addressed by treating each transcript separately as a phenotype for QTL mapping. Single trait QTL analysis was then carried out thousands of times (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003). Notably, although adjustments were made for multiple tests across the genome, no adjustments had been considered for multiple tests across transcripts. There are hundreds of test locations across the genome but tens of thousands of transcripts leading to a potentially serious multiple testing problem and an inflated false discovery rate (FDR). For some labs, an inflated FDR is tolerable as many genes can be tested quickly for certain properties and discarded if found to be false positives. However, for many labs, such tests are prohibitively expensive. Statistical methods that control error rates and 2
4 that are more sensitive and more specific are needed. In a few recent studies, there has been some effort in attempting to account for both sets of multiplicities (Chesler et al. 2005; Hubner et al. 2005; Bystrykh et al. 2005). Permutation tests were performed to derive the genome-wide LOD score threshold and q-values were computed for the set of transcripts declared significant using the corresponding genome-wide empirical p-values. As discussed in Section 4.2, this last approach may not properly control FDR and may suffer from very low power. The main aim of the proposed thesis is to develop a statistical framework for ETL mapping that properly accounts for multiplicities while maintaining or improving upon the operating characteristics of currently used approaches. Section 2 provides a brief background on the questions addressed and data collected in ETL mapping experiments. Statistical methods for ETL mapping are reviewed in Section 4. As discussed there, the mixture over markers (MOM) model is the only statistically rigorous ETL mapping method developed to account for multiplicities across both markers and transcripts. However, MOM has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. When dense maps are available, the MOM model by itself may not be applicable as the number of mixture components is too big to fit. Finally, MOM does not allow for multiple ETL. Statistical methods to address each of these shortcomings are detailed in Section 5 and preliminary results are demonstrated on simulated data and data from a study of diabetes in mouse. Future directions are discussed in Section 6. 2 ETL mapping experiments The general data collected in an ETL mapping experiment consists of a genetic map, marker genotypes, and microarray data (phenotypes) collected on a set of individuals. A genetic marker is a region of the genome of known location. These locations make up the genetic map. The distance between markers is given by genetic distance, in the unit of centimorgan (cm). It is defined as the expected percentage of crossovers between two loci during meiosis. At each marker, genotypes are obtained. ETL mapping studies take place in both human and experimental populations. We focus on the latter. For these populations, the possibilities of marker genotypes are simplified. For example, studies with experimental populations most often involve arranging a cross between two inbred strains differing substantially in some trait of interest to produce F1 offspring. Segregating progeny are then typically derived from a B1 backcross (F1 x Parent) or an F2 3
5 intercross (F1 x F1). Repeated intercrossing (Fn x Fn) can also be done to generate so-called recombinant inbred (RI) lines. For simplicity of notation, we focus on a backcross population. Consider two inbred parental populations P 1 and P 2, genotyped as AA and aa, respectively, at M markers. The offspring of the first generation (F 1 ) have genotype Aa at each marker (allele A from parent P 1 and a from parent P 2 ). In a backcross, the F 1 offspring are crossed back to a parental line, say P 1, resulting in a population with genotypes AA or Aa at a given marker. We denote AA by 0 and Aa by 1. For each member of the backcross population, phenotypes are collected via microarrays. Microarrays allow us to snapshot the expressions of thousands of genes at the same time. The oligonucleotide and cdna microarrays are the two types of technology that are most widely used. A nice review of the microarray technologies can be found in Nguyen et al. (2002). We present a very brief, by no means complete review here. Affymetrix is one company that produces oligonucleotide chips which contain tens of thousands of probe sets, or DNA sequences related to a gene. We will refer to these sequences throughout this paper as transcripts. Each gene is represented by some number (usually 11-20) of features. Many Affymetrix arrays use 20 features (Nguyen et al. 2002). Each feature is a short sequence of oligonucleotides. Present in the features are pairs of perfect match (PM) and mismatch (MM) sequences. The PM is a piece of gene, 25 nucleotides in length; the corresponding MM is identical to PM except for the middle (i.e., 13th) position. After some pre-processing and normalization, one summary score is derived for each probe set. There are a number of methods for processing the probe set intensities and for normalization, such as DNA chip Analyzer by Li and Wong (2001), and Robust Multi-array Analysis (RMA) by Irizarry et al. (2003). In a cdna array experiment, a gene is represented by a long cdna fragment (500 to 1000 bases). The experimental sample of interest is often labeled with a red fluorescent dye, and a reference sample is labeled green. The amount of cdna hybridized to each probe can be captured through some imaging device which measures the amount of the fluorescent intensity. Image files are processed to give a summary expression score, log 2 (R/G). Yang et al. (2002) propose methods for cdna array data normalization and compare their methods with a number of other approaches. With proper pre-processing and normalization, from either technology we obtain a single summary score of expression for each transcript on each array. 4
6 3 QTL Mapping Methods As noted above, ETL studies are very similar to QTL studies, but with thousands of phenotypes. Perhaps it is not surprising then that early ETL studies repeatedly applied methods developed for QTL mapping to each transcript. The literature on QTL mapping methods is quite large. We here review only those methods relevant to this proposal and refer the interested reader to Doerge et al. (1997) or Lynch and Walsh (1998) for more information on QTL mapping methods. 3.1 Single Phenotype - Single QTL Models Consider a backcross with n progeny with univariate phenotypes y j measured on all the individuals, j = 1,..., n, together with genotypes for a set of M markers. Let m ij = 0 or 1 according to whether the individual j has genotype AA or Aa at the ith marker, i = 1,..., M, The simplest method to test for trait-marker association is marker regression (MR), to test mean trait value differences between different marker groups for a particular marker. Specifically, for a test at the i th marker, the single QTL model is: y j = µ + β i m ij + ɛ j (3.1) where ɛ j are independent and identically distributed (iid) as Normal(0, σ 2 ) and one can test H 0 : β i = 0 vs. H 1 : β i 0 or equivalently H 0 : µ 0 = µ 1 vs. µ 0 µ 1 where µ 0 = µ and µ 1 = µ + β i. This is equivalent to an analysis of variance (ANOVA) at each marker (Soller et al. 1976) when there are more than two marker genotype groups (a t-test for two genotype groups as in a backcross). Usually, instead of F or t-statistics, geneticists prefer to report a LOD score, which is defined as the (base 10) log-likelihood ratio comparing the two hypotheses. A LOD score is calculated at each marker position, and marker loci giving significant LOD scores are identified as putative QTLs. For these putative QTLs, we loosely say the phenotype is linked to them. This approach is conceptually very simple, and clearly there are problems with it (Lander and Botstein 1989). First, if the true QTL is not located exactly at the marker, its effect will likely be underestimated because of recombination between the marker and the true QTL. Second, because of the confounding effects, the power for QTL detection will decrease, especially when the markers are widely spread, requiring more individuals for the test. Third, this approach considers one 5
7 marker loci at a time, which is not very powerful comparing with multiple QTL models in the presence of more than one QTL. Lander and Botstein (1989) proposed interval mapping (IM) which addresses the first two problems above. Their approach also assumes a single segregating QTL, but allows for tests between markers where genotypes are not known. Specifically, for a backcross, the proposed model to test for a QTL in the marker i and i + 1 interval is y j = µ + β m kj + e j (3.2) where m k is the genotype at the test position between marker i and i + 1. It takes value 0 or 1 with probability depending on the genotypes of the flanking markers and the test position (see Table 1). β is the effect of the putative QTL. Technically, (3.2) is a mixture of two normal distributions, since p(y j ) = p(m j = 0 m, r)p(y j m kj = 0) + p(m j = 1 m, r)p(y j m kj = 1) (here, r denotes flanking marker distances). As in MR, tests are done at each location to test H 0 : β = 0 vs. H 1 : β 0 The test compares the hypothesis of a single QTL at the current locus to the null hypothesis of no QTL. The two likelihood functions must be maximized over their respective parameters. The procedure described above is repeated for each locus in the genome. In practice, test loci are set up every 1cM or some other user-defined distance. The likelihood under the alternative varies with the test locus, so the EM algorithm (Dempster et al. 1977) must be applied at each locus. A LOD score profile can be constructed by plotting the LOD scores against the test positions. The LOD score is then compared to a genome-wide threshold. Whenever the LOD profile exceeds such threshold, we infer there exists a QTL. Generally, the genome-wide threshold is obtained using the 95th percentile of the distribution of the maximum (genome-wide) LOD score, under the null hypothesis of no segregating QTLs. Much effort has been expended to derive the appropriate genome-wide LOD score cutoff value (Churchill and Doerge 1994; Dupuis etl al. 1995; Dupuis and Siegmund 1999; Feingold et al. 1993; Lander and Botstein 1989; Rebai et al. 1994; Rebai et al. 1995). The major advantage of IM over MR at marker loci is that it gives more precise estimates of the QTL locations and effects. However, the computational cost involved in IM is bigger, and in the case of dense genetic markers and complete genotype data, the advantage tends to be very little. More importantly, IM, like MR at marker loci, is still a single-qtl model. 6
8 3.2 Single Phenotype - Multiple QTL Models The principal reasons for modelling multiple QTls are to increase sensitivity and to achieve better separation of linked QTLs. Also, epistasis (i.e., interactions between alleles at different QTLs) can only be identified through multiple-qtl models. When a single QTL model is used when really there are multiple QTLs, the genetic variation due to other segregating QTLs is incorporated into the environmental variation. This reduces sensitivity. A straight-forward extension of MR to model multiple QTLs is multiple regression, which includes a number of different markers in the model, rather than looking at them one at a time. Let m ij = 0 or 1, according to whether individual j has genotype Aa or AA at marker i. The model becomes y j = µ + β i m ij + e j (3.3) i S where S is the set of markers for which β i is not 0. To implement this, one must find a way to search through the model space to find such a set of S. As the number of markers gets larger, it would be impossible to consider every possible model in the model space. In addition, there remains a question of how to choose between candidate models; some form of criterion is needed. Broman (2002) looked at this problem in a model selection framework. Direct use of the multiple regression analysis is not easy. Moreover, Zeng (1993) showed that the partial regression coefficient is generally a biased estimate of the relevant QTL effect. An approach for multiple QTL mapping that combines ideas from IM and multiple regression is composite IM (CIM) (Zeng 1993, 1994). The method attempts to reduce the multidimensional search for QTLs to a series of one-dimensional searches. It conditions on markers outside the region of interest while performing IM to control for the effects of QTLs in other intervals. That way, there will be better power for QTL detection and also the QTL effects can be estimated more accurately. CIM can be described as follows. One chooses a subset of markers, S, to control background genetic variation. As in IM, suppose one wants to test for QTL between marker i and i + 1. For a test at the k th location, between markers i and i + 1, the statistical model can be written as y j = µ + β m kj + β l m lj + e j l i,i+1 where β is the effect of the putative QTL. A general guideline for practice would be to drop those markers that are within 10cM of the test position (Broman 2002 calls this subset of markers S ). 7
9 Under this model, the contribution of each individual to the likelihood has the form of a mixture of two normal distributions with means µ + l S β l m lj and µ + β + l S β l m lj with mixing proportions equal to the conditional probabilities of the individual having QTL genotype 0 or 1, given flanking marker genotypes and test position. Zeng (1994) used the ECM algorithm (Meng and Rubin 1993) to obtain the maximum likelihood estimates. As in IM, a LOD score is calculated at each test position, comparing the likelihood assuming there is a QTL at the putative test locus, to the likelihood assuming that there is no QTL there. The LOD score is then plotted as a function of test positions in the genome, and is compared to a genome-wide threshold to declare significance. Jansen independently developed a similar approach to handle multiple QTL combining IM and MR, multiple-qtl mapping (MQM) (Jansen 1993; Jansen and Stam 1994). It fits single QTL models with selected markers as cofactors in the regression to eliminate the effects of possible QTLs in other intervals. The major problem with these approaches is how to choose the set of markers to be included in the model. Too many markers will give low power for QTL detection, and too few will cause low accuracy. Zeng (1994) compared the performance of including all other markers to including only unlinked markers through simulation. He then recommended some combinations of deleting or inserting some linked markers in practice. Jansen (1993) and Jansen and Stam (1994) used backward elimination with AIC (Akaike 1969), or a slight variant, to pick the subset of markers in the model. Broman (2002) recommended the use of the BIC δ criterion, with the value δ chosen by the approximate correspondence between BIC δ and a genome-wide threshold on the LOD score. Kao et al. (1999) proposed multiple IM (MIM) which uses multiple marker intervals simultaneously to construct multiple QTL in the model. In MIM, Kao et al. (1999) adopted stepwise selection with Likelihood Ratio Test (LRT) as the selection criterion to identify QTLs. 3.3 Multiple Phenotype - Single or Multiple QTL Models In many QTL mapping studies there is more than one trait being measured. Performing single trait analysis repeatedly is not optimal clearly because it doesn t take into account the correlation structure among the traits. It has been shown that analyzing traits jointly will increase the power of QTL detection (Jiang and Zeng 1995; Knott and Haley 2000). In these studies, joint distribution of the multi-trait is imposed, which requires specification of the covariance structure of the traits (for a review of multi-trait QTL mapping methods, see Lund et al and references therein.) 8
10 Multi-trait methods are very attractive in that they try to capture the inner structure among traits, because many traits are genetically or environmentally correlated. However, as the number of traits gets bigger, so does the number of parameters that need to be estimated. 4 ETL Mapping Methods To this date, most ETL mapping methods consider single transcripts at a time. Multi-trait mapping using available methods has not been attempted. Most recognize that this would be impossible with thousands of traits since estimation of a phenotype covariance matrix is not feasible. We detail exact methods used below. 4.1 Transcript Based Approach The earliest ETL mapping studies applied single phenotype-single QTL mapping methods to every transcript (Brem et al. 2003; Schadt et al. 2003). We call this type of approach transcript based (TB). In Brem et al. (2002), a Wilcoxon-Mann-Whitney rank sum test was applied to every transcript and marker pair. Nominal p-values were reported and the number of linkages expected by chanced was estimated by permuation tests (Churchill and Doerge 1994). In Schadt et al. (2003), transcript specific LOD score profiles were obtained using standard QTL IM. A common genome-wide LOD score threshold was chosen to account for the potential increase of type I error induced by testing across multiple markers. Neither study accounted for multiple tests across transcripts. We view this as a problem that needs serious attention because in ETL studies, we usually deal with thousands of traits. 4.2 Transcript Based Approach with FDR control Recently, investigators have made attempts to account for multiplicities across transcripts (Chesler et al. 2005; Hubner et al. 2005). They first computed genome-wide empirical p-values of the maximum LOD score for every transcript using permutation tests and then estimated the q-values (Storey and Tibshirani 2003) accordingly. It has to be pointed out that even though effort to adjust for multiple testings across the transcripts was presented, this is by no means a systematic way to approach the multiple testing problem. Preliminary results from simulations such as those described in Kendziorski et al. (2004) show that FDR is not properly controlled and power is 9
11 relatively lower than other approaches. This is consistent with the results of Chesler et al. (2005), where the q-value threshold was increased to 0.25 so that a reasonable number of transcripts could be identified. 4.3 Marker Based Approach Instead of conducting the analysis at every transcript, ETL mapping could be done by conducting the analysis at every marker. We call these marker based (MB) approaches. They consist of identifying differentially expressed (DE) transcripts across groups of animals where groups are determined by the genotype at a given marker. Any DE methods could be used. Usually, the DE evidence threshold can be chosen such that multiple testing performed across the transcripts can be accounted for. However, the MB approach does not consider multiplicities across the genome. As was shown in Kendziorski et al. (2004), both TB and MB approaches share similar flaws. In TB, separate tests are conducted for each transcript. In MB, each marker is tested separately. And for both, the evidence that a transcript maps to a marker is measured against the evidence that it doesn t map there. Since in reality a transcript can map to any of the marker locations, the evidence that a transcript maps to a particular marker should be judged relative to the possibility that it maps nowhere or to some other marker. This idea motivates what we call the Mixture Over Markers (MOM) model (Kendziorski et al. 2004). 4.4 Mixture Over Markers Model Let y t be the expression level for tth transcript, y t = {y t1, y t2,..., y tn }, where n is the number of animals in the ETL study. The MOM model assumes a transcript t maps nowhere with probability p 0 and maps to marker m with probability p m, such that M m=0 p m = 1, where M denotes the total number of markers. The marginal distribution of y t is then given by p 0 f 0 (y t ) + M p m f m (y t ) (4.4) m=1 where f m is the predictive density of the data if transcript t maps to marker m; f 0 is the predictive density when the transcript maps to nowhere. Specifically, suppose transcript abundance measurements y tj arise independently from some observation distribution f obs ( µ t,, θ). The dependence among the underlying means µ t, is captured by a distribution π(µ). With this setting, f 0 (y t ) = ( ) n j=1 f obs(y tj µ) π(µ)dµ. For a transcript that maps to marker m say, the underlying 10
12 expression means defined by the marker genotype groups are not equal (µ t,0 µ t,1 ), but they both are assumed to come from π(µ). The governing distribution for y t is then: f m (y t ) = f 0 (y 0 t ) f 0(y 1 t ) where y 0(1) t denotes the set of transcript t values for animals with genotype 0(1). Model fit proceeds via the EM algorithm. Once the parameter estimates are obtained, posterior probabilities of mapping nowhere or to any of the M locations can be calculated via Bayes rule. For instance, the posterior probability that transcript t maps to location l, l = 0,..., M is given by p l f l (y t ) p 0 f 0 (y t ) + M m=1 p (4.5) mf m (y t ) With the MOM approach, a transcript is identified to be DE if the posterior probability of DE exceeds some threshold, where the threshold is chosen to control the expected posterior false discovery rate (Newton et al. 2004). In order to make a transcript specific call, the highest posterior density (HPD) region can be constructed in a straightforward fashion. A 1 α HPD region is obtained by including those marker locations until the sum of the posterior probability exceeds 1 α. 4.5 Current Status of ETL Mapping Methods Here we present some of our early simulation results comparing different approaches. We considered a single ETL simulation with two chromosomes. Marker data was obtained from chr 2 and chr 3 from the F2 data from Dr. Alan Attie s lab on campus. Chromosome 2 has 17 markers and there are 6 markers on chromosome 3. A single ETL was simulated at marker 5 of chromosome 2. We generated 20 data sets for each of the seven values of ν 0, which is a tuning parameter to control the variance pattern in the simulated data (for details, see Kendziorski et al. 2004) so that operating characteristics could be evaludated without biasing towards one method. We consider applying MR for every transcript as transcript based MR (TB-MR). For TB-MR, the genome wide type I error rate per transcript is controlled at 5% (Dupuis and Siegmund 1999). We also have four marker based approaches. The first is EBarrays with LogNormal-Normal (LNN) model where posterior probabilities of DE are computed for all the transcripts at every marker (MB-EB). See Kendziorski et al. (2003) for details of the LNN model. The second one is a t-test of equal means for every transcript, followed by calculations of q-values (Storey and Tibshirani 11
13 2003) at every marker (MB-Q). The third and fourth methods calculate moderated t-statistics using SAM (Tusher et al. 2003) and LIMMA (Smyth 2004), followed by q-value calculation at every marker. We refer to these as MB-SAM and MB-LIMMA. For MB-EB, MB-Q, MB-SAM, and MB-LIMMA, the false discovery rate per marker is controlled at 5%. A fifth method considered attempts to test all the transcripts and all the markers simultaneously. P-values from t-tests for every transcript and marker pair were used to calculate q-values (Storey and Tibshirani 2003) at once. Note that by doing so, we assume that a certain dependence structure among tests is satisfied (Storey 2003), which is likely to be not true here. We, nevertheless, include this method (Q-ALL) in our simulation, because we d like to see the performance of this kind of ad-hoc procedure. In addition, Storey and Tibshirani (2003) use the ETL mapping data of Brem et al. (2002) as motivation for considering calculation of all q-values simultaneously. For Q-ALL, FDR control is targeted at 5%. Power is defined as the probability of calling marker 4, 5 or 6 (flanking region of the true ETL) on chromosome 2 for mapping transcripts. FDR is the proportion of transcripts identified incorrectly as mapping to chromosome 2 or 3, i.e., they were EE or they were DE but mapped outside the flanking region of the true ETL. Figure 1 shows the average power and FDR (over the 20 simulated data sets) against each ν 0 value, together with 95% point-wise confidence interval. As can be seen, the power is around 80% and 90% for all of the methods; however, for all methods except MOM, the FDR is well above 5% for most values of ν 0. These simple and somewhat ad-hoc approaches fail to control the FDR at their claimed level, because they couldn t adjust for multiple tests across the markers and the transcripts simultaneously. Although MOM does adjust appropriately for these multiplicities, the approach has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. Also, MOM does not allow for multiple ETL. Some dense maps are currently available or under development, particularly those that use single nucleotide polymorphisms (SNPs) (see The SNP Consortium). Because of its proximity, SNPs may be shared among groups of people with harmful but unknown mutations and serve as markers for them. Such markers can help to reveal the mutations and expedite therapeutic drug discovery. When dense maps are available, the MOM model as it is may not be applicable as the number of mixture components to fit will be huge. 12
14 5 Research Plan The main objective of the proposed thesis is to develop a powerful statistical framework within which ETL can be localized. The framework relies on the MOM model, in that it simultaneously controls for tests done across the genome locations and across all the transcripts. However, the proposed methods, detailed below, significantly increase the utility and applicability of MOM by addressing the first two shortcomings listed above. The effort of developing a method to handle dense maps is ongoing. 5.1 ETL Interval Mapping The genomic regions identified using MOM are limited by their size, which may be large as analysis is conducted at genotyped markers only. When dense maps are not available (Attie lab data, Schadt et al. 2003), this limitation can be a serious one. The biological techniques currently available to search for genes in large genomic regions (e.g. candidate gene approach, congenic lines) can take years and, as a result, additional statistical methods capable of narrowing down regions are necessary. We here propose a method for IM of ETL. Consider for simplicity of notation a backcross population genotyped as 0 or 1 at M markers. The observed phenotype data y is a T n matrix of transcript abundance levels for transcript t = 1,..., T and individuals j = 1,..., n; m is an M n matrix of marker genotypes for markers i = 1,..., M and individuals j = 1,..., n. Consider a set of L locations spanning the entire genome, we model the expression data for transcript t as a L + 1 component mixture. To be specific, we imagine that the transcript may map to nowhere with probability p 0, and to any of the L locations with probability p l, l = 1, 2,..., L. The p s are mixing proportions. As noted in Section 3.1, transcript t is said to be linked (mapped) to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the population of individuals with genotype 0(1) at location l. Let zt l be an indicator of whether transcript t maps to location l. If l is at the markers, then we can decompose the predictive density under the alternative hypothesis such that f l (y t ) = f 0 (yt 0) f 0(yt 1 ), where the grouping is determined by genotypes at that marker. However, when l is between markers, the decomposition 13
15 is no longer valid. Instead, we have f l (y t ) = = j G l 0 f obs (y tj µ t,0 ) j G l 1 f l (y t g l ) p(g l m) dg l f obs (y tj µ t,1 ) π(µ t,0 ) π(µ t,1 ) p(g l m) dµ t,0 dµ t,1 dg l where g l = (g1 l, gl 2,..., gl n ) denotes the unknown genotype vector at location l; G 0(1) l denotes the set of population having genotype 0(1) at location l. Under the null hypothesis, the predictive density of the data, f 0 (y t ) can be calculated as before since it doesn t rely on genotype groupings. Parameter estimates for mixing proportions and hyper parameters can be obtained via the EM algorithm (see Appendix A). We show that the posterior probability for transcript t to be mapped to location l, after integrating out the µ s, is given by p(z l t = 1 y, m) = p(zl t = 1) f l (y t g l )p(g l m)dg l p(y t m) (5.6) where p(zt l = 1) is the prior probability that transcript t maps to location l. At a particular location l, the conditional distribution of genotype given the expression and marker data p(g l m) is assumed to only depend on the two markers flanking l. Notice that g l is a vector of length n. Theoretically, there are 2 n possible genotype vectors. So the integral in (5.6) is a huge mixture. In practice, one can restrict to consider a smaller number (Table 1, Zeng 1994, reproduced here as Table 2) since a lot of them have small probabilities. However, as the number of individuals in the study is large, this 2 n problem quickly becomes computationally infeasible Pseudomarker-MOM To get around with the 2 n problem, here we propose a general framework of ETL IM using importance sampling and pseudomarker generation. The idea of pseudomarkers was introduced by Sen and Churchill (2001) (see Appendix B). Let us introduce some slightly more general notation. Recall that a transcript t is linked to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the populations of individuals with genotype 0(1) at location l. When this is the case, we say that transcript t is in expression pattern P1 at location l, denoted as P 1 l t; similarly P 0 l t denotes the null pattern of expression at location l (µ 0 t,l = µ 1 t,l ). It is useful to introduce specific patterns in our framework, since when an F2 population is considered, for example, there are numerous patterns. Each defines a way in which the latent means can be different across genotype groups at a location. All of the non-null patterns 14
16 imply linkage to that location). Two T L matrices, θ 0 and θ 1, contain the latent mean levels of expression (θ = (θ 0, θ 1 )); L denotes the total number of locations considered. Let zt l = 1 if transcript t is in expression pattern P1 at location l and 0 otherwise. Then z is a T L indicator matrix specifying QTL locations for each transcript. Let us first consider a simple case where a transcript is associated with at most one genomic location l and consider inference at location l. This assumption simplifies algebraic development and will be relaxed later. At location l, of primary interest is the posterior probability that zt l = 1 for transcript t. Reexpressing (5.6) in terms of the patterns, we have p(p kt l y, m) ( pl t,p k f P k yt g l) p(g l m)dg l (5.7) where p l t,p k denotes the prior probability that transcript t is in pattern k at location l and f P k describes the predictive density of the data. k = 0, 1 for backcross. A normalizing constant is not required if further calculations of operational characteristics such as FDR is not of interest, which may be the case if a simple ranking of the genes is desired. For the model we propose here, we are interested in the calculation of estimated FDR. Therefore, we must specify the normalizing constant p(y t m). Expanding the probability in terms of different possible mapping locations, we have p(y t m) = p(y t m, z t = 0)p(z t = 0) + L l =1 p(y t m, zl t = 1)p(zt l = 1), where p(z t ) + L l =1 p(zl t = 1) = 1 and p(z t = 0) implies that the transcript does not map to any of the L locations, i.e., the transcript is in pattern P0 at every location. Therefore, the exact form of (5.7) is given by ( p l p(p kt l t y, m) =,P k fp k yt g l) p(g l m) dg l ( pt,p 0f P 0 (y t ) + L l =1 pl t,p 1 f P 1 (y t g l ) ) (5.8) p(g l m) dg l The p t,p 0 and p l t,p 1 s are unknown. We estimate them from the data, with the average of posterior probability that each transcript belongs to a particular pattern. Due to the 2 n problem, calculation of p(g l y, m), the posterior distribution of the unobserved genotype at location l given the expression and marker data, becomes computationally prohibitive. Here we use importance sampling by first simulating multiple versions of pseudomarkers from p(g l m), and then replace the exact integral with its Monte Carlo approximation. In simulating the pseudomarkers, one can use a simple Markov Chain structure where the putative QTL genotype at a given location only depends on two flanking markers. However, this might not work well if there is genotyping error or noninformative markers in the marker data. A hidden Markov Model (HMM) can be considered where the true marker genotypes follow a 15
17 Markov Chain, and the observed marker genotypes are characterized by distributions conditional on the underlying state process. Using an HMM, one can account for genotyping error and missing marker data in a coherent way. R/qtl (Broman et al. 2003) has this option as well. Figure 11 gives an example. The upper stripe gives the marker data for one animal. Blue shows AA, and yellow for Aa. There are some missing marker data, represented by light blue. The lower panel has 20 realizations of sampling from the HMM model trained by the marker data. Suppose for each location l, Q genotype vectors are sampled from the proposal distribution p(g m), where g = {g 1, g 2,..., g L } to give (g1 l, gl 2,..., gl Q ) for l = 1,..., L. Then equation (5.8) can be approximated by Monte Carlo integration using importance samples (see Appendix C.) p l Q p(p kt l t y, m),p k q=1 f ( P k yt gq) l p Q t,p 0 q=1 f P 0 (y t ) + L Q l =1 pl t,p 1 q=1 f ( ) (5.9) P 1 yt gq l This approach is an extension of the MOM model evaluated by averaging over the pseudomarkers. We call it pseudomarker-mom Two-Stage Approach Pseudomarker-MOM scans the genome at some small distance step in order to find potential locations to which transcripts are mapped. But applying it over the entire genome, even for each chromosome separately, can be computationally prohibitive. We propose a two-stage approach where we first apply MOM at markers to identify interesting regions, then follow up by applying pseudomarker-mom to those regions to better localize ETLs. MOM calculates for each transcript, the posterior probability that it maps to a particular marker, or it doesn t map at all. To identify potential ETL regions, we average the linkage evidence across all the transcripts, to give a marker specific linkage score. We can choose those hot-spot markers by thresholding the marker specific linkage evidence, using an HPD region Theoretical Result We here justify the fact that picking those regions with the highest average linkage evidence is indeed the correct thing to do under simplified conditions. Theorem 1: If we assume transcripts map to at most one genomic location and the prior probability of mapping to a particular marker is known and equal for all the markers, and hyperparameter values are known, then the expected posterior probability of a transcript mapping 16
18 to a particular marker is a non-increasing function of the recombination frequency between that marker and the ETL. Proof: See Appendix D. As a result of Theorem 1, we can be sure that under these conditions, the interesting marker regions picked by using MOM will be those regions that are the closest to the ETL. Once these regions are defined, we set up some equally-spaced pseudomarker grid and use pseudomarker- MOM to help localize the ETL with greater accuracy. We show by simulation studies in the next section that this approach works quite well, even under more general conditions Simulation Results A simulation was set up where there are 5000 transcripts and 100 individuals. The proportion of differential expression (DE) was 10%. The hypothetical marker map is composed of one chromosome and is equally spaced with 10cM in between. There are 10 markers in total (i.e, from 0cM to 90cM). Intensity values are simulated as described in Section 4.4. Two simulations were considered: one with a single ETL at 35cM and one with two ETLs: one at 35cM and one at 75cM. The two-stage approach was applied in the simulation. Specifically, two hotspot marker regions were selected from the average posterior probability profile at every marker, obtained by MOM. Pseudomarker-MOM was then applied across a 2cM grid within the hot-spot regions, with 50 pseudomarker realizations (Q = 50). For comparison, we applied traditional QTL IM transcript by transcript on the same data set and obtained genome-wide cutoffs on LOD scores based on an approximation formula from Rebai et al. (1994). Here we show results from the two simulations. Figure 2 (left panel) shows the average posterior probability profile, averaged across the mapping transcripts. The ETL region is identified both by MOM and pseudomarker-mom. However, MOM picks up a wide peak between marker 4 and marker 5, whereas pseudomarker- MOM identifies the ETL location with much better accuracy. To ensure that non-mapping transcripts were not falsely identified, we considered the posterior probability profiles averaged across non-mapping transcripts. They show little structure, as expected (figure not shown). Figure 2 (right panel) shows the 96.8% HPD regions for true mapping transcripts (96.8% is used to compare with IM results below). As shown, the ETL is identified correctly for most of the mapping transcripts. Just as shown in left panel, pseudomarker-mom provides good ETL localization. 17
19 In comparison, we also considered IM on the same simulated data set. This was implemented using R/qtl (Broman et al. 2003). Figure 3 (left panel) is the average LOD score, averaged across mapping transcripts. The region containing the ETL has the highest average LOD, but the average LOD scores are overall very high for mapping transcripts, and it s not clear what cutoff one should use in order to correctly identify the ETL region. For example, if we use 5, then it chooses almost the entire simulated chromosome. To compare with Figure 2, a confidence interval was constructed around the ETL using a 1-LOD drop interval around peak LOD score (Mangin et al. 1994). This procedure is designed to approximate a 96.8% confidence interval, but in general, these intervals can be biased in that they are too small and bootstrap procedure has been recommended (Visscher et al. 1996). In the ETL setting, obtaining bootstrap samples for thousands of transcripts doesn t seem feasible. On the other hand, confidence intervals that are slightly too small favors IM here as ETL appear to be better localized. It is not always clear which peaks to construct confidence intervals around. To give IM the best results, we consider a 10 cm window around the true ETL (35cM) and define the LOD peak as the highest LOD within the window. The 1-LOD drop interval is then constructed. Of course, in practice, one does not have the luxury of knowing where to choose these peaks and perhaps only the largest peak would be identified. For these reasons, this method of identifying ETL regions favors IM. Even so, in comparison to pseudomarker MOM, IM does not provide as precise estimates of ETL location. Similar results were obtained for the 2-ETL case. As shown in Figure 4 (left panel), for a particular simulation, the two ETL regions are identified both by MOM and pseudomarker-mom. However, spurious peaks show up in different places from different simulations, but in general have low average posterior probability compared with the main peaks (e.g., in the bottom row). As shown, pseudomarker-mom increases localization of the ETL compared with MOM. In Figure 4 (right panel), the distinct ETLs are identified for most of the mapping transcripts. We see that pseudomarker-mom provides good ETL localization. We again considered IM on the same two simulated data sets. Figure 5 (left panel) shows the average LOD score profile, averaged across mapping transcripts. The regions containing the ETLs has the highest average LOD, but distinct ETLs are not as clear as in Figure 3. As before, the 96.8 % confidence intervals were around each ETL using a 1-LOD drop interval around peak LOD scores, with peak chosen from the two 10cM windows around the true ETLs (35cM and 75cM). IM suffers from the same problem as before in that has relatively more false positive calls. 18
20 5.2 Multiple ETL mapping The MOM approach can be extended to account for multiple ETL. For example, if transcript t is possibly affected by two genotype locations l 1 and l 2, then four latent means are of interest: µ 0,0 t,(l 1,l 2 ), µ0,1 t,(l 1,l 2 ), µ1,0 t,(l 1,l 2 ) and µ1,0 t,(l 1,l 2 ), where µj,k t,(l 1,l 2 ) denotes the latent mean level of expression for transcript t for the populations of individuals with genotype (j, k) at locations l 1 and l 2. These latent means can be arranged in 15 possible expression patterns, all of which may be of interest. For simplicity, we consider: P0: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P1: µ 0,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P2: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P3: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) P4: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P5: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) Pattern P 0 allows for the possibility that a transcript maps to neither location. The latent means of a transcript mapping only to location l 1 would satisfy pattern P 1 since only allelic differences at l 1 affect the mean level of expression. Similarly, the means of transcripts mapping only to location l 2 satisfy P 2. Patterns P3-P5 describe possible ways in which the allelic variation at both locations can act and interact to affect expression level. P 3 describes a scenario in which the alleles at each location have equal, but not dominant, effects. A dominant model would be described by P 4 and an additive model by P 5. The multiple ETL MOM model (M-MOM) has 5 ( ) M patterns. As before, of primary interest is the posterior probability of particular expression patterns. They can be calculated similarly as in (4.5) for any pattern of interest, where k = 0, 1,..., Simulation Results We apply M-MOM to the simulated 2-ETL data described in Section We perform a two dimensional scan, by looking at every possible marker pair and all the expression patterns simultaneously. In the simulation, the two ETL effect sizes are generated to be the same, thus corresponding to pattern 3. If we look at the average posterior probabilities for all non-null patterns, most of them are negligible as expected, except for pattern 3 (figure not shown). Looking closely at pattern 3, we plot the average posterior probabilities for every marker pair in Figure 6. The first ETL location is on the Y axis and the second ETL location on the X axis. As 19
21 shown, the plot locates the two ETLs at 30cM and 80cM, which is very close to the truth and the best accuracy that M-MOM can achieve, since the true ETLs are not at markers. For comparison purposes, we implement a 2-D marker regression scan on the same data set (see Figure 7.) On the upper triangle, we plot the average LOD score for epistasis, which has very low probability, as expected. The diagonal is the average LOD sore from 1-D IM scan. The lower triangle gives the average joint LOD scores. Also shown in the plot are the contour lines over the range of the LOD scores obtained. The region between 30cM, 40cM and 70cM to 90cM has relatively high average probabilities compared to the others. In order to assess the significance, we randomly sample 10 transcripts corresponding to every 10th percentile of the log of expression means, and perform permutation tests of size 1000 on each of them. The average 95th percentile of the LOD scores from each of the 1000 permutation tests is about 3.2. Using this as our 2-D LOD score cutoff, we see from the contour lines that it gives a much wider region than the actual ETL. In line with the previous comparisons, using traditional QTL mapping techniques on the ETL data seems to yield more false positives. 6 Future Research Questions The proposed framework for ETL mapping enables IM of single ETL and the identification of multiple ETL at markers. Preliminary results from simulated data sets are encouraging. There are a lot of questions that need to be explored further. 1. Our methods are extensions of the MOM model and a number of questions of implementation remain open. One question that is important to our extension is: How sparse is sparse? In other words, how many markers can MOM handle relative to the number of transcripts? Generally, one might also be concerned about whether MOM could fit mixtures with so many components. We did a series of simulations, trying to shed some light on this. Consider a backcross, and suppose the genome has 400 equally spaced markers with 5cM in between. There are 25% of DE transcripts, mapping to 80% of the markers with equal probabilities. We varied the number of transcripts to be 100, 1000, 5000 and 40000, representing very small, small, moderate and large number of transcripts in a realistic experiment, with animal number being 10, 60 and 100. Here 10 might be a little unrealistic, but we were curious about how the results will look like. 20
22 We plot the posterior probabilities for all the transcripts of a particular mixture component in the MOM model in Figures 8, 9 and 10. The number of transcripts mapping to that mixture component is 1, 1, 4 and 22 corresponding to the four values of total transcripts number. The true DE transcripts are colored in red. Surprisingly, it seems that MOM can fit a mixture model with the number of components bigger than the number of transcripts pretty well (first panel in Figure 9 and Figure 10), as long as there are enough animals in the experiment. Our impression is that when the number of animals is small, there tends to be very few recombinations. The degree of distincitiveness between the MOM components is not very high, resulting in relatively low power (see Figure 8.) When the number of transcripts is small, there are also very few DE transcripts mapped to every marker. We might expect the posterior probabilities of those transcripts not to be very accurate. But as seen from Figure 9, this does not seem to be the case. There are 60 animals, but even when we only have 100 transcripts, MOM still detects the one DE transcript with posterior probabilitiy being 1.0. When there are transcripts, among the top 22 transcripts with the highest posterior probabilities, 20 of them are true DE (22 DE in total). The posterior probabilities range from to 1.0. In Figure 10, when there are 100 animals and transcripts, 21 out of the top 22 transcripts are DE, and their posterior probabilities range from 0.76 to 1.0. Much more work is required to investigate these results and define precise conditions under which parameters are well estimated. 2. Importance sampling is applied in pseudomarker-mom. Importance samples are taken from the proposal distribution p(g l m) to approximate what we really desire p(g l y, m). We will investigate ways to choose Q so that we balance between computational burden and accuracy. Perhaps a lower bound on Q can be obtained. 3. The proposed two-stage approach relies on the ability of MOM to identify the correct interesting regions for follow-up. We have shown theoretically that under simplified conditions, the highest posterior probability region obtained using MOM is that region closest to ETL. The simplified conditions need to be relaxed. In particular, we will investigate the case of multiple ETL and varying prior probabilities. 4. We have developed a method for mapping multiple-etl. Preliminary results from simulations investigating a 2 ETL system are encouraging. We would like to extend this 21
Multiple QTL mapping
Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power
More informationGene mapping in model organisms
Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA
More informationIntroduction to QTL mapping in model organisms
Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Backcross P 1 P 2 P 1 F 1 BC 4
More informationInferring Genetic Architecture of Complex Biological Processes
Inferring Genetic Architecture of Complex Biological Processes BioPharmaceutical Technology Center Institute (BTCI) Brian S. Yandell University of Wisconsin-Madison http://www.stat.wisc.edu/~yandell/statgen
More informationFast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets
Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets Yu-Ling Chang A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment
More informationMapping multiple QTL in experimental crosses
Human vs mouse Mapping multiple QTL in experimental crosses Karl W Broman Department of Biostatistics & Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman www.daviddeen.com
More informationStatistical issues in QTL mapping in mice
Statistical issues in QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Outline Overview of QTL mapping The X chromosome Mapping
More informationCausal Model Selection Hypothesis Tests in Systems Genetics
1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;
More informationCausal Graphical Models in Systems Genetics
1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation
More informationQTL Mapping I: Overview and using Inbred Lines
QTL Mapping I: Overview and using Inbred Lines Key idea: Looking for marker-trait associations in collections of relatives If (say) the mean trait value for marker genotype MM is statisically different
More informationMapping multiple QTL in experimental crosses
Mapping multiple QTL in experimental crosses Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]
More informationUse of hidden Markov models for QTL mapping
Use of hidden Markov models for QTL mapping Karl W Broman Department of Biostatistics, Johns Hopkins University December 5, 2006 An important aspect of the QTL mapping problem is the treatment of missing
More informationR/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org
R/qtl workshop (part 2) Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Example Sugiyama et al. Genomics 71:70-77, 2001 250 male
More informationOverview. Background
Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems
More informationQuantile-based permutation thresholds for QTL hotspot analysis: a tutorial
Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial Elias Chaibub Neto and Brian S Yandell September 18, 2013 1 Motivation QTL hotspots, groups of traits co-mapping to the same genomic
More informationQuantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012
Quantile based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March 2012 2012 Yandell 1 Fisher on inference We may at once admit that any inference from the particular
More informationQTL Model Search. Brian S. Yandell, UW-Madison January 2017
QTL Model Search Brian S. Yandell, UW-Madison January 2017 evolution of QTL models original ideas focused on rare & costly markers models & methods refined as technology advanced single marker regression
More informationLecture 8. QTL Mapping 1: Overview and Using Inbred Lines
Lecture 8 QTL Mapping 1: Overview and Using Inbred Lines Bruce Walsh. jbwalsh@u.arizona.edu. University of Arizona. Notes from a short course taught Jan-Feb 2012 at University of Uppsala While the machinery
More informationParametric Empirical Bayes Methods for Microarrays
Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca
More informationLecture 9. QTL Mapping 2: Outbred Populations
Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred
More informationExpression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia
Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.
More informationSelecting explanatory variables with the modified version of Bayesian Information Criterion
Selecting explanatory variables with the modified version of Bayesian Information Criterion Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh,
More informationGenotype Imputation. Biostatistics 666
Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives
More informationJoint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. Jessica Mendes Maia
ABSTRACT MAIA, JESSICA M. Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. (Under the direction of Professor Zhao-Bang Zeng). The goal of this dissertation is
More informationMethods for QTL analysis
Methods for QTL analysis Julius van der Werf METHODS FOR QTL ANALYSIS... 44 SINGLE VERSUS MULTIPLE MARKERS... 45 DETERMINING ASSOCIATIONS BETWEEN GENETIC MARKERS AND QTL WITH TWO MARKERS... 45 INTERVAL
More informationMULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI ROBY JOEHANES
MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI by ROBY JOEHANES B.S., Universitas Pelita Harapan, Indonesia, 1999 M.S., Kansas State University, 2002 A REPORT submitted in partial
More informationAnumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,
Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.104.031427 An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci Fei
More informationThe Admixture Model in Linkage Analysis
The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the
More informationPrediction of the Confidence Interval of Quantitative Trait Loci Location
Behavior Genetics, Vol. 34, No. 4, July 2004 ( 2004) Prediction of the Confidence Interval of Quantitative Trait Loci Location Peter M. Visscher 1,3 and Mike E. Goddard 2 Received 4 Sept. 2003 Final 28
More informationNon-specific filtering and control of false positives
Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview
More informationEnhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure
Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure by John C. Schwarz A dissertation submitted to the faculty of the University of North Carolina at Chapel
More informationTutorial Session 2. MCMC for the analysis of genetic data on pedigrees:
MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation
More informationInferring Causal Phenotype Networks from Segregating Populat
Inferring Causal Phenotype Networks from Segregating Populations Elias Chaibub Neto chaibub@stat.wisc.edu Statistics Department, University of Wisconsin - Madison July 15, 2008 Overview Introduction Description
More informationTHE problem of identifying the genetic factors un- QTL models because of their ability to separate linked
Copyright 2001 by the Genetics Society of America A Statistical Framework for Quantitative Trait Mapping Śaunak Sen and Gary A. Churchill The Jackson Laboratory, Bar Harbor, Maine 04609 Manuscript received
More informationCalculation of IBD probabilities
Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities
More informationQTL model selection: key players
Bayesian Interval Mapping. Bayesian strategy -9. Markov chain sampling 0-7. sampling genetic architectures 8-5 4. criteria for model selection 6-44 QTL : Bayes Seattle SISG: Yandell 008 QTL model selection:
More informationBinary trait mapping in experimental crosses with selective genotyping
Genetics: Published Articles Ahead of Print, published on May 4, 2009 as 10.1534/genetics.108.098913 Binary trait mapping in experimental crosses with selective genotyping Ani Manichaikul,1 and Karl W.
More informationBAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI
BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI By DÁMARIS SANTANA MORANT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
More informationCS Homework 3. October 15, 2009
CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website
More informationWhole Genome Alignments and Synteny Maps
Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of
More informationLocating multiple interacting quantitative trait. loci using rank-based model selection
Genetics: Published Articles Ahead of Print, published on May 16, 2007 as 10.1534/genetics.106.068031 Locating multiple interacting quantitative trait loci using rank-based model selection, 1 Ma lgorzata
More informationLecture 6. QTL Mapping
Lecture 6 QTL Mapping Bruce Walsh. Aug 2003. Nordic Summer Course MAPPING USING INBRED LINE CROSSES We start by considering crosses between inbred lines. The analysis of such crosses illustrates many of
More informationEiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6
Yamamoto et al. BMC Genetics 2014, 15:50 METHODOLOGY ARTICLE Open Access Effect of advanced intercrossing on genome structure and on the power to detect linked quantitative trait loci in a multi-parent
More informationSupplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control
Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model
More informationStatistical testing. Samantha Kleinberg. October 20, 2009
October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find
More informationModel Selection for Multiple QTL
Model Selection for Multiple TL 1. reality of multiple TL 3-8. selecting a class of TL models 9-15 3. comparing TL models 16-4 TL model selection criteria issues of detecting epistasis 4. simulations and
More informationCausal Model Selection Hypothesis Tests. in Systems Genetics
Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto 1 Aimee T. Broman 2 Mark P Keller 2 Alan D Attie 2 Bin Zhang 1 Jun Zhu 1 Brian S Yandell 3,4 1 Sage Bionetworks, Seattle,
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationCalculation of IBD probabilities
Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm
More informationTESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University
The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986 1012 DOI: 10.1214/08-AOAS182 Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS BY DANIELA
More informationComputational statistics
Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationTable of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors
The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a
More informationBayesian Partition Models for Identifying Expression Quantitative Trait Loci
Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Bayesian Partition Models for Identifying Expression Quantitative
More informationThe optimal discovery procedure: a new approach to simultaneous significance testing
J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationMapping QTL to a phylogenetic tree
Mapping QTL to a phylogenetic tree Karl W Broman Department of Biostatistics & Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman Human vs mouse www.daviddeen.com 3 Intercross
More informationSimultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models
Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationMIXED MODELS THE GENERAL MIXED MODEL
MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted
More informationLecture WS Evolutionary Genetics Part I 1
Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in
More informationThe miss rate for the analysis of gene expression data
Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,
More informationOn the mapping of quantitative trait loci at marker and non-marker locations
Genet. Res., Camb. (2002), 79, pp. 97 106. With 3 figures. 2002 Cambridge University Press DOI: 10.1017 S0016672301005420 Printed in the United Kingdom 97 On the mapping of quantitative trait loci at marker
More informationHigh-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018
High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously
More informationQuantitative Genomics and Genetics BTRY 4830/6830; PBSB
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary
More informationLecture 21: October 19
36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 21: October 19 21.1 Likelihood Ratio Test (LRT) To test composite versus composite hypotheses the general method is to use
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationAssociation Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5
Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative
More informationStat 542: Item Response Theory Modeling Using The Extended Rank Likelihood
Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal
More informationQTL model selection: key players
QTL Model Selection. Bayesian strategy. Markov chain sampling 3. sampling genetic architectures 4. criteria for model selection Model Selection Seattle SISG: Yandell 0 QTL model selection: key players
More informationBTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014
BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y
More informationA new simple method for improving QTL mapping under selective genotyping
Genetics: Early Online, published on September 22, 2014 as 10.1534/genetics.114.168385 A new simple method for improving QTL mapping under selective genotyping Hsin-I Lee a, Hsiang-An Ho a and Chen-Hung
More informationMultiple Change-Point Detection and Analysis of Chromosome Copy Number Variations
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem
More informationCausal Network Models for Correlated Quantitative Traits. outline
Causal Network Models for Correlated Quantitative Traits Brian S. Yandell UW Madison October 2012 www.stat.wisc.edu/~yandell/statgen Jax SysGen: Yandell 2012 1 outline Correlation and causation Correlatedtraitsinorganized
More informationMultiple interval mapping for ordinal traits
Genetics: Published Articles Ahead of Print, published on April 3, 2006 as 10.1534/genetics.105.054619 Multiple interval mapping for ordinal traits Jian Li,,1, Shengchu Wang and Zhao-Bang Zeng,, Bioinformatics
More informationTHE data in the QTL mapping study are usually composed. A New Simple Method for Improving QTL Mapping Under Selective Genotyping INVESTIGATION
INVESTIGATION A New Simple Method for Improving QTL Mapping Under Selective Genotyping Hsin-I Lee,* Hsiang-An Ho,* and Chen-Hung Kao*,,1 *Institute of Statistical Science, Academia Sinica, Taipei 11529,
More informationLearning ancestral genetic processes using nonparametric Bayesian models
Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew
More informationLinkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA
Linkage analysis and QTL mapping in autotetraploid species Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA Collaborators John Bradshaw Zewei Luo Iain Milne Jim McNicol Data and
More informationBustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #
Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either
More informationA CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES. By Jee Young Moon
A CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES By Jee Young Moon A dissertation submitted in partial fulfillment of the requirements for the degree
More informationTECHNICAL REPORT NO December 1, 2008
DEPARTMENT OF STATISTICS University of Wisconsin 300 University Avenue Madison, WI 53706 TECHNICAL REPORT NO. 46 December, 2008 Revised on January 27, 2009 Causal Graphical Models in System Genetics: a
More informationLatent Variable models for GWAs
Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationNIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.
NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION
More informationQUANTITATIVE trait analysis has many applica- several crosses, since the correlations may not be the
Copyright 001 by the Genetics Society of America Statistical Issues in the Analysis of Quantitative Traits in Combined Crosses Fei Zou, Brian S. Yandell and Jason P. Fine Department of Statistics, University
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationSingle gene analysis of differential expression. Giorgio Valentini
Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationBayesian Inference of Interactions and Associations
Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,
More informationAffected Sibling Pairs. Biostatistics 666
Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD
More informationLecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017
Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction
More information25 : Graphical induced structured input/output models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large
More informationOne-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation
One-week Course on Genetic Analysis and Plant Breeding 21-2 January 213, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation Jiankang Wang, CIMMYT China and CAAS E-mail: jkwang@cgiar.org; wangjiankang@caas.cn
More informationRegularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics
Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More information