A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

Size: px
Start display at page:

Download "A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen"

Transcription

1 A Statistical Framework for Expression Trait Loci (ETL) Mapping Meng Chen Prelim Paper in partial fulfillment of the requirements for the Ph.D. program in the Department of Statistics University of Wisconsin-Madison Committee Members: Professor Christina Kendziorski Professor Alan Attie Professor Michael Newton Professor Brian Yandell 1

2 Contents 1 Introduction 2 2 ETL mapping experiments 3 3 QTL Mapping Methods Single Phenotype - Single QTL Models Single Phenotype - Multiple QTL Models Multiple Phenotype - Single or Multiple QTL Models ETL Mapping Methods Transcript Based Approach Transcript Based Approach with FDR control Marker Based Approach Mixture Over Markers Model Current Status of ETL Mapping Methods Research Plan ETL Interval Mapping Pseudomarker-MOM Two-Stage Approach Theoretical Result Simulation Results Multiple ETL mapping Simulation Results Future Research Questions 20 References 23 Appendix 27 1

3 1 Introduction Identifying the genetic loci responsible for variation in quantitative traits is of great importance to biologists. Although quantitative trait loci (QTL) mapping studies have been going on for over 80 years starting with Sax in 1923 (Sax 1923; Rasmusson 1933; Thoday 1961), where he proposed that the association between seed weight and seed coat color in beans was due to the linkage between the genes controlling weight and the genes controlling color, the vast majority of studies have taken place in the last 20 years. The increased rate was due largely to two major advances in the 1980s: the advent of restriction fragment length polymorphisms (RFLPs) (Botstein et al. 1980) so that it s possible to genotype markers on a large scale and the advent of statistical methods for data analysis (Lander and Botstein 1989). A recent advance of comparable significance has been made in the area of phenotyping. With high throughput technologies now widely available, investigators can measure thousands of phenotypes at once. Gene expression measurements are particularly amenable to QTL mapping and much excitement abounds for this field of genetical genomics (Jansen and Nap 2001; Jansen 2003; Cox 2004; Broman 2005). The so called expression QTL (eqtl) or expression trait loci (ETL) studies have been used to identify candidate genes (Dumas et al. 2000; Eaves et al. 2002; Karp et al. 2000; Wayne et al. 2003; Schadt et al. 2003; Brstrykh et al. 2005; Hubner et al. 2005), to infer not only correlative but also causal relationships among modulator and modulated genes (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003), to better define traditional phenotypes (Schadt et al. 2003), and to serve as a bridge between genetic variation and the traditional complex traits of interest (Schadt et al. 2003). Although successful in many ways, the results obtained from ETL studies to date are limited. In the early published studies, the ETL mapping problem had been addressed by treating each transcript separately as a phenotype for QTL mapping. Single trait QTL analysis was then carried out thousands of times (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003). Notably, although adjustments were made for multiple tests across the genome, no adjustments had been considered for multiple tests across transcripts. There are hundreds of test locations across the genome but tens of thousands of transcripts leading to a potentially serious multiple testing problem and an inflated false discovery rate (FDR). For some labs, an inflated FDR is tolerable as many genes can be tested quickly for certain properties and discarded if found to be false positives. However, for many labs, such tests are prohibitively expensive. Statistical methods that control error rates and 2

4 that are more sensitive and more specific are needed. In a few recent studies, there has been some effort in attempting to account for both sets of multiplicities (Chesler et al. 2005; Hubner et al. 2005; Bystrykh et al. 2005). Permutation tests were performed to derive the genome-wide LOD score threshold and q-values were computed for the set of transcripts declared significant using the corresponding genome-wide empirical p-values. As discussed in Section 4.2, this last approach may not properly control FDR and may suffer from very low power. The main aim of the proposed thesis is to develop a statistical framework for ETL mapping that properly accounts for multiplicities while maintaining or improving upon the operating characteristics of currently used approaches. Section 2 provides a brief background on the questions addressed and data collected in ETL mapping experiments. Statistical methods for ETL mapping are reviewed in Section 4. As discussed there, the mixture over markers (MOM) model is the only statistically rigorous ETL mapping method developed to account for multiplicities across both markers and transcripts. However, MOM has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. When dense maps are available, the MOM model by itself may not be applicable as the number of mixture components is too big to fit. Finally, MOM does not allow for multiple ETL. Statistical methods to address each of these shortcomings are detailed in Section 5 and preliminary results are demonstrated on simulated data and data from a study of diabetes in mouse. Future directions are discussed in Section 6. 2 ETL mapping experiments The general data collected in an ETL mapping experiment consists of a genetic map, marker genotypes, and microarray data (phenotypes) collected on a set of individuals. A genetic marker is a region of the genome of known location. These locations make up the genetic map. The distance between markers is given by genetic distance, in the unit of centimorgan (cm). It is defined as the expected percentage of crossovers between two loci during meiosis. At each marker, genotypes are obtained. ETL mapping studies take place in both human and experimental populations. We focus on the latter. For these populations, the possibilities of marker genotypes are simplified. For example, studies with experimental populations most often involve arranging a cross between two inbred strains differing substantially in some trait of interest to produce F1 offspring. Segregating progeny are then typically derived from a B1 backcross (F1 x Parent) or an F2 3

5 intercross (F1 x F1). Repeated intercrossing (Fn x Fn) can also be done to generate so-called recombinant inbred (RI) lines. For simplicity of notation, we focus on a backcross population. Consider two inbred parental populations P 1 and P 2, genotyped as AA and aa, respectively, at M markers. The offspring of the first generation (F 1 ) have genotype Aa at each marker (allele A from parent P 1 and a from parent P 2 ). In a backcross, the F 1 offspring are crossed back to a parental line, say P 1, resulting in a population with genotypes AA or Aa at a given marker. We denote AA by 0 and Aa by 1. For each member of the backcross population, phenotypes are collected via microarrays. Microarrays allow us to snapshot the expressions of thousands of genes at the same time. The oligonucleotide and cdna microarrays are the two types of technology that are most widely used. A nice review of the microarray technologies can be found in Nguyen et al. (2002). We present a very brief, by no means complete review here. Affymetrix is one company that produces oligonucleotide chips which contain tens of thousands of probe sets, or DNA sequences related to a gene. We will refer to these sequences throughout this paper as transcripts. Each gene is represented by some number (usually 11-20) of features. Many Affymetrix arrays use 20 features (Nguyen et al. 2002). Each feature is a short sequence of oligonucleotides. Present in the features are pairs of perfect match (PM) and mismatch (MM) sequences. The PM is a piece of gene, 25 nucleotides in length; the corresponding MM is identical to PM except for the middle (i.e., 13th) position. After some pre-processing and normalization, one summary score is derived for each probe set. There are a number of methods for processing the probe set intensities and for normalization, such as DNA chip Analyzer by Li and Wong (2001), and Robust Multi-array Analysis (RMA) by Irizarry et al. (2003). In a cdna array experiment, a gene is represented by a long cdna fragment (500 to 1000 bases). The experimental sample of interest is often labeled with a red fluorescent dye, and a reference sample is labeled green. The amount of cdna hybridized to each probe can be captured through some imaging device which measures the amount of the fluorescent intensity. Image files are processed to give a summary expression score, log 2 (R/G). Yang et al. (2002) propose methods for cdna array data normalization and compare their methods with a number of other approaches. With proper pre-processing and normalization, from either technology we obtain a single summary score of expression for each transcript on each array. 4

6 3 QTL Mapping Methods As noted above, ETL studies are very similar to QTL studies, but with thousands of phenotypes. Perhaps it is not surprising then that early ETL studies repeatedly applied methods developed for QTL mapping to each transcript. The literature on QTL mapping methods is quite large. We here review only those methods relevant to this proposal and refer the interested reader to Doerge et al. (1997) or Lynch and Walsh (1998) for more information on QTL mapping methods. 3.1 Single Phenotype - Single QTL Models Consider a backcross with n progeny with univariate phenotypes y j measured on all the individuals, j = 1,..., n, together with genotypes for a set of M markers. Let m ij = 0 or 1 according to whether the individual j has genotype AA or Aa at the ith marker, i = 1,..., M, The simplest method to test for trait-marker association is marker regression (MR), to test mean trait value differences between different marker groups for a particular marker. Specifically, for a test at the i th marker, the single QTL model is: y j = µ + β i m ij + ɛ j (3.1) where ɛ j are independent and identically distributed (iid) as Normal(0, σ 2 ) and one can test H 0 : β i = 0 vs. H 1 : β i 0 or equivalently H 0 : µ 0 = µ 1 vs. µ 0 µ 1 where µ 0 = µ and µ 1 = µ + β i. This is equivalent to an analysis of variance (ANOVA) at each marker (Soller et al. 1976) when there are more than two marker genotype groups (a t-test for two genotype groups as in a backcross). Usually, instead of F or t-statistics, geneticists prefer to report a LOD score, which is defined as the (base 10) log-likelihood ratio comparing the two hypotheses. A LOD score is calculated at each marker position, and marker loci giving significant LOD scores are identified as putative QTLs. For these putative QTLs, we loosely say the phenotype is linked to them. This approach is conceptually very simple, and clearly there are problems with it (Lander and Botstein 1989). First, if the true QTL is not located exactly at the marker, its effect will likely be underestimated because of recombination between the marker and the true QTL. Second, because of the confounding effects, the power for QTL detection will decrease, especially when the markers are widely spread, requiring more individuals for the test. Third, this approach considers one 5

7 marker loci at a time, which is not very powerful comparing with multiple QTL models in the presence of more than one QTL. Lander and Botstein (1989) proposed interval mapping (IM) which addresses the first two problems above. Their approach also assumes a single segregating QTL, but allows for tests between markers where genotypes are not known. Specifically, for a backcross, the proposed model to test for a QTL in the marker i and i + 1 interval is y j = µ + β m kj + e j (3.2) where m k is the genotype at the test position between marker i and i + 1. It takes value 0 or 1 with probability depending on the genotypes of the flanking markers and the test position (see Table 1). β is the effect of the putative QTL. Technically, (3.2) is a mixture of two normal distributions, since p(y j ) = p(m j = 0 m, r)p(y j m kj = 0) + p(m j = 1 m, r)p(y j m kj = 1) (here, r denotes flanking marker distances). As in MR, tests are done at each location to test H 0 : β = 0 vs. H 1 : β 0 The test compares the hypothesis of a single QTL at the current locus to the null hypothesis of no QTL. The two likelihood functions must be maximized over their respective parameters. The procedure described above is repeated for each locus in the genome. In practice, test loci are set up every 1cM or some other user-defined distance. The likelihood under the alternative varies with the test locus, so the EM algorithm (Dempster et al. 1977) must be applied at each locus. A LOD score profile can be constructed by plotting the LOD scores against the test positions. The LOD score is then compared to a genome-wide threshold. Whenever the LOD profile exceeds such threshold, we infer there exists a QTL. Generally, the genome-wide threshold is obtained using the 95th percentile of the distribution of the maximum (genome-wide) LOD score, under the null hypothesis of no segregating QTLs. Much effort has been expended to derive the appropriate genome-wide LOD score cutoff value (Churchill and Doerge 1994; Dupuis etl al. 1995; Dupuis and Siegmund 1999; Feingold et al. 1993; Lander and Botstein 1989; Rebai et al. 1994; Rebai et al. 1995). The major advantage of IM over MR at marker loci is that it gives more precise estimates of the QTL locations and effects. However, the computational cost involved in IM is bigger, and in the case of dense genetic markers and complete genotype data, the advantage tends to be very little. More importantly, IM, like MR at marker loci, is still a single-qtl model. 6

8 3.2 Single Phenotype - Multiple QTL Models The principal reasons for modelling multiple QTls are to increase sensitivity and to achieve better separation of linked QTLs. Also, epistasis (i.e., interactions between alleles at different QTLs) can only be identified through multiple-qtl models. When a single QTL model is used when really there are multiple QTLs, the genetic variation due to other segregating QTLs is incorporated into the environmental variation. This reduces sensitivity. A straight-forward extension of MR to model multiple QTLs is multiple regression, which includes a number of different markers in the model, rather than looking at them one at a time. Let m ij = 0 or 1, according to whether individual j has genotype Aa or AA at marker i. The model becomes y j = µ + β i m ij + e j (3.3) i S where S is the set of markers for which β i is not 0. To implement this, one must find a way to search through the model space to find such a set of S. As the number of markers gets larger, it would be impossible to consider every possible model in the model space. In addition, there remains a question of how to choose between candidate models; some form of criterion is needed. Broman (2002) looked at this problem in a model selection framework. Direct use of the multiple regression analysis is not easy. Moreover, Zeng (1993) showed that the partial regression coefficient is generally a biased estimate of the relevant QTL effect. An approach for multiple QTL mapping that combines ideas from IM and multiple regression is composite IM (CIM) (Zeng 1993, 1994). The method attempts to reduce the multidimensional search for QTLs to a series of one-dimensional searches. It conditions on markers outside the region of interest while performing IM to control for the effects of QTLs in other intervals. That way, there will be better power for QTL detection and also the QTL effects can be estimated more accurately. CIM can be described as follows. One chooses a subset of markers, S, to control background genetic variation. As in IM, suppose one wants to test for QTL between marker i and i + 1. For a test at the k th location, between markers i and i + 1, the statistical model can be written as y j = µ + β m kj + β l m lj + e j l i,i+1 where β is the effect of the putative QTL. A general guideline for practice would be to drop those markers that are within 10cM of the test position (Broman 2002 calls this subset of markers S ). 7

9 Under this model, the contribution of each individual to the likelihood has the form of a mixture of two normal distributions with means µ + l S β l m lj and µ + β + l S β l m lj with mixing proportions equal to the conditional probabilities of the individual having QTL genotype 0 or 1, given flanking marker genotypes and test position. Zeng (1994) used the ECM algorithm (Meng and Rubin 1993) to obtain the maximum likelihood estimates. As in IM, a LOD score is calculated at each test position, comparing the likelihood assuming there is a QTL at the putative test locus, to the likelihood assuming that there is no QTL there. The LOD score is then plotted as a function of test positions in the genome, and is compared to a genome-wide threshold to declare significance. Jansen independently developed a similar approach to handle multiple QTL combining IM and MR, multiple-qtl mapping (MQM) (Jansen 1993; Jansen and Stam 1994). It fits single QTL models with selected markers as cofactors in the regression to eliminate the effects of possible QTLs in other intervals. The major problem with these approaches is how to choose the set of markers to be included in the model. Too many markers will give low power for QTL detection, and too few will cause low accuracy. Zeng (1994) compared the performance of including all other markers to including only unlinked markers through simulation. He then recommended some combinations of deleting or inserting some linked markers in practice. Jansen (1993) and Jansen and Stam (1994) used backward elimination with AIC (Akaike 1969), or a slight variant, to pick the subset of markers in the model. Broman (2002) recommended the use of the BIC δ criterion, with the value δ chosen by the approximate correspondence between BIC δ and a genome-wide threshold on the LOD score. Kao et al. (1999) proposed multiple IM (MIM) which uses multiple marker intervals simultaneously to construct multiple QTL in the model. In MIM, Kao et al. (1999) adopted stepwise selection with Likelihood Ratio Test (LRT) as the selection criterion to identify QTLs. 3.3 Multiple Phenotype - Single or Multiple QTL Models In many QTL mapping studies there is more than one trait being measured. Performing single trait analysis repeatedly is not optimal clearly because it doesn t take into account the correlation structure among the traits. It has been shown that analyzing traits jointly will increase the power of QTL detection (Jiang and Zeng 1995; Knott and Haley 2000). In these studies, joint distribution of the multi-trait is imposed, which requires specification of the covariance structure of the traits (for a review of multi-trait QTL mapping methods, see Lund et al and references therein.) 8

10 Multi-trait methods are very attractive in that they try to capture the inner structure among traits, because many traits are genetically or environmentally correlated. However, as the number of traits gets bigger, so does the number of parameters that need to be estimated. 4 ETL Mapping Methods To this date, most ETL mapping methods consider single transcripts at a time. Multi-trait mapping using available methods has not been attempted. Most recognize that this would be impossible with thousands of traits since estimation of a phenotype covariance matrix is not feasible. We detail exact methods used below. 4.1 Transcript Based Approach The earliest ETL mapping studies applied single phenotype-single QTL mapping methods to every transcript (Brem et al. 2003; Schadt et al. 2003). We call this type of approach transcript based (TB). In Brem et al. (2002), a Wilcoxon-Mann-Whitney rank sum test was applied to every transcript and marker pair. Nominal p-values were reported and the number of linkages expected by chanced was estimated by permuation tests (Churchill and Doerge 1994). In Schadt et al. (2003), transcript specific LOD score profiles were obtained using standard QTL IM. A common genome-wide LOD score threshold was chosen to account for the potential increase of type I error induced by testing across multiple markers. Neither study accounted for multiple tests across transcripts. We view this as a problem that needs serious attention because in ETL studies, we usually deal with thousands of traits. 4.2 Transcript Based Approach with FDR control Recently, investigators have made attempts to account for multiplicities across transcripts (Chesler et al. 2005; Hubner et al. 2005). They first computed genome-wide empirical p-values of the maximum LOD score for every transcript using permutation tests and then estimated the q-values (Storey and Tibshirani 2003) accordingly. It has to be pointed out that even though effort to adjust for multiple testings across the transcripts was presented, this is by no means a systematic way to approach the multiple testing problem. Preliminary results from simulations such as those described in Kendziorski et al. (2004) show that FDR is not properly controlled and power is 9

11 relatively lower than other approaches. This is consistent with the results of Chesler et al. (2005), where the q-value threshold was increased to 0.25 so that a reasonable number of transcripts could be identified. 4.3 Marker Based Approach Instead of conducting the analysis at every transcript, ETL mapping could be done by conducting the analysis at every marker. We call these marker based (MB) approaches. They consist of identifying differentially expressed (DE) transcripts across groups of animals where groups are determined by the genotype at a given marker. Any DE methods could be used. Usually, the DE evidence threshold can be chosen such that multiple testing performed across the transcripts can be accounted for. However, the MB approach does not consider multiplicities across the genome. As was shown in Kendziorski et al. (2004), both TB and MB approaches share similar flaws. In TB, separate tests are conducted for each transcript. In MB, each marker is tested separately. And for both, the evidence that a transcript maps to a marker is measured against the evidence that it doesn t map there. Since in reality a transcript can map to any of the marker locations, the evidence that a transcript maps to a particular marker should be judged relative to the possibility that it maps nowhere or to some other marker. This idea motivates what we call the Mixture Over Markers (MOM) model (Kendziorski et al. 2004). 4.4 Mixture Over Markers Model Let y t be the expression level for tth transcript, y t = {y t1, y t2,..., y tn }, where n is the number of animals in the ETL study. The MOM model assumes a transcript t maps nowhere with probability p 0 and maps to marker m with probability p m, such that M m=0 p m = 1, where M denotes the total number of markers. The marginal distribution of y t is then given by p 0 f 0 (y t ) + M p m f m (y t ) (4.4) m=1 where f m is the predictive density of the data if transcript t maps to marker m; f 0 is the predictive density when the transcript maps to nowhere. Specifically, suppose transcript abundance measurements y tj arise independently from some observation distribution f obs ( µ t,, θ). The dependence among the underlying means µ t, is captured by a distribution π(µ). With this setting, f 0 (y t ) = ( ) n j=1 f obs(y tj µ) π(µ)dµ. For a transcript that maps to marker m say, the underlying 10

12 expression means defined by the marker genotype groups are not equal (µ t,0 µ t,1 ), but they both are assumed to come from π(µ). The governing distribution for y t is then: f m (y t ) = f 0 (y 0 t ) f 0(y 1 t ) where y 0(1) t denotes the set of transcript t values for animals with genotype 0(1). Model fit proceeds via the EM algorithm. Once the parameter estimates are obtained, posterior probabilities of mapping nowhere or to any of the M locations can be calculated via Bayes rule. For instance, the posterior probability that transcript t maps to location l, l = 0,..., M is given by p l f l (y t ) p 0 f 0 (y t ) + M m=1 p (4.5) mf m (y t ) With the MOM approach, a transcript is identified to be DE if the posterior probability of DE exceeds some threshold, where the threshold is chosen to control the expected posterior false discovery rate (Newton et al. 2004). In order to make a transcript specific call, the highest posterior density (HPD) region can be constructed in a straightforward fashion. A 1 α HPD region is obtained by including those marker locations until the sum of the posterior probability exceeds 1 α. 4.5 Current Status of ETL Mapping Methods Here we present some of our early simulation results comparing different approaches. We considered a single ETL simulation with two chromosomes. Marker data was obtained from chr 2 and chr 3 from the F2 data from Dr. Alan Attie s lab on campus. Chromosome 2 has 17 markers and there are 6 markers on chromosome 3. A single ETL was simulated at marker 5 of chromosome 2. We generated 20 data sets for each of the seven values of ν 0, which is a tuning parameter to control the variance pattern in the simulated data (for details, see Kendziorski et al. 2004) so that operating characteristics could be evaludated without biasing towards one method. We consider applying MR for every transcript as transcript based MR (TB-MR). For TB-MR, the genome wide type I error rate per transcript is controlled at 5% (Dupuis and Siegmund 1999). We also have four marker based approaches. The first is EBarrays with LogNormal-Normal (LNN) model where posterior probabilities of DE are computed for all the transcripts at every marker (MB-EB). See Kendziorski et al. (2003) for details of the LNN model. The second one is a t-test of equal means for every transcript, followed by calculations of q-values (Storey and Tibshirani 11

13 2003) at every marker (MB-Q). The third and fourth methods calculate moderated t-statistics using SAM (Tusher et al. 2003) and LIMMA (Smyth 2004), followed by q-value calculation at every marker. We refer to these as MB-SAM and MB-LIMMA. For MB-EB, MB-Q, MB-SAM, and MB-LIMMA, the false discovery rate per marker is controlled at 5%. A fifth method considered attempts to test all the transcripts and all the markers simultaneously. P-values from t-tests for every transcript and marker pair were used to calculate q-values (Storey and Tibshirani 2003) at once. Note that by doing so, we assume that a certain dependence structure among tests is satisfied (Storey 2003), which is likely to be not true here. We, nevertheless, include this method (Q-ALL) in our simulation, because we d like to see the performance of this kind of ad-hoc procedure. In addition, Storey and Tibshirani (2003) use the ETL mapping data of Brem et al. (2002) as motivation for considering calculation of all q-values simultaneously. For Q-ALL, FDR control is targeted at 5%. Power is defined as the probability of calling marker 4, 5 or 6 (flanking region of the true ETL) on chromosome 2 for mapping transcripts. FDR is the proportion of transcripts identified incorrectly as mapping to chromosome 2 or 3, i.e., they were EE or they were DE but mapped outside the flanking region of the true ETL. Figure 1 shows the average power and FDR (over the 20 simulated data sets) against each ν 0 value, together with 95% point-wise confidence interval. As can be seen, the power is around 80% and 90% for all of the methods; however, for all methods except MOM, the FDR is well above 5% for most values of ν 0. These simple and somewhat ad-hoc approaches fail to control the FDR at their claimed level, because they couldn t adjust for multiple tests across the markers and the transcripts simultaneously. Although MOM does adjust appropriately for these multiplicities, the approach has a number of shortcomings. For experiments with sparse maps, the MOM model is lacking as information between markers is not available. Also, MOM does not allow for multiple ETL. Some dense maps are currently available or under development, particularly those that use single nucleotide polymorphisms (SNPs) (see The SNP Consortium). Because of its proximity, SNPs may be shared among groups of people with harmful but unknown mutations and serve as markers for them. Such markers can help to reveal the mutations and expedite therapeutic drug discovery. When dense maps are available, the MOM model as it is may not be applicable as the number of mixture components to fit will be huge. 12

14 5 Research Plan The main objective of the proposed thesis is to develop a powerful statistical framework within which ETL can be localized. The framework relies on the MOM model, in that it simultaneously controls for tests done across the genome locations and across all the transcripts. However, the proposed methods, detailed below, significantly increase the utility and applicability of MOM by addressing the first two shortcomings listed above. The effort of developing a method to handle dense maps is ongoing. 5.1 ETL Interval Mapping The genomic regions identified using MOM are limited by their size, which may be large as analysis is conducted at genotyped markers only. When dense maps are not available (Attie lab data, Schadt et al. 2003), this limitation can be a serious one. The biological techniques currently available to search for genes in large genomic regions (e.g. candidate gene approach, congenic lines) can take years and, as a result, additional statistical methods capable of narrowing down regions are necessary. We here propose a method for IM of ETL. Consider for simplicity of notation a backcross population genotyped as 0 or 1 at M markers. The observed phenotype data y is a T n matrix of transcript abundance levels for transcript t = 1,..., T and individuals j = 1,..., n; m is an M n matrix of marker genotypes for markers i = 1,..., M and individuals j = 1,..., n. Consider a set of L locations spanning the entire genome, we model the expression data for transcript t as a L + 1 component mixture. To be specific, we imagine that the transcript may map to nowhere with probability p 0, and to any of the L locations with probability p l, l = 1, 2,..., L. The p s are mixing proportions. As noted in Section 3.1, transcript t is said to be linked (mapped) to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the population of individuals with genotype 0(1) at location l. Let zt l be an indicator of whether transcript t maps to location l. If l is at the markers, then we can decompose the predictive density under the alternative hypothesis such that f l (y t ) = f 0 (yt 0) f 0(yt 1 ), where the grouping is determined by genotypes at that marker. However, when l is between markers, the decomposition 13

15 is no longer valid. Instead, we have f l (y t ) = = j G l 0 f obs (y tj µ t,0 ) j G l 1 f l (y t g l ) p(g l m) dg l f obs (y tj µ t,1 ) π(µ t,0 ) π(µ t,1 ) p(g l m) dµ t,0 dµ t,1 dg l where g l = (g1 l, gl 2,..., gl n ) denotes the unknown genotype vector at location l; G 0(1) l denotes the set of population having genotype 0(1) at location l. Under the null hypothesis, the predictive density of the data, f 0 (y t ) can be calculated as before since it doesn t rely on genotype groupings. Parameter estimates for mixing proportions and hyper parameters can be obtained via the EM algorithm (see Appendix A). We show that the posterior probability for transcript t to be mapped to location l, after integrating out the µ s, is given by p(z l t = 1 y, m) = p(zl t = 1) f l (y t g l )p(g l m)dg l p(y t m) (5.6) where p(zt l = 1) is the prior probability that transcript t maps to location l. At a particular location l, the conditional distribution of genotype given the expression and marker data p(g l m) is assumed to only depend on the two markers flanking l. Notice that g l is a vector of length n. Theoretically, there are 2 n possible genotype vectors. So the integral in (5.6) is a huge mixture. In practice, one can restrict to consider a smaller number (Table 1, Zeng 1994, reproduced here as Table 2) since a lot of them have small probabilities. However, as the number of individuals in the study is large, this 2 n problem quickly becomes computationally infeasible Pseudomarker-MOM To get around with the 2 n problem, here we propose a general framework of ETL IM using importance sampling and pseudomarker generation. The idea of pseudomarkers was introduced by Sen and Churchill (2001) (see Appendix B). Let us introduce some slightly more general notation. Recall that a transcript t is linked to location l if µ 0 t,l µ 1 t,l, where µ0(1) t,l denotes the latent mean level of expression for transcript t for the populations of individuals with genotype 0(1) at location l. When this is the case, we say that transcript t is in expression pattern P1 at location l, denoted as P 1 l t; similarly P 0 l t denotes the null pattern of expression at location l (µ 0 t,l = µ 1 t,l ). It is useful to introduce specific patterns in our framework, since when an F2 population is considered, for example, there are numerous patterns. Each defines a way in which the latent means can be different across genotype groups at a location. All of the non-null patterns 14

16 imply linkage to that location). Two T L matrices, θ 0 and θ 1, contain the latent mean levels of expression (θ = (θ 0, θ 1 )); L denotes the total number of locations considered. Let zt l = 1 if transcript t is in expression pattern P1 at location l and 0 otherwise. Then z is a T L indicator matrix specifying QTL locations for each transcript. Let us first consider a simple case where a transcript is associated with at most one genomic location l and consider inference at location l. This assumption simplifies algebraic development and will be relaxed later. At location l, of primary interest is the posterior probability that zt l = 1 for transcript t. Reexpressing (5.6) in terms of the patterns, we have p(p kt l y, m) ( pl t,p k f P k yt g l) p(g l m)dg l (5.7) where p l t,p k denotes the prior probability that transcript t is in pattern k at location l and f P k describes the predictive density of the data. k = 0, 1 for backcross. A normalizing constant is not required if further calculations of operational characteristics such as FDR is not of interest, which may be the case if a simple ranking of the genes is desired. For the model we propose here, we are interested in the calculation of estimated FDR. Therefore, we must specify the normalizing constant p(y t m). Expanding the probability in terms of different possible mapping locations, we have p(y t m) = p(y t m, z t = 0)p(z t = 0) + L l =1 p(y t m, zl t = 1)p(zt l = 1), where p(z t ) + L l =1 p(zl t = 1) = 1 and p(z t = 0) implies that the transcript does not map to any of the L locations, i.e., the transcript is in pattern P0 at every location. Therefore, the exact form of (5.7) is given by ( p l p(p kt l t y, m) =,P k fp k yt g l) p(g l m) dg l ( pt,p 0f P 0 (y t ) + L l =1 pl t,p 1 f P 1 (y t g l ) ) (5.8) p(g l m) dg l The p t,p 0 and p l t,p 1 s are unknown. We estimate them from the data, with the average of posterior probability that each transcript belongs to a particular pattern. Due to the 2 n problem, calculation of p(g l y, m), the posterior distribution of the unobserved genotype at location l given the expression and marker data, becomes computationally prohibitive. Here we use importance sampling by first simulating multiple versions of pseudomarkers from p(g l m), and then replace the exact integral with its Monte Carlo approximation. In simulating the pseudomarkers, one can use a simple Markov Chain structure where the putative QTL genotype at a given location only depends on two flanking markers. However, this might not work well if there is genotyping error or noninformative markers in the marker data. A hidden Markov Model (HMM) can be considered where the true marker genotypes follow a 15

17 Markov Chain, and the observed marker genotypes are characterized by distributions conditional on the underlying state process. Using an HMM, one can account for genotyping error and missing marker data in a coherent way. R/qtl (Broman et al. 2003) has this option as well. Figure 11 gives an example. The upper stripe gives the marker data for one animal. Blue shows AA, and yellow for Aa. There are some missing marker data, represented by light blue. The lower panel has 20 realizations of sampling from the HMM model trained by the marker data. Suppose for each location l, Q genotype vectors are sampled from the proposal distribution p(g m), where g = {g 1, g 2,..., g L } to give (g1 l, gl 2,..., gl Q ) for l = 1,..., L. Then equation (5.8) can be approximated by Monte Carlo integration using importance samples (see Appendix C.) p l Q p(p kt l t y, m),p k q=1 f ( P k yt gq) l p Q t,p 0 q=1 f P 0 (y t ) + L Q l =1 pl t,p 1 q=1 f ( ) (5.9) P 1 yt gq l This approach is an extension of the MOM model evaluated by averaging over the pseudomarkers. We call it pseudomarker-mom Two-Stage Approach Pseudomarker-MOM scans the genome at some small distance step in order to find potential locations to which transcripts are mapped. But applying it over the entire genome, even for each chromosome separately, can be computationally prohibitive. We propose a two-stage approach where we first apply MOM at markers to identify interesting regions, then follow up by applying pseudomarker-mom to those regions to better localize ETLs. MOM calculates for each transcript, the posterior probability that it maps to a particular marker, or it doesn t map at all. To identify potential ETL regions, we average the linkage evidence across all the transcripts, to give a marker specific linkage score. We can choose those hot-spot markers by thresholding the marker specific linkage evidence, using an HPD region Theoretical Result We here justify the fact that picking those regions with the highest average linkage evidence is indeed the correct thing to do under simplified conditions. Theorem 1: If we assume transcripts map to at most one genomic location and the prior probability of mapping to a particular marker is known and equal for all the markers, and hyperparameter values are known, then the expected posterior probability of a transcript mapping 16

18 to a particular marker is a non-increasing function of the recombination frequency between that marker and the ETL. Proof: See Appendix D. As a result of Theorem 1, we can be sure that under these conditions, the interesting marker regions picked by using MOM will be those regions that are the closest to the ETL. Once these regions are defined, we set up some equally-spaced pseudomarker grid and use pseudomarker- MOM to help localize the ETL with greater accuracy. We show by simulation studies in the next section that this approach works quite well, even under more general conditions Simulation Results A simulation was set up where there are 5000 transcripts and 100 individuals. The proportion of differential expression (DE) was 10%. The hypothetical marker map is composed of one chromosome and is equally spaced with 10cM in between. There are 10 markers in total (i.e, from 0cM to 90cM). Intensity values are simulated as described in Section 4.4. Two simulations were considered: one with a single ETL at 35cM and one with two ETLs: one at 35cM and one at 75cM. The two-stage approach was applied in the simulation. Specifically, two hotspot marker regions were selected from the average posterior probability profile at every marker, obtained by MOM. Pseudomarker-MOM was then applied across a 2cM grid within the hot-spot regions, with 50 pseudomarker realizations (Q = 50). For comparison, we applied traditional QTL IM transcript by transcript on the same data set and obtained genome-wide cutoffs on LOD scores based on an approximation formula from Rebai et al. (1994). Here we show results from the two simulations. Figure 2 (left panel) shows the average posterior probability profile, averaged across the mapping transcripts. The ETL region is identified both by MOM and pseudomarker-mom. However, MOM picks up a wide peak between marker 4 and marker 5, whereas pseudomarker- MOM identifies the ETL location with much better accuracy. To ensure that non-mapping transcripts were not falsely identified, we considered the posterior probability profiles averaged across non-mapping transcripts. They show little structure, as expected (figure not shown). Figure 2 (right panel) shows the 96.8% HPD regions for true mapping transcripts (96.8% is used to compare with IM results below). As shown, the ETL is identified correctly for most of the mapping transcripts. Just as shown in left panel, pseudomarker-mom provides good ETL localization. 17

19 In comparison, we also considered IM on the same simulated data set. This was implemented using R/qtl (Broman et al. 2003). Figure 3 (left panel) is the average LOD score, averaged across mapping transcripts. The region containing the ETL has the highest average LOD, but the average LOD scores are overall very high for mapping transcripts, and it s not clear what cutoff one should use in order to correctly identify the ETL region. For example, if we use 5, then it chooses almost the entire simulated chromosome. To compare with Figure 2, a confidence interval was constructed around the ETL using a 1-LOD drop interval around peak LOD score (Mangin et al. 1994). This procedure is designed to approximate a 96.8% confidence interval, but in general, these intervals can be biased in that they are too small and bootstrap procedure has been recommended (Visscher et al. 1996). In the ETL setting, obtaining bootstrap samples for thousands of transcripts doesn t seem feasible. On the other hand, confidence intervals that are slightly too small favors IM here as ETL appear to be better localized. It is not always clear which peaks to construct confidence intervals around. To give IM the best results, we consider a 10 cm window around the true ETL (35cM) and define the LOD peak as the highest LOD within the window. The 1-LOD drop interval is then constructed. Of course, in practice, one does not have the luxury of knowing where to choose these peaks and perhaps only the largest peak would be identified. For these reasons, this method of identifying ETL regions favors IM. Even so, in comparison to pseudomarker MOM, IM does not provide as precise estimates of ETL location. Similar results were obtained for the 2-ETL case. As shown in Figure 4 (left panel), for a particular simulation, the two ETL regions are identified both by MOM and pseudomarker-mom. However, spurious peaks show up in different places from different simulations, but in general have low average posterior probability compared with the main peaks (e.g., in the bottom row). As shown, pseudomarker-mom increases localization of the ETL compared with MOM. In Figure 4 (right panel), the distinct ETLs are identified for most of the mapping transcripts. We see that pseudomarker-mom provides good ETL localization. We again considered IM on the same two simulated data sets. Figure 5 (left panel) shows the average LOD score profile, averaged across mapping transcripts. The regions containing the ETLs has the highest average LOD, but distinct ETLs are not as clear as in Figure 3. As before, the 96.8 % confidence intervals were around each ETL using a 1-LOD drop interval around peak LOD scores, with peak chosen from the two 10cM windows around the true ETLs (35cM and 75cM). IM suffers from the same problem as before in that has relatively more false positive calls. 18

20 5.2 Multiple ETL mapping The MOM approach can be extended to account for multiple ETL. For example, if transcript t is possibly affected by two genotype locations l 1 and l 2, then four latent means are of interest: µ 0,0 t,(l 1,l 2 ), µ0,1 t,(l 1,l 2 ), µ1,0 t,(l 1,l 2 ) and µ1,0 t,(l 1,l 2 ), where µj,k t,(l 1,l 2 ) denotes the latent mean level of expression for transcript t for the populations of individuals with genotype (j, k) at locations l 1 and l 2. These latent means can be arranged in 15 possible expression patterns, all of which may be of interest. For simplicity, we consider: P0: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P1: µ 0,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P2: µ 0,0 t,(l 1,l 2 ) = µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P3: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) P4: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) = µ0,1 t,(l 1,l 2 ) = µ1,1 t,(l 1,l 2 ) P5: µ 0,0 t,(l 1,l 2 ) µ1,0 t,(l 1,l 2 ) µ0,1 t,(l 1,l 2 ) µ1,1 t,(l 1,l 2 ) Pattern P 0 allows for the possibility that a transcript maps to neither location. The latent means of a transcript mapping only to location l 1 would satisfy pattern P 1 since only allelic differences at l 1 affect the mean level of expression. Similarly, the means of transcripts mapping only to location l 2 satisfy P 2. Patterns P3-P5 describe possible ways in which the allelic variation at both locations can act and interact to affect expression level. P 3 describes a scenario in which the alleles at each location have equal, but not dominant, effects. A dominant model would be described by P 4 and an additive model by P 5. The multiple ETL MOM model (M-MOM) has 5 ( ) M patterns. As before, of primary interest is the posterior probability of particular expression patterns. They can be calculated similarly as in (4.5) for any pattern of interest, where k = 0, 1,..., Simulation Results We apply M-MOM to the simulated 2-ETL data described in Section We perform a two dimensional scan, by looking at every possible marker pair and all the expression patterns simultaneously. In the simulation, the two ETL effect sizes are generated to be the same, thus corresponding to pattern 3. If we look at the average posterior probabilities for all non-null patterns, most of them are negligible as expected, except for pattern 3 (figure not shown). Looking closely at pattern 3, we plot the average posterior probabilities for every marker pair in Figure 6. The first ETL location is on the Y axis and the second ETL location on the X axis. As 19

21 shown, the plot locates the two ETLs at 30cM and 80cM, which is very close to the truth and the best accuracy that M-MOM can achieve, since the true ETLs are not at markers. For comparison purposes, we implement a 2-D marker regression scan on the same data set (see Figure 7.) On the upper triangle, we plot the average LOD score for epistasis, which has very low probability, as expected. The diagonal is the average LOD sore from 1-D IM scan. The lower triangle gives the average joint LOD scores. Also shown in the plot are the contour lines over the range of the LOD scores obtained. The region between 30cM, 40cM and 70cM to 90cM has relatively high average probabilities compared to the others. In order to assess the significance, we randomly sample 10 transcripts corresponding to every 10th percentile of the log of expression means, and perform permutation tests of size 1000 on each of them. The average 95th percentile of the LOD scores from each of the 1000 permutation tests is about 3.2. Using this as our 2-D LOD score cutoff, we see from the contour lines that it gives a much wider region than the actual ETL. In line with the previous comparisons, using traditional QTL mapping techniques on the ETL data seems to yield more false positives. 6 Future Research Questions The proposed framework for ETL mapping enables IM of single ETL and the identification of multiple ETL at markers. Preliminary results from simulated data sets are encouraging. There are a lot of questions that need to be explored further. 1. Our methods are extensions of the MOM model and a number of questions of implementation remain open. One question that is important to our extension is: How sparse is sparse? In other words, how many markers can MOM handle relative to the number of transcripts? Generally, one might also be concerned about whether MOM could fit mixtures with so many components. We did a series of simulations, trying to shed some light on this. Consider a backcross, and suppose the genome has 400 equally spaced markers with 5cM in between. There are 25% of DE transcripts, mapping to 80% of the markers with equal probabilities. We varied the number of transcripts to be 100, 1000, 5000 and 40000, representing very small, small, moderate and large number of transcripts in a realistic experiment, with animal number being 10, 60 and 100. Here 10 might be a little unrealistic, but we were curious about how the results will look like. 20

22 We plot the posterior probabilities for all the transcripts of a particular mixture component in the MOM model in Figures 8, 9 and 10. The number of transcripts mapping to that mixture component is 1, 1, 4 and 22 corresponding to the four values of total transcripts number. The true DE transcripts are colored in red. Surprisingly, it seems that MOM can fit a mixture model with the number of components bigger than the number of transcripts pretty well (first panel in Figure 9 and Figure 10), as long as there are enough animals in the experiment. Our impression is that when the number of animals is small, there tends to be very few recombinations. The degree of distincitiveness between the MOM components is not very high, resulting in relatively low power (see Figure 8.) When the number of transcripts is small, there are also very few DE transcripts mapped to every marker. We might expect the posterior probabilities of those transcripts not to be very accurate. But as seen from Figure 9, this does not seem to be the case. There are 60 animals, but even when we only have 100 transcripts, MOM still detects the one DE transcript with posterior probabilitiy being 1.0. When there are transcripts, among the top 22 transcripts with the highest posterior probabilities, 20 of them are true DE (22 DE in total). The posterior probabilities range from to 1.0. In Figure 10, when there are 100 animals and transcripts, 21 out of the top 22 transcripts are DE, and their posterior probabilities range from 0.76 to 1.0. Much more work is required to investigate these results and define precise conditions under which parameters are well estimated. 2. Importance sampling is applied in pseudomarker-mom. Importance samples are taken from the proposal distribution p(g l m) to approximate what we really desire p(g l y, m). We will investigate ways to choose Q so that we balance between computational burden and accuracy. Perhaps a lower bound on Q can be obtained. 3. The proposed two-stage approach relies on the ability of MOM to identify the correct interesting regions for follow-up. We have shown theoretically that under simplified conditions, the highest posterior probability region obtained using MOM is that region closest to ETL. The simplified conditions need to be relaxed. In particular, we will investigate the case of multiple ETL and varying prior probabilities. 4. We have developed a method for mapping multiple-etl. Preliminary results from simulations investigating a 2 ETL system are encouraging. We would like to extend this 21

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Backcross P 1 P 2 P 1 F 1 BC 4

More information

Inferring Genetic Architecture of Complex Biological Processes

Inferring Genetic Architecture of Complex Biological Processes Inferring Genetic Architecture of Complex Biological Processes BioPharmaceutical Technology Center Institute (BTCI) Brian S. Yandell University of Wisconsin-Madison http://www.stat.wisc.edu/~yandell/statgen

More information

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets Yu-Ling Chang A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment

More information

Mapping multiple QTL in experimental crosses

Mapping multiple QTL in experimental crosses Human vs mouse Mapping multiple QTL in experimental crosses Karl W Broman Department of Biostatistics & Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman www.daviddeen.com

More information

Statistical issues in QTL mapping in mice

Statistical issues in QTL mapping in mice Statistical issues in QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Outline Overview of QTL mapping The X chromosome Mapping

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Causal Graphical Models in Systems Genetics

Causal Graphical Models in Systems Genetics 1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation

More information

QTL Mapping I: Overview and using Inbred Lines

QTL Mapping I: Overview and using Inbred Lines QTL Mapping I: Overview and using Inbred Lines Key idea: Looking for marker-trait associations in collections of relatives If (say) the mean trait value for marker genotype MM is statisically different

More information

Mapping multiple QTL in experimental crosses

Mapping multiple QTL in experimental crosses Mapping multiple QTL in experimental crosses Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]

More information

Use of hidden Markov models for QTL mapping

Use of hidden Markov models for QTL mapping Use of hidden Markov models for QTL mapping Karl W Broman Department of Biostatistics, Johns Hopkins University December 5, 2006 An important aspect of the QTL mapping problem is the treatment of missing

More information

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org R/qtl workshop (part 2) Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Example Sugiyama et al. Genomics 71:70-77, 2001 250 male

More information

Overview. Background

Overview. Background Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial Elias Chaibub Neto and Brian S Yandell September 18, 2013 1 Motivation QTL hotspots, groups of traits co-mapping to the same genomic

More information

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012 Quantile based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March 2012 2012 Yandell 1 Fisher on inference We may at once admit that any inference from the particular

More information

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

QTL Model Search. Brian S. Yandell, UW-Madison January 2017 QTL Model Search Brian S. Yandell, UW-Madison January 2017 evolution of QTL models original ideas focused on rare & costly markers models & methods refined as technology advanced single marker regression

More information

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines Lecture 8 QTL Mapping 1: Overview and Using Inbred Lines Bruce Walsh. jbwalsh@u.arizona.edu. University of Arizona. Notes from a short course taught Jan-Feb 2012 at University of Uppsala While the machinery

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Selecting explanatory variables with the modified version of Bayesian Information Criterion

Selecting explanatory variables with the modified version of Bayesian Information Criterion Selecting explanatory variables with the modified version of Bayesian Information Criterion Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland in cooperation with J.K.Ghosh,

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. Jessica Mendes Maia

Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. Jessica Mendes Maia ABSTRACT MAIA, JESSICA M. Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. (Under the direction of Professor Zhao-Bang Zeng). The goal of this dissertation is

More information

Methods for QTL analysis

Methods for QTL analysis Methods for QTL analysis Julius van der Werf METHODS FOR QTL ANALYSIS... 44 SINGLE VERSUS MULTIPLE MARKERS... 45 DETERMINING ASSOCIATIONS BETWEEN GENETIC MARKERS AND QTL WITH TWO MARKERS... 45 INTERVAL

More information

MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI ROBY JOEHANES

MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI ROBY JOEHANES MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI by ROBY JOEHANES B.S., Universitas Pelita Harapan, Indonesia, 1999 M.S., Kansas State University, 2002 A REPORT submitted in partial

More information

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations, Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.104.031427 An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci Fei

More information

The Admixture Model in Linkage Analysis

The Admixture Model in Linkage Analysis The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the

More information

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Prediction of the Confidence Interval of Quantitative Trait Loci Location Behavior Genetics, Vol. 34, No. 4, July 2004 ( 2004) Prediction of the Confidence Interval of Quantitative Trait Loci Location Peter M. Visscher 1,3 and Mike E. Goddard 2 Received 4 Sept. 2003 Final 28

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure

Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure by John C. Schwarz A dissertation submitted to the faculty of the University of North Carolina at Chapel

More information

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees: MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation

More information

Inferring Causal Phenotype Networks from Segregating Populat

Inferring Causal Phenotype Networks from Segregating Populat Inferring Causal Phenotype Networks from Segregating Populations Elias Chaibub Neto chaibub@stat.wisc.edu Statistics Department, University of Wisconsin - Madison July 15, 2008 Overview Introduction Description

More information

THE problem of identifying the genetic factors un- QTL models because of their ability to separate linked

THE problem of identifying the genetic factors un- QTL models because of their ability to separate linked Copyright 2001 by the Genetics Society of America A Statistical Framework for Quantitative Trait Mapping Śaunak Sen and Gary A. Churchill The Jackson Laboratory, Bar Harbor, Maine 04609 Manuscript received

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

QTL model selection: key players

QTL model selection: key players Bayesian Interval Mapping. Bayesian strategy -9. Markov chain sampling 0-7. sampling genetic architectures 8-5 4. criteria for model selection 6-44 QTL : Bayes Seattle SISG: Yandell 008 QTL model selection:

More information

Binary trait mapping in experimental crosses with selective genotyping

Binary trait mapping in experimental crosses with selective genotyping Genetics: Published Articles Ahead of Print, published on May 4, 2009 as 10.1534/genetics.108.098913 Binary trait mapping in experimental crosses with selective genotyping Ani Manichaikul,1 and Karl W.

More information

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI By DÁMARIS SANTANA MORANT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Locating multiple interacting quantitative trait. loci using rank-based model selection

Locating multiple interacting quantitative trait. loci using rank-based model selection Genetics: Published Articles Ahead of Print, published on May 16, 2007 as 10.1534/genetics.106.068031 Locating multiple interacting quantitative trait loci using rank-based model selection, 1 Ma lgorzata

More information

Lecture 6. QTL Mapping

Lecture 6. QTL Mapping Lecture 6 QTL Mapping Bruce Walsh. Aug 2003. Nordic Summer Course MAPPING USING INBRED LINE CROSSES We start by considering crosses between inbred lines. The analysis of such crosses illustrates many of

More information

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6 Yamamoto et al. BMC Genetics 2014, 15:50 METHODOLOGY ARTICLE Open Access Effect of advanced intercrossing on genome structure and on the power to detect linked quantitative trait loci in a multi-parent

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Model Selection for Multiple QTL

Model Selection for Multiple QTL Model Selection for Multiple TL 1. reality of multiple TL 3-8. selecting a class of TL models 9-15 3. comparing TL models 16-4 TL model selection criteria issues of detecting epistasis 4. simulations and

More information

Causal Model Selection Hypothesis Tests. in Systems Genetics

Causal Model Selection Hypothesis Tests. in Systems Genetics Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto 1 Aimee T. Broman 2 Mark P Keller 2 Alan D Attie 2 Bin Zhang 1 Jun Zhu 1 Brian S Yandell 3,4 1 Sage Bionetworks, Seattle,

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986 1012 DOI: 10.1214/08-AOAS182 Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS BY DANIELA

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Bayesian Partition Models for Identifying Expression Quantitative Trait Loci

Bayesian Partition Models for Identifying Expression Quantitative Trait Loci Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Bayesian Partition Models for Identifying Expression Quantitative

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Mapping QTL to a phylogenetic tree

Mapping QTL to a phylogenetic tree Mapping QTL to a phylogenetic tree Karl W Broman Department of Biostatistics & Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman Human vs mouse www.daviddeen.com 3 Intercross

More information

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

Lecture WS Evolutionary Genetics Part I 1

Lecture WS Evolutionary Genetics Part I 1 Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

On the mapping of quantitative trait loci at marker and non-marker locations

On the mapping of quantitative trait loci at marker and non-marker locations Genet. Res., Camb. (2002), 79, pp. 97 106. With 3 figures. 2002 Cambridge University Press DOI: 10.1017 S0016672301005420 Printed in the United Kingdom 97 On the mapping of quantitative trait loci at marker

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

Lecture 21: October 19

Lecture 21: October 19 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 21: October 19 21.1 Likelihood Ratio Test (LRT) To test composite versus composite hypotheses the general method is to use

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

QTL model selection: key players

QTL model selection: key players QTL Model Selection. Bayesian strategy. Markov chain sampling 3. sampling genetic architectures 4. criteria for model selection Model Selection Seattle SISG: Yandell 0 QTL model selection: key players

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

A new simple method for improving QTL mapping under selective genotyping

A new simple method for improving QTL mapping under selective genotyping Genetics: Early Online, published on September 22, 2014 as 10.1534/genetics.114.168385 A new simple method for improving QTL mapping under selective genotyping Hsin-I Lee a, Hsiang-An Ho a and Chen-Hung

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Causal Network Models for Correlated Quantitative Traits. outline

Causal Network Models for Correlated Quantitative Traits. outline Causal Network Models for Correlated Quantitative Traits Brian S. Yandell UW Madison October 2012 www.stat.wisc.edu/~yandell/statgen Jax SysGen: Yandell 2012 1 outline Correlation and causation Correlatedtraitsinorganized

More information

Multiple interval mapping for ordinal traits

Multiple interval mapping for ordinal traits Genetics: Published Articles Ahead of Print, published on April 3, 2006 as 10.1534/genetics.105.054619 Multiple interval mapping for ordinal traits Jian Li,,1, Shengchu Wang and Zhao-Bang Zeng,, Bioinformatics

More information

THE data in the QTL mapping study are usually composed. A New Simple Method for Improving QTL Mapping Under Selective Genotyping INVESTIGATION

THE data in the QTL mapping study are usually composed. A New Simple Method for Improving QTL Mapping Under Selective Genotyping INVESTIGATION INVESTIGATION A New Simple Method for Improving QTL Mapping Under Selective Genotyping Hsin-I Lee,* Hsiang-An Ho,* and Chen-Hung Kao*,,1 *Institute of Statistical Science, Academia Sinica, Taipei 11529,

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA Linkage analysis and QTL mapping in autotetraploid species Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA Collaborators John Bradshaw Zewei Luo Iain Milne Jim McNicol Data and

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

A CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES. By Jee Young Moon

A CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES. By Jee Young Moon A CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES By Jee Young Moon A dissertation submitted in partial fulfillment of the requirements for the degree

More information

TECHNICAL REPORT NO December 1, 2008

TECHNICAL REPORT NO December 1, 2008 DEPARTMENT OF STATISTICS University of Wisconsin 300 University Avenue Madison, WI 53706 TECHNICAL REPORT NO. 46 December, 2008 Revised on January 27, 2009 Causal Graphical Models in System Genetics: a

More information

Latent Variable models for GWAs

Latent Variable models for GWAs Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

QUANTITATIVE trait analysis has many applica- several crosses, since the correlations may not be the

QUANTITATIVE trait analysis has many applica- several crosses, since the correlations may not be the Copyright 001 by the Genetics Society of America Statistical Issues in the Analysis of Quantitative Traits in Combined Crosses Fei Zou, Brian S. Yandell and Jason P. Fine Department of Statistics, University

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation One-week Course on Genetic Analysis and Plant Breeding 21-2 January 213, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation Jiankang Wang, CIMMYT China and CAAS E-mail: jkwang@cgiar.org; wangjiankang@caas.cn

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information