USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Size: px
Start display at page:

Download "USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago"

Transcription

1 Submitted to the Annals of Applied Statistics USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA By Xiaoquan Wen and Matthew Stephens University of Chicago Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas in practice it is often the case that only summary data are available. For example this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article, we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straightforward, and related to a long history of the use of linear methods for estimating missing values (e.g. Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-ofthe art imputation methods that use individual-level data, but at a fraction of the computational cost. 1. Introduction. Genotype imputation [Servin and Stephens (2008), Guan and Stephens (2008), Marchini et al. (2007), Howie et al. (2009), Browning and Browning (2007), Huang et al. (2009)] has recently emerged as a useful tool in the analysis of genetic association studies as a way of performing tests of association at genetic variants (specifically Single Nucleotide Polymorphisms, or SNPs) that were not actually measured in the association study. In brief, the idea is to exploit the fact that untyped SNPs are often This work was supported by NIH grants HG02585 and HL to MS Keywords and phrases: regularized linear predictor, shrinkage estimation, genotype imputation, genetic association study 1

2 2 correlated, in a known way, with one or more typed SNPs. Imputation-based approaches exploit these correlations, using observed genotypes at typed SNPs to estimate, or impute, genotypes at untyped SNPs, and then test for association between the imputed genotypes and phenotype, taking account of uncertainty in the imputed genotypes. (Although in general statistics applications the term imputation may imply replacing unobserved data with a single point estimate, in the genetic context it is often used more broadly to include methods that consider the full conditional distribution of the unobserved genotypes, and this is the way we use it here.) These approaches have been shown to increase overall power to detect associations by expanding the number of genetic variants that can be tested for association [Servin and Stephens (2008), Marchini et al. (2007)], but perhaps their most important use has been in performing meta-analysis of multiple studies that have typed different, but correlated, sets of SNPs (e.g. Zeggini et al. (2008)). Existing approaches to imputation in this context have been developed to work with individual-level data: given genotype data at typed SNPs in each individual they attempt to impute the genotypes of each individual at untyped SNPs. From a general statistical viewpoint, one has a large number of correlated discrete-valued random variables (genotypes), whose means and covariances can be estimated, and the aim is to predict values of a subset of these variables, given observed values of all the other variables. Although one could imagine applying off-the-shelf statistical methods to this problem (e.g. Yu and Schaid (2007) consider approaches based on linear regression), in practice the most successful methods in this context have used purpose-built methods based on discrete Hidden Markov Models (HMMs) that capture ideas from population genetics (e.g. Li and Stephens (2003), Scheet and Stephens (2005)). In this paper we consider a related, but different, problem: given the frequency of each allele at all typed SNPs, we attempt to impute the frequency of each allele at each untyped SNP. We have two main motivations for considering this problem. The first is that, although most large-scale association studies collect individual-level data, it is often the case that, for reasons of privacy [Homer et al. (2008b), Sankararaman et al. (2009)] or politics, only the allele frequency data (e.g. in cases vs controls) are made available to the research community at large. The second motivation is an experimental design known as DNA pooling [Homer et al. (2008a), Meaburn et al. (2006)], in which individual DNA samples are grouped into pools and high-throughput genotypings are performed on each pool. This experimental design can be considerably cheaper than separately genotyping each individual, but comes at the cost of providing only (noisy) estimates of the

3 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 3 allele frequencies in each pool. In this setting the methods described here can provide not only estimates of the allele frequencies at untyped SNPs, but also more accurate estimates of the allele frequencies at typed SNPs. From a general statistical viewpoint this problem of imputing frequencies is not so different from imputing individual genotypes: essentially it simply involves moving from discrete-valued variables to continuous ones. However, this change to continuous variables precludes direct use of the discrete HMMbased methods that have been applied so successfully to impute individual genotypes. The methods we describe here come from our attempts to extend and modify these HMM-based approaches to deal with continuous data. In doing so we end up with a considerably simplified method that might be considered an off-the-shelf statistical approach: in essence, we model the allele frequencies using a multivariate normal distribution, which results in unobserved frequencies being imputed using linear combinations of the observed frequencies (as in Kriging, for example). Some connection with the HMM based approaches remains though, in how we estimate the mean and variance-covariance matrix of the allele frequencies. In particular, consideration of the HMM-based approaches lead to a natural way to regularize the estimated variance-covariance matrix, making it sparse and banded: something that is important here for both computational and statistical reasons. The resulting methods are highly computationally efficient, and can easily handle very large panels (phased or unphased). They are also surprisingly accurate, giving estimated allele frequencies that are similar in accuracy to those obtained from state-of-the-art HMM-based methods applied to individual genotype data. That is, one can estimate allele frequencies at untyped SNPs almost as accurately using only the frequency data at typed SNPs as using the individual data at typed SNPs. Furthermore, when individual-level data are available one can also apply our method to imputation of individual genotypes (effectively by treating each individual as a pool of 1), and this results in imputation accuracy very similar to that of state-of-the-art HMM-based methods, at a fraction of the computational cost. Finally, in the context of noisy data from pooling experiments, we show via simulation that the method can produce substantially more accurate estimated allele frequencies at genotyped markers. 2. Methods and models. We begin by introducing some terminology and notation. A SNP (Single Nucleotide Polymorphism) is a genetic marker that usually exhibits two different types (alleles) in a population. We will use 0 and 1 to denote the two alleles at each SNP with the labeling being essentially

4 4 arbitrary. A haplotype is a combination of alleles at multiple SNPs residing on a single copy of a genome. Each haplotype can be represented by a string of binary (0/1) values. Each individual has two haplotypes, one inherited from each parent. Routine technologies read the two haplotypes simultaneously to produce a measurement of the individual s genotype at each SNP, which can be coded as 0,1 or 2 copies of the 1 allele. Thus the haplotypes themselves are not usually directly observed, although they can be inferred using statistical methods [Stephens et al. (2001)]. Genotype data where the haplotypes are treated as unknown are referred to as unphased, whereas if the haplotypes are measured or estimated they are referred to as phased. In this paper we consider the following form of imputation problem. We assume that data are available on p SNPs in a reference panel of data on m individuals samples from a population, and that a subset of these SNPs are typed on a further study sample of individuals taken from a similar population. The goal is to estimate data at untyped SNPs in the study sample, using the information on the correlations among typed and untyped SNPs that is contained in the reference panel data. We let M denote the panel data, and h 1,...,h 2n denote the 2n haplotypes in the study sample. In the simplest case, the panel data will be a 2m p binary matrix, and the haplotypes h 1,...,h 2n can be thought of as additional rows of this matrix with some missing entries whose values need imputing. Several papers have focused on this problem of individuallevel imputation [Servin and Stephens (2008), Scheet and Stephens (2005), Marchini et al. (2007), Browning and Browning (2007), Li et al. (2006)]. In this paper, we consider the problem of performing imputation when only summary-level data are available for h 1,...,h 2n. Specifically, let (2.1) y = (y 1... y p ) = 1 2n denote the vector of allele frequencies in the study sample. We assume that these allele frequencies are measured at a subset of typed SNPs, and consider the problem of using these measurements, together with information in M, to estimate the allele frequencies at the remaining untyped SNPs. More formally, if (y t, y u ) denotes the partition of y into typed and untyped SNPs, our aim is to estimate the conditional distribution y u y t, M. Our approach is based on the assumption that h 1,...,h 2n are independent and identically distributed (i.i.d.) draws from some conditional distribution Pr(h M). Specifically, in common with many existing approaches to individual-level imputation [Stephens and Scheet (2005), Marchini et al. 2n i=1 h i,

5 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 5 (2007), Li et al. (2006)], we use the HMM-based conditional distribution from Li and Stephens (2003), although other choices could be considered. It then follows by the central limit theorem, provided that the sample size 2n is large, the distribution of y M can be approximated by a multivariate normal distribution: (2.2) y M N p (µ, Σ), where µ = E(h M) and Σ = 1 2n Var(h M). From this joint distribution, the required conditional distribution is easily obtained. Specifically, by partitioning µ and Σ in the same way as y, according to SNPs typed/untyped status, (2.2) can be written, (2.3) and ( yu y t ) M N p (( µu µ t ), ( Σuu Σ ut (2.4) y u y t, M N q (µ u + Σ ut Σ 1 tt (y t µ t ), Σ uu Σ ut Σ 1 tt Σ tu). The mean of this last distribution can be used as a point estimate for the unobserved frequencies y u, while the variance gives an indication of the uncertainty in these estimates. (In principle the mean can lie outside the range [0, 1), in which case we use 0 or 1, as appropriate, as the point estimate; however this happens very rarely in practice). The parameters µ and Σ must be estimated from the panel data. It may seem natural to estimate these using the empirical mean f panel and the empirical covariance matrix Σ panel from the panel. However, Σ panel is highly rank deficient because the sample size m in the panel is far less than the number of SNPs p, and so this empirical matrix cannot be used directly. Use of the conditional distribution from Li and Stephens solves this problem. Indeed, under this conditional distribution E(h M) = ˆµ and Var(h M) = ˆΣ can be derived analytically (appendix A) as: Σ tu (2.5) ˆµ = (1 θ)f panel + θ 2 1, (2.6) ˆΣ = (1 θ) 2 S + θ 2 (1 θ 2 )I, where θ is a parameter relating to mutation, and S is obtained from Σ panel by shrinking off-diagonal entries towards 0. Specifically, Σ tt )), (2.7) S ij = { Σ panel ij exp( ρ ij 2m )Σpanel ij i = j i j

6 6 where ρ ij is an estimate of the population-scaled recombination rate between SNPs i and j (e.g. Hudson (2001), Li and Stephens (2003), McVean et al. (2002)). We use the value of θ suggested by Li and Stephens (2003), (2.8) θ = ( 2m 1 i=1 1 i ) 1 2m + ( 2m 1 i=1 1 i ) 1, and values of ρ ij obtained by applying the software PHASE [Stephens and Scheet (2005)] to the HapMap CEU data, which are conveniently distributed with the IMPUTE software package. For SNPs i and j that are distant, exp( ρ ij 2m ) 0. To exploit the benefits of sparsity we set any value that was less than 10 8 to be 0, which makes ˆΣ sparse and banded: see Figure 1 for illustration. This makes matrix inversion in (2.4) computationally feasible and fast, using standard Gaussian elimination. Empirical Estimate Shrinkage Estimate Fig 1. Comparison of empirical and shrinkage estimates (based on Li and Stephens Model) of squared correlation matrix from the panel. Both of them are estimated using Hapmap CEU panel with 120 haplotypes. The region plotted is on chromosome 22 and contains 1000 Affymetrix SNPs which cover a 15Mb genomic region. Squared correlation values in [0.05, 1.00] are displayed using R s heat.colors scheme, with gold color representing stronger correlation and red color representing weaker correlation Incorporating measurement error and over-dispersion. Our treatment above assumes that the allele frequencies of typed SNPs, y t, are ob-

7 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 7 served without error. In some settings, for example in DNA pooling experiments, this is not the case. We incorporate measurement error by introducing a single parameter ǫ 2, and assume (2.9) y obs t y true t N p q (y true t, ǫ 2 I), where random vectors y obs t and y true t represent the observed and true sample allele frequencies for typed SNPs respectively, and subscript p q denotes the number of typed SNPs. We assume that, given y true t, the observations y obs t are conditionally independent of the panel data (M) and the allele frequencies at untyped SNPs (y true u ). Our treatment in the previous section also makes several other implicit assumptions: for example, that the panel and study individuals are sampled from the same population, and that the parameters ρ and θ are estimated without error. Deviations from these assumptions will cause over-dispersion: the true allele frequencies will lie further from their expected values than the model predicts. To allow for this, we modify (2.2) by introducing an over-dispersion parameter σ 2 : (2.10) y true M N p (ˆµ, σ 2ˆΣ). Over-dispersion models like this are widely used for modeling binomial data [McCullagh and Nelder (1989)]. In our applications below, for settings involving measurement error (i.e. DNA pooling experiments), we estimate σ 2, ǫ 2 by maximizing the multivariate normal likelihood: (2.11) y obs t M N p q (ˆµ t, σ 2ˆΣtt + ǫ 2 I). For settings without measurement error, we set ǫ 2 = 0 and estimate σ 2 by maximum likelihood. From the hierarchical model defined by (2.9) and (2.10), the conditional distributions of allele frequencies at untyped and typed SNPs are given by: ( y true u y obs t, M N q ˆµ u + ˆΣ ut (ˆΣ tt + ǫ2 σ 2I) 1 (y obs t ˆµ t ), (2.12) and (2.13) σ 2 (ˆΣ uu ˆΣ ut (ˆΣ tt + ǫ2 σ 2I) 1ˆΣtu ) ( y true t y obs t, M N p q ( 1 1 ˆΣ σ2 tt + 1 ǫ 2I) 1 ( 1 1 ˆΣ σ2 tt ˆµ t + 1 ǫ 2yobs t ), ( 1 1 ˆΣ σ2 tt + 1 ). ǫ 2I) 1 ),

8 8 We use (2.12) to impute allele frequencies at untyped SNPs. In particular, we use the conditional mean (2.14) ŷ true u = ˆµ u + ˆΣ ut (ˆΣ tt + ǫ2 σ 2I) 1 (y obs t ˆµ t ), as a natural point estimate for these allele frequencies. In settings involving measurement error, we use (2.13) to estimate allele frequencies at typed SNPs, again using the mean (2.15) ŷ true t = ( 1 1 ˆΣ σ2 tt + 1 ǫ 2I) 1 ( 1 1 ˆΣ σ2 tt ˆµ t + 1 ǫ 2yobs t ), as a point estimate. Note that this mean has an intuitive interpretation as a weighted average of the observed allele frequency at that SNP and information from other nearby, correlated, SNPs. For example, if two SNPs are perfectly correlated, then in the presence of measurement error, the average of the measured frequencies will be a better estimator of the true frequency than either of the single measurements (assuming measurement errors are uncorrelated). The lower the measurement error, ǫ 2, the greater the weight given to the observed frequencies; and when ǫ 2 = 0 the estimated frequencies are just the observed frequencies. Remark. For both untyped and typed SNPs, our point estimates for allele frequencies, (2.14) and (2.15)) are linear functions of the observed allele frequencies. Although these linear predictors were developed based on an appeal to the Central Limit Theorem, and resultant normality assumption, there are alternative justifications for use of these particular linear functions that do not rely on normality. Specifically, assuming that the two haplotypes making up each individual are i.i.d draws from a conditional distribution Pr(h M) with mean ˆµ and variance covariance matrix σ 2ˆΣ, then the linear predictors (2.14) and (2.15)) minimize the integrated risk (assuming squared error loss) among all linear predictors [West and Harrison (1997)]. In this sense they are the best linear predictors, and so we refer to this method of imputation as Best Linear IMPutation or BLIMP Extension to imputing genotype frequencies. The development above considers imputing unobserved allele frequencies. In some settings one might also want to impute genotype frequencies. A simple way to do this is to use an assumption of Hardy Weinberg equilibrium: that is, to assume that if y is the allele frequency at the untyped SNP, then the three genotypes have frequencies (1 y) 2, 2y(1 y) and y 2. Under our normal model, the expected

9 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 9 values of these three quantities can be computed: (2.16) E((1 y) 2 y obs t, M) = (1 E(y y obs t, M)) 2 + Var(y y obs t, M) E(y 2 y obs t, M) = (E(y y obs t, M)) 2 + Var(y y obs t, M) E(2y(1 y) y obs t, M) = 1 E((1 y) 2 y obs t, M) E(y 2 y obs t, M), where E(y y obs t, M) and Var(y y obs t, M) are given in (2.12). These expectations can be used as estimates of the unobserved genotype frequencies. The method above uses only allele frequency data at typed SNPs. If data are also available on genotype frequencies, as might be the case if the data are summary data from a regular genome scan in which all individuals were individually genotypes, then an alternative approach that does not assume HWE is possible. In brief, we write the unobserved genotype frequencies as means of the genotype indicators, ½ [g=0] and ½ [g=2] (analogous to expression (2.1)), and then derive expressions for the means and covariances of these indicators both within SNPs and across SNPs. Imputation can then be performed by computing the appropriate conditional distributions using the joint normal assumption, as in (2.4). See appendix B for more details. In practice, we have found these two methods give similar average accuracy (results not shown), although this could be because our data conform well to Hardy Weinberg equilibrium Individual-level genotype imputation. Although we developed the above model to tackle the imputation problem when individual genotypes are not available, it can also be applied to the problem of individual-level genotype imputation when individual-level data are available, by treating each individual as a pool of two haplotypes (application of these methods to small pool sizes is justified by the Remark above). For example, doubling (2.14) provides a natural estimate of the posterior mean genotype for an untyped SNP. For many applications this posterior mean genotype may suffice; see Guan and Stephens (2008) for the use of such posterior means in downstream association analyses. If an estimate that takes a value in {0,1,2} is desired, then a simple ad hoc procedure that we have found works well in practice is to round the posterior mean to the nearest integer. Alternatively, a full posterior distribution on the three possible genotypes can be computed by using the genotypic version of our approach (appendix B) Using unphased genotype panel data. Our method can be readily adapted to settings where the panel data are unphased. To do this, we note that the estimates (2.5, 2.6) for µ and Σ depend on the panel data only through the empirical mean and variance covariance matrix of the panel

10 10 haplotypes. When the panel data are unphased, we simply replace these with 0.5 times the empirical mean and variance covariance matrix of the panel genotypes (since, assuming random mating, genotypes are expected to have twice the mean and twice the (co)variance of haplotypes); see Weir (1979) for related discussion Imputation without a panel. In some settings, it may be desired to impute missing genotypes in a sample where no individuals are typed at all SNPs (i.e. there is no panel M), and each individual is typed at a different subset of SNPs. For example, this may arise if many individuals are sequenced at low coverage, as in the currently-ongoing 1000 genomes project ( In the absence of a panel we cannot directly obtain the mean and variance-covariance estimates ˆµ and ˆΣ as in (2.5) and (2.6). An alternative way to obtain these estimates is to treat each individual genotype vector as a random sample from multivariate normal distribution N p (ˆµ, ˆΣ), and apply the ECM algorithm [Meng and Rubin (1993)] to perform maximum likelihood estimation. However, this approach does not incorporate shrinkage. We therefore modify the algorithm in an ad-hoc way to incorporate shrinkage in the conditional maximization step. See Appendix C for details. 3. Data application and results. We evaluate methods described above by applying them to data from a subset of the WTCCC Birth Cohort, consisting of 1376 unrelated British individuals genotyped on the Affymetrix 500K platform [Wellcome Trust Case Control Consortium (2007)]. For demonstration purpose, we use only the 4329 SNPs from chromosome 22. We impute data at these SNPs using the 60 unrelated HapMap CEU parents [The International HapMap Consortium (2005)] as the panel. For the recombination parameters required in (2.7) we use the estimates distributed in the software package IMPUTE v1 [Marchini et al. (2007)], which were estimated from the same panel using the software package PHASE [Stephens and Scheet (2005)]. In our evaluations, we consider three types of application: frequency imputation using summary-level data, individual-level genotype imputation, and noise reduction in DNA pooling experiments. We examine both the accuracy of point estimates, and calibration of the credible intervals Frequency imputation using summary-level data. In this section, we evaluate the performance of (2.14) for imputing frequencies at untyped SNPs. The observed data consist of the marginal allele frequencies at each SNP, which we compute from the WTCCC individual-level genotype data.

11 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 11 To assess imputation accuracy, we perform the following cross-validation procedure: we mask the observed data at every 25th SNP, then treat the remaining SNPs as typed and use them to impute the frequencies of masked SNPs and compare the imputation results with the actual observed frequencies. We repeat this procedure 25 times by shifting the position of the first masked SNP. Because in this case, the observed frequencies are obtained through high quality individual-level genotype data, we assume the experimental error parameter ǫ 2 = 0. To provide a basis for comparison, we also perform the same experiment using the software package IMPUTE v1 [Marchini et al. (2007)], which is among the most accurate of existing methods for this problem. IMPUTE requires individual-level genotype data, and outputs posterior genotype probabilities for each unmeasured genotype. We therefore input the individuallevel genotype data to IMPUTE and estimate the allele frequency at each untyped SNP using the posterior expected frequency computed from the posterior genotype probabilities. Like our method, IMPUTE performs imputation using the conditional distribution from Li and Stephens [Li and Stephens (2003)]; however, it uses the full conditional distribution whereas our method uses an approximation based on the first two moments. Furthermore, IMPUTE uses individual-level genotype data. For both these reasons, we would expect IMPUTE to be more accurate than our method, and our aim is to assess how much we lose in accuracy by our approximation and by using summary-level data. To assess accuracy of estimated allele frequencies, we use the Root Mean Squared Error (RMSE), (3.1) RMSE = 1 J (y j ŷ j ) J 2, where J is the number of SNPs tested (4329) and y j, ŷ j are observed and imputed allele frequencies for SNP j respectively. The RMSE from our method was compared with from IM- PUTE (Table 1). Thus, for these data, using only summary-level data sacrifices virtually nothing in accuracy of frequency imputation. Furthermore, we found that using an unphased panel (replacing the phased HapMap CEU haplotypes with the corresponding unphased genotypes) resulted in only a very small decrease in imputation accuracy: RMSE = In all cases the methods are substantially more accurate than a naive method that simply estimates the sample frequency using the panel frequency (Table 1). We also investigated the calibration of the estimated variances of the imputed frequencies from (2.12). To do this, we constructed a Z-static for j=1

12 12 each test SNP j, (3.2) Z j = y j E(y j y t, M), Var(y j y t, M) where y j is true observed frequency, and the conditional mean and variance are as in (2.12). If the variances are well calibrated, the Z-scores should follow a standard normal distribution (with slight dependence among Z-scores of neighboring SNPs due to LD). Figure 2a shows that, indeed, the empirical distribution of Z-scores is close to standard normal (results are shown for phased panel; results for unphased panel are similar). Note that the over-dispersion parameter plays a crucial role in achieving this calibration. In particular, the Z-scores produced by the model without over-dispersion (2.2) do not follow a standard normal distribution, with many more observations in the tails (Figure 2b) indicating that the variance is under-estimated. a: Model allowing for overdispersion Frequency Z scores binned by standard normal percentiles b: Model without overdispersion Frequency Z scores binned by standard normal percentiles Fig 2. Comparison of variance estimation in models with and without overdispersions. The Z-scores are binned according to the standard normal percentiles, e.g. the first bin (0 to 0.05) contains Z-score values from to If the Z-scores are i.i.d. and strictly follows standard normal distribution, we expect all the bins having approximately equal height.

13 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS Comparison with un-regularized linear frequency estimator. To assess the role of regularization, we compared the accuracy of BLIMP with simple un-regularized linear frequency estimators based on a small number of near-by predicting SNPs. (In fact, selecting a small number of SNPs can be viewed as a kind of regularization, but we refer to it as un-regularized for convenience.) The un-regularized linear estimator has the same form as in (2.4), but uses the un-regularized estimates f panel and Σ panel for µ and Σ. We consider two schemes to select predictors: the first scheme selects k flanking SNPs on either side of the target SNP (so 2k predictors in total); the second scheme selects the 2k SNPs with the highest marginal correlation with the target SNP. Figure 3 shows RMSE as predicting SNPs increases from 0 to 50. We find that the best performance of the un-regularized methods is achieved by the first scheme, with a relatively large number of predicting SNPs (20-40); however its RMSE is larger than that of IMPUTE and BLIMP. RMSE Number of Predicting SNPs Fig 3. Comparison between BLIMP estimator and un-regularized linear estimators. The lines show the RMSE of each allele frequency estimator vs. number of predicting SNPs. Results are shown for two schemes for selecting predicting SNPs: flanking SNPs (red line) and correlated SNPs (green line). Neither scheme is as accurate as BLIMP (blue solid line) or IMPUTE (blue dashed line) Individual-level genotype imputation. Although very satisfactory methods already exist for individual-level genotype imputation, BLIMP has the

14 14 potential advantage of being extremely fast and low on memory-usage (see computational comparisons, below). We therefore also assessed its performance for individual-level genotype imputation. We used the same crossvalidation procedure as in frequency imputation, but using individual-level data as input. As above, we compared results from our approach with those obtained using IMPUTE v1. We again use RMSE to measure accuracy of imputed (posterior mean) genotypes: (3.3) RMSE = 1 p m (gj i mp ĝi j )2 j=1 i=1 where m is the number of the individuals (1376), p is the total number of tested SNPs (4329) and gj i, ĝi j are observed and estimated (posterior mean) genotypes for individual i at SNP j respectively. For comparison purpose, we also use a different measure of accuracy that is commonly used in this setting: the genotype error rate, which is the number of wrongly imputed genotypes divide by the total number of imputed genotypes. To minimize the expected value of this metric, one should use the posterior mode genotype as the estimated genotype. Thus, for IMPUTE v1 we used the posterior mode genotype for this assessments with this metric. However, for simplicity, for our approach we used the posterior mean genotype rounded to the nearest integer. (Obtaining posterior distributions on genotypes using our approach, as outlined in appendix B, is considerably more complicated, and in fact produced slightly less accurate results, not shown). We found that, under either metric BLIMP provides only very slightly less accurate genotype imputations than IMPUTE (Table 1). Further, as before, replacing the phased panel with an unphased panel produces only a small decrease in accuracy (Table 1). These results show average accuracy when all untyped SNPs are imputed. However, it has been observed previously (e.g. Marchini et al. (2007), Guan and Stephens (2008)) that accuracy of calls at the most confident SNPs tends to be considerably higher than the average. We checked that this is also true for BLIMP. To obtain estimates of the confidence of imputations at each SNP we first estimated σ by maximum likelihood using the summary data across all individuals, and then compute the variance for each SNP using (2.12); note that this variance does not depend on the individual, only on the SNP. We then considered performing imputation only at SNPs whose variance was less than some threshold, plotting the proportion of

15 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 15 Frequency imputation RMSE Error Rate naive method NA BLIMP (phased panel) NA BLIMP (unphased panel) NA IMPUTE NA Individual genotype imputation RMSE Error Rate BLIMP (phased panel) % BLIMP (unphased panel) % IMPUTE % Table 1 Comparison of accuracy of BLIMP and IMPUTE for frequency and individual-level genotype imputations. The RMSE and Error rate, defined in the text, provide different metrics for assessing accuracy; in all cases BLIMP was very slightly less accurate than IMPUTE. The naive method refers to the strategy of estimating the sample frequency of each untyped SNP by its observed frequency in the panel; this ignores information in the observed sample data, and provides a baseline level of accuracy against which the other methods can be compared. SNPs imputed ( call rate ) against their average genotype error rate as this threshold varies. The resulting curve for BLIMP is almost identical to the corresponding curve for IMPUTE (Figure 4) Individual-level genotype imputation without a panel. We use the same WTCCC Birth Cohort data to assess our modified ECM algorithm for performing individual-level genotype imputation without using a panel. To create a data set with data missing at random we mask each genotype, independently, with probability m. We create multiple data sets by varying m from 5% to 50%. For each data set we run our ECM algorithm for 20 iterations. (Results using different starting points for the ECM algorithm were generally very consistent, and so results here are shown for a single starting point.) We compare the imputation accuracy with the software package BIMBAM [Guan and Stephens (2008)] which implements the algorithms from Scheet and Stephens (2005). BIMBAM requires the user to specify a number of clusters, and other parameters related to the EM algorithm it uses: after experimenting with different settings we applied BIMBAM on each data set assuming 20 clusters, with 10 different EM starting points, performing 20 iterations for each EM run. (These settings produced more accurate results than shorter runs.) Overall, imputation accuracy of the two methods was similar (Table 2),

16 16 Error rate BLIMP IMPUTE Call rate Fig 4. Controlling individual-level genotype imputation error rate on a per-snp basis. For BLIMP, the error rate is controlled by thresholding on the estimated variance for imputed SNP frequencies; for IMPUTE the call threshold is determined by average maximum posterior probability. These two different measures of per-snp imputation quality are strongly correlated (correlation coefficient = ). with BLIMP being slightly more accurate with larger amounts of missing data and BIMBAM being slightly more accurate for smaller amounts of missing data. We note that in this setting, some of the key computational advantages of our method are lost. In particular, when each individual is missing genotypes at different SNPs, one must effectively invert a different covariance matrix for each individual. Furthermore, this inversion has to be performed multiple times, due to the iterative scheme. For small amounts of missing data the results from BLIMP we present here took less time than the results for BIMBAM, but for larger amounts the run times are similar. Missing Rate 5% 10% 20% 50% BIMBAM 5.79% 6.35% 7.15% 9.95% BLIMP ECM 6.07% 6.49% 7.31% 9.91% Table 2 Comparison of imputation error rates from BLIMP and BIMBAM for individual genotype imputation without a panel.

17 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS Noise reduction in pooled experiment. We used simulations to assess the potential for our approach to improve allele frequency estimates from noisy data in DNA pooling experiments (equation (2.13)). To generate noisy observed data we took allele frequencies of 4329 genotyped SNP from the WTCCC Birth Cohort chromosome 22 data as true values, and added independent and identically distributed N(0, ǫ 2 ) noise terms to each true allele frequency. Real pooling data will have additional features not captured by these simple simulations (e.g. biases towards one of the alleles), but our aim here is simply to illustrate the potential for methods like ours to reduce noise in this type of setting. We varied ǫ from to to examine different noise levels. Actual noise levels in pooling experiments will depend on technology and experimental protocol; to give a concrete example, Meaburn et al. (2006) found differences between allele frequency estimates from pooled genotyping and individual genotyping of the same individuals, at 26 SNPs, in the range to (mean 0.036). We applied our method to the simulated data by first estimating σ and ǫ using (2.12), and then, conditional on these estimated parameters, estimating the allele frequency at each observed SNP using the posterior mean given in equation (2.13). We assessed the accuracy (RMSE) of these allele frequency estimates by comparing them with the known true values. We found that our method was able to reliably estimate the amount of noise present in the data: the estimated values for the error parameter ǫ show good correspondence with the standard deviation used to simulate the data (Figure 5a), although for high noise levels we underestimate the noise because some of the errors are absorbed by the parameter σ. More importantly, we found our estimated allele frequency estimates were consistently more accurate than the direct (noisy) observations, with the improvement being greatest for higher noise levels (Figure 5b). For example, with ǫ = 0.05 our method reduced the RMSE by more than half, to Computational efficiency. Imputation using our implementation of BLIMP is highly computationally efficient. The computational efficiency is especially notable when dealing with large panels: the panel is used only to estimate ˆµ and ˆΣ, which is both quick and done just once, after which imputation computations do not depend on panel size. Our implementations also take advantage of the sparsity of ˆΣ to increase running speed and reduce memory usage. To give a concrete indication of running speed, we applied BLIMP to the WTCCC Birth Cohort data on chromosome 22, containing 4,329 genotyped and 29,697 untyped Hapmap SNPs on 1376 individuals, using a Linux system with eight-core Intel Xeon 2.66GHz processors (although

18 18 a b Estimated ε RMSE (noise reduced) True ε RMSE (observed) Fig 5. a. Detection of experimental noise in simulated data. The simulated data sets are generated by adding Gaussian noise N(0,ǫ 2 ) to the actual observed WTCCC frequencies. The estimated ǫ values are plotted against the true ǫ values used for simulation. We estimate ǫ using maximum likelihood by (2.11). b. An illustration on the effect of noise reduction in varies noise levels. RMSE from noise reduced estimates are plotted against RMSE from direct noisy observations. The noise reduced frequency estimates are posterior means obtained from model (2.13). only one processor is used by our implementation). The running time is measured by real time reported by the Unix time command. For frequency imputation, BLIMP took 9 minutes and 34 seconds, with peak memory usage of 162 megabytes; for individual-level genotype imputation BLIMP took 25 minutes, using under 300 megabytes of memory. As a comparison, IM- PUTE v1 took 195 minutes for individual-level genotype imputation, with memory usage exceeding 5.1 gigabytes. Since these comparisons were done we note that a new version of IMPUTE (v2) has been released [Howie et al. (2009)]. This new version gives similar imputation accuracy in the settings we described above; it runs more slowly than v1 but requires less memory. 4. Conclusion and discussion. Imputation has recently emerged as an important and powerful tool in genetic association studies. In this paper, we propose a set of statistical tools that help solve the following problems:

19 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS Imputation of allele frequencies at untyped SNPs when only summarylevel data are available at typed SNPs. 2. Noise reduction for estimating allele frequencies from DNA poolingbased experiments. 3. Fast and accurate individual-level genotype imputation. The proposed methods are simple, yet statistically elegant, and computationally extremely efficient. For individual-level genotype imputation the imputed genotypes from this approach are only very slightly less accurate than state-of-the-art methods. When only summary-level data are available we found that imputed allele frequencies were almost as accurate as when using full individual genotype data. The linear predictor approach to imputation requires only an estimate of the mean and the covariance matrix among SNPs. Our approach to obtaining these estimates is based on the conditional distribution from Li and Stephens (2003); however, it would certainly be possible to consider other estimates, and specifically to use other approaches to shrink the off-diagonal terms in the covariance matrix. An alternative, closely-related, approach is to obtain a linear predictor directly by training a linear regression to predict each SNP, using the panel as a training set, and employing some kind of regularization scheme to solve potential problems with over-fitting caused by large p small n. This approach has been used in the context of individuallevel genotype imputation by A. Clark (personal communication), and Yu and Schaid (2007). However, the choice of appropriate regularization is not necessarily straightforward, and different regularization schemes can provide different results [Yu and Schaid (2007)]. Our approach of regularizing the covariance matrix using the conditional distribution from Li and Stephens (2003) has the appeal that this conditional distribution has already been shown to be very effective for individual genotype imputation, and for modeling patterns of correlation among SNPs more generally [Li and Stephens (2003), Stephens and Scheet (2005), Servin and Stephens (2008), Marchini et al. (2007)]. Furthermore, the fact that, empirically, BLIMP s accuracy is almost as good as the best available purpose-built methods for this problem suggests that alternative approaches to regularization are unlikely to yield considerable improvements in accuracy. The accuracy with which linear combinations of typed SNPs can predict untyped SNPs is perhaps somewhat surprising. That said, theoretical arguments for the use of linear combinations have been given in previous work. For example, Clayton et al. (2004) showed by example that, when SNP data are consistent with no recombination (as might be the case for markers very close together on the genome), each SNP can be written as a linear regression

20 20 on the other SNPs. Conversely, it is easy to construct hypothetical examples where linear predictors would fail badly. For example, consider the following example from Nicolae (2006a): 3 SNPs form 4 haplotypes, 111, 001, 100 and 010, each at frequency 0.25 in a population. Here the correlation between every pair of SNPs is 0, but knowing any 2 SNPs is sufficient to predict the third SNP precisely. Linear predictors cannot capture this higher order interaction information, so produce sub-optimal imputations in this situation. In contrast, other methods (including IMPUTE) could use the higher-order information to produce perfect imputations. The fact that, empirically, the linear predictor works well suggests that this kind of situation is rare in real human population genotype data. Indeed, this is not so surprising when one considers that, from population genetics theory, SNPs tend to be uncorrelated only when there is sufficiently high recombination rate between them, and recombination will tend to break down any higher-order correlations as well as pairwise correlations. Besides HMM-based methods, another type of approach to genotype imputation that has been proposed is to use multi-marker tagging [de Bakker et al. (2005), Purcell et al. (2007), Nicolae (2006b)]. A common feature of these methods is to pre-select a (relatively small) subset of tagging SNPs or haplotypes from all typed SNPs based on some LD measure threshold, and then use a possibly non-linear approach to predicting untyped SNPs from this subset. Thus, compared with our approach, these methods generally use a more complex prediction method based on a smaller number of SNPs. Although we have not compared directly with these methods here, published comparisons [Howie et al. (2009)] suggest that they are generally noticeably less accurate than HMM-based methods like IMPUTE, and thus by implication less accurate than BLIMP. That is, it seems from these results that, in terms of average accuracy, it is more important to make effective use of low-order correlations from all available SNPs that are correlated with the target untyped SNP, than to take account of unusual higher-order correlations that may occasionally exist. Our focus here has been on the accuracy with which untyped SNP allele frequencies can be imputed. In practice an important application of these imputation methods is to test untyped alleles for association with an outcome variable (e.g. case-control status). Because our allele frequency predictors are linear combinations of typed SNP frequencies, each test of an untyped SNP is essentially a test for differences between a given linear combination of typed SNPs in case and control groups. Several approaches to this are possible; for example the approach in Nicolae (2006b) could be readily applied in this setting. The resulting test would be similar to the test

21 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 21 suggested in Homer et al. (2008a) which also uses a linear combination of allele frequencies at typed SNPs to test untyped SNPs for association with case-control status in a pooling context. The main difference is that their proposed linear combinations are ad hoc, rather than being chosen to be the best linear imputations; as such we expect that appropriate use of our linear imputations should result in more powerful tests, although a demonstration of this lies outside the scope of this paper. Acknowledgments. We thank Yongtao Guan and Bryan Howie for helpful discussions. APPENDIX A: LEARNING FROM PANEL USING LI AND STEPHENS MODEL In this section, we show the calculation of ˆµ and ˆΣ using phased population panel data. Following the Li and Stephens model, we assume that there are K template haplotypes in the panel. For a new haplotype h sampled from the same population, it can be modeled as an imperfect mosaic of existing template haplotypes in the panel. Let e j denote the jth unit vector in K dimensions (1 in the jth coordinate, 0 s elsewhere), we define random vector Z t to be e j if haplotype h at locus t copies from jth template haplotype. The model also assumes Z 1,...,Z n form a Markov chain in state space {e 1,...,e K } with transition probabilities (A.1) Pr(Z t = e m Z t 1 = e n, M) = (1 r t )½ [e m=e n] + r t e mα, where α = 1 K 1 and r t = 1 exp( ρ t /K) is a parameter that controls the probability that h switches copying template at locus t. The initial-state probabilities of the Markov chain is given by (A.2) π(z 1 = e k M) = 1 K for k = 1,, K. It is easy to check that the initial distribution π is also the stationary distribution of the described Markov chain. Because the chain is initiated at the stationary state, it follows that conditional on M (A.3) Z 1 = d Z 2 = d = d Z p = d π. Therefore marginally, the means and variances of Z t s have following simple forms: (A.4) (A.5) E(Z 1 M) = = E(Z p M) = α, Var(Z 1 M) = = Var(Z n M) = diag(α) αα.

22 22 Let K-dimensional vector q panel t denote the binary allelic state of panel haplotypes at locus t and scalar parameter θ represents mutation. The emission distribution in Li and Stephens model is given by (A.6) Pr(h t = 1 Z t = e k, M) = (1 θ)e kq panel t θ, that is, with probability 1 θ, h perfectly copies from k-th template in the panel at locus t, while with probability θ, a mutation occurs and h t mutates to allele 0 or 1 equally likely. If we define p t = (1 θ)q panel t + θ 2 1, then the emission distribution can be written as (A.7) Pr(h t = 1 Z t, M) = E(h t Z t, M) = p tz t. The goal here is to find the closed-form representations of first two moments of joint distribution (h 1, h 2,..., h p ) given the observed template panel M. For marginal mean and variance of h t, it follows that (A.8) E(h t M) = E(E(h t Z t, M) M) = p te(z t M) = (1 θ) f panel t + θ 2, (A.9) Var(h t M) = (1 θ) 2 f panel t (1 f panel t ) + θ 2 (1 θ 2 ), where f panel t = q t α is the observed allele frequency at locus t from panel M. Finally, to compute Cov(h s, h t ) for some loci s < t, we notice that conditional on Z s and M, h s and h t are independent and (A.10) E(h s h t M) = E(E(h s h t Z s, M) M) = E ( E(h s Z s, M) E(h t Z s, M) M ). Let r st denote the switching probability between s and t, and E(h t Z s, M) can be calculated from (A.11) E(h t Z s, M) = E( E(h t Z t, Z s, M) Z s, M) = E(Z tp t Z s, M) = ((1 r st )Z s + r st α )p t. Therefore, (A.12) E(h s h t M) = p se(z s ( (1 r st )Z s + r st α ) M)p t = (1 r st )p svar(z s )p t + p sαα p t = (1 r st ) p s(diag(α) αα )p t + E(h s M)E(h t M),

23 IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 23 It follows that (A.13) Cov(h s, h t M) = (1 θ) 2 (1 r st )(f panel st fs panel f panel t ), where f panel st is the panel frequency of the haplotype 1 1 consisting of loci s and t. In conclusion, under the Li and Stephens model, the distribution h M has expectation (A.14) ˆµ = E(h M) = (1 θ)f panel + θ 2 1, where f panel is the p-vector of observed frequencies of all p SNPs in the panel, and variance (A.15) ˆΣ = Var(h M) = (1 θ) 2 S + θ 2 (1 θ 2 )I, where matrix S has the structure { f panel (A.16) S ij = i (1 f panel i ) i = j (1 r ij )(f panel ij f panel i f panel j ) i j APPENDIX B: DERIVATION OF JOINT GENOTYPE FREQUENCY DISTRIBUTION In this section, we derive the joint genotype frequency distribution based on the Li and Stephens model. Let g it denote the genotype of individual i at locus t. The sample frequency of genotype 0 at locus t is given by (B.1) Pr(g t = 0) = p gt 0 = 1 n n ½ [git =0]. i=1 Similarly, genotype frequencies p gt 1 = Pr(g t = 1) and p gt 2 = Pr(g t = 2) can be obtained by averaging indicators ½ [git =1] and ½ [git =2] over the samples respectively. Because of the restriction (B.2) ½ [git =0] + ½ [git =1] + ½ [git =2] = 1, given any two of the three indicators, the third one is uniquely determined. Let g i denote 2p-vector (½ [gi1 =0], ½ [gi1 =2],...,½ [gip =0], ½ [gip =2]) and (B.3) y g = (p g 1 0 pg p gp 0 pgp 2 ) = 1 n n g i. i=1

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Theoretical Population Biology

Theoretical Population Biology Theoretical Population Biology 87 (013) 6 74 Contents lists available at SciVerse ScienceDirect Theoretical Population Biology journal homepage: www.elsevier.com/locate/tpb Genotype imputation in a coalescent

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

1 Motivation for Instrumental Variable (IV) Regression

1 Motivation for Instrumental Variable (IV) Regression ECON 370: IV & 2SLS 1 Instrumental Variables Estimation and Two Stage Least Squares Econometric Methods, ECON 370 Let s get back to the thiking in terms of cross sectional (or pooled cross sectional) data

More information

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Modelling Genetic Variations with Fragmentation-Coagulation Processes Modelling Genetic Variations with Fragmentation-Coagulation Processes Yee Whye Teh, Charles Blundell, Lloyd Elliott Gatsby Computational Neuroscience Unit, UCL Genetic Variations in Populations Inferring

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Expected complete data log-likelihood and EM

Expected complete data log-likelihood and EM Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Quasi-regression for heritability

Quasi-regression for heritability Quasi-regression for heritability Art B. Owen Stanford University March 01 Abstract We show in an idealized model that the narrow sense (linear heritability from d autosomal SNPs can be estimated without

More information

URN MODELS: the Ewens Sampling Lemma

URN MODELS: the Ewens Sampling Lemma Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 3, 2014 1 2 3 4 Mutation Mutation: typical values for parameters Equilibrium Probability of fixation 5 6 Ewens Sampling

More information

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies by Yijuan Hu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University

More information

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Jon Wakefield Departments of Statistics and Biostatistics University of Washington 1 / 37 Lecture Content Motivation

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/35195 holds various files of this Leiden University dissertation Author: Balliu, Brunilda Title: Statistical methods for genetic association studies with

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

2 Naïve Methods. 2.1 Complete or available case analysis

2 Naïve Methods. 2.1 Complete or available case analysis 2 Naïve Methods Before discussing methods for taking account of missingness when the missingness pattern can be assumed to be MAR in the next three chapters, we review some simple methods for handling

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter

More information

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Bayesian construction of perceptrons to predict phenotypes from 584K SNP data. Luc Janss, Bert Kappen Radboud University Nijmegen Medical Centre Donders Institute for Neuroscience Introduction Genetic

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Basics of Modern Missing Data Analysis

Basics of Modern Missing Data Analysis Basics of Modern Missing Data Analysis Kyle M. Lang Center for Research Methods and Data Analysis University of Kansas March 8, 2013 Topics to be Covered An introduction to the missing data problem Missing

More information

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Detecting selection from differentiation between populations: the FLK and hapflk approach. Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

The STS Surgeon Composite Technical Appendix

The STS Surgeon Composite Technical Appendix The STS Surgeon Composite Technical Appendix Overview Surgeon-specific risk-adjusted operative operative mortality and major complication rates were estimated using a bivariate random-effects logistic

More information

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Statistical Methods. Missing Data  snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23 1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Quantile POD for Hit-Miss Data

Quantile POD for Hit-Miss Data Quantile POD for Hit-Miss Data Yew-Meng Koh a and William Q. Meeker a a Center for Nondestructive Evaluation, Department of Statistics, Iowa State niversity, Ames, Iowa 50010 Abstract. Probability of detection

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Learning gene regulatory networks Statistical methods for haplotype inference Part I Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Feature Selection via Block-Regularized Regression

Feature Selection via Block-Regularized Regression Feature Selection via Block-Regularized Regression Seyoung Kim School of Computer Science Carnegie Mellon University Pittsburgh, PA 3 Eric Xing School of Computer Science Carnegie Mellon University Pittsburgh,

More information

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i, A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS 1. THE CLASS OF MODELS y t {y s, s < t} p(y t θ t, {y s, s < t}) θ t = θ(s t ) P[S t = i S t 1 = j] = h ij. 2. WHAT S HANDY ABOUT IT Evaluating the

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) Kelly Swarts PAG Allele Mining 1/11/2014 Imputation is the projection

More information

Conditional Least Squares and Copulae in Claims Reserving for a Single Line of Business

Conditional Least Squares and Copulae in Claims Reserving for a Single Line of Business Conditional Least Squares and Copulae in Claims Reserving for a Single Line of Business Michal Pešta Charles University in Prague Faculty of Mathematics and Physics Ostap Okhrin Dresden University of Technology

More information

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty Estimation of Parameters in Random Effect Models with Incidence Matrix Uncertainty Xia Shen 1,2 and Lars Rönnegård 2,3 1 The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden; 2 School

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A. Linero and M. Daniels UF, UT-Austin SRC 2014, Galveston, TX 1 Background 2 Working model

More information

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu June 24, 2014 Ian Barnett

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Probabilistic Models of Past Climate Change

Probabilistic Models of Past Climate Change Probabilistic Models of Past Climate Change Julien Emile-Geay USC College, Department of Earth Sciences SIAM International Conference on Data Mining Anaheim, California April 27, 2012 Probabilistic Models

More information

Use of hidden Markov models for QTL mapping

Use of hidden Markov models for QTL mapping Use of hidden Markov models for QTL mapping Karl W Broman Department of Biostatistics, Johns Hopkins University December 5, 2006 An important aspect of the QTL mapping problem is the treatment of missing

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Missing dependent variables in panel data models

Missing dependent variables in panel data models Missing dependent variables in panel data models Jason Abrevaya Abstract This paper considers estimation of a fixed-effects model in which the dependent variable may be missing. For cross-sectional units

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Stochastic Processes

Stochastic Processes qmc082.tex. Version of 30 September 2010. Lecture Notes on Quantum Mechanics No. 8 R. B. Griffiths References: Stochastic Processes CQT = R. B. Griffiths, Consistent Quantum Theory (Cambridge, 2002) DeGroot

More information