Genetic Association Studies in the Presence of Population Structure and Admixture

Size: px
Start display at page:

Download "Genetic Association Studies in the Presence of Population Structure and Admixture"

Transcription

1 Genetic Association Studies in the Presence of Population Structure and Admixture Purushottam W. Laud and Nicholas M. Pajewski Division of Biostatistics Department of Population Health Medical College of Wisconsin A presentation in the BNP Workshop, INI, Cambridge, August 7, 2007

2 Outline Some genetics background Impact of structure on association studies Remedies A DPM model Clustering success Estimation success A model for admixture Detection of admixed individuals

3 Some Genetics Terminology Chromosome 1 A d GAATTAGTCAAGC Chromosome 2 a D Chromosome Locus / Polymorphism Allele Genotype Two alleles A and a yield three potential genotypes at each locus: AA, aa/aa, or aa Phenotype

4 Propagation, Inheritance and Diversity In a pair of homologous chromosomes, one is inherited from the mother and the other from the father Recombination can result in alleles that differ from those of parents Mechanisms leading to variability in the frequency of genetic variants in human populations Positive Selection Ancestral Admixture Genetic Drift Allele frequency assumptions Hardy-Weinberg Equilibrium - Chromosome independence Linkage Disequilibrium - Locus to locus independence

5 Wright s Fixation Index Wright s F ST - Measures the divergence between populations relative to the overall level of genetic diversity within a species (Wright 1951) Simulation mechanism {Balding & Nichols (1995)} : Let p denote ancestral allele frequency at a locus. This could be drawn from, say, Uniform(0.10,0.90) or fixed Then, π - the allele frequency at this locus is simulated as π Beta (rp, r(1 p)), r = (1 F ST ) F ST E(π) = p, V ar(π) = p(1 p)f ST

6 Common Study Designs Main goal is to determine if a phenotype is associated with a genotype at some locus or a combination of loci Phenotype of interest can be quantitative or binary Sampling design can be cross-sectional or case-control Disease presence/absence, sample of cases and sample of non-cases leads to 2 2 table with column totals fixed Generally, traditional analyses use linear models and contingency tables

7 Population Structure and Impact on Association If sampling from a homogeneous population with stable allele frequencies at loci of interest, traditional methods can appropriately determine association Variation in allele frequencies in human subpopulations can mislead these analyses A disease with higher prevalence in one subpopulation than the other will look spuriously associated with alleles that have higher prevalence in the first Attenuation of a genetic effect is also possible - e.g., high prevalence in a small subpopulation

8 Impact of Structure on Association Genetics Literature Considerable debate in the genetics community about the potential for spurious associations caused by population allele frequency differences (Wacholder et.al. 2000; Cardon and Palmer 2003; Freedmen et.al. 2004) Bidirectional effect - Structure can also mask or reverse genetic effects (Deng 2001) Examples Knowler et.al. (1988) - Type 2 diabetes in American Indians Campbell et.al. (2005) - Lactase (LCT) gene and tall/short status in European Americans

9 Remedies Transmission Disequilibrium Test - Spielman et al. (1993) Genomic Control - Devlin & Roeder (1999) Modeled Classification - Pritchard et al. (2000) Structural Association - Satten et al. (2001)

10 Modeled Classification Two-Stage Approach of Pritchard et.al. Assume we have a sample of N individuals, consisting of n 1 cases and n 0 controls Each subject is genotyped at L null marker loci and the candidate locus Assume each subject originates from one of K distinct subpopulations, where K is fixed. Notation Population pointer i = 1,..., N z i Genotype X (c) il l = 1,..., (L + 1) Genotype frequencies p klj j = 1, 2, 3 Admixture proportions q (i) k Z X P Q

11 Modeled Classification STRUCTURE Intermediate Goal: Inference on P and Z/Q based on the genotypic data X Pritchard et al construct a Markov Chain Monte Carlo (MCMC) algorithm to approximate the posterior distribution P r (P, Z/Q X) P r (Z/Q) P r (P ) P r (X P, Z/Q) Given Z and P, Pritchard et al then test H0 : No association between candidate allele frequencies and phenotype φ H1 : Candidate locus allele frequencies depend on phenotype φ

12 Modeled Classification Likelihood Ratio Statistic: STRAT Define Λ = P r 1 (X; ˆP ) 1, Q ) P r 0 (X; ˆP 0, Q where [ ] P r (i) 0 X (c) i1 = j P 0, φ = k [ ] P r (i) 1 X (c) i1 = j P 1, φ = k q (i) k p k1j q (i) k p[φ(i) ] k1j The maximized likelihoods under each hypothesis are estimated using the EM algorithm Significance levels for Λ are approximated through simulation

13 Modeled Classification Extensions and Other Approaches Falush et.al. (2003) - Extended Pritchard et.al. s original work to allow for linkage disequilibrium and correlated allele frequencies Zhang et.al. (2006) - Parametric Bayesian approach that incorporates the inherent variability in estimating K through Model Averaging Other methods include Frequentist approaches such as Satten et.al. (2001), Epstein et.al. (2007), and Price et.al (2006) and a Bayesian approach for quantitative traits by Hoggart et.al. (2003)

14 Model : Quantitative Trait Part Given a sample of N individuals, genotyped at L null marker loci and C candidates, Y i Normal (µ i, τ P ) µ i = β 0i + G i + γz i C G i = (β il W il + β 2l V il ) l=1 V and W are indicators of AA and aa respectively

15 Model : Genetic Marker Part Let θ il = logit(π il ), L(W i, V i ) = L+C l=1 [ 2 W il ] exp(θil (2V il + W il )) (1 + exp(θ il )) 2 β 0i, θ i1,..., θ i,l+c G G G α, G 0 Dirichlet Process (αg 0 ) α Gamma (α 0, λ 0 ) G 0 = L+C f(β 0 ) f(θ il ) l=1

16 Model Graph l=1,..,c G θ il θ il β 0i (W il, V il ) (W il, V il ) β 1l β 2l l=c+1,..,l τ P Y i i=1,...,n

17 Simulation of Posterior Methods for DPM Conjugate Setting Escobar (1994), Escobar and West (1995), Bush and MacEachern (1996) Split-Merge Samplers: Jain and Neal (2004), Green and Richardson (2001) Non-conjugate Setting No-Gaps algorithm: MacEachern and Muller (1998) Neal (2000), Walker and Damien (1998) Jain and Neal (2007) extended their Split-Merge algorithm to a conditionally conjugate setting

18 Simulation of Posterior Algorithm 5 of Neal (2000) Given K current clusters of distinct values 1. Update configuration indicator for each individual (s i ) by proposing a new cluster membership (s i ) R times P (s i = j s i ) = n i,j /(N + α 1) for j=1,..,k P (s i = K + 1 s i ) = α/(n + α 1) 2. Use Metropolis-Hastings update for proposed move [ L(Yi, W a(s i, s i, V i φ s i i ) = min ) ] L(Y i, W i, V i φ si ), 1 (1) 3. Update other parameters given current configuration. Combination of conjugate updates and Adaptive Rejection Sampling (ARS) (Gilks and Wild 1992)

19 Simulation of Posterior MCMC Scheme Given j=1,2,..,k distinct clusters in the DP, let φ j = (β 0j, θ j1,..., θ jl ), Update configuration using Neal s Algorithm, sampling from [s i (s i, φ)] for i=1,...,n For j=1,2,..,k, Sample [β 0j X j, β 11,.., β 1C, β 21,.., β 2C, τ P ] where X j = {Y i, W i, V i s i = j} (conjugate normal update) For l=1,...,l, update cluster specific allele frequencies, sampling from [θ jl W j, V j ] where (W j, V j ) denote genotype data for subjects in j th cluster (ARS)

20 Simulation of Posterior MCMC Scheme (continued) Update genetic effects at candidate loci, for l=1,2,..,c Sample [ β 1l Y, W, β 01,.., β 0K, β (1 l), β 21,.., β 2C, τ P ] Sample [ β 2l Y, V, β 01,.., β 0K, β (2 l), β 11,.., β 1C ] where β (r l) = β (r1),.., β (r,l 1), β (r,l+1),.., β (rc) (both are conjugate normal updates) Update phenotypic precision Sample [τ P Y, W, V, β 01,.., β 0K, β 11,.., β 1C, β 21,.., β 2C ] (conjugate gamma update)

21 A Two-Population Scenario Sample Size: 400 subjects with 500 markers Pop. 1 (50%) with F ST = 0.01 Pop. 2 (50%) with F ST = Simulated data characteristics The biggest allele frequency difference,δ l = π 1l π 2l, occurs at locus 254, where δ 254 = =.355 The smallest difference occurs at locus 223, δ 223 = =.000

22 Prior for K n=400 and α=1

23 Posterior Distribution for K Two populations scenario: n=400, L=500 and α=1

24 Predictive Density for π 254 π P OP = and π P OP = 0.358

25 Predictive Density for π 223 π P OP = and π P OP = 0.820

26 Predictive Density for π 109 π P OP = and π P OP = 0.598

27 A Four-Population High Heterogeneity Scenario Sample Size: 800 subjects at 1000 markers Population % of sample F (j) ST

28 Prior for K n=800 and α=1

29 Posterior for K 4 Populations - High Heterogeneity

30 Predictive Density for π 43 (20%) π (1) = 0.07 (40%) π (2) = 0.08 (30%) π (3) = 0.37 (10%) π (4) = 0.24

31 Predictive Density for π 47 (20%) π (1) = 0.34 (40%) π (2) = 0.39 (30%) π (3) = 0.42 (10%) π (4) = 0.34

32 A Four-Population Low Heterogeneity Scenario Sample Size: 800 subjects at 1000 markers Population % of sample F (j) ST

33 Posterior for K 4 Populations - Low Heterogeneity

34 Quantitative Trait Model Simulation Three scenarios where individuals originate from two ancestral populations Baseline trait mean differs between populations, with β 01 = and β 02 = Subjects genotyped at 750 markers including 5 candidates for association with the phenotype (generically denoted as loci 1-5) Scenarios I. n=700; Pop. 1 (50%) and Pop. 2 (50%) II. n=1000; Pop. 1 (50%) and Pop. 2 (50%) III. Same as I plus 300 admixed individuals

35 Simulation Setup Locus β 1l β 2l A l D l Y i Normal (µ i, τ P ) where 5 µ i = β 0i + (β 1l W il + β 2l V il ) τ P = 0.01 l=1 A l and D l = Additive / Dominance effects Falconer & MacKay (1996)

36 Generation of Admixed Ancestry Individuals Assume j=1,..,k subpopulations with associated allele frequencies π jl at each locus Let q i denote an admixture vector where q ij denotes the proportion of the i th admixed individual s genome that originates from the j th subpopulation. q i Dirichlet(α q,..., α q ) where α q was set at 1.0 Then the genotype at the l th locus is independently simulated as Binomial(2, K q ij π jl ) j=1

37 Naive Analysis Effect True ˆβ 95% CI* χ 2 2,0.99 = 9.28 β to β to β to β to β to β to 2.24 β to β to β to β to *Simultaneous CI using Bonferroni Correction

38 Scenario I Results Two Distinct Subpopulations (n=700) Parameter True ˆβ 95% Cred. Int.* χ 2 l β to β to β to β to 7.00 β to β to 3.33 β to β to β to β to 1.93 * denotes simultaneous 95% Credible Intervals

39 Scenario II Results Two Distinct Subpopulations (n=1000) Parameter True ˆβ 95% Cred. Int.* χ 2 l β to β to β to β to 7.35 β to β to 4.13 β to β to β to β to 4.64 * denotes simultaneous 95% Credible Intervals

40 Scenario III Results n=1000, Including 300 Admixed Individuals Parameter True ˆβ 95% Cred. Int.* χ 2 l β to β to β to β to 7.06 β to β to 2.31 β to β to β to β to 7.48 * denotes simultaneous 95% Credible Intervals

41 Comparative Results Parameter True N S I S II S III β β β β β β β β β β

42 Model for Admixture Inclusion of highly admixed individuals affects estimation The DPM construction above clusters individuals based on their entire set of genotypes at all L loci. It does not allow for differing ancestries across the genome How to handle admixed individuals more appropriately? Assume, for a moment, that we know which individuals are highly admixed. Divide the sample between N L low admixture subjects and N H high admixture subjects We then consider a separate parametric likelihood for the N H individuals that uses locus-specific information, effectively making an assignment at each locus for each admixed person to the clusters detected by the Dirichlet Process

43 Admixture Model Likelihood For the i th admixed subject, let X (c) il denote the allele at the l th locus on the c th chromosome, c = 1, 2. Let z (c) il denote a subpopulation assignment indicator for that allele Let q ij represent an N L dimensional admixture vector that denotes how much of that person s genome originates from the subpopulations represented by the low admixture sample. L H = N H i=1 L τ 1/2 P 2π exp l=1 c=1j=1 τ P 2 2 N L (q ij e θ jlx (c) il 1 + e θ jl ( ) N 2 L Y i q ij β 0j G i ) I(z (c) il j=1 =j)

44 Admixture Model Continued The model is then the same as before except for the addition of two prior specifications q i Dirichlet (α 1,..., α NH ) for i = 1,..., N H Pr(z (c) il = j)= 1 N L for j = 1,..., N L Note: Although the model is setup to assign alleles to specific low-admixture individuals, because of the inherent Dirichlet Process clustering, the calculations simplify to assigning admixed alleles to the current K distinct clusters

45 Admixture Model Graph l = C + 1,.., L j = 1,.., N L θ jl l = 1,.., C i = 1,.., N H X (c) il G q ij β 0j Y i X (c) θ jl il β 2l β 1l τ P

46 Scenario III Results With Hybrid Model Parameter True ˆβ 95% Cred. Int.* χ 2 l β to β to β to β to 8.17 β to β to 2.73 β to β to β to β to 6.33 * denotes simultaneous 95% Credible Intervals

47 Comparative Results Parameter True N S I S II S III S IIIA β β β β β β β β β β

48 Detecting Admixed Individuals Let s (g) i denote the cluster membership indicator for the i th individual at the g th MCMC iteration, then define Then let ηˆ (g) il = n ɛ (g+1) i = j:s (g) j L l=1 N (2V o jl + W jl ) =s (g) i 2n s (g) i ( ˆη (g+1) il ) 2 ˆη (g) il

49 Results on Detection

50 Results on Detection

51 Modifications for Case-Control Studies Binary Outcomes Simply involves switching to a logistic or probit likelihood. Case-Control Studies The extension to a case-control setting is also relatively straightforward Need to specify a retrospective likelihood such as that proposed by Seaman and Richardson (2001) Müller and Roeder (1997)

52 Extensions 1. Clustering by genome sections 2. Introducing dependence structure amongst neighboring loci, i.e. incorporating linkage disequilibrium patterns 3. Extending to Genomewide Association Studies - Removing candidate loci designation so that all markers are potentially phenotypically associated

53 References In a separate file

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Phasing via the Expectation Maximization (EM) Algorithm

Phasing via the Expectation Maximization (EM) Algorithm Computing Haplotype Frequencies and Haplotype Phasing via the Expectation Maximization (EM) Algorithm Department of Computer Science Brown University, Providence sorin@cs.brown.edu September 14, 2010 Outline

More information

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism

More information

GENETIC association studies aim to map phenotypeinfluencing

GENETIC association studies aim to map phenotypeinfluencing Copyright Ó 2006 by the Genetics Society of America DOI: 10.1534/genetics.105.055335 A General Population-Genetic Model for the Production by Population Structure of Spurious Genotype Phenotype Associations

More information

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees: MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement

More information

QTL model selection: key players

QTL model selection: key players Bayesian Interval Mapping. Bayesian strategy -9. Markov chain sampling 0-7. sampling genetic architectures 8-5 4. criteria for model selection 6-44 QTL : Bayes Seattle SISG: Yandell 008 QTL model selection:

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

BIOINFORMATICS. Gilles Guillot

BIOINFORMATICS. Gilles Guillot BIOINFORMATICS Vol. 00 no. 00 2008 Pages 1 19 Supplementary material for: Inference of structure in subdivided populations at low levels of genetic differentiation. The correlated allele frequencies model

More information

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles Jeremy Gaskins Department of Bioinformatics & Biostatistics University of Louisville Joint work with Claudio Fuentes

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Lecture 13: Population Structure. October 8, 2012

Lecture 13: Population Structure. October 8, 2012 Lecture 13: Population Structure October 8, 2012 Last Time Effective population size calculations Historical importance of drift: shifting balance or noise? Population structure Today Course feedback The

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Populations in statistical genetics

Populations in statistical genetics Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Wei Zhang, Jing Liu, Rachel Fewster and Jesse Goodman Department of Statistics, The University of Auckland December 1, 2015 Overview

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

A Nonparametric Approach Using Dirichlet Process for Hierarchical Generalized Linear Mixed Models

A Nonparametric Approach Using Dirichlet Process for Hierarchical Generalized Linear Mixed Models Journal of Data Science 8(2010), 43-59 A Nonparametric Approach Using Dirichlet Process for Hierarchical Generalized Linear Mixed Models Jing Wang Louisiana State University Abstract: In this paper, we

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

Markov Chain Monte Carlo in Practice

Markov Chain Monte Carlo in Practice Markov Chain Monte Carlo in Practice Edited by W.R. Gilks Medical Research Council Biostatistics Unit Cambridge UK S. Richardson French National Institute for Health and Medical Research Vilejuif France

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/35195 holds various files of this Leiden University dissertation Author: Balliu, Brunilda Title: Statistical methods for genetic association studies with

More information

Bayesian analysis of the Hardy-Weinberg equilibrium model

Bayesian analysis of the Hardy-Weinberg equilibrium model Bayesian analysis of the Hardy-Weinberg equilibrium model Eduardo Gutiérrez Peña Department of Probability and Statistics IIMAS, UNAM 6 April, 2010 Outline Statistical Inference 1 Statistical Inference

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Similarity Measures and Clustering In Genetics

Similarity Measures and Clustering In Genetics Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com Talk outline Introduction

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Nonparametric Bayes Modeling

Nonparametric Bayes Modeling Nonparametric Bayes Modeling Lecture 6: Advanced Applications of DPMs David Dunson Department of Statistical Science, Duke University Tuesday February 2, 2010 Motivation Functional data analysis Variable

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Neutral Theory of Molecular Evolution

Neutral Theory of Molecular Evolution Neutral Theory of Molecular Evolution Kimura Nature (968) 7:64-66 King and Jukes Science (969) 64:788-798 (Non-Darwinian Evolution) Neutral Theory of Molecular Evolution Describes the source of variation

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Supporting Information Text S1

Supporting Information Text S1 Supporting Information Text S1 List of Supplementary Figures S1 The fraction of SNPs s where there is an excess of Neandertal derived alleles n over Denisova derived alleles d as a function of the derived

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Problems for 3505 (2011)

Problems for 3505 (2011) Problems for 505 (2011) 1. In the simplex of genotype distributions x + y + z = 1, for two alleles, the Hardy- Weinberg distributions x = p 2, y = 2pq, z = q 2 (p + q = 1) are characterized by y 2 = 4xz.

More information

Mathematical models in population genetics II

Mathematical models in population genetics II Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

More information

The genomes of recombinant inbred lines

The genomes of recombinant inbred lines The genomes of recombinant inbred lines Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman C57BL/6 2 1 Recombinant inbred lines (by sibling mating)

More information

Lecture 16: Mixtures of Generalized Linear Models

Lecture 16: Mixtures of Generalized Linear Models Lecture 16: Mixtures of Generalized Linear Models October 26, 2006 Setting Outline Often, a single GLM may be insufficiently flexible to characterize the data Setting Often, a single GLM may be insufficiently

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling 2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling Jon Wakefield Departments of Statistics and Biostatistics University of Washington Outline Introduction and Motivating

More information

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition. Christian P. Robert The Bayesian Choice From Decision-Theoretic Foundations to Computational Implementation Second Edition With 23 Illustrations ^Springer" Contents Preface to the Second Edition Preface

More information

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency Bruce Walsh lecture notes Introduction to Quantitative Genetics SISG, Seattle 16 18 July 2018 1 Outline Genetics of complex

More information

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M. STAT 550 Howework 6 Anton Amirov 1. This question relates to the same study you saw in Homework-4, by Dr. Arno Motulsky and coworkers, and published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).

More information

Down by the Bayes, where the Watermelons Grow

Down by the Bayes, where the Watermelons Grow Down by the Bayes, where the Watermelons Grow A Bayesian example using SAS SUAVe: Victoria SAS User Group Meeting November 21, 2017 Peter K. Ott, M.Sc., P.Stat. Strategic Analysis 1 Outline 1. Motivating

More information

An introduction to PRISM and its applications

An introduction to PRISM and its applications An introduction to PRISM and its applications Yoshitaka Kameya Tokyo Institute of Technology 2007/9/17 FJ-2007 1 Contents What is PRISM? Two examples: from population genetics from statistical natural

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013 Lecture 9 Short-Term Selection Response: Breeder s equation Bruce Walsh lecture notes Synbreed course version 3 July 2013 1 Response to Selection Selection can change the distribution of phenotypes, and

More information

Notes on Population Genetics

Notes on Population Genetics Notes on Population Genetics Graham Coop 1 1 Department of Evolution and Ecology & Center for Population Biology, University of California, Davis. To whom correspondence should be addressed: gmcoop@ucdavis.edu

More information

Monte Carlo-based statistical methods (MASM11/FMS091)

Monte Carlo-based statistical methods (MASM11/FMS091) Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 12 MCMC for Bayesian computation II March 1, 2013 J. Olsson Monte Carlo-based

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Major Genes, Polygenes, and

Major Genes, Polygenes, and Major Genes, Polygenes, and QTLs Major genes --- genes that have a significant effect on the phenotype Polygenes --- a general term of the genes of small effect that influence a trait QTL, quantitative

More information

Hmms with variable dimension structures and extensions

Hmms with variable dimension structures and extensions Hmm days/enst/january 21, 2002 1 Hmms with variable dimension structures and extensions Christian P. Robert Université Paris Dauphine www.ceremade.dauphine.fr/ xian Hmm days/enst/january 21, 2002 2 1 Estimating

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies Confounding in gene+c associa+on studies q What is it? q What is the effect? q How to detect it?

More information

Estimating the location and shape of hybrid zones.

Estimating the location and shape of hybrid zones. Estimating the location and shape of hybrid zones. Benjamin Guedj and Gilles Guillot May 24, 2011 Abstract We propose a new model to make use of geo-referenced genetic data for inferring the location and

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Spatial localization of recent ancestors for admixed individuals

Spatial localization of recent ancestors for admixed individuals G3: Genes Genomes Genetics Early Online, published on November 3, 2014 as doi:10.1534/g3.114.014274 Spatial localization of recent ancestors for admixed individuals Wen-Yun Yang 1, Alexander Platt 2, Charleston

More information

Causal Graphical Models in Systems Genetics

Causal Graphical Models in Systems Genetics 1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation

More information

A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW

A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW Miguel A Gómez-Villegas and Beatriz González-Pérez Departamento de Estadística

More information

Evaluating the performance of a multilocus Bayesian method for the estimation of migration rates

Evaluating the performance of a multilocus Bayesian method for the estimation of migration rates University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Publications, Agencies and Staff of the U.S. Department of Commerce U.S. Department of Commerce 2007 Evaluating the performance

More information

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline Module 4: Bayesian Methods Lecture 9 A: Default prior selection Peter Ho Departments of Statistics and Biostatistics University of Washington Outline Je reys prior Unit information priors Empirical Bayes

More information

Discrete & continuous characters: The threshold model

Discrete & continuous characters: The threshold model Discrete & continuous characters: The threshold model Discrete & continuous characters: the threshold model So far we have discussed continuous & discrete character models separately for estimating ancestral

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES Elizabeth A. Thompson Department of Statistics, University of Washington Box 354322, Seattle, WA 98195-4322, USA Email: thompson@stat.washington.edu This

More information