(Genome-wide) association analysis

Similar documents
Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Efficient Bayesian mixed model analysis increases association power in large cohorts

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Linkage and Linkage Disequilibrium

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

I Have the Power in QTL linkage: single and multilocus analysis

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Partitioning the Genetic Variance

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Genotype Imputation. Biostatistics 666

Methods for Cryptic Structure. Methods for Cryptic Structure

Lecture 11: Multiple trait models for QTL analysis

Lecture WS Evolutionary Genetics Part I 1

Overview. Background

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

GBLUP and G matrices 1

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Case-Control Association Testing. Case-Control Association Testing

Linear Regression (1/1/17)

The Quantitative TDT

Evolution of phenotypic traits

Introduction to Linkage Disequilibrium

Calculation of IBD probabilities

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Lecture 9. QTL Mapping 2: Outbred Populations

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Calculation of IBD probabilities

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Computational Approaches to Statistical Genetics

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

INTRODUCTION TO ANIMAL BREEDING. Lecture Nr 2. Genetics of quantitative (multifactorial) traits What is known about such traits How they are modeled

Partitioning the Genetic Variance. Partitioning the Genetic Variance

Notes on Population Genetics

Asymptotic distribution of the largest eigenvalue with application to genetic data

Bayesian Inference of Interactions and Associations

Statistical issues in QTL mapping in mice

p(d g A,g B )p(g B ), g B

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

1. Understand the methods for analyzing population structure in genomes

Multiple QTL mapping

Cover Page. The handle holds various files of this Leiden University dissertation

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Multi-population genomic prediction. Genomic prediction using individual-level data and summary statistics from multiple.

Conditions under which genome-wide association studies will be positively misleading

PCA vignette Principal components analysis with snpstats

Mapping QTL to a phylogenetic tree

Accounting for read depth in the analysis of genotyping-by-sequencing data

Evolutionary quantitative genetics and one-locus population genetics

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Lecture 24: Multivariate Response: Changes in G. Bruce Walsh lecture notes Synbreed course version 10 July 2013

Introduction to Quantitative Genetics. Introduction to Quantitative Genetics

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Inferring Genetic Architecture of Complex Biological Processes

Methods for QTL analysis

SUPPLEMENTARY INFORMATION

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

2. Map genetic distance between markers

Computational Systems Biology: Biology X

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Introduction to QTL mapping in model organisms

Package LBLGXE. R topics documented: July 20, Type Package

Population Genetics I. Bio

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

The concept of breeding value. Gene251/351 Lecture 5

Distinctive aspects of non-parametric fitting

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

BTRY 7210: Topics in Quantitative Genomics and Genetics

Variance Component Models for Quantitative Traits. Biostatistics 666

QTL Mapping I: Overview and using Inbred Lines

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

1. they are influenced by many genetic loci. 2. they exhibit variation due to both genetic and environmental effects.

Using Genomic Structural Equation Modeling to Model Joint Genetic Architecture of Complex Traits

Linear Regression. Volker Tresp 2018

Evolutionary Genetics Midterm 2008

GWAS for Compound Heterozygous Traits: Phenotypic Distance and Integer Linear Programming Dan Gusfield, Rasmus Nielsen.

Association studies and regression

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Quantitative Genetics & Evolutionary Genetics

Régression en grande dimension et épistasie par blocs pour les études d association

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Genome-wide analysis of zygotic linkage disequilibrium and its components in crossbred cattle

STAT 536: Genetic Statistics

Transcription:

(Genome-wide) association analysis 1

Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by confounding between a marker and other effects (= usually bad); The power of QTL detection by LD depends on the proportion of phenotypic variance explained at a marker; Mixed models are good for performing GWAS Genetic (co)variance can be estimated from GWAS summary statistics 2

Outline Association vs linkage Linkage disequilibrium Analysis: single SNP GWAS: design, power GWAS: analysis 3

Linkage Association Families Populations 4

Linkage disequilibrium around an ancestral mutation 5

LD Non-random association between alleles at different loci Many possible causes mutation drift / inbreeding / founder effects population stratification selection Broken down by recombination 6

Definition of D 2 bi-allelic loci Locus 1, alleles A & a, with freq. p and (1-p) Locus 2, alleles B & b with freq. q and (1-q) Haplotype frequencies p AB, p Ab, p ab, p ab D = p AB - pq 7

r 2 r 2 = D 2 / [pq(1-p)(1-q)] Squared correlation between presence and absence of the alleles in the population Nice statistical properties 8 [Hill and Robertson 1968]

Properties of r and r 2 Population in equilibrium E(r) = 0 E(r 2 ) = var(r) 1/[1 + 4Nc] + 1/n N = effective population size n = sample size (haplotypes) c = recombination rate nr 2 ~ c (1) 2 Human population is NOT in equilibrium LD depends on population size and recombination distance [Sved 1971; Weir and Hill 1980] 9

Analysis Single locus association GWAS Least squares ML Bayesian methods 10

Falconer model for single biallelic QTL a d m -a bb Bb BB Var (X) = Regression Variance + Residual Variance = Additive Variance + Dominance Variance 11

12

Statistical power (linear regression) y = µ + bx + e, x = 0, 1, 2 s y2 = s q2 + s e 2 s x2 = 2p(1-p) regression + residual p = allele frequency for indicator x {HWE: note x is usually considered fixed in regression} s q2 = b 2 s x2 = [a + d(1-2p)] 2 * 2p(1-p) q 2 = s q2 / s y 2 {QTL heritability} 13

Statistical Power c 2 test with 1 df: E(X 2 ) = 1 + n R 2 / (1 R 2 ) = 1 + nq 2 /(1-q 2 ) = 1 + NCP NCP = non-centrality parameter Power of association proportional to q 2 (Power of linkage proportional to q 4 ) 14

Statistical Power (R) alpha= 5e-8 threshold= qchisq(1-alpha,1) q2= 0.005 n= 10000 ncp= n*q2/(1-q2) power= 1-pchisq(threshold,1,ncp) threshold ncp power > alpha= 5e-8 > threshold= qchisq(1-alpha,1) > q2= 0.005 > n= 10000 > ncp= n*q2/(1-q2) > power= 1-pchisq(threshold,1,ncp) > threshold [1] 29.71679 > ncp [1] 50.25126 > power [1] 0.9492371 In-class demo 15

Power by association with SNP (small effect; HWE) NCP(SNP) = n r 2 q 2 = r 2 * NCP(causal variant) = n * {r 2 q 2 } = n * (variance explained by SNP) Power of LD mapping depends on the experimental sample size, variance explained by the causal variant and LD with a genotyped SNP 16

GWAS Same principle as single locus association, but additional information QC Duplications, sample swaps, contamination Power of multi-locus data Unbiased genome-wide association Relatedness Population structure Ancestry More powerful statistical analyses 17

The multiple testing burden P(at least 1 false positive) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 P = 1 (1-a) n per test false positive rate 0.05 per test false positive rate 0.001 = 0.05/50 0 0 5 10 15 20 25 30 35 40 45 50 18 Number of independent tests performed

Population stratification (association unlinked genes) Both populations are in linkage equilibrium; genes unlinked Allele frequency Haplotype frequency p A1 p B1 p A1B1 p A1B2 p A2B1 p A2B2 Pop. 1 0.9 0.9 0.81 0.09 0.09 0.01 Pop. 2 0.1 0.1 0.01 0.09 0.09 0.81 Average 0.5 0.5 0.41 0.09 0.09 0.41 Combined population: D = 0.16 and r 2 = 0.41 19

Population stratification (genes and phenotypes) [Hamer & Sirota 2010 Mol Psych] 20

Population stratification (genes and phenotypes) [Hamer & Sirota 2010 Mol Psych] 21

Population stratification (genes and phenotypes) [Vilhjalmsson & Nordborg 2013 Nature Reviews Genetics ] 22

23

Stratification y = Sg i + Se i r(y,g i ) due to causal association with g i correlation g i and g j and causal association with g j (LTC and height) correlation g i and environmental factor e j (chopsticks) 24

How to deal with structure? Detect and discard outliers Detect, analysis and adjustment E.g. genomic control Account for structure during analysis Fit a few principal components as covariates Fit GRM 25

GWAS using mixed linear models y = Xb + bx + g + e var(g) = Gs g 2 G = genetic relationship matrix (GRM) Model conditions on effects of all other variants Power depends on whether x is included (MLMi) or excluded (MLMe) from the construction of G. [Yang et al. 2014 Nature Genetics] 26

GWAS using mixed linear models: statistical power r 2 here is the squared correlation between g-hat and g [Yang et al. 2014 Nature Genetics] 27

Exploiting GWAS summary statistics Summary stats are typically in the form of SNP, SNP allele, effect size, SE (or p-value) 100s of datasets in the public domain Uses: Prediction Conditional analysis (GCTA-COJO) Estimation of genetic parameters (LD scoring) 28

Exploiting known LD between SNPs N = sample size, M = number of markers Single SNP, assume SNP scores are standardised (w = (x-2p)/ 2p(1-p)) y = wb + e b = Swy / Sw 2 E(Sw 2 ) N var(w) = N E(Swy) Nb 29

Multiple regression y = Wb + e b = (W W) -1 W y Diagonal element of W W are Sw 2 with E(Sw 2 ) N Off-diagonal elements are Sw i w j with E(Sw i w j ) Nr ij à (W W)/N is an LD correlation matrix 30

LD correlation matrix from reference Let R be an LD correlation matrix estimated from a reference sample with individuallevel data Then E(W W) ~ NR 31

Approximate joint analaysis using summary statistics b GWAS = diag(w W) -1 W y We want b joint = (W W) -1 W y Approximate solution: E(W W) ~ NR W y = diag(w W)b GWAS = Nb GWAS b joint = (NR) -1 Nb GWAS = R -1 b GWAS Allows joint fitting and conditional analysis [Yang et al. 2012, Nature Genetics] 32

LD score regression Exploit summary statistics to estimate genetic parameters and detect evidence of population stratification Principle: E(c 2 ) = Nq 2 for single causal variant E(c 2 ) = Nq 2 (1 + r 2 ) for causal variant correlated with another causal variant 33

How does LD shape association A set of markers along a chromosome region: Superimpose LD between markers Consider causal SNPs All markers correlated with a causal variant show association 34 Slide from B Neale

Slide from B Neale How does LD shape association Consider causal SNPs All markers correlated with a causal variant show association. Lonely SNPs only show association if they are causal The more you tag the more likely you are to tag a causal variant Assuming all SNPs gave an equal probability of association given LD status, we expect to see more association for SNPs with more LD friends. This is a reasonable assumption under a polygenic genetic architecture 35

LD score regression ) l " = % r "' ( '*+ Quantifies local LD for SNP i E χ ( l " = Nh ( l " M + Na + 1 Test statistic is linear in LD score à regression of test statistic on LD score provides an estimate of SNP heritability Use GWAS summary statistics and reference sample for LD score estimation [Yang 2011 EJHG; Bulik-Sullivan 2015 NG] 36

Same principle for genetic covariance E z ",+ z ",( l " = N +N ( ρ g M l " + ρn : N + N ( N s is the number of overlapping samples z = test statistics from GWAS summary statistics N = sample size M = number of markers ρ g = genetic covariance between traits ρ = phenotypic correlation between traits [Bulik-Sullivan 2015 Nature Genetics] 37

Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by confounding between a marker and other effects (= usually bad); The power of QTL detection by LD depends on the proportion of phenotypic variance explained at a marker; Mixed models are good for performing GWAS Genetic (co)variance can be estimated from GWAS summary statistics 38