Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Similar documents
1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

(Genome-wide) association analysis

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Population Genetics I. Bio

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Lecture WS Evolutionary Genetics Part I 1

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Linear Regression (1/1/17)

Outline of lectures 3-6

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Introduction to Linkage Disequilibrium

The Quantitative TDT

Linkage and Linkage Disequilibrium

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

F1 Parent Cell R R. Name Period. Concept 15.1 Mendelian inheritance has its physical basis in the behavior of chromosomes

Outline of lectures 3-6

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Problems for 3505 (2011)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 9. QTL Mapping 2: Outbred Populations

Unit 6 Reading Guide: PART I Biology Part I Due: Monday/Tuesday, February 5 th /6 th

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Computational Systems Biology: Biology X

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

- mutations can occur at different levels from single nucleotide positions in DNA to entire genomes.

SNP Association Studies with Case-Parent Trios

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Genetics (patterns of inheritance)

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Constructing a Pedigree

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

2. Map genetic distance between markers

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

p(d g A,g B )p(g B ), g B

Notes on Population Genetics

SNP-SNP Interactions in Case-Parent Trios

EVOLUTION ALGEBRA Hartl-Clark and Ayala-Kiger

Processes of Evolution

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Computational Systems Biology: Biology X

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Dropping Your Genes. A Simulation of Meiosis and Fertilization and An Introduction to Probability

Calculation of IBD probabilities

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Outline for today s lecture (Ch. 14, Part I)

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

When one gene is wild type and the other mutant:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Case-Control Association Testing. Case-Control Association Testing

Genotype Imputation. Biostatistics 666

Objectives. Announcements. Comparison of mitosis and meiosis

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Microevolution Changing Allele Frequencies

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

TOPICS IN STATISTICAL METHODS FOR HUMAN GENE MAPPING

Lesson 4: Understanding Genetics

Lab 12. Linkage Disequilibrium. November 28, 2012

Biology. Revisiting Booklet. 6. Inheritance, Variation and Evolution. Name:

Calculation of IBD probabilities

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Biology 211 (1) Exam 4! Chapter 12!

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Bayesian Inference of Interactions and Associations

NOTES CH 17 Evolution of. Populations

Patterns of inheritance

Genetics and Natural Selection

Effective population size and patterns of molecular evolution and variation

Introduction to population genetics & evolution

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Concept 15.1 Mendelian inheritance has its physical basis in the behavior of chromosomes

A. Correct! Genetically a female is XX, and has 22 pairs of autosomes.

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

Outline of lectures 3-6

Introduction to Genetics

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

How robust are the predictions of the W-F Model?

The Chromosomal Basis of Inheritance

Name: Period: EOC Review Part F Outline

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies

Solutions to Problem Set 4

X Chromosome Association Testing in Genome-Wide Association Studies

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Evolutionary Genetics Midterm 2008

Unit 3 - Molecular Biology & Genetics - Review Packet

Heterozygosity is variance. How Drift Affects Heterozygosity. Decay of heterozygosity in Buri s two experiments

Ch 11.Introduction to Genetics.Biology.Landis

Transcription:

Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo

Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization Prioritization

OUTLINE Goal: be able to answer the following questions What are some of the historical landmarks of GWAS? What is unique about GWAS data and data quality considerations? How do you test for genetic association?

TOWARDS GWAS Evidence for genetic role? Population differences Familial aggregation Linkage?

LINKAGE

One cent Use properties of recombination to localize Track transmissions through families Linkage Analysis: a two cent version Second cent Use principle of similarity Sib-pairs that are phenotypically similar should also be genotypically similar -Penrose, 1935 Identity by state / descent (IBS/IBD) Thomas Hunt Morgan- 1933 Nobel "for his discoveries concerning the role played by the chromosome in heredity".

Recombination Two Loci: A and B Two Alleles at each Locus: {A 1, A 2 }, {B 1, B 2 } Four Possible Haplotypes: A 1 B 1 A 1 B 2 A 2 B 1 A 2 B 2 Ten Possible Diploid Genotypes (sometimes called diplotypes): A 1 B 1 A 1 B 1 A 1 B 1 A 1 B 2 A 1 B 1 A 2 B 1 A 1 B 1 A 2 B 2 A 1 B 2 A 1 B 2 A 1 B 2 A 2 B 1 A 1 B 2 A 2 B 2 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 2 A 2 B 2 A 2 B 2

Diploid Genotypes 1 2 3 4 5 6 7 8 9 10 A 1 B 1 A 1 B 1 A 1 B 1 A 1 B 2 A 1 B 1 A 2 B 1 A 1 B 1 A 2 B 2 A 1 B 2 A 1 B 2 A 1 B 2 A 2 B 1 A 1 B 2 A 2 B 2 A 2 B 1 A 2 B 1 A 2 B 1 A 2 B 2 A 2 B 2 A 2 B 2 Recombination only detectable in the double heterozygotes

Double Heterozygotes A 1 B 1 A 2 B 2 gametes A 1 B 1 A 1 B 2 A 2 B 1 A 2 B 2 1- q 2 transmission probability q 2 q 2 1- q 2 q = recombination rate (ranges from 0 to 0.5) q = 0 : no recombination q = 0.5 : unlinked

Cardon & Bell, Nat Rev Gen (2001)

LINKAGE DISEQUILIBRIUM (LD)

A definition Linkage Disequilibrium allelic association between two genetic loci

What you need to know about LD It can be defined several ways mathematically, each definition with its own pros/cons (I will show a couple briefly) It degrades over generations Its properties are used for GWAS

Linkage Disequilibrium A 1 B 1 A 2 B 2 gametes A 1 B 1 A 1 B 2 A 2 B 1 A 2 B 2 transmission frequency x 1 x 2 x 3 x 4

Linkage Disequilibrium Gametes A 1 B 1 A 1 B 2 A 2 B 1 A 2 B 2 Frequency x 1 x 2 x 3 x 4 Allele A 1 A 2 B 1 B 2 Frequency p A1 =x 1 +x 2 p A2 =x 3 +x 4 p B1 =x 1 +x 3 p B2 =x 2 +x 4 D = Observed - Expected D = x 1 - p A1 p B1 D = x 1 - (x 1 + x 2 )(x 1 + x 3 ) D = x 1 x 4 - x 2 x 3

Linkage Disequilibrium After one generation of random mating: x 1 x 2 x 3 x 4 = = = = x x x 1 x 2 3 4 -qd + qd + qd -qd D t=1 = x 1 x 4 - x 2 x 3 D t=1 = (1-q)D After t generations: D t = (1-q) t D 0

What does this mean? D t = (1-q) t D 0 D 0 theta t D 10 1 0.5 10 0.001 1 0.1 10 0.35

Normalized LD Parameters D = D D max D max = min(p A1 p B2,p A2 p B1 ) if D is positive = min(p A1 p B1,p A2 p B2 ) if D is negative Now, LD ranges from -1 to +1

r 2 Most commonly used LD measure -- squared correlation coefficient -- r 2 = D 2 p A1 p A 2 p B1 p B 2

LD take home points It can be defined several ways mathematically, each definition with its own pros/cons It degrades over generations Its properties are used for GWAS

IMPUTATION

Historical context of some large-scale initiatives à towards imputation Human Genome Project 2003 (kind of) 2 males, 2 females HapMap 2005 / 2007 / 2009 initially 269; expanded to ~1400 1000 Genomes Project 2010 / 2012 / 2015 guess? Haplotype Reference Consortium 2016 1 st release is ~65k haplotypes All of Us (PMI Initiative)? http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Human Genome Project Goals: identify all the approximate 30,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project. Milestones: 1990: Project initiated as joint effort of U.S. Department of Energy and the National Institutes of Health June 2000: Completion of a working draft of the entire human genome February 2001: Analyses of the working draft are published April 2003: HGP sequencing is completed and Project is declared finished two years ahead of schedule U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003

HapMap An NIH program to chart genetic variation within the human genome Begun in 2002, the project is a 3-year effort to construct a map of the patterns of SNPs (single nucleotide polymorphisms) that occur across populations in Africa, Asia, and the United States. Consortium of researchers from six countries Researchers hope that dramatically decreasing the number of individual SNPs to be scanned will provide a shortcut for identifying the DNA regions associated with common complex diseases Map may also be useful in understanding how genetic variation contributes to responses in environmental factors

What is the 1000 Genomes Project? International multi-center collaboration building on HapMap data to establish the most comprehensive catalogue of human genetic variation available Phase I: 1,092 complete genomes from 14 populations published in Nature, October 2012 Freely accessible public databases Final phase of project brings total genotyped to 2504 individuals from 26 populations worldwide

From Where?

Haplotype Reference Consortium

Take home points Many subjects From many populations Assayed for many variants Quality of reference haplotypes continues to improve Data are publicly available

QUALITY CONTROL

Quality Control As in ANY analysis, we want quality data Garbage in à garbage out So what here is unique? Mendelian inheritance Lab-based protocols Sample duplication for concordance Call rates Chromosomal anomalies Population genetics, e.g., Hardy-Weinberg Equilibrium testing Much research in this area, updated protocols

Hardy-Weinberg Equilibrium Allele and genotype frequencies remain constant over time, when Large population Random mating Sex-independent genotype frequencies No natural selection No migration No mutation No inbreeding Implications Can derive expected genotype frequencies from allele frequencies If these deviate from realized genotype frequencies, then one of the assumptions may not hold OR Genotyping error Association? OR

http://bioconductor.org/packages/release/bioc/html/gwastools.html

Modes of Inheritance

Coding Genotypes Assume a biallelic marker (SNP) A or G Each chromosome will have one of the two possible alleles FID IID A1 A2 0 0001 A A 0 0002 A G 0 0003 G G 0 0004 A A

Mode of inheritance (MOI) A pattern of how a disease is transmitted in families Additive 0 1 2 Carrier Carrier Phenotype value Dominant 1 1 0 Recessive 1 0 0 Non-carrier Carrier Carrier Carrier

Additive Mode b 1 Y b 0 0 1 2 X

Dominant Mode Y 0 1 2 X

Recessive Mode Y 0 1 2 X

FID IID A1 A2 0 0001 A A 0 0002 A G 0 0003 G G 0 0004 A A A: risk allele Additive Dominant Recessive FID IID G1 FID IID G1 FID IID G1 0 0001 2 0 0001 1 0 0001 1 0 0002 1 0 0002 1 0 0002 0 0 0003 0 0 0003 0 0 0003 0 0 0004 2 0 0004 1 0 0004 1

Tests for Association

Tests for Association Discrete Traits Cochrane-Armitage Trend Test Alleles Test General RxC Contingency Table (Chi-square) Other Types Continuous Time-to-event Multivariate

Cochrane-Armitage Copies of Allele 0 1 2 Case A 0 A 1 A 2 M 1 Control U 0 U 1 U 2 M 0 N 0 N 1 N 2 N c 1 2 = N[N(A 1 + 2A 2 ) - M 1 (N 1 + 2N 2 )] 2 M 1 (N - M 1 )[N(N 1 + 4N 2 ) - (N 1 + 2N 2 ) 2 ]

Alleles Test + - Case A 1 +2A 2 A 1 +2A 0 2M 1 Control U 1 +2U 2 U 1 +2U 0 2M 0 N 1 +2N 2 N 1 +2N 0 2N c 1 2 = 2N[2N(A 1 + 2A 2 ) - 2M 1 (N 1 + 2N 2 )] 2 2M 1 2(N - M 1 )[2N(N 1 + 2N 2 ) - (N 1 + 2N 2 ) 2 ] Note: Variance (denominator) assumes HWE!!!

General Chi-Square Copies of Allele 0 1 2 Case A 0 A 1 A 2 M 1 Control U 0 U 1 U 2 M 0 N 0 N 1 N 2 N c 2 2 = [A 0 - E(A 0 )]2 E(A 0 ) + [A 1 - E(A 1 )]2 E(A 1 ) + [A 2 - E(A 2 )]2 E(A 2 ) + [U 0 - E(U 0 )]2 E(U 0 ) + [U 1 - E(U 1 )]2 E(U 1 ) + [U 2 - E(U 2 )]2 E(U 2 )

Logistic Regression æ p ln x ç è 1- p x ö = b 0 + b 1 X ø logit(p x ) = b 0 + b 1 X

Model Interpretation Additive model (X=0, 1 or 2) æ p ln x ö ç = b 0 + b 1 X è 1- p x ø ln(b 1 ) = one - unit increase Note: This is analogous to an odds ratio (OR) from a 2x3 table Genotype Model (indicator variables G i = 0 or 1) æ p ln x ö ç = b 0 + b 1 G 1 + b 2 G 2 è 1- p x ø Subjects with G 0 =1 are the reference group OR for subjects with G 2 compared to G 0 = e b 2

SNP SNP SNP SNP SNP E(y % ) = β * + β, X %. P-value or log p % 1 p % = β * + β, X %. P-value -log 10 (p-value) Top hitter Chromosomal position

Association take home points Many ways to seek out and test for a genetic association Mode of inheritance, while somewhat a misnomer in complex disease genetics, reflects our assumptions on how genotype influences phenotype We will focus largely on the flexible frameworks of linear and logistic regression

Population Stratification

Genetic Associations Truth Causal locus (direct) In LD with causal locus (indirect) Chance If you test 100 times, you ll see ~ 5 tests < 0.05 The association is due to chance - no causal underpinning Bias Association is not causal Yellow fingers associated with lung cancer e.g. Population stratification

Genetic Associations Truth Causal locus (direct) In LD with causal locus (indirect) Chance If you test 100 times, you ll see ~ 5 tests < 0.05 The association is due to chance - no causal underpinning Bias Association is not causal Yellow fingers associated with lung cancer e.g. Population stratification

Truth A genetic association test finds A Phenotype

Truth A genetic association test finds A Phenotype LD: highly correlated A Phenotype G?

Genetic Associations Truth Causal locus (direct) In LD with causal locus (indirect) Chance If you test 100 times, you ll see ~ 5 tests < 0.05 The association is due to chance - no causal underpinning Bias Association is not causal Yellow fingers associated with lung cancer e.g. Population stratification

Chance Each point represents the association for each locus. There are more than 6 million points. Genome wide significance level P = 5 10-8 p = 0.05 = 10-1.30103 Manhattan plot: It is a scatter plot used to display the p-values in genome-wide association studies (GWAS)

https://www.ebi.ac.uk/gwas/diagram

Genetic Associations Truth Causal locus (direct) In LD with causal locus (indirect) Chance If you test 100 times, you ll see ~ 5 tests < 0.05 The association is due to chance - no causal underpinning Bias Association is not causal Yellow fingers associated with lung cancer e.g. Population stratification

Quantile-quantile (QQ) Plots Good way of seeing what s going on overall Any real hits? Any systematic problems? In GWAS, MOST SNPs will not be associated with whatever phenotype is examined, i.e., they are from the null distribution

Quantile-quantile (QQ) Plots

Quantile-quantile (QQ) Plots

Quantile-quantile (QQ) Plots

Bias A genetic association test finds A Phenotype

Bias A genetic association test finds A Spurious association Ancestry Phenotype True risk factor

Stratification Essentially a confounder! Yellow fingers associated with lung cancer How does it happen?

Famous Example Knowler et al (1988)

Cardon et al (2003)

Stratification Happens Historical strategies to deal with it Self-Reported Ancestry Match (design) or Adjust (analysis) Use other genetic markers (ancestry informative) Genomic Control STRUCTURE PCA/Eigenstrat Use a family-based design More later

Thank you! Questions?

Acknowledgements Harvard T. F. Chan School of Public Health Christoph Lange Pete Kraft University of Colorado Matt McQueen University of Kentucky Yuri Katsumata Jennifer Daddysman University of Florida Tyler Smith University of Washington Paul Crane Joey Mukherjee Vanderbilt University Tim Hohman Brigham Young University Keoni Kauwe National Institute of Aging R01 AG042437 K25 AG043546 R01 AG057187 P30 AG028383 R01 AG054060