Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Similar documents
Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies

Genotype Imputation. Biostatistics 666

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Methods for Cryptic Structure. Methods for Cryptic Structure

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

p(d g A,g B )p(g B ), g B

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

(Genome-wide) association analysis

Association studies and regression

Bayesian Inference of Interactions and Associations

BTRY 7210: Topics in Quantitative Genomics and Genetics

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

How to analyze many contingency tables simultaneously?

Non-specific filtering and control of false positives

Case-Control Association Testing. Case-Control Association Testing

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Computational Systems Biology: Biology X

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Statistical Methods in Mapping Complex Diseases

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Linkage and Linkage Disequilibrium

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Cover Page. The handle holds various files of this Leiden University dissertation

1. Understand the methods for analyzing population structure in genomes

Genotype Imputation. Class Discussion for January 19, 2016

Introduction to Linkage Disequilibrium

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Looking at the Other Side of Bonferroni

Genetic Association Studies in the Presence of Population Structure and Admixture

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Asymptotic distribution of the largest eigenvalue with application to genetic data

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Régression en grande dimension et épistasie par blocs pour les études d association

Affected Sibling Pairs. Biostatistics 666

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Lecture WS Evolutionary Genetics Part I 1

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Variance Component Models for Quantitative Traits. Biostatistics 666

The Quantitative TDT

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

BTRY 4830/6830: Quantitative Genomics and Genetics

Multiple testing: Intro & FWER 1

2. Map genetic distance between markers

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies

PCA vignette Principal components analysis with snpstats

Parts 2. Modeling chromosome segregation

Logistic Regression Model for Analyzing Extended Haplotype Data

Lecture 6 April

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

SNP Association Studies with Case-Parent Trios

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Statistical testing. Samantha Kleinberg. October 20, 2009

Estimating the Marginal Odds Ratio in Observational Studies

Statistical inference on the penetrances of rare genetic mutations based on a case family design

Tests for the Odds Ratio in a Matched Case-Control Design with a Quantitative X

X Chromosome Association Testing in Genome-Wide Association Studies

Linear Regression (1/1/17)

Harvard University. Harvard University Biostatistics Working Paper Series

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Conditions under which genome-wide association studies will be positively misleading

Statistical Applications in Genetics and Molecular Biology

Marginal Screening and Post-Selection Inference

Empirical Bayes Moderation of Asymptotically Linear Parameters

Backward Genotype-Trait Association. in Case-Control Designs

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Haplotyping. Biostatistics 666

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Lecture 9. QTL Mapping 2: Outbred Populations

High-Throughput Sequencing Course

Parts 2. Modeling chromosome segregation

Lesson 4: Understanding Genetics

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Power and sample size calculations for designing rare variant sequencing association studies.

SNP-SNP Interactions in Case-Parent Trios

ABC Fax Original Paper

Computational Systems Biology: Biology X

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Transcription:

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National Cancer Institute, NIH, USA William Wheeler, David Pee Information Management Systems, Silver Spring, USA

Outline Genetics background Case-control Genome-Wide Association Study (GWAS) Ranking and selection procedures Performance criteria Detection probability Positive proportion Analytic results & simulations Extnesions: two stage designs Conclusions

The Human Genome Four DNA bases (nucleotides): A (adenine), T (thymine), G (guanine), C (cytosine) 3 billion base pairs 22+2 chromosomes > 99% of human DNA sequences identical ~8,000,000 single nucleotide polymorphims (SNPs): alterations in single nucelotide e.g. AAGGC -> ATGGC For variation to be considered a SNP, it must occur in >= 1% of population. SNPs occur every 100-300 bases, in coding (gene) and noncoding regions Find disease genes: study variation in SNPs

Linkage Disequilibrium Genotyping all loci is not possible (not yet!) => Utilization of correlation of alleles at two loci Marker locus a: 10% (p a ) Disease locus G: 1% (p G ) A: 90% g: 99% D

Linkage Disequilibrium Bi-allelic disease locus: disease allele G (p G ), wild type allele g (p g ); Bi-allelic marker: a, A (p a, p A ) Linkage disequilibrium (LD) defined as D = P(A, G ) p A p G D =D/D max =D / min(p G p a,p g p a ) for D>0 r 2 =(D ) 2 p a p G /p A p g D is upper bound of r 2

Genome wide approaches Try to get closer to disease locus by high density SNP coverage: tag SNPs tag SNP: representative SNP in region of genome with high LD (r 2 >0.8) with untyped SNPs in that region (HapMap project) Typically: 550,000-650,000 tag SNPs/subject Still estimated that 25% of SNPs not captured adequately by tag SNPs

Case-control Genome Wide Association Studies Retrospective sample of unrelated cases (diseased) & controls (non-diseased) Exposure: tagsnps covering genome Assess correlation between SNP genotypes and disease status Common SNPs (minor allele frequencies>5%) Complex diseases (e.g. cancer) Small/moderate genetic effects (OR~1.3)

Genetic Association Disease outcome Y=1 or 0 Tag SNP Alleles A, a a: 10% (p a ) Disease Locus G: 1% (p G )

Observed data for cases and controls Cases Controls SNP i Genotype aa aa AA total r 0 r 1 r 2 s 0 s 1 s 2 R S Total counts n 0 n 1 n 2 N

Assumed Penetrance Model 1, diseased individual Y= 0, healthy individual P(Y=1 X i) exp[ X i] 1 exp[ X i] X i is a score attached to genotype of SNP i additive model: X i=# of A alleles (0,1,2) dominant model: X i=1 if {aa, AA}; = 0 if {aa} recessive model: X i=1 if {AA}; =0 if {aa,aa}

Tests for Association based on Test H 0: WaldTest : tag SNP 0 using W i 2 i / Var( i) ScoreTest : S U 2 / Var ˆ ( U ) i i 0 i n ˆ ˆ ˆ where U X (1 ) X i i i cases controls n

Remark on the null hypothesis H 0: no association between SNP i and disease is true if either one of the following 1. Disease has no genetic component 2. D=0 (true disease locus not in LD with observed SNP)

Factors causing false positive Population Stratification associations Confounding by ethnicity (bias) Cryptic relatedness Variance distortion Genomic control methods (Devlin & Roeder, 1999) Differential genotyping error (Clayton et al, Nature Genetics, 2005) Failure to call genotype not independent of case-control status Multiple comparison: test 550,000 hypotheses

Controlling for Multiplicity Control experiment-wise Type I error Bonferroni correction: α/n=10-7 Refinements (e.g. Simes Test) Adaptations for two-phase designs (Skol et al., 2006) Control false discovery rate (FDR) (Benjamini and Hochberg, 1995) Ranking and Selection

Motivating Example: Cancer Genetic Markers of Susceptibility (CGEMS) http://cgems.cancer.gov/ A three-year, $14 million initiative Outcome: Prostate cancer (finished), breast cancer (ongoing) Combine data from 5 large studies 550K chip: measures N=550,000 tag SNPs Yeager M, et al. Nature Genetics, 2007.

GWAS Designs Single stage design: all markers measured on all samples Two stage design: Stage 1: Proportion of available samples genotyped on large number of markers Stage 2: Proportion of these markers are followed up by genotyping them on remaining samples

Ranking and Selection for Single Stage Designs Sample n unrelated cases (Y=1) and n controls (Y=0) For each subject, measure X i =0,1,2, the number of minor alleles at SNP i for i=1,.. N.

Models for Genetic Effects in Source Population X i =0,1,2, - number of minor alleles at ith SNP SNPs 1,2,..., M disease-associated SNPs M+1,...,N non-disease associated Probability of disease in source population given by M exp( i X i ) P( Y 1 X,..., X ) i 1 1 N M 1exp( i X i ) i 1

Models for the genetic effects Fixed Effects Model: β i = β for i=1,2,..., M β i = 0 for i=m+1,..., N Random Effects Model: β i ~N(0,τ 2 ) for i=1,2,..., M β i = 0 for i=m+1,..., N E β i = 1/2 (2/ ) 0.798

Chi-Square Statistics for WaldTest : ScoreTest : Detecting Disease W Association i ˆ2 / Var ˆ ( ˆ i) 2 S U / Var ˆ ( U ) i i 0 i where U 0.5( X X ) n i i i i cases controls Use 2-sided test with additive scores, X =0,1,2, both tests unchanged by assignment of minor allele. n

Definitions for Ranking and Selection in One Stage Design SNP i T-selected if W i in top T values, i.e. rank(w i ) >N-T Detection probability (DP): probability that a disease SNP will be T-selected Proportion positive (PP): proportion of true disease SNPs among T selected SNPs

Show: use of marginal model in cases and controls appropriate Assumption: tagsnps independent in source population, i.e. X 1 independent of X 2,,X N For rare disease, P( Y 1 X) exp( ixi) For a single disease SNP, say SNP 1, i1 P( Y 1 X ) exp( X ) * 1 1 1 where * log{ E exp( X )} M M i2 i i

Marginal model in cases and controls In case-control population, logit{ P( Y 1 X )} X ** 1 1 1 ** * where log( 1/ 0) i P(sampled Y i), i 0,1 ˆi

If tagsnps independent in source population, then tagsnps independent in cases and controls Let P( X k) ki g P( X k Y 0) i ki i ki g P( X k, X l Y 0) klih i h ki lh f P( X k Y 1) exp( k){ exp( l)} ki i ki i li i l0 f P( X k, X l Y 1) klih i h 2 2 1 ki lh exp( ik hl){ s exp( 1i s2h is1 hs2)} fki flh s s 0 10 2 2 1

Properties of test statistics W i Based on independence of genotypes in cases and controls, we compute expected values of prospective estimating equations and their cross products using the retrospective sampling distributions and show that scores are uncorrelated Estimating each β i using separate logistic models yields independent estimates of β i, thus W i are independent Similar results for score test S i ˆi

Generalizes to different G i for various disease SNPs with differing allele frequencies and β i Analytic Calculation of DP and PP Special Case: disease SNPs have same allele frequency and β Then W i ~ G for i=1, M: non-central chi-square with non centrality β 2 /Var(β) for fixed effects model W i ~ F, central chi-square (1df), i=m=1,.n DP M 1 N M [ g( c) G( c) {1 G( c)} {1 F( c)} { F( c)} ] dc min( M1, T1) Tm1 M 1 m m s NM s 0 m0 m s0 s PP=(M/T)(DP)

Approximation for M=1 Disease SNP 1 DP 1 G{ F (1 T / N)} In this case, DP equivalent to power of a test with same non-centrality but with type I error (alpha-level) T/N

Simulations Brute force simulation not feasible; use analytic results including independence For each simulation (ISIM) NSIM=10,000 For SNPs i=1,2, M obtain fixed or random β i For SNPs i=m+1,... N let β i =0 For each SNP: Draw random minor allele frequency (MAF) from distribution of MAFs in CGEMS controls (MAF 0.05): mean=0.276, median=0.26, IQR: 0.15-0.38 Solve for μ ** such that P(Y=1)=0.5 Compute information matrix from case-control logistic model, using retrospective sampling distributions Compute Var( ˆi ) Sample from N(, Var( )) ˆi i ˆi Compute W i = ˆ2 ˆ / Var( ) i i

Simulated Estimates of DP and PP Define I(m, ISIM,T)=1 if rank(w m )> N-T, 0 otherwise M NSIM 1 1 ˆDP NSIM M I( m, ISIM, T) m1isim 1 Probability a given disease SNP is T-selected Proportion of disease SNPs selected PP=(M)(DP)/T ˆ ˆ

Results for Fixed Effects Model

DP for Fixed Effects Model. n=1000 black; n=8000 red. Odds ratios 1.1, 1.2, 1.3, 1.5, 2.0. T on log scale.

PP on log scale versus T on log scale. M=1 black; M=100 red. Odds ratios 1.2, 1.3, 1.5, 2.0.

Results for Random Effects Model

DP for Random Effects Model. n=1000 black; n=8000 red. 0.798τ=log of 1.1, 1.2, 1.3, 1.5, 2.0. T on log scale.

Summary: single stage design DP and PP are useful criteria for ranking Related work: Zaykin & Zhivotovsky, 2005; Satagopan et al, 2004 For OR=1.2, large T required to assure adequate DP. T=25,000 yields DP=0.69 (fixed effects model) DP larger for fixed effects than random effects model DP is decreased by increasing M for M>T PP decreases with T for T>M. Increasing T to increase DP decreases PP and is futile if n is too small.

Extension: two stage designs Stage 1: π samples genotyped on 550,000 SNPs Stage 2: T 1 SNPs largest chi-square statistics followed up by genotyping on remaining samples. Replication analysis: view stage 2 as a replication study; final selection of the SNPs depended on ranking only the chi-square statistics in stage2 Joint analysis: for each of the T 1 SNPs selected in stage 1, compute λ C 1 + (1-λ)C 2, λ =0.0, 0.05, 0.1,., 1.0 C i chi-square statistic observed in stage i For each value of λ estimate DP, present maximal DP with corresponding λ. Skol et al, Nature genetics, 2006

Two stage designs, cont. Compute probability that exactly M 2 disease SNPs selected after stage 1, given that T 1 SNPs were selected and there were K 1 non-disease SNPs and M 1 disease SNPs Use joint densities of the (M 1 -M 2 +1)th and (M 1 -M 2 )th order statistic of M 1 disease SNPs, and (K 1 -T 1 -M 2 +1)th and (K 1 - T 1 -M 2 )th order statistic of the K 1 non-disease SNPs. W 1 (i) - order statistics of disease SNPs W 0 (i) - order statistics of non-disase SNPS

Analytic expression Fixed effects model, same allele frequencies for all disease SNPs P(exactly M disease SNPs chosen after stage1) w1 w1 2 1 0 0 1 P( W W ; W W ) ( M1M21) ( M1M21) ( K1T 1M21) ( M1M2 ) 1 1 P( W w2; w1 W ) dg( w1, w2) min( x1, w1) 0 0 w2 0 ( M1M21) ( M1M2 ) K1! M1! ( K T M 1)!( T M 1)! ( M M 1)!( M 1)! 1 1 2 1 2 1 2 2 M2 1 T1 M2 1 K1 T1 M2 1 {1 ( )} ( ) ( ){1 ( )} ( ) ( ) ( ) Gw M ( 2) G w g w g w F x F x f x f x dx dx dw dw 1 1 2 1 2 1 2 2 1 2 1 M 1 2 1

Special case: M 2 =0 Special cases 0 1 P(zero disease SNPs chosen after stage1) P( W W ) y 0 0 K1! ( K T )!( T 1)! 1 1 1 ( K1T1 1) ( K1T1 1) T1 1 K1 T1 1 M11 {1 ( )} ( ) ( ) ( ) ( ) F y F y f y M G x g x dxdy 1 Special cases: M 2 =M 1 1 0 P(all disease SNPs chosen after stage1) P( W W ) 0 y K1! ( K T M 1)!( T M )! 1 1 1 1 1 (1) ( K1T 1M1) K1 T1 M11 T1 M1 M11 {1 ( )} ( ) ( ) (1 ( )) ( ) F y F y f y M G x g x dxdy 1 Special case M 2 =T 1 1 0 P(all disease SNPs chosen after stage1) P( W W ) M! ( M1M 21) ( K1) K 1 M M M 1 0 1 ( M1 M2)!( M2 1)! y K F( y) 1 f ( y) 1 G( x) 1 2{1 G( x)} 2 g( x) dxdy

Probability of detecting a disease SNP (DP) and optimal stage1 weight for fixed effects model with log odds ratio per allele β= log(1.2) for 8000 cases and 8000 controls Analysis sample Number of disease SNPs, M0 1 T1 1000 T1 25,000 T2 1 T2 100 T2 1 T2 100 0.125 Replicate.266.269.630.664 Joint.267.269.635.664.25.27.35.25 opt 0.25 Replicate.612.626.803.881 Joint.613.626.825.885.32.15.40.25 opt 0.50 Replicate.769.897.821.912 Joint.843.900.858.953.40.20.45.40 opt 1.00 Onestage a.882.966.882.966

Summary, two stage design A small first stage can only partially be compensated for by selecting a large number of SNPs for further testing Even joint analysis does not improve DP much if π sample 0.25 As the cost per genotype for stage1 drops, compared to later stages, economic incentives for multi-stage designs decrease Advantage of one-stage design: offers unbiased estimates of genetic effects that can be combined easily with those from other studies; multi-stage designs introduce selection biases that complicate meta-analyses.

References Gail MH, Pfeiffer RM, Wheeler W, Pee D, Probability of Detecting Disease- Associated Single Nucleotide Polymorphisms in Case-Control Studies with Whole Genome Scans, in press, Biostatistics. Pfeiffer R, Gail MH, Genetic Epidemiology, 25 (2), pp 136--48, 2003. Skol et al. 2006. Nature Genetics, 38:209-13. Dudridge 2006, Am J of Hum Genetics. 78:1094-95. Marchini et al. 2006. Nature Genetics, 2005: 413-7. Freedman et al. 2004. Nature Genetics, 36: 388-395 Marchini et al. 2004. Nature Genetics, 36: 512-517 Campbell et al. 2005. Nature Genetics, 37: 868-872 Devlin and Roeder, 1999. Biometrics, 55: 997-1004 Price et al. 2006. Nature Genetics, 38: 904-909. Pritchard JK et al. 2000. Genetics, 155: 945-959 Reich and Goldstein, 2001. Genetic Epidemiology, 20: 4-16 Satten GA et al. 2001. Am J. Hum. Genet. 68: 466-477

Platforms & SNP Chips Affymetrix 100K Affymetrix 500K } essentially random set of SNPs Illumina 317K Illumina 550K } designed using Hapmap Illumina 650Y (550K+100K YRI fill in)