GWAS. Genotype-Phenotype Association CMSC858P Spring 2012 Hector Corrada Bravo University of Maryland. logistic regression. logistic regression

Size: px

Start display at page:

Download "GWAS. Genotype-Phenotype Association CMSC858P Spring 2012 Hector Corrada Bravo University of Maryland. logistic regression. logistic regression"

Lester Matthew Short
6 years ago
Views:

1 Genotype-Phenotype Association CMSC858P Spring 202 Hector Corrada Bravo University of Maryland GWAS Genome-wide association studies Scans for SNPs (or other structural variants) that show association with some phenotype categorical phenotypes: age-related macular degeneration continuous phenotypes (QTL): blood pressure Commonly: 0^3 samples, 0^6 SNPs logistic regression Estimate log odds diseaseratio f is linear Binary outcome, disease/no θ(x) = P r{y = x} f(x) = log θ(x) θ(x) Predictors (genotypes) logistic regression f(x) = log θ(x) θ(x) = β 0 + β x Encoding genotype data We usually think of major/minor alleles, where minor allele occurs at a less frequency in the population (e.g., 5%) haplotype: minor allele: AA, Aa -> x=0; aa -> x= major allele: AA,Aa -> x=; aa -> x=0 both:aa->x=,x2=;aa->x=,x2=0,etc... genotype (dosage): AA -> x=0; Aa -> x=; aa-> x=2

2 Interpretation Odds of outcome for, e.g, genotype AA P (Y = X = 0) P (Y =0 X = 0) = eβ 0 Odds of outcome for, e.g, genotype Aa P (Y = X = ) P (Y =0 X = ) = eβ 0+β Odds-ratio P (Y = X = )/P (Y =0 X = ) P (Y = X = 0)/P (Y =0 X = 0) = eβ GWAS gwas Discovering association: how unexpected is this odds ratio? Expensive and pervasive...

Published Genome-Wide Associations through 2/200, 22 published GWA at p<5x0-8 for 20 traits NHGRI GWA Catalog www.genome.

4*-2&@*%A3BC&40*+&D% G)32E43(% 30 $E0%<0.@3=*+&% 40 ;3'%5(%&/)3'%F*0)%2&4&)%% >30.F=% 50 ;3(<&(3%=&%>3<*++(%?

3 Published Genome-Wide Associations through 2/200, 22 published GWA at p<5x0-8 for 20 traits NHGRI GWA Catalog GWAS!"#$%&''(%)*+&(% Most diseases here ,)-./0'-23'%4053)%06-)7%% %5(9%8)&:%373(%% 20!4*-2&@*%A3BC&40*+&D% G)32E43(% 30 $E0%<0.@3=*+&% 40 ;3'%5(%&/)3'%F*0)%2&4&)%% >30.F=% 50 ;3(<&(3%=&%>3<*++(%?%=)3*=@3=% Testing for marginal effects is limited Epistasis, interactions Environment/risk factors, unaccounted dependencies Not all SNPs are created equal (annotation)

4 Examining the relative influence of familial, genetic, and environmental covariate information in flexible risk models Héctor Corrada Bravo a,, Kristine E. Lee b, Barbara E. K. Klein b, Ronald Klein b, Sudha K. Iyengar c, and Grace Wahba d, a Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 2205; b Department of Ophthalmology and Visual Science, University of Wisconsin, Madison, WI 53706; c Departments of Epidemiology and Biostatistics, Genetics, and Ophthalmology, Case Western Reserve University, Cleveland, OH 4406; and d Departments of Statistics, Biostatistics and Medical Informatics, and Computer Sciences, University of Wisconsin, Madison, WI 53706b; Contributed by Grace Wahba, March 9, 2009 (sent for review February 22, 2009) Environment/risk factors, unaccounted dependencies How to incorporate subject dependence Splines and BDES History of Smoothing Spline (SS) models for analyzing BDES data [Wahba et al. 998a,b,999,2000,2002,2006] In particular, SS-ANOVA model of pigmentary abnormalities [Ann. Statistics 28 (2000)] SS-ANOVA [Ann. Statistics 28 (2000)] Model for pigmentary abnormalities (PA), female BDES I subjects f(t) = µ + f (sysbp) + f 2 (chol) + f 2 (sysbp, chol) + d age age + d bmi bmi + d horm I (horm) + d hist I 2 (hist), hormone replacement yes/ history of heavy no drinking probability SS-ANOVA bmi : 32.2 age : 55 bmi : 28 age : 55 bmi : 24.6 age : sysbp = 09 sysbp = 24 sysbp = 39 sysbp = bmi : 32.2 age : 66 bmi : 28 age : 66 bmi : 24.6 age : 66 cholesterol bmi : 32.2 age : 73 bmi : 28 age : 73 bmi : 24.6 age : nonlinear protective effect of cholesterol

5 SS-ANOVA SS-ANOVA (w/ ARMS2) Recent results linking variation in specific genetic regions and AMD (age-related macular degeneration) In particular, CFH and LOC38775 (ARMS2) genes probability snp2 : 22 age : 48.5 snp2 : 2 age : 48.5 snp2 : age : 48.5 sysbp = 09 sysbp = 24 sysbp = 39 sysbp = snp2 : 22 age : 59.5 snp2 : 2 age : 59.5 snp2 : age : 59.5 snp2 : 22 age : 69.5 snp2 : 2 age : 69.5 snp2 : age : snp2 : 22 age : 80.5 snp2 : 2 age : 80.5 snp2 : age : 80.5 protective effect gone cholesterol Pedigrees Pedigree Distance PA present male female PA absent Use Malecot s kinship coefficient (φ): for subjects i and j: the probability that randomly chosen alleles, one from each subject, are identical by descent e.g. parent-offspring: /4 e.g. siblings: /4 Pedigree distance: (-2 φ)

6 Relationship Graph Example Pedigree Graph Metric embeddings 26! Relationship sibs avuncular first-cousins unrelated Distance We will extend the SS-ANOVA model with an encoding of this relationship graph ! 35! Interpretation: embedding gives relationship pseudo-attributes over which smooth functions can be estimated.5 0! 8! Extensions Comparison to Covariate-Only Model Percent change in mean AUC w.r.t. C only model f(t) = µ + d SNP, I(X = 2) + d SNP,2 I(X = 22) + d SNP2, I(X 2 = 2) + d SNP2,2 I(X 2 = 22) + f (sysbp) + f 2 (chol) + f 2 (sysbp, chol) + d age age + d bmi bmi + d horm I (horm) + d hist I 2 (hist) + d smoke I 3 (smoke) + h(z(t)) SNP data environmental covariates pedigree data!auc C"only S only S+C P only S+P C+P S+C+P [Corrada Bravo, et al., PNAS 2009]

7 Epistasis Testing marginal effects is limited Modeling is straightforward: We want to test interactions (epistasis) add non-linear interaction terms to logistic regression model Computationally, it s a problem we started with 0^6 SNPs... BIOINFORMATICS ORIGINAL PAPER Vol. 26 no , pages doi:0.093/bioinformatics/btq529 Genetics and population analysis Advance Access publication September 24, 200 RAPID detection of gene gene interactions in genome-wide association studies Dumitru Brinza, Matthew Schultz 2, Glenn Tesler 3 and Vineet Bafna 4, Life Technologies, Foster City, CA, 2 Graduate Bioinformatics Program, 3 Department of Mathematics and 4 Department of Computer Science and Engineering, Institute for Genomic Medicine, University of California, San Diego, CA, USA Associate Editor: Jeffrey Barrett A filtering approach: Discover possible interactions quickly Test good candidates completely RAPID If two SNPs (x and y) associate with disease (d) then at least one of the following must hold:. x associates with d 2. y associates with d 3. x associates with y in cases 4. x associates with y in controls RAPID finds SNPs where 3 holds RAPID Look at cases only, and define vector for each SNP as: 0, v x (a) = a P x n Px ( P x ) Proportion of s

8 RAPID RAPID dist(v x,v y )= 2 2 χ 2 x,y/n Association between x and y Statistical association is now a geometric problem RAPID Use random projections to find possible interacting pairs RAPID Do this repeatedly, to avoid false positives vx r Hash(x, r, B) = B

9 Interactions/Epistasis A MAJOR problem Inherently computational and statistical We are nowhere close We will be inundated with data (sequencing) Learning a Prior on Regulatory Potential from eqtl Data Su-In Lee, Aimée M. Dudley 2, David Drubin 3, Pamela A. Silver 3, Nevan J. Krogan 4, Dana Pe er 5, Daphne Koller * Computer Science Department, Stanford University, Stanford, California, United States of America, 2 Institute for Systems Biology, Seattle, Washington, United States of America, 3 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 4 Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, California, United States of America, 5 Department of Biological Sciences, Columbia University, New York, New York, United States of America SNP annotation Outcome is gene expression (eqtl) The goal is to learn regulatory programs Potential for a mutation to have an effect on expression depends on SNP features Regulatory programs gene exp. SNPs y m,g ~w m, x zw m,2 x 2 z...zw m,n x n ze, for all g 0 s, in module m, linear regression

Regulatory Potential SNP features PrðSNP n causes variation in expression levels of genesþ ~sigmoid X b k kf n,k, SNP features sigmoidðþ~= t ðzexp ð{tþþ, Yeast dataset Regulatory programs gene exp.

10 Regulatory Potential SNP features PrðSNP n causes variation in expression levels of genesþ ~sigmoid X b k kf n,k, SNP features sigmoidðþ~= t ðzexp ð{tþþ, Yeast dataset Regulatory programs gene exp. SNPs y m,g ~w m, x zw m,2 x 2 z...zw m,n x n ze, for all g 0 s, in module m, Hierarchical model Prðw r Þ!exp ð{c r jw r jþ, linear regression minimize module m y m,g gene g Estimation regulator r expression fit w m,r x r 2 + regulator r C r w m,r + D SNP selection uses regulatory potential Parameters estimated iteratively regulator r w 2 m,r + E k β 2 k C r ~C PrðRegulator r is causalþ zc 0 ½{PrðRegulator r is causalþš:

11 Where are SNPs with largest regulatory potential?

Examining the Relative Influence of Familial, Genetic and Covariate Information In Flexible Risk Models. Grace Wahba

Examining the Relative Influence of Familial, Genetic and Covariate Information In Flexible Risk Models Grace Wahba Based on a paper of the same name which has appeared in PNAS May 19, 2009, by Hector