GWAS for Compound Heterozygous Traits: Phenotypic Distance and Integer Linear Programming Dan Gusfield, Rasmus Nielsen.

Similar documents
Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

(Genome-wide) association analysis

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Genotype Imputation. Biostatistics 666

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Association studies and regression

Outline for today s lecture (Ch. 14, Part I)

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Chapter 10. Prof. Tesler. Math 186 Winter χ 2 tests for goodness of fit and independence

Linear Regression (1/1/17)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Calculation of IBD probabilities

Estimating Recombination Rates. LRH selection test, and recombination

p(d g A,g B )p(g B ), g B

Calculation of IBD probabilities

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The Quantitative TDT

The Pure Parsimony Problem

Latent Variable models for GWAs

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

SAT in Bioinformatics: Making the Case with Haplotype Inference

GWAS IV: Bayesian linear (variance component) models

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Lecture WS Evolutionary Genetics Part I 1

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

Ch 11.Introduction to Genetics.Biology.Landis

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Integer Linear Programs

Population Structure

Dropping Your Genes. A Simulation of Meiosis and Fertilization and An Introduction to Probability

Genetics (patterns of inheritance)

Introduction to Genetics

1. Understand the methods for analyzing population structure in genomes

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Unit 3 - Molecular Biology & Genetics - Review Packet

SNP Association Studies with Case-Parent Trios

Objective 3.01 (DNA, RNA and Protein Synthesis)

Rational Numbers. Integers. Irrational Numbers

Case-Control Association Testing. Case-Control Association Testing

Results. Experiment 1: Monohybrid Cross for Pea Color. Table 1.1: P 1 Cross Results for Pea Color. Parent Descriptions: 1 st Parent: 2 nd Parent:

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Conditional distributions (discrete case)

Genotype Imputation. Class Discussion for January 19, 2016

Proof: Let the check matrix be

SNP-SNP Interactions in Case-Parent Trios

Family Trees for all grades. Learning Objectives. Materials, Resources, and Preparation

Lecture 25: Cook s Theorem (1997) Steven Skiena. skiena

Week 7.2 Ch 4 Microevolutionary Proceses

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

a. See the textbook for examples of proving logical equivalence using truth tables. b. There is a real number x for which f (x) < 0. (x 1) 2 > 0.

Name Date Class CHAPTER 10. Section 1: Meiosis

3 Comparison with Other Dummy Variable Methods

Heredity and Genetics WKSH

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Efficient Haplotype Inference with Boolean Satisfiability

Computational Approaches to Statistical Genetics

Computational Systems Biology: Biology X

Lesson 4: Understanding Genetics

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Introduction to Quantitative Genetics. Introduction to Quantitative Genetics

Section 11 1 The Work of Gregor Mendel

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Introduction to population genetics & evolution

Chapter 1. Vectors, Matrices, and Linear Spaces

genome a specific characteristic that varies from one individual to another gene the passing of traits from one generation to the next

Computational Systems Biology: Biology X

Patterns of inheritance

Show Your Work! Point values are in square brackets. There are 35 points possible. Some facts about sets are on the last page.

BTRY 4830/6830: Quantitative Genomics and Genetics

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

2. Introduction to commutative rings (continued)

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

Classical Selection, Balancing Selection, and Neutral Mutations

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values.

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Learning gene regulatory networks Statistical methods for haplotype inference Part I

NP-problems continued

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Name Date Class. Meiosis I and Meiosis II

Lecture 9. QTL Mapping 2: Outbred Populations

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Private Computation with Genomic Data for Genome-Wide Association and Linkage Studies

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2018

FastANOVA: an Efficient Algorithm for Genome-Wide Association Study

10. How many chromosomes are in human gametes (reproductive cells)? 23

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Transcription:

GWAS for Compound Heterozygous Traits: Phenotypic Distance and Integer Linear Programming Dan Gusfield, Rasmus Nielsen December 11, 2016

GWAS In Genome Wide Association Studies (GWAS) we try to locate mutations that are causal for a particular trait by comparing genomic information from individuals who are believed to have the trait (cases), with information from individuals who are believed to not have the trait (controls). The goal is to find sites in the genome which statistically distinguish the cases from the controls. The method considers each site separately. This has worked well for simple traits caused by a single particular mutation at a single site (pure Mendelian traits).

Complex Traits However, for many complex traits, involving more than one site, or mutations with small effect, or with large errors in determining who has the trait, GWAS have only recovered a small proportion of the variance in trait prevalence known to be caused by genetics. Ex. Diabetes II. The most common explanation is the presence of multiple interacting or causal mutations that cannot be identified individually due to a lack of statistical power. But, the mutations are often concentrated in a small number of genes, or loci. Many are thought to be Compound Heterozygous traits, a subset of the recessive traits.

Compound Heterozygous (CH) Trait X g (SNPs): 0 1 1 0 0 0 1 CH H 1,1 : 1 0 1 0 1 0 0 H 1,2 : 1 1 0 0 1 1 0 1 H 2,1 : 1 0 0 1 1 0 0 H 2,2 : 0 1 1 0 1 0 1 0 Table : Vector X g and two haplotype pairs. CH(1) is 1, and CH(2) is 0. In this talk, we discuss an GWAS approach to finding loci that are causal for CH traits.

Formal Model of a CH-trait at a Causal Locus Binary vector X g denotes which of the m SNP sites are causal (i.e., contribute to the CH-trait), and which are not. H is the set of haplotype pairs for n individuals. Given X g and H, wedefinech(i): CH(i) =[ c (X g (c) H i,1 (c))] [ c (X g (c) H i,2 (c))] (1) In words, CH(i) willhavevalue1ifandonlyifthereisasnpsite c with X g (c) = 1, where site c in haplotype H i,1 also has value 1; and there is also a site c (possibly c) withx g (c ) = 1, where site c in haplotype H i,2 also has value 1.

CH We let CH denote the vector of length n, containingthevalues CH(1),...,CH(n). Phenotypes The phenotypes of the individuals are recorded by the observable vector T.Thecases have T -value of 1; the controls have T -value of 0. CH + noise T. Without false positive or negatives, T = CH. But there are always some false positives or negatives, so T will differ to some extent from CH. In the simulations I will discus later in the talk, about 30% of the cases (T value of 1) are false positives, and about 5% of the controls (T value of 0) are false negatives, at a causal gene.

1001100010001001001 True, but unknown SNPs 0000000000000000000 0 0000000000000000100 0000100000000000000 1 0010000100011001010 0001000000000000000 0 0000000000000000000 0000000000000000000 0 0000000001000000000 0000000000000000000 1 0000001000000000001 0001000000000000000 1 1000000010000100000 0100010000000000000 1 0000000000100000000 Low statistical power.

Hidden Phenotypic Distance Definition: Given CH (which is a function of X g and H), the Hidden Phenotypic Distance is the Hamming Distance (number of positions where the vectors differ) between CH and T.Thisis written HPD(CH,T). Without false positives or negatives, HPD(CH,T) =0.

Compound Heterozygous (CH) Trait X g (SNPs): 0 1 1 1 0 0 1 CH phenotype T H 1,1 : 1 0 1 0 1 0 0 H 1,2 : 1 1 0 0 1 1 0 1 1 H 2,1 : 1 0 0 1 1 0 0 H 2,2 : 0 1 1 0 1 0 1 1 0 Table : Hidden Phenotypic Distance is the Hamming Distance between vectors CH and T. In this example, it is 1.

The Phenotypic Distance Problem Definition: Given only H and T,thePhenotypic Distance Problem is the problem of determining a vector X g,whichinduces avector CH to Minimize the Hamming Distance between CH and T.ThisiswrittenPD(H,T). Loosely, PD(H,T) measures how well the phenotypes fit the CH model, or, the deviation from what is expected (under the CH-model), and what is observed. We want to compute Phenotypic Distance for up to n =4000 haplotype pairs and n =250sites.

Computing PD(H,T) Enumeration of 2 m possible X g is infeasible. But Integer Linear Programming (ILP) works efficiently in practice for this problem. Under 3 seconds for a causal gene, and under 1 minute for a non-causal gene, on a macbook pro (2.3 GH, 4 cpus, $1500 to buy).

Integer Linear Programming (ILP) In ILP, we translate the problem of computing PD(H,T) into a set of linear inequalities on a set of binary variables, andalinear objective function on a subset of the variables. The particulars of the inequalities is where the magic resides. For 4000 haplotype pairs and 250 sites, the ILP has about 10,000 inequalities and 8,000 variables (this is modest size). The translation from problem instance to ILP inequalities is written in (really) bad and simple Perl. Then we use a commercial ILP solver that finds the optimal values of the variables. (GUROBI 6.0, free to academics and researchers)

An ILP Formulation for the Phenotypic Distance Problem The ILP formulation is almost an immediate restatement of the problem requirements, with only a couple of subtlties. Recall that an H i pair with T (i) =1iscalledacase, eventhough, due to false-positives, indivdual i might not actually have the trait; similarly an H i pair with T (i) =0iscalledacontrol.

The ILP Inequalities for Cases For each case H i, the ILP formulation for the Phenotypic Distance will have the following inequalities: CH(i) CH(i) c: H i,1 (c)=1 c: H i,2 (c)=1 X (c) X (c) The first inequality ensures that for any case, CH(i) canbesetto 1 only if some X (c) issetto1foracolumnc where H i,1 (c) =1. The second inequality says the similar thing for the second haplotype of a case, i.e., for H i,2 (c). So, for any case H i, CH(i) will be set to 1 only if the values of X and H i satisfy equation 1.

Errors from Cases It follows that in an ILP solution, [(the number of cases) H i acase CH(i))] is the number of cases (i.e., T (i) = 1), where CH(i) issetto0. That is, it is the number of errors in the solution, contributed by the cases. Next, we consider the inequalities for a control.

The ILP inequalities for Controls Let f i be the number of columns, c, inh i where H i,1 (c) =1,and let s i be the number of columns, c, inh i where H i,2 (c) =1. For each control H i,theilpformulationwillhavethethree inequalities: c: H i,1 (c)=1 c: H i,2 (c)=1 X (c) f i Z i,1 X (c) s i Z i,2 Z i,1 + Z i,2 CH(i) 1 The first inequality ensures, for a control H i,thatz i,1 will be set to 1 if there is a column c where X (c) issetto1andh i,1 (c) =1. The second inequality ensures, for a control H i,thatz i,2 will be set to 1 if there is a column c where X (c) issetto1and H i,2 (c) =1. Thethirdinequalityensuresthat CH(i) willbesetto 1 if both Z i,1 and Z i,2 are set to 1.

The Converse The converse, that for H i a control, CH(i) willbesetto1only if those inequalities are satisfied, is not needed because the objective function has the term + H i acontrol CH(i), and since the objective is a minimization, CH(i) will be set to 0 for any control H i, unless doing so violates one of the three inequalities above. The result is that in an optimal ILP solution, H i acontrol CH(i) is the number of H i pairs where T (i) =0,but CH(i) issetto1.

The Objective Function It follows that in an optimal ILP solution, the Hamming Distance between CH and T is [(The number of cases) H i acase CH(i)] + H i acontrol CH(i). So, the ILP formulation optimizes the objective function: Minimize[(#ofCases) H i acase CH(i)] + H i acontrol CH(i), and hence the optimal solution has value exactly PD(H,T). The formulation has at most 3n + m variables and at most 3n inequalities, and so has modest size.

min Z subject to: +X 5+X 109 + X 131 + X 167 + X 215 + X 227 6Z 0, 1 0 +X 129 1Z 0, 2 0 Z 0, 1+Z 0, 2 W 0 1 +X 118 + X 139 2Z 1, 1 0 0Z 1, 2 0 Z 1, 1+Z 1, 2 W 1 1 +X 52 + X 77 + X 210 3Z 2, 1 0 +X 196 1Z 2, 2 0 Z 2, 1+Z 2, 2 W 2 1 0Z 3, 1 0 +X 7+X 127 + X 130 + X 167 + X 215 + X 227 6Z 3, 2 0 Z 3, 1+Z 3, 2 W 3 1 X 13 X 63 X 80 X 114 X 192 X 217 + W 4 0 X 14 X 36 X 44 X 45 X 59 X 74 X 102 X 115 + W 5 0.about10,000moreinequalities

The GWAS Context and Simulations Let T g denote the phenotype vector at a causal gene. In GWAS simulations, we generate data for many genes, but use T g as the phenotype vector for all genes.

Identifying Causal Genes with GWAS The computations distinguish causal genes from non-causal genes. 1. The phenotypic distance computed at a causal gene is consistently and significantly less than the phenotypic distance computed at every non-causal gene. Meaning: Causal genes fit the CH-model better than non-causal ones. 2. The time needed at a causal gene is generally less than at a non-causal gene. 3. Permutation tests (of T g ) behave very differently at causal and non-causal genes. Large increase in PD(H, T g )whent g is permuted at a causal gene g, butlittlechangewhenpermutedata non-causal gene.

sites HPD PD PD/SNP cases controls secs SNP-dist 233 890 879 3.77 2000, 2000 1.50 64 255 1374 1340 5.25 2000, 2000 4.11 85 239 1324 1301 5.44 2000, 2000 17.33 71 238 1348 1319 5.54 2000, 2000 4.77 77 241 1333 1313 5.44 2000, 2000 47.86 88 242 1337 1305 5.39 2000, 2000 9.67 79 269 1344 1300 4.83 2000, 2000 33.26 88 237 1381 1345 5.67 2000, 2000 13.39 82 236 1378 1345 5.69 2000, 2000 7.58 74 236 1320 1281 5.42 2000, 2000 5.27 83 237 1353 1320 5.56 2000, 2000 1.84 79 247 1297 1280 5.18 2000, 2000 9.61 61 Table : The first few datasets in a GWAS simulation. The first one is the causal gene, and the others are non-causal.

Table : Simulations at causal loci, each followed by a test of the permuted T vector. c/p hp sites HPD PD/ub case con secs SNP-dist c 400 153 91 86 200, 200 0.02 46 p 159 200, 200 0.10 71 c 400 155 96 85 200, 200 0.02 63 p 165 200, 200 0.19 67 c 400 162 80 75 200, 200 0.03 59 p 156 200, 200 0.19 63 c 400 145 100 90 200, 200 0.04 44 p 162 200, 200 0.30 59 c 400 178 91 75 200, 200 0.02 70 p 149 200, 200 0.11 71 c 400 172 99 85 200, 200 0.04 59 p 142 200, 200 0.32 78 AVG c 400 157.6 91.6 83.2 200, 200 0.025 54.7 AVG p 153.8 200, 200 0.172 68.8

Thank You!