A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Size: px

Start display at page:

Download "A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction"

Everett Newman
5 years ago
Views:

Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung

1 A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab Seoul National University

2 Contents 1. Introduction 2. Motivation 3. Method 4. Results 5. Conclusion

3 Interaction SNP $ Ø In single locus association study üno effect SNP 1 üno effect SNP A reason of the Missing heritability

, 69, 138 147 SNP 1 SNP 2 Class 0 0 0 1 2 0 2 1 1 Calculate casecontrol ratio Identify high/lowrisk Build 2 2

4 MDR method Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Ritchie M.D. et al. (2001), Am. J. Hum. Genet., 69, SNP 1 SNP 2 Class Calculate casecontrol ratio Identify high/lowrisk Build 2 2 confusion matrix true positive negative positive TP FP predicted negative FN TN Case Control Case Control High risk High risk 12 4 Low risk Low risk

5 Weaknesses of MDR Biological meaning All possible genotype interaction models are really possible in real world? Log-linear model based MDR (Lee et al. 2007) Computation time Exponential increase by increase of interaction order Filtering based approaches Relief, ReliefF, TuRF, SURF Processing MDR GPU, cumdr Binary classification (# of case, # of control): (2, 1) vs (20, 10), (1, 11) vs (10, 20) Next slide

6 Approaches to overcome simple binary classification Model-based MDR (# of case, # of control): (2, 1) vs (20, 10) Calle, M.L., et al. (2008) MB-MDR: model-based multifactor dimensionality reduction for detecting interactions in high-dimensional genomic data. Ternary classification: high, low and no evidence group wba MDR (# of case, # of control): (11, 1) vs (20, 10) Namkung, J., et al. (2009) New evaluation measures for multifactor dimensionality reduction classifiers in gene gene interaction analysis, Bioinformatics, 25, Weighted balanced accuracy MDR

7 Fuzzy set theory Extension of classical set theory Zadeh, L.A. (1965) Fuzzy sets, Information and control, 8, Degrees of membership Rich or poor vs degree of rich 1/10/100/1000 dollars for a day poor rich poor 1 rich

Key difference Original MDR Case Control High risk Low risk +1 +20 wba MDR Fuzzy MDR Case Control

8 Key difference Original MDR Case Control High risk Low risk wba MDR Fuzzy MDR Case Control Case Control High risk Low risk +1* *2.5 High risk +1* *0. 05 Low risk +1* *0. 95

Simple example Case Control High risk 60 42.5 40 37.3 Low risk 20 37.

9 Simple example Case Control High risk Low risk Case Control High risk Low risk

10 Criteria (x) of membership degree (μ H,μ L ) The estimate of the odds ratio (OR) θ. i = n i1 n i0 n 31 n 30 n 56 : the number of individuals with the i th multi locus genotypes in the j th disease group n :6 : the total number of individuals in the j th disease group i = 1,, 3?, j = 1 for case and j = 0 for control (# of case, # of control): (2, 1) vs (20, 10) Standardization z = log(or R ) SE SE = $ X YZ $ X 3Z + $ X Y] $ X 3], log θ 5. = log X YZ X Y] X 3Z X 3] = log n 5$ n :$ log n 5b n :b

11 Membership function Original MDR Fuzzy MDR 0 x < t j klm n, μ g x = m o lm n t j x < t q 1 x t q μ s x = 1-μ g x μ g x = 0 x < t j $ $: tu v o tuv n w xuz t j x < t q 1 x t q μ s x = 1-μ g x,

12 Tuning Parameters Notation F y {,y, y }~5qm,y mq ~ q jƒ 80 (2*2*4*5) combinations Membership function y { = l for linear membership function, y { = s for sigmoid membership function Standardization y = 0 for OR, y = 1 for z Weights w 5 = 1 + ln(or) 5, i = 0, 0.5, 1, 2 Threshold values 2, 4, 8, 16 and 32 for OR , , , and for z

13 Fuzzy MDR procedure(1) Consistent case/control ratio In two loci interactions,?c

14 Fuzzy MDR procedure(2) Original MDR Fuzzy MDR Membership degrees depend on parameter values. TP = n 5$ μ g x 5 5 FN = n 5$ μ s x 5 FP = n 5b μ g x TN = n 5b μ s x 5. 5

15 Empirical Studies Experiments of simulation data Objectives To compare power of Fuzzy MDR with original MDR and wba MDR To find optimal parameter values Data Without marginal effects With marginal effects Generation Parameters F y {, y, y }~5qm, y mq ~ q jƒ Linear/sigmoid, with/without SE, four weight values and 5 threshold values Experiments of real data Bipolar disorder (BD) data in Wellcome Trust Case Control Consortium (WTCCC)

16 Data without marginal effects Structure Four sample sizes 200, 400, 800 and 1600 samples 1000 SNPs Two causative SNPs 70 penetrance tables 7 heritability values 2 minor allele frequencies 5 models Example of penetrance table Model1 AA Aa aa BB Bb Bb Downloaded from

17 200 sample results heritability MAF

18 400 sample results heritability MAF

19 800 sample results heritability MAF

20 1600 sample results heritability MAF

21 Data with marginal effects Structure One sample sizes 2000 cases and 2000 controls 1000 SNPs Two causative SNPs 18 penetrance tables 3 models 3 minor allele frequencies 2 linkage disequilibrium values Model 1 AA Aa aa BB 1 1+θ (1+θ) 2 Bb 1+θ (1+θ) 2 (1+θ) 3 bb (1+θ) 2 (1+θ) 3 (1+θ) 4 Model 2 AA Aa aa BB Bb 1 (1+θ) (1+θ) 2 bb 1 (1+θ) 2 (1+θ) 4 Model 3 AA Aa aa BB Bb 1 1+θ 1+θ bb 1 1+θ 1+θ

22 Results data with marginal effects Model LD MAF

23 index rs number MAF Chromosome (position) gene p-value (rank) 1 rs ( ) 9.82E-06 (8) 2 rs ( ) 1.83E-05 (12) 3 rs ( ) DPP E-05 (10) 4 rs ( ) RNPEPL1 5.03E-06 (3) 5 rs ( ) CMTM8 1.45E-05 (11) 6 rs ( ) LAMP3 5.25E-06 (4) 7 rs ( ) SORCS2 1.13E-01 (17) 8 rs ( ) GLTSCR1L, LOC E-06 (2) 9 rs ( ) 5.39E-05 (14) 10 rs ( ) DFNB E-05 (13) 11 rs ( ) CACNA1C 9.72E-04 (15) 12 rs ( ) TSPAN8 7.22E-02 (16) 13 rs ( ) DGKH 6.23E-01 (19) 14 rs ( ) SLC35F4 1.15E-05 (9) 15 rs ( ) TDRD9 7.69E-06 (6) 16 rs ( ) PALB2 1.33E-07 (1) 17 rs ( ) 9.18E-06 (7) 18 rs ( ) MYO5B 4.79E-01 (18) 19 rs ( ) CDC25B 7.47E-06 (5) Real data BD in WTCCC 1868 cases and 2938 controls 19 SNPs are selected by a literature review Two parameter settings 1. F(L,0,0,3) Linear membership, without SE, without weight, threshold OR = 8 2. F(S,1,1,2) Sigmoid membership, with SE, w $ = 1 + ln(or) $, threshold ZOR = 2*1.96

24 Result of BD in WTCCC F(S, 1,1,2) order SNP combination training accuracy testing accuracy CVC , , 6, , 6, 14, , 6, 9, 11, index rs number MAF Chromosome gene p-value (rank) 15 rs rs TDRD9 (Tudor Domain Containing 9) CDC25B (Cell Division Cycle 25B) 7.69E-06 (6) 7.47E-06 (5)

25 Fuzzy MDR vs Original MDR (Interaction model ) Fuzzy MDR Original MDR M11 has been discovered in real world! M M

26 5. Conclusion A novel and powerful Fuzzy MDR for gene-gene interaction analysis Based on fuzzy set theory H and L risk groups are fuzzy sets Original MDR is a special case of Fuzzy MDR More flexible interpretation by the degree of membership of each multi-locus genotype Potential of extension Future work Determining of the optimal tuning parameter values Extensions

27 Thank you.

28 References Ritchie, M.D., et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Am J Hum Genet, 69, Velez, D.R., et al. (2007) A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction, Genet Epidemiol, 31, Leem, S., et al. (2014) Fast detection of high-order epistaticinteractions in genome-wide association studies using information theoretic measure, Computational Biology and Chemistry, 50, Burton, P.R., et al. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, Nature, 447, Li, W. and Reich, J. (2000) A Complete Enumeration and Classification of Two-Locus Disease Models, Human Heredity, 50,

29 Limit of single-locus association studies <Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B

30 Method Original Confusion matrix calculation wtp = 5 q5q n 5$ wfp = 5 q5q n 5b wfn = 5 j } n 5$ wtn = 5 j } n 5b Weighted wtp = 5 q5q w i n 5$ wfp = 5 q5q w i n 5b wfn = 5 j } w i n 5$ wtn = 5 j } w i n 5b Fuzzy Weighted fuzzy wtp = 5 q5q m i1 n 5$ j } m i1 n 5$ wfn = 5 q5q m i0 n 5$ j } m i0 n 5$ wtp = 5 q5q w i m i1 n 5$ j } w i m i1 n 5$ wfn = 5 q5q w i m i0 n 5$ j } w i m i0 n 5$ wfp = m i1 n 5b + 5 j } m i1 n 5b wtn = m i0 n 5b + 5 j } m i0 n 5b 5 q5q + 5 q5q + wfp = w i m i1 n 5b + 5 j } w i m i1 n 5b wtn = w i m i0 n 5b + 5 j } w i m i0 n 5b 5 q5q + 5 q5q +

31 SNP $ a SNP š b method a b Chi-square statistic (p-value) (0.221) (0.009) Balanced accuracy of MDR Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8) SNP c SNP œ d method c d Chi-square statistic (p-value) (0.025) (2.6E-5) Balanced accuracy of MDR Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8)

32 <Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B

Calculations of an example (B) genoty pe # of case # of control 00 1 10 Original MDR is high is low TP FP FN TN 0 1 0 0 1 10 OR wba MDR log(or ) TP FP FN TN 0.14 1.18 0.00 0.00 1.18 11.

33 Calculations of an example (B) genoty pe # of case # of control Original MDR is high is low TP FP FN TN OR wba MDR log(or ) TP FP FN TN Fuzzy MDR p_high p_low TP FP FN TN sum sum sum

34 Accuracy = 0.6

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction