A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab Seoul National University

Contents 1. Introduction 2. Motivation 3. Method 4. Results 5. Conclusion

Interaction SNP $ 0 1 0 20 0 1 2 10 20 0 0 10 40 0 0 20 0 20 20 40 40 Ø In single locus association study üno effect SNP 1 üno effect SNP 2 2 0 10 20 0 0 10 20 20 40 40 20 20 20 20 A reason of the Missing heritability

MDR method Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Ritchie M.D. et al. (2001), Am. J. Hum. Genet., 69, 138 147 SNP 1 SNP 2 Class 0 0 0 1 2 0 2 1 1 Calculate casecontrol ratio Identify high/lowrisk Build 2 2 confusion matrix true positive negative positive TP FP predicted negative FN TN 0 0 1 1 0 0 0 1 1 Case Control Case Control High risk +12 +4 High risk 12 4 Low risk Low risk +10 +24

Weaknesses of MDR Biological meaning All possible genotype interaction models are really possible in real world? Log-linear model based MDR (Lee et al. 2007) Computation time Exponential increase by increase of interaction order Filtering based approaches Relief, ReliefF, TuRF, SURF Processing MDR GPU, cumdr Binary classification (# of case, # of control): (2, 1) vs (20, 10), (1, 11) vs (10, 20) Next slide

Approaches to overcome simple binary classification Model-based MDR (# of case, # of control): (2, 1) vs (20, 10) Calle, M.L., et al. (2008) MB-MDR: model-based multifactor dimensionality reduction for detecting interactions in high-dimensional genomic data. Ternary classification: high, low and no evidence group 2 1 20 10 wba MDR (# of case, # of control): (11, 1) vs (20, 10) Namkung, J., et al. (2009) New evaluation measures for multifactor dimensionality reduction classifiers in gene gene interaction analysis, Bioinformatics, 25, 338-345. Weighted balanced accuracy MDR 11 1 20 10

Fuzzy set theory Extension of classical set theory Zadeh, L.A. (1965) Fuzzy sets, Information and control, 8, 338-353. Degrees of membership Rich or poor vs degree of rich 1/10/100/1000 dollars for a day poor rich poor 1 rich 1 10 100 10 1000 100 1000

Key difference Original MDR Case Control High risk Low risk +1 +20 wba MDR Fuzzy MDR Case Control Case Control High risk Low risk +1*2.5 +20*2.5 High risk +1*0. 05 +20*0. 05 Low risk +1*0. 95 +20*0. 95

Simple example Case Control High risk 60 42.5 40 37.3 Low risk 20 37.4 40 42.7 Case Control High risk 60 44.9 40 33.0 Low risk 20 35.1 40 47.0

Criteria (x) of membership degree (μ H,μ L ) The estimate of the odds ratio (OR) θ. i = n i1 n i0 n 31 n 30 n 56 : the number of individuals with the i th multi locus genotypes in the j th disease group n :6 : the total number of individuals in the j th disease group i = 1,, 3?, j = 1 for case and j = 0 for control (# of case, # of control): (2, 1) vs (20, 10) Standardization z = log(or R ) SE SE = $ X YZ $ X 3Z + $ X Y] $ X 3], log θ 5. = log X YZ X Y] X 3Z X 3] = log n 5$ n :$ log n 5b n :b

Membership function Original MDR Fuzzy MDR 0 x < t j klm n, μ g x = m o lm n t j x < t q 1 x t q μ s x = 1-μ g x μ g x = 0 x < t j $ $: tu v o tuv n w xuz t j x < t q 1 x t q μ s x = 1-μ g x,

Tuning Parameters Notation F y {,y, y }~5qm,y mq ~ q jƒ 80 (2*2*4*5) combinations Membership function y { = l for linear membership function, y { = s for sigmoid membership function Standardization y = 0 for OR, y = 1 for z Weights w 5 = 1 + ln(or) 5, i = 0, 0.5, 1, 2 Threshold values 2, 4, 8, 16 and 32 for OR 0.5 1.96, 1 1.96, 2 1.96, 4 1.96 and 8 1.96 for z

Fuzzy MDR procedure(1) Consistent case/control ratio In two loci interactions,?c

Fuzzy MDR procedure(2) Original MDR Fuzzy MDR Membership degrees depend on parameter values. TP = n 5$ μ g x 5 5 FN = n 5$ μ s x 5 FP = n 5b μ g x 5 5 5 TN = n 5b μ s x 5. 5

Empirical Studies Experiments of simulation data Objectives To compare power of Fuzzy MDR with original MDR and wba MDR To find optimal parameter values Data Without marginal effects With marginal effects Generation Parameters F y {, y, y }~5qm, y mq ~ q jƒ Linear/sigmoid, with/without SE, four weight values and 5 threshold values Experiments of real data Bipolar disorder (BD) data in Wellcome Trust Case Control Consortium (WTCCC)

Data without marginal effects Structure Four sample sizes 200, 400, 800 and 1600 samples 1000 SNPs Two causative SNPs 70 penetrance tables 7 heritability values 2 minor allele frequencies 5 models Example of penetrance table Model1 AA Aa aa BB 0.486 0.960 0.538 Bb 0.947 0.004 0.811 Bb 0.640 0.606 0.909 Downloaded from http://discovery.dartmouth.edu/epistatic_data

200 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

400 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

800 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

1600 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

Data with marginal effects Structure One sample sizes 2000 cases and 2000 controls 1000 SNPs Two causative SNPs 18 penetrance tables 3 models 3 minor allele frequencies 2 linkage disequilibrium values Model 1 AA Aa aa BB 1 1+θ (1+θ) 2 Bb 1+θ (1+θ) 2 (1+θ) 3 bb (1+θ) 2 (1+θ) 3 (1+θ) 4 Model 2 AA Aa aa BB 1 1 1 Bb 1 (1+θ) (1+θ) 2 bb 1 (1+θ) 2 (1+θ) 4 Model 3 AA Aa aa BB 1 1 1 Bb 1 1+θ 1+θ bb 1 1+θ 1+θ

Results data with marginal effects Model 1 2 3 LD 0.7 1.0 0.7 1.0 0.7 1.0 MAF 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

index rs number MAF Chromosome (position) gene p-value (rank) 1 rs4027132 0.440 2 (11897366) 9.82E-06 (8) 2 rs7570682 0.230 2 (104366809) 1.83E-05 (12) 3 rs1375144 0.320 2 (115483610) DPP10 1.19E-05 (10) 4 rs2953145 0.210 2 (240576179) RNPEPL1 5.03E-06 (3) 5 rs4276227 0.350 3 (32289194) CMTM8 1.45E-05 (11) 6 rs683395 0.090 3 (183152030) LAMP3 5.25E-06 (4) 7 rs4411993 0.120 4 (7464739) SORCS2 1.13E-01 (17) 8 rs6458307 0.315 6 (42763377) GLTSCR1L, LOC 105375062 1.92E-06 (2) 9 rs2609653 0.060 8 (34379474) 5.39E-05 (14) 10 rs10982256 0.455 9 (114498554) DFNB31 4.33E-05 (13) 11 rs1006737 0.340 12 (2236129) CACNA1C 9.72E-04 (15) 12 rs1705236 0.055 12 (71151778) TSPAN8 7.22E-02 (16) 13 rs9315885 0.325 13 (42068674) DGKH 6.23E-01 (19) 14 rs10134944 0.095 14 (57652478) SLC35F4 1.15E-05 (9) 15 rs11622475 0.285 14 (104042739) TDRD9 7.69E-06 (6) 16 rs420259 0.270 16 (23622705) PALB2 1.33E-07 (1) 17 rs1344484 0.385 16 (52878387) 9.18E-06 (7) 18 rs4939921 0.100 18 (49935958) MYO5B 4.79E-01 (18) 19 rs3761218 0.380 20 (3795528) CDC25B 7.47E-06 (5) Real data BD in WTCCC 1868 cases and 2938 controls 19 SNPs are selected by a literature review Two parameter settings 1. F(L,0,0,3) Linear membership, without SE, without weight, threshold OR = 8 2. F(S,1,1,2) Sigmoid membership, with SE, w $ = 1 + ln(or) $, threshold ZOR = 2*1.96

Result of BD in WTCCC F(S, 1,1,2) order SNP combination training accuracy testing accuracy CVC 1 19 51.723 51.216 5 2 15, 19 52.658 52.434 10 3 2, 6, 14 52.613 51.418 3 4 4, 6, 14, 16 53.633 52.523 5 5 2, 6, 9, 11, 15 53.502 51.675 2 index rs number MAF Chromosome gene p-value (rank) 15 rs11622475 0.285 14 19 rs3761218 0.380 20 TDRD9 (Tudor Domain Containing 9) CDC25B (Cell Division Cycle 25B) 7.69E-06 (6) 7.47E-06 (5)

Fuzzy MDR vs Original MDR (Interaction model ) Fuzzy MDR Original MDR M11 has been discovered in real world! M11 1 1 0 M110 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1

5. Conclusion A novel and powerful Fuzzy MDR for gene-gene interaction analysis Based on fuzzy set theory H and L risk groups are fuzzy sets Original MDR is a special case of Fuzzy MDR More flexible interpretation by the degree of membership of each multi-locus genotype Potential of extension Future work Determining of the optimal tuning parameter values Extensions

Thank you.

References Ritchie, M.D., et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Am J Hum Genet, 69, 138-147. Velez, D.R., et al. (2007) A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction, Genet Epidemiol, 31, 306-315. Leem, S., et al. (2014) Fast detection of high-order epistaticinteractions in genome-wide association studies using information theoretic measure, Computational Biology and Chemistry, 50, 19-28. Burton, P.R., et al. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, Nature, 447, 661-678. Li, W. and Reich, J. (2000) A Complete Enumeration and Classification of Two-Locus Disease Models, Human Heredity, 50, 334-349.

Limit of single-locus association studies <Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A 0.27 0 0 0.1 0 0.21 0 0.1 0 0 0.625 0.1 Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B 0.1 0.1 0.1 0.1

Method Original Confusion matrix calculation wtp = 5 q5q n 5$ wfp = 5 q5q n 5b wfn = 5 j } n 5$ wtn = 5 j } n 5b Weighted wtp = 5 q5q w i n 5$ wfp = 5 q5q w i n 5b wfn = 5 j } w i n 5$ wtn = 5 j } w i n 5b Fuzzy Weighted fuzzy wtp = 5 q5q m i1 n 5$ + + 5 j } m i1 n 5$ wfn = 5 q5q m i0 n 5$ + + 5 j } m i0 n 5$ wtp = 5 q5q w i m i1 n 5$ + + 5 j } w i m i1 n 5$ wfn = 5 q5q w i m i0 n 5$ + + 5 j } w i m i0 n 5$ wfp = m i1 n 5b + 5 j } m i1 n 5b wtn = m i0 n 5b + 5 j } m i0 n 5b 5 q5q + 5 q5q + wfp = w i m i1 n 5b + 5 j } w i m i1 n 5b wtn = w i m i0 n 5b + 5 j } w i m i0 n 5b 5 q5q + 5 q5q +

SNP $ a 0 1 2 SNP š 0 1 2 0 5 10 15 10 5 10 b 0 19 10 10 1 9 10 method a b Chi-square statistic (p-value) 10.667 (0.221) 20.514 (0.009) 1 2 15 10 15 10 11 10 0 0 0 0 5 10 15 10 5 10 1 2 19 9 10 10 11 10 1 10 Balanced accuracy of MDR 0.625 0.625 Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8) 0.629 0.667 0.531 0.570 SNP c 0 1 2 SNP œ 0 1 2 0 1 2 20 10 10 10 10 0 10 10 10 10 0 0 20 10 10 10 10 0 d 0 1 2 50 5 5 0 5 5 5 5 0 0 5 5 50 5 5 0 5 5 method c d Chi-square statistic (p-value) 17.501 (0.025) 35.029 (2.6E-5) Balanced accuracy of MDR 0.625 0.813 Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8) 1.000 1.000 0.625 0.813

<Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A 0.27 0 0 0.1 0 0.21 0 0.1 0 0 0.625 0.1 Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B 0.1 0.1 0.1 0.1

Calculations of an example (B) genoty pe # of case # of control 00 1 10 Original MDR is high is low TP FP FN TN 0 1 0 0 1 10 OR wba MDR log(or ) TP FP FN TN 0.14 1.18 0.00 0.00 1.18 11.81 Fuzzy MDR p_high p_low TP FP FN TN 0.03 0.97 0.03 0.32 0.97 9.68 01 19 10 1 0 19 10 0 0 1.86 0.89 16.85 8.87 0.00 0.00 0.65 0.35 12.33 6.49 6.67 3.51 02 9 10 0 1 0 0 9 10 0.90 0.56 0.00 0.00 5.06 5.62 0.48 0.52 4.28 4.76 4.72 5.24 10 11 10 1 0 11 10 0 0 1.10 0.55 6.04 5.49 0.00 0.00 0.52 0.48 5.74 5.22 5.26 4.78 11 0 0 1 0 0 0 0 0 1.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 12 11 10 1 0 11 10 0 0 1.10 0.55 6.04 5.49 0.00 0.00 0.52 0.48 5.74 5.22 5.26 4.78 20 9 10 0 1 0 0 9 10 0.90 0.56 0.00 0.00 5.06 5.62 0.48 0.52 4.28 4.76 4.72 5.24 21 19 10 1 0 19 10 0 0 1.86 0.89 16.85 8.87 0.00 0.00 0.65 0.35 12.33 6.49 6.67 3.51 22 1 10 0 1 0 0 1 10 0.14 1.18 0.00 0.00 1.18 11.81 0.03 0.97 0.03 0.32 0.97 9.68 sum 60 40 20 40 sum 45.79 28.72 12.49 34.87 sum 44.77 33.58 35.23 46.42

Accuracy = 0.6