A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Similar documents
Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Bayesian Inference of Interactions and Associations

SNP-SNP Interactions in Case-Parent Trios

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

BTRY 7210: Topics in Quantitative Genomics and Genetics

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Detection and characterization of interactions of genetic risk factors in disease

p(d g A,g B )p(g B ), g B

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

SNP Association Studies with Case-Parent Trios

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Research Article Detecting Genetic Interactions for Quantitative Traits Using m-spacing Entropy Measure

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Genotype Imputation. Biostatistics 666

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Affected Sibling Pairs. Biostatistics 666

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Cover Page. The handle holds various files of this Leiden University dissertation

Using the estimated penetrances to determine the range of the underlying genetic model in casecontrol

Linear Regression (1/1/17)

Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies

opulation genetics undamentals for SNP datasets

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Introduction to Linkage Disequilibrium

Applied Machine Learning Annalisa Marsico

Model Accuracy Measures

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

(Genome-wide) association analysis

Régression en grande dimension et épistasie par blocs pour les études d association

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Association studies and regression

Computational Approaches to Statistical Genetics

QTL model selection: key players

Computational Systems Biology: Biology X

Aggregated Quantitative Multifactor Dimensionality Reduction

#33 - Genomics 11/09/07

Pearson s Test, Trend Test, and MAX Are All Trend Tests with Different Types of Scores

Gene mapping in model organisms

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

How to analyze many contingency tables simultaneously?

Backward Genotype-Trait Association. in Case-Control Designs

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

BIOINFORMATICS ORIGINAL PAPER

Supplementary Figures

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Heredity and Genetics WKSH

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

Package ESPRESSO. August 29, 2013

Some models of genomic selection

contents: BreedeR: a R-package implementing statistical models specifically suited for forest genetic resources analysts

PCA vignette Principal components analysis with snpstats

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes

Predicting Protein Functions and Domain Interactions from Protein Interactions

Lecture WS Evolutionary Genetics Part I 1

Searching Genome-wide Disease Association Through SNP Data

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

TEST SUMMARY AND FRAMEWORK TEST SUMMARY

QTL Mapping I: Overview and using Inbred Lines

Module Contact: Dr Doug Yu, BIO Copyright of the University of East Anglia Version 1

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Statistical Methods in Mapping Complex Diseases

Mapping QTL to a phylogenetic tree

1 Errors in mitosis and meiosis can result in chromosomal abnormalities.

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Calculation of IBD probabilities

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Evolution of phenotypic traits

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

Performance Evaluation

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

I Have the Power in QTL linkage: single and multilocus analysis

SNP Association Studies with Case-Parent Trios

Miller & Levine Biology

Multiple QTL mapping

Methods for High Dimensional Inferences With Applications in Genomics

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Transcription:

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab Seoul National University

Contents 1. Introduction 2. Motivation 3. Method 4. Results 5. Conclusion

Interaction SNP $ 0 1 0 20 0 1 2 10 20 0 0 10 40 0 0 20 0 20 20 40 40 Ø In single locus association study üno effect SNP 1 üno effect SNP 2 2 0 10 20 0 0 10 20 20 40 40 20 20 20 20 A reason of the Missing heritability

MDR method Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Ritchie M.D. et al. (2001), Am. J. Hum. Genet., 69, 138 147 SNP 1 SNP 2 Class 0 0 0 1 2 0 2 1 1 Calculate casecontrol ratio Identify high/lowrisk Build 2 2 confusion matrix true positive negative positive TP FP predicted negative FN TN 0 0 1 1 0 0 0 1 1 Case Control Case Control High risk +12 +4 High risk 12 4 Low risk Low risk +10 +24

Weaknesses of MDR Biological meaning All possible genotype interaction models are really possible in real world? Log-linear model based MDR (Lee et al. 2007) Computation time Exponential increase by increase of interaction order Filtering based approaches Relief, ReliefF, TuRF, SURF Processing MDR GPU, cumdr Binary classification (# of case, # of control): (2, 1) vs (20, 10), (1, 11) vs (10, 20) Next slide

Approaches to overcome simple binary classification Model-based MDR (# of case, # of control): (2, 1) vs (20, 10) Calle, M.L., et al. (2008) MB-MDR: model-based multifactor dimensionality reduction for detecting interactions in high-dimensional genomic data. Ternary classification: high, low and no evidence group 2 1 20 10 wba MDR (# of case, # of control): (11, 1) vs (20, 10) Namkung, J., et al. (2009) New evaluation measures for multifactor dimensionality reduction classifiers in gene gene interaction analysis, Bioinformatics, 25, 338-345. Weighted balanced accuracy MDR 11 1 20 10

Fuzzy set theory Extension of classical set theory Zadeh, L.A. (1965) Fuzzy sets, Information and control, 8, 338-353. Degrees of membership Rich or poor vs degree of rich 1/10/100/1000 dollars for a day poor rich poor 1 rich 1 10 100 10 1000 100 1000

Key difference Original MDR Case Control High risk Low risk +1 +20 wba MDR Fuzzy MDR Case Control Case Control High risk Low risk +1*2.5 +20*2.5 High risk +1*0. 05 +20*0. 05 Low risk +1*0. 95 +20*0. 95

Simple example Case Control High risk 60 42.5 40 37.3 Low risk 20 37.4 40 42.7 Case Control High risk 60 44.9 40 33.0 Low risk 20 35.1 40 47.0

Criteria (x) of membership degree (μ H,μ L ) The estimate of the odds ratio (OR) θ. i = n i1 n i0 n 31 n 30 n 56 : the number of individuals with the i th multi locus genotypes in the j th disease group n :6 : the total number of individuals in the j th disease group i = 1,, 3?, j = 1 for case and j = 0 for control (# of case, # of control): (2, 1) vs (20, 10) Standardization z = log(or R ) SE SE = $ X YZ $ X 3Z + $ X Y] $ X 3], log θ 5. = log X YZ X Y] X 3Z X 3] = log n 5$ n :$ log n 5b n :b

Membership function Original MDR Fuzzy MDR 0 x < t j klm n, μ g x = m o lm n t j x < t q 1 x t q μ s x = 1-μ g x μ g x = 0 x < t j $ $: tu v o tuv n w xuz t j x < t q 1 x t q μ s x = 1-μ g x,

Tuning Parameters Notation F y {,y, y }~5qm,y mq ~ q jƒ 80 (2*2*4*5) combinations Membership function y { = l for linear membership function, y { = s for sigmoid membership function Standardization y = 0 for OR, y = 1 for z Weights w 5 = 1 + ln(or) 5, i = 0, 0.5, 1, 2 Threshold values 2, 4, 8, 16 and 32 for OR 0.5 1.96, 1 1.96, 2 1.96, 4 1.96 and 8 1.96 for z

Fuzzy MDR procedure(1) Consistent case/control ratio In two loci interactions,?c

Fuzzy MDR procedure(2) Original MDR Fuzzy MDR Membership degrees depend on parameter values. TP = n 5$ μ g x 5 5 FN = n 5$ μ s x 5 FP = n 5b μ g x 5 5 5 TN = n 5b μ s x 5. 5

Empirical Studies Experiments of simulation data Objectives To compare power of Fuzzy MDR with original MDR and wba MDR To find optimal parameter values Data Without marginal effects With marginal effects Generation Parameters F y {, y, y }~5qm, y mq ~ q jƒ Linear/sigmoid, with/without SE, four weight values and 5 threshold values Experiments of real data Bipolar disorder (BD) data in Wellcome Trust Case Control Consortium (WTCCC)

Data without marginal effects Structure Four sample sizes 200, 400, 800 and 1600 samples 1000 SNPs Two causative SNPs 70 penetrance tables 7 heritability values 2 minor allele frequencies 5 models Example of penetrance table Model1 AA Aa aa BB 0.486 0.960 0.538 Bb 0.947 0.004 0.811 Bb 0.640 0.606 0.909 Downloaded from http://discovery.dartmouth.edu/epistatic_data

200 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

400 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

800 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

1600 sample results heritability 0.01 0.025 0.05 0.1 0.2 0.3 0.4 0.01 0.025 0.05 0.1 0.2 0.3 0.4 MAF 0.2 0.4

Data with marginal effects Structure One sample sizes 2000 cases and 2000 controls 1000 SNPs Two causative SNPs 18 penetrance tables 3 models 3 minor allele frequencies 2 linkage disequilibrium values Model 1 AA Aa aa BB 1 1+θ (1+θ) 2 Bb 1+θ (1+θ) 2 (1+θ) 3 bb (1+θ) 2 (1+θ) 3 (1+θ) 4 Model 2 AA Aa aa BB 1 1 1 Bb 1 (1+θ) (1+θ) 2 bb 1 (1+θ) 2 (1+θ) 4 Model 3 AA Aa aa BB 1 1 1 Bb 1 1+θ 1+θ bb 1 1+θ 1+θ

Results data with marginal effects Model 1 2 3 LD 0.7 1.0 0.7 1.0 0.7 1.0 MAF 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5

index rs number MAF Chromosome (position) gene p-value (rank) 1 rs4027132 0.440 2 (11897366) 9.82E-06 (8) 2 rs7570682 0.230 2 (104366809) 1.83E-05 (12) 3 rs1375144 0.320 2 (115483610) DPP10 1.19E-05 (10) 4 rs2953145 0.210 2 (240576179) RNPEPL1 5.03E-06 (3) 5 rs4276227 0.350 3 (32289194) CMTM8 1.45E-05 (11) 6 rs683395 0.090 3 (183152030) LAMP3 5.25E-06 (4) 7 rs4411993 0.120 4 (7464739) SORCS2 1.13E-01 (17) 8 rs6458307 0.315 6 (42763377) GLTSCR1L, LOC 105375062 1.92E-06 (2) 9 rs2609653 0.060 8 (34379474) 5.39E-05 (14) 10 rs10982256 0.455 9 (114498554) DFNB31 4.33E-05 (13) 11 rs1006737 0.340 12 (2236129) CACNA1C 9.72E-04 (15) 12 rs1705236 0.055 12 (71151778) TSPAN8 7.22E-02 (16) 13 rs9315885 0.325 13 (42068674) DGKH 6.23E-01 (19) 14 rs10134944 0.095 14 (57652478) SLC35F4 1.15E-05 (9) 15 rs11622475 0.285 14 (104042739) TDRD9 7.69E-06 (6) 16 rs420259 0.270 16 (23622705) PALB2 1.33E-07 (1) 17 rs1344484 0.385 16 (52878387) 9.18E-06 (7) 18 rs4939921 0.100 18 (49935958) MYO5B 4.79E-01 (18) 19 rs3761218 0.380 20 (3795528) CDC25B 7.47E-06 (5) Real data BD in WTCCC 1868 cases and 2938 controls 19 SNPs are selected by a literature review Two parameter settings 1. F(L,0,0,3) Linear membership, without SE, without weight, threshold OR = 8 2. F(S,1,1,2) Sigmoid membership, with SE, w $ = 1 + ln(or) $, threshold ZOR = 2*1.96

Result of BD in WTCCC F(S, 1,1,2) order SNP combination training accuracy testing accuracy CVC 1 19 51.723 51.216 5 2 15, 19 52.658 52.434 10 3 2, 6, 14 52.613 51.418 3 4 4, 6, 14, 16 53.633 52.523 5 5 2, 6, 9, 11, 15 53.502 51.675 2 index rs number MAF Chromosome gene p-value (rank) 15 rs11622475 0.285 14 19 rs3761218 0.380 20 TDRD9 (Tudor Domain Containing 9) CDC25B (Cell Division Cycle 25B) 7.69E-06 (6) 7.47E-06 (5)

Fuzzy MDR vs Original MDR (Interaction model ) Fuzzy MDR Original MDR M11 has been discovered in real world! M11 1 1 0 M110 1 1 0 1 0 0 1 0 0 0 0 0 0 1 1

5. Conclusion A novel and powerful Fuzzy MDR for gene-gene interaction analysis Based on fuzzy set theory H and L risk groups are fuzzy sets Original MDR is a special case of Fuzzy MDR More flexible interpretation by the degree of membership of each multi-locus genotype Potential of extension Future work Determining of the optimal tuning parameter values Extensions

Thank you.

References Ritchie, M.D., et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer, Am J Hum Genet, 69, 138-147. Velez, D.R., et al. (2007) A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction, Genet Epidemiol, 31, 306-315. Leem, S., et al. (2014) Fast detection of high-order epistaticinteractions in genome-wide association studies using information theoretic measure, Computational Biology and Chemistry, 50, 19-28. Burton, P.R., et al. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, Nature, 447, 661-678. Li, W. and Reich, J. (2000) A Complete Enumeration and Classification of Two-Locus Disease Models, Human Heredity, 50, 334-349.

Limit of single-locus association studies <Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A 0.27 0 0 0.1 0 0.21 0 0.1 0 0 0.625 0.1 Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B 0.1 0.1 0.1 0.1

Method Original Confusion matrix calculation wtp = 5 q5q n 5$ wfp = 5 q5q n 5b wfn = 5 j } n 5$ wtn = 5 j } n 5b Weighted wtp = 5 q5q w i n 5$ wfp = 5 q5q w i n 5b wfn = 5 j } w i n 5$ wtn = 5 j } w i n 5b Fuzzy Weighted fuzzy wtp = 5 q5q m i1 n 5$ + + 5 j } m i1 n 5$ wfn = 5 q5q m i0 n 5$ + + 5 j } m i0 n 5$ wtp = 5 q5q w i m i1 n 5$ + + 5 j } w i m i1 n 5$ wfn = 5 q5q w i m i0 n 5$ + + 5 j } w i m i0 n 5$ wfp = m i1 n 5b + 5 j } m i1 n 5b wtn = m i0 n 5b + 5 j } m i0 n 5b 5 q5q + 5 q5q + wfp = w i m i1 n 5b + 5 j } w i m i1 n 5b wtn = w i m i0 n 5b + 5 j } w i m i0 n 5b 5 q5q + 5 q5q +

SNP $ a 0 1 2 SNP š 0 1 2 0 5 10 15 10 5 10 b 0 19 10 10 1 9 10 method a b Chi-square statistic (p-value) 10.667 (0.221) 20.514 (0.009) 1 2 15 10 15 10 11 10 0 0 0 0 5 10 15 10 5 10 1 2 19 9 10 10 11 10 1 10 Balanced accuracy of MDR 0.625 0.625 Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8) 0.629 0.667 0.531 0.570 SNP c 0 1 2 SNP œ 0 1 2 0 1 2 20 10 10 10 10 0 10 10 10 10 0 0 20 10 10 10 10 0 d 0 1 2 50 5 5 0 5 5 5 5 0 0 5 5 50 5 5 0 5 5 method c d Chi-square statistic (p-value) 17.501 (0.025) 35.029 (2.6E-5) Balanced accuracy of MDR 0.625 0.813 Balanced accuracy of wba MDR (α = 0.25) Balanced accuracy of fuzzy MDR (linear, OR = 8) 1.000 1.000 0.625 0.813

<Penetrance table> MAF: 0.4 Prevalence: 0.1 SNP_B SNP_A AA (0.36) Aa (0.48) aa (0.16) BB (0.36) Bb (0.48) bb (0.16) P SNP_A 0.27 0 0 0.1 0 0.21 0 0.1 0 0 0.625 0.1 Penetrances are the same across genotypes in SNP_A. Penetrances are the same across genotypes in SNP_B. Penetrances are different in genotype combinations of SNP_A and SNP_B P SNP_B 0.1 0.1 0.1 0.1

Calculations of an example (B) genoty pe # of case # of control 00 1 10 Original MDR is high is low TP FP FN TN 0 1 0 0 1 10 OR wba MDR log(or ) TP FP FN TN 0.14 1.18 0.00 0.00 1.18 11.81 Fuzzy MDR p_high p_low TP FP FN TN 0.03 0.97 0.03 0.32 0.97 9.68 01 19 10 1 0 19 10 0 0 1.86 0.89 16.85 8.87 0.00 0.00 0.65 0.35 12.33 6.49 6.67 3.51 02 9 10 0 1 0 0 9 10 0.90 0.56 0.00 0.00 5.06 5.62 0.48 0.52 4.28 4.76 4.72 5.24 10 11 10 1 0 11 10 0 0 1.10 0.55 6.04 5.49 0.00 0.00 0.52 0.48 5.74 5.22 5.26 4.78 11 0 0 1 0 0 0 0 0 1.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 12 11 10 1 0 11 10 0 0 1.10 0.55 6.04 5.49 0.00 0.00 0.52 0.48 5.74 5.22 5.26 4.78 20 9 10 0 1 0 0 9 10 0.90 0.56 0.00 0.00 5.06 5.62 0.48 0.52 4.28 4.76 4.72 5.24 21 19 10 1 0 19 10 0 0 1.86 0.89 16.85 8.87 0.00 0.00 0.65 0.35 12.33 6.49 6.67 3.51 22 1 10 0 1 0 0 1 10 0.14 1.18 0.00 0.00 1.18 11.81 0.03 0.97 0.03 0.32 0.97 9.68 sum 60 40 20 40 sum 45.79 28.72 12.49 34.87 sum 44.77 33.58 35.23 46.42

Accuracy = 0.6