INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen

Size: px

Start display at page:

Download "INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen"

Annis Kathryn Stewart
5 years ago
Views:

1 INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen

2 CHAPTER 7: A WORLD OF INTERACTIONS 1 Beyond main effects 1.a Dealing with multiplicity 1.b A bird s eye view on roads less travelled by 1.c Multi-locus analysis epistasis analysis 2 Epistasis detection: a challenging task 2.a Variable selection 2.b Multifactor dimensionality reduction 2.c Interpretation 3 Future challenges K Van Steen 521

3 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 1 Beyond main effects 1.a Dealing with multiplicity Multiple testing explosion: ~500,000 SNPs span 80% of common variation in genome (HapMap) n-th order interaction K Van Steen 522

4 Ways to handle multiplicity Recall that several strategies can be adopted, including: - clever multiple corrective procedures - pre-screening strategies, - multi-stage designs, - adopting haplotype tests or - multi-locus tests Which of these approaches are more powerful is still under heavy debate The multiple testing problem becomes unmanageable when looking at multiple loci jointly? K Van Steen 523

5 1.b A bird s eye view on roads less travelled by Multiple disease susceptibility loci (mdsl) Dichotomy between - Improving single markers strategies to pick up multiple signals at once (PBAT) - Testing groups of markers (FBAT multi-locus tests) K Van Steen 524

6 PBAT screening for mdsl Little has been done in the context of family-based screening for epistasis First assess how a method is capable of detecting multiple DSL Simulation strategy (10,000 replicates): - Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families - Select 5 regions; 1 DSL in each region - Generate traits according to normal distribution, including up to 5 genetic contributions - For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci) (Van Steen et al 2005) K Van Steen 525

7 General theory on FBAT testing Test statistic: - works for any phenotype, genetic model - use covariance between offspring trait and genotype Test Distribution: - computed assuming H 0 true; random variable is offspring genotype - condition on parental genotypes when available, extend to family configurations (avoid specification of allele distribution) - condition on offspring phenotypes (avoid specification of trait distribution) (Horvath et al 1998, 2001; Laird et al 2000) K Van Steen 526

8 Screen Use between-family information [f(s,y)] Calculate conditional power (a b,y,s) Select top N SNPs on the basis of power Test Use within-family information [f(x S)] while computing the FBAT statistic This step is independent from the screening step Adjust for N tests (not 500K!) ( Van Steen et al 2005) ( Lange and Laird 2006) K Van Steen 527

9 Power to detect genes with multiple DSL top : top 5 SNPs in the ranking bottom: top 10 SNPs in the ranking (Van Steen et al 2005) K Van Steen 528

10 Power to detect genes with multiple DSL top : Benjamini-Yekutieli FDR control at 5% (general dependencies) bottom: Benjamini-Hochberg FDR control at 5% (Van Steen et al 2005) K Van Steen 529

11 FBAT multi-locus tests FBAT-SNP-PC attains higher power in candidate genes with lower average pair-wise correlations and moderate to high allele frequencies with large gains (up to 80%). (Rakovski et al 2008) The new test has an overall performance very similar to that of FBAT-LC (FBAT-LC : Xin et al 2008) K Van Steen 530

12 In contrast: popular multi-locus approaches for unrelateds Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: - Combinatorial Partitioning Method (CPM) quantitative phenotypes; interactions - Multifactor-Dimensionality Reduction (MDR) qualitative phenotypes; interactions - Machine learning and data mining The multiple testing problem becomes unmanageable when looking at (genetic) interaction effects? More about this in Chapter 9. K Van Steen 531

13 1.c Multi-locus analysis epistasis analysis Epistasis: what s in a name? Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect. (Wikipedia) (slide : C Amos) K Van Steen 532

14 Epistasis: what s in a name? Distortions of Mendelian segregation ratios due to one gene masking the effects of another (William Bateson ). Deviations from linearity in a statistical model (Ronald Fisher ). Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans (Cordell 2002) K Van Steen 533

15 Why is there epistasis? From an evolutionary biology perspective, for a phenotype to be buffered against the effects of mutations, it must have an underlying genetic architecture that is comprised of networks of genes that are redundant and robust. This creates dependencies among the genes in the network and is realized as epistasis. (slide: Y Chen, 2007) K Van Steen 534

16 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Different types of interactions Trait Qq qq QQ m-a m+d m+a (Fisher, Wright) (slide: C Amos) K Van Steen 535

17 Interpretation of epistasis The study of epistasis poses problems of interpretability. Statistically, epistasis is usually defined in terms of deviation from a model of additive multiple effects, but this might be on either a linear or logarithmic scale, which implies different definitions. (Moore 2004) - Despite the aforementioned concerns, there is evidence that a direct search for epistatic effects can pay dividends. - It is expected to have an increasing role in future analyses K Van Steen 536

18 The frequency of epistasis Not a new idea! (Bateson 1909) Complexity of gene regulation and biochemical networks (Gibson 1996; Templeton 2000) Single gene results don t replicate (Hirschhorn et al. 2002) Gene-gene interactions are commonly found when properly investigated (Templeton 2000) Working hypothesis: Single gene studies don t replicate because gene-gene interactions are more important (Moore and Williams 2002) (Moore 2003) K Van Steen 537

19 Slow shift from main towards epistatis effects (Motsinger et al 2007) K Van Steen 538

20 Power of a gene-gene or gene-environment interaction analysis There is a vast literature on power considerations - Most of this literature strengthen their beliefs by extensive simulation studies There is a need for user-friendly software tools that allow the user to perform hands-on power calculations Main package targeting interaction analyses is QUANTO (v1.2.1): - Available study designs for a disease (binary) outcome include the unmatched case-control, matched case-control, case-sibling, caseparent, and case-only designs. Study designs for a quantitative trait include independent individuals and case parent designs. - Reference: Gauderman (2000a), Gauderman (2000b), Gauderman (2003) / K Van Steen 539

21 A simple example of epistasis K Van Steen 540

22 A simple disease model Penetrance - Pr (affected genotype) One-locus Dominant Model Genotype aa aa AA Status K Van Steen 541

23 A slightly more complicated two-locus model Genotype bb bb BB aa aa AA Enumeration of two-locus models Although there are 2 9 =512 possible models, because of symmetries in the data, only 50 of these are unique. Enumeration allows 0 and 1 only for penetrance values ( fully penetrant ; i.e., show example). K Van Steen 542

24 Enumeration of two-locus models (Li and Reich 2000) Each model represents a group of equivalent models under permutations. The representative model is the one with the smallest model number. The six models studied in Neuman and Rice [67] ( RR, RD, DD, T, Mod, XOR ), as well as two single-locus models ( IL ) the recessive (R) and the interference (I) model, are marked. K Van Steen 543

25 Different degrees of epistasis (slide: Motsinger) K Van Steen 544

26 Pure epistasis model for dichotomous traits Suppose - p(a)=p(b)=p(a)=p(b)=0.5 - HWE (hence, p(aa)=0.5 2 =0.25,p(Aa)= =0.5) and no LD - penetrances are given according to the table below P(affected genotype) Penetrance bb bb BB prob aa aa AA prob Then make multiple use of Bayes rule to retrieve the genotype distributions in cases and controls K Van Steen 545

27 Pure epistasis model for dichotomous traits Then the marginal genotype distributions for cases and controls are the same, and hence one-locus approaches will be powerless! P(genotypes affected) bb bb BB prob aa aa AA prob P(genotypes unaffected) bb bb BB prob aa aa AA prob P(aa,BB D) =p(d aa,bb)p(aa,bb) / p(d) = /( ) = ¼ = 0.25 K Van Steen 546

28 Purely epistatic 3-locus diseasemodel for quantitative traits Assume all allele frequencies are 0.5 Heritability is 55% and prevalence is 6.25% L.2 L.3=0 L.3=1 L.3= L (Culverhouse et al 2002) K Van Steen 547

29 Expected genotype patterns for 3-locus model L.1 L.2 L.3 p(g) E[#affected] E[#unaff] Other Sum (Culverhouse et al 2002) (sllide: J Ott 2004) K Van Steen 548

30 2 Epistasis detection: a challenging task Main challenges Variable selection Modeling Interpretation - Making inferences about biological epistasis from statistical epistasis (slide Chen 2007) K Van Steen 549

31 2.a Variable selection Introduction The aim is to make clever selections of marker combinations to look at in an epistasis analysis This may not only aid in the interpretation of analysis results, but also reduced the burden of multiple testing and the computational burden K Van Steen 550

32 Variable selection and multiple testing Multiple testing is a thorny issue, the bane of statistical genetics. - The problem is not really the number of tests that are carried out: even if a researcher only tests one SNP for one phenotype, if many other researchers do the same and the nominally significant associations are reported, there will be a problem of false positives. (Balding 2006) Example - Given 3 disease SNPS (e.g., Culverhouse 3-locus model before), making inferences is not at all an easy task: Chi-sq = (26 df), p= With 50,000 SNPS, there will be subsets of size 3 Applying Bonferroni correction, p = A more manageable approach is to test all possible pairs of loci for interaction effects, different in cases and controls (Hoh and Ott 2003) K Van Steen 551

33 Variable selection and multiple testing Pre-screening for subsequent testing: - Independent screening and testing step (PBAT screening; Van Steen et al 2005) - Dependent screening and testing step K Van Steen 552

34 Methods to correct for multiple testing Family-wise error rates (FWER) - In the presence of too many SNPs, the Bonferroni threshold will be extremely low: Bonferroni adjustments are conservative when statistical tests are not independent / Bonferroni adjustments control the error rate associated with the omnibus null hypothesis / The interpretation of a finding depends on how many statistical tests were performed Permutation data sets - It is particularly handy for rare genotypes, small studies, non-normal phenotypes, and tightly linked markers - In case-control data this is relatively straightforward / In family data this is not at all an easy task K Van Steen 553

35 Methods to correct for multiple testing False discovery rate (FDR) - With too many SNPs it starts to break down and the power over Bonferroni is minimal (e.g. see Van Steen et al 2005) False-positive report probability (FPRP) - It is the probability of no true association between a genetic variant and disease given a statistically significant finding, depends not only on the observed p-value but also on both the prior probability that the association between the genetic variant and the disease is real and the statistical power of the test (Wacholder et al 2004) - In general, Bayesian approaches do not yet have a big role in genetic association analyses, possibly because of computational burden? - Not yet well documented / What are the priors? (Balding 2006; Lucke 2008) K Van Steen 554

36 Variable selection and computation time When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. For instance, if SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require seconds (ie, 8.3 hours) of computer time. Exhaustive evaluation of the approximately pairwise combinations of SNPs would require 1286 years. Although it might be possible for a large supercomputer to complete these computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem. (Moore and Ritchie 2004) K Van Steen 555

37 2.b Modeling Failure of traditional methods A large number of SNPs are genotyped - multiple comparisons problem, very small p-values required for significance, which is even compounded in gene-environment interaction analyses. Genetic loci may interact (epistasis) in their influence on the phenotype - loci with small marginal effects may go undetected - interested in the interaction itself Curse of dimensionality and sparse cells K Van Steen 556

38 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 2 SNPs, there are 9 = 3 2 possible two locus genotype combinations. If the alleles are rare (MAF 10%), then some cells will be empty (slide: C Amos) K Van Steen 557

39 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 4 SNPs, there are 81 possible combinations with more cells possible empty (slide: C Amos) K Van Steen 558

40 Modeling: strategy 1 Strategy 1: Set association approach At each SNP, compute an association statistic T Build sum over 1, 2, 3, etc highest values t Evaluate significance of given sum by permutation test Sum with smallest p-value will point towards the markers to select Smallest p is single statistic, find significance level Is applicable to many SNPs and has also been used in microarray settings (Hoh et al 2001) K Van Steen 559

41 Strategy 1: Set association approach (Hoh et al 2001) K Van Steen 560

42 Modeling: strategy 2 Strategy 2: Multi-locus approaches Most case control studies far too often do not take into account the multilocus nature of complex traits When the aim is to analyze multiple SNPs or genes jointly, two classes of approaches emerge: - Combine (properties of) single-locus statistics over multiple SNPs to obtain a new multivariate test statistic Depending on whether SNPS are in high LD or not, different measures need to be taken - Look for patterns of genotypes at SNPs in different genomic locations K Van Steen 561

43 Two frameworks for multi-locus approaches (Onkamo and Toivonen 2006) Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: Tree-based methods: - Recursive Partitioning (Helix Tree) - Random Forests (R, CART) Pattern recognition methods: - Mining association rules - Neural networks (NN) - Support vector machines (SVM) Data reduction methods: - DICE (Detection of Informative Combined Effects) - MDR (Multifactor Dimensionality Reduction) K Van Steen 562

44 Non-parametric chi-square The question is how to test for epistatic effects above and beyond (independent) main effects (of single-locus genotype effects) - Use usual chi-square for interactions independent of main effects. Isolate individual df s. - Assess difference in interactions between cases and controls, since then interactions may be better indicative for underlying pathways Locus 2 Locus 1 BB Bb Bb AA Aa aa Main effect locus 1 2df Main effect locus 2 2 df Interactions 4 df Total 8 df K Van Steen 563

45 Partitioning chi-squares for one locus 2df 1 df 1 df Simple disease model, population frequency K = 0.10 N = 100 cases, 100 controls. Predicted numbers of cases and controls in given genotype classes, and resulting odds ratios, OR K Van Steen 564

46 Partitioning chi-squares for two loci 3 3 table of genotypes (4 df) may be partitioned into 4 independent components, each with 1 df. Do such partitioning for cases and controls each (Agresti 2002). AA Aa BB Bb AA Aa BB, Bb bb AA, Aa aa BB Bb AA, Aa aa BB, Bb bb K Van Steen 565

47 Partitioning chi-squares for two loci Compare each of the four 2 by 2 subtables between cases and controls to see whether their odds ratios are the same K Van Steen 566

48 Logistic regression LR is a derivative of linear regression that fits a function to continuous or discrete independent variables based on a dichotomous dependent variable (Hosmer and Lemeshow, 2000). One of the most common procedures for variable selection in a LR analysis is step-wise logistic regression (step LR) [Hosmer and Lemeshow, 2000]. - In the step-wise procedure, each variable is tested for independent effects, and those variables with significant effects are included in the model. - In a second step, interaction terms of those variables with significant main effects are included, and significant effects are included in the model. (Motsinger-Reif et al 2008) K Van Steen 567

49 Logistic regression LR is a de facto standard for traditional association studies. Using independent variables to predict a dichotomous dependent variable, LR by definition lacks the ability to characterize purely interactive effects. Only variables that contain an independent main effect will be included in the final model. To properly evaluate non-linear purely interactive effects, combinations of variables must be encoded as a single variable for inclusion in the analysis. Such an encoding scheme can be computationally expensive, depending on the number of variables used. (Motsinger-Reif et al 2008) K Van Steen 568

50 Strategy 2: Look for patterns of genotypes using unrelated individuals CPM = combinatorial partitioning method (Charlie Sing, U Michigan). Applicable to small number (~50) of SNPs only. MDR = multifactor-dimensionality reduction method (Jason Moore, Vanderbuilt U) LAD = logical analysis of data (P. Hammer, Rutgers U) Mining association rules, Apriori algorithm (R. Agrawal) Special approaches for microarray data (Hoh and Ott 2003) K Van Steen 569

51 The MDR algorithm What is MDR? A data mining approach to identify interactions among discrete variables that influence a binary outcome A nonparametric alternative to traditional statistical methods such as logistic regression Driven by the need to improve the power to detect gene-gene interactions (slide: L Mustavich) K Van Steen 570

52 The 6 steps of MDR K Van Steen 571

53 MDR Step 1 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets K Van Steen 572

54 MDR Step 2 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set K Van Steen 573

55 MDR Step 3 Create a contingency table for these multi-locus genotypes, counting the number of affected and unaffected individuals with each multi-locus genotype K Van Steen 574

56 MDR Step 4 Calculate the ratio of cases to controls for each multi-locus genotype Label each multi-locus genotype as high-risk or low-risk, depending on whether the casecontrol ratio is above a certain threshold This is the dimensionality reduction step: Reduces n-dimensional space to 1 dimension with 2 levels K Van Steen 575

57 MDR Step 5 To evaluate the developed model in Step 4, use labels to classify individuals as cases or controls, and calculate the misclassification error In fact: balanced accuracy is used (arithmetic mean between sensitivity and specificity), which IS mathematically equivalent to classification accuracy when data are balanced K Van Steen 576

58 Repeat Steps 2 to 5 All possible combinations of n factors are evaluated sequentially for their ability to classify affected and unaffected individuals in the training data, and the best n-factor model is selected in terms of minimal misclassification error K Van Steen 577

59 MDR Step 6 The independent test data from the cross-validation are used to estimate the prediction error (testing accuracy) of the best model selected K Van Steen 578

60 Towards MDR Final Steps 1 through 6 are repeated for each possible cross-validation interval The best model across all 10 training and testing sets is selected on the basis of the criterion: - Maximize the cross-validation consistency = The number of times a particular model was the best model across the cross-validation subsets The end of a cross-validation procedure also allows to compute the - average training accuracy - average testing accuracy of best models over all cross-validation sets, and possible over multiple runs (with different seeds, to reduce the chance of observing spurious results due to chance divisions of the data) K Van Steen 579

61 MDR final The entire process is repeated for each k=1 to N loci combinations that are computationally feasible and an optimal k-locus model is chosen for each level of k considered. The final model is based on maximizing two criteria: - maximizing the (average) prediction accuracy - maximizing the (average) cross-validation consistency Statistical significance is obtained by comparing the average crossvalidation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations, derived empirically from 1000 permutations (Ritchie et al 2001, Ritchie et al 2003, Hahn et al 2003) K Van Steen 580

62 Several measures of fitness to compare models Balanced accuracy Balanced accuracy(ba) weighs the classification accuracy of the two classes equally and it is thought to be more powerful than using accuracy alone when data are imbalanced, or when the counts of cases and controls are not equal (Velez et al 2007) - BA is calculated from a 2 2 table relating exposure to status by [(sensitivity+specificity)/2]. Real case Model case TP Model control FN Real control FP TN When #cases = #controls, then TP+FN = FP+TN and BA = (TP+TN)/2*#cases = TP+TN/(total sample size) K Van Steen 581

63 Several measures of fitness to compare models Model-adjusted balanced accuracy Model-adjusted balanced accuracy uses in addition a different threshold in the MDR modeling, one that is based on the actual counts of case and control samples in the data. - When individuals have missing data, it accounts for the precise number of individuals with complete data for that particular multi-locus combination - This makes MDR robust to class imbalances (Velez et al 2007) K Van Steen 582

64 Hypothesis test of best model Evaluate magnitude of cross-validation consistency and prediction error estimates by adopting a permutation strategy In particular: - Randomize disease labels - Repeat MDR analysis several times (1000?) to get distribution of crossvalidation consistencies and prediction errors - Use distributions to derive the p-values for the actual cross-validation consistencies and prediction errors K Van Steen 583

65 Sample Quantiles 0% An Example Empirical Distribution 25% % % % % % % Frequency % K Van Steen 584

66 The probability that we would see results as, or more, extreme than for instance , simply by chance, is between 5% and 10% (slide: L Mustavich) The MDR Software Downloads Available from The MDR method is described in further detail by Ritchie et al. (2001) and reviewed by Moore and Williams (2002). An MDR software package is available from the authors by request, and is described in detail by Hahn et al. (2003). More information can also be found at K Van Steen 585

67 The authors Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, Required operating system software Linux: Linux (Fedora version Core 3): Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_06-b03) Java HotSpot(TM) Client VM (build 1.4.2_06-b03, mixed mode) Windows: Windows (XP Professional and XP Home): Java(TM) 2 Runtime Environment, Standard Edition (build v1.4.2_05) Minimum system requirements 1 GHz Processor 256 MB Ram 800x600 screen resolution K Van Steen 586

68 K Van Steen 587

69 Application to simulated data To show MDR in action, we simulated 200 cases and 200 controls using different multi-locus epistasis models (Evans 2006) - Scenario 1: 10 SNPs, adapted epistasis model M170, minor allele frequencies of disease susceptibility pair Scenario 2: 10 SNPs, epistasis model M27, minor allele frequencies of disease susceptibility pair 0.25 M170 M All markers were assumed to be in HWE. No LD between the markers. K Van Steen 588

70 Application to simulated data Marginal distributions for the controls M Marginal distributions for the cases M M M K Van Steen 589

71 Data format The definition of the format is as follows: - All fields are tab-delimited. - The first line contains a header row. This row assigns a label to each column of data. Labels should not contain whitespace. - Each following line contains a data row. Data values may be any string value which does not contain whitespace. - The right-most column of data is the class, or status, column. The data values for this column must be 1, to represent Affected or Case status, or 0, to represent Unaffected or Control status. No other values are allowed. K Van Steen 590

72 Easy data conversion > M170data[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] M170data <- rbind(m170.cases,m170.controls) M170ccdata <- matrix(na,nrow=ss,ncol=nsnps) for (i in 1:nsnps){ M170ccdata[,i] <- apply(m170data[,c(2*i-1,2*i)],1,sum)-2 } M170ccdata <- cbind(m170ccdata,c(rep(1,200),rep(0,200))) write.table(m170ccdata,"m170ccdata.txt",sep="\t",row.names=f,col.names=f) > M170ccdata[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] K Van Steen 591

73 M170 case control data SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class K Van Steen 592

74 Loading a data file (MDR 2.0 beta 3) K Van Steen 593

75 Configuring the analysis K Van Steen 594

76 Reducing the number of cross-validations CV=10 CV=3 (Motsinger and Ritchie 2006) K Van Steen 595

77 Reducing the number of cross-validations (CVs) In general, CV is a useful approach for limiting false-positives by assessing the generalizability of models (Coffey et al 2004) The number of CV intervals in an MB-MDR analysis can be reduced from 10 to 5, but not to 3 CV seems to be rather important in the MDR algorithm: - Motsinger and Ritchie (2003) showed that, without CV, selection of a final model is difficult, but that it is encouraging that the false-positive results almost always include at least one correct functional locus. - This indicates that perhaps, in the case of extremely large datasets, like genomewide scans, where using any type of CV would be computationally infeasible, MDR could still be used (without CV) to identify at least one functional locus K Van Steen 596

78 Search method configuration K Van Steen 597

79 Running the MDR analysis K Van Steen 598

80 Summary of results K Van Steen 599

81 Best MDR model K Van Steen 600

82 MDR best model K Van Steen 601

83 Values calculated by MDR Measure Balanced Accuracy Accuracy Sensitivity Specificity Odds Ratio Formula/Interpretation (Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class (TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified (skewed in favor of larger class) TP/(TP+FN); how likely a positive classification is correct TN/(TN+FP); how likely a negative classification is correct (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups K Van Steen 602

84 Values calculated by MDR Measure Precision Kappa X 2 F-Measure Formula/Interpretation TP/(TP+FP); the proportion of relevant cases returned 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy Chi-squared score for the attribute constructed by MDR from this attribute combination 2*TP/(2*TP+FP+FN); a function of sensitivity and precision TP: true positive; TN: true negative; FP: false positive; FN: false negative K Van Steen 603

85 MDR CV results average = K Van Steen 604

86 MDR best model Graphical display on whole data If-then rules on whole data K Van Steen 605

87 The fitness landscape Gives the fitness landscape across all models as a line chart (the default). - The models produced are on the x-axis of the chart. The models on the x-axis are in the order in which they were generated (e.g., 1,2,3,, 12, 13, 14, ) - Training accuracy is shown on the y-axis. K Van Steen 606

88 The fitness landscape SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP1,SNP SNP1,SNP SNP1,SNP SNP1,SNP K Van Steen 607

89 Locus Dendrogram The dendrogram provides a graphical representation of the interactions between attributes (and the strength of those interactions) from the MDR analysis (max nr of interactions asked for) using an interaction dendrogram. The purpose of the interaction dendrogram is to assist the user with determining the nature of the interactions (redundant, additive, or synergistic). K Van Steen 608

90 Locus Dendrogram The dendrogram is constructed using hierarchical cluster analysis with average-linking. The distance matrix used by the cluster analysis is constructed by calculating the information gained by constructing two attributes using the MDR function (Moore et al 2006, Jakulin and Bratko 2003, Jakulin et al 2003) K Van Steen 609

91 Raw entropy values Entropy is basically a defined a measure of randomness or disorder within a system. More specifically indicates that the lower the entropy values are the higher likelihood that the system is in a more probable state. A classic example of this principle is the melting of a glass of ice in which as the state becomes more unstable as the entropy increases. A graphical illustration of the relationships between information theoretic measures on the joint distribution of attributes A and B. The surface area of a section corresponds to the labeled quantity (Jakulin 2003) [I(A;B) = mutual information = the amount of information provided by A about B = information gain.; H(A) = entropy of A] K Van Steen 610

92 Raw entropy values Let us assume an attribute, A. We have observed its probability distribution, P A (a). Shannon s entropy measured in bits is a measure of predictability of an attribute is defined as: Hence phrased differently, the higher the entropy, the less reliable are our predictions about A. We can understand H(A) as the amount of uncertainty about A, as estimated from its probability distribution. K Van Steen 611

93 Raw entropy values Single Attribute Values: - H(A): This is the entropy of the given attribute (A) - H(A C): This is the entropy of the given attribute (A) given the class (C) - I(A;C): This is the information gain of the given attribute (A) given the class (C) Pairwise Values: - H(AB): This is the entropy of the given constructed attribute (AB) - H(AB C): This is the entropy of the given constructed attribute (AB) given class I - I(A;B): This is the information gain of attribute (A) given attribute (B) - I(A;B;C): This is the information gain for attribute (A) or Attribute (B) given class (C) - I(AB;C): This is the information for the constructed attribute (AB) given class I K Van Steen 612

94 Raw entropy values Mutual information I(A ;B) as a function of r 2 (as a measure of LD between markers), for a subset of the Spanish Bladder Cancer data (SBCS) unpublished results K Van Steen 613

95 K Van Steen 614

Locus dendrogram The colors range from red representing a high degree of synergy (positive information gain), orange a lesser degree, and gold representing the midway point between synergy and

96 Locus dendrogram The colors range from red representing a high degree of synergy (positive information gain), orange a lesser degree, and gold representing the midway point between synergy and redundancy. On the redundancy end of the spectrum, the highest degree is represented by the blue color (negative information gain) with a lesser degree represented by green. Synergy The interaction between two attributes provides more information than the sum of the individual attributes. Redundancy The interaction between attributes provides redundant information. K Van Steen 615

97 Positive and negative interactions Say I(A;B;C) = I(A,B;C) I(A;C) I(B;C) Assume that we are uncertain about the value of C, but we have information about A and B. - Knowledge of A alone eliminates I(A;C) bits of uncertainty from C. - Knowledge of B alone eliminates I(B;C) bits of uncertainty from C. - However, the joint knowledge of A and B eliminates I(A,B;C) bits of uncertainty. Hence, if interaction information is positive, we benefit from an unexpected synergy. If interaction information is negative, we suffer diminishing marginal returns by introducing attributes that partly contribute redundant information. K Van Steen 616

98 Significance of the results We simulated data from a two-locus epistasis model. The remaining SNPs were generated at random Hence, what does it mean that the best single effects model SNP5 was chosen? Answer: Every k-locus setting will give rise to a best model. MDR forces for every k-locus setting an optimal model. K Van Steen 617

99 Significance of results The best model among all 1-3 locus models, is the one with maximal cross validation consistency and maximum average balanced prediction accuracy But how significant is this result? K Van Steen 618

100 Configuring the permutation analysis (MDR PT Module alpha) K Van Steen 619

101 Performing the MDR permutation test K Van Steen 620

102 Performing the MDR permutation test K Van Steen 621

103 Performing the MDR permutation test Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) (<0.0010) (<0.0010) (0.2160) (0.2160) (0.2160) Obtained from MDR summary table Obtained from MDR Permutation Testing p-value calculator K Van Steen 622

104 Performing the MDR permutation test Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) 10 (0.2160) (<0.0010) 10 (0.2160) (<0.0010) 10 (0.2160) Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) K Van Steen 623

105 What is going on? Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) Effect of strong main effect is carried through in higher order interactions? What will happen for data simulated under M27 (with main effects by simulation? K Van Steen 624

106 Results for M27 K Van Steen 625

107 Results for M27 Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP1-SNP (<0.0010) 10 (0.2310) SNP1-SNP2-SNP (<0.0010) 5 (0.9110) What about SNP2? Why is this not highlighted as an important main effect? Maximizing CVC first and then looking at prediction accuracy highlights SNP1-SNP2. Maximizing prediction accuracy alone, would point towards SNP1-SNP2-SNP4. K Van Steen 626

108 Results for M27 Using permutation null distributions per k-locus setting, the following results are obtained: Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP1 SNP1-SNP2 SNP1-SNP2-SNP (<0.0010) 10 (0.1790) (<0.0010) 10 (0.0620) (<0.0010) 5 (0.9110) Wouldn t it be natural to correct for SNP1 when looking for interactions? What if more than one main effect is present in the data? K Van Steen 627

109 Strengths of MDR Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multi-locus data Non-parametric no values are estimated Assumes no particular genetic model Minimal false-positive rates K Van Steen 628

110 Weaknesses of MDR Computationally intensive (especially with >10 loci) - The original MDR software supports diseasemodels with up to 15 factors at a time from a list of up to 500 total factors and a maximum sample size of 4,000 subjects. - Parallel MDR (Bush et al 2006) is a redesign of the initial MDR algorithm to allow an unlimited number of study subjects, total variables and variable states, and to remove restrictions on the order of interactions being analyzed The algorithm gives an approximate 150-fold decrease in runtime for equivalent analyses. The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data K Van Steen 629

111 Several (other) extensions to the MDR paradigm (CV based) (Lou et al 2008) K Van Steen 630

112 Different measure to score model quality One crucial component of the MDR algorithm measures the percentage of cases and controls incorrectly labelled by the proposed classification the classification error. - The combination of variables that produces the lowest classification error is selected as the best or most fit model. The correctly and incorrectly labelled cases and controls can be expressed as a two-way contingency table. The ability of MDR to detect gene-gene interactions can be improved by replacing classification error with a different measure to score model quality. - Of 10 measures evaluated, Bush et al (2008) found that the likelihood ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model. K Van Steen 631

113 Contingency table measures of classification performance (Bush et al 2008) K Van Steen 632

114 Towards an easy-to- adapt framework MB-MDR (Lou et al 2008) FAM-MDR K Van Steen 633

115 MB-MDR as a semi-parametric approach for unrelateds Step 1: New risk cell identification via association test on each genotype cell c j - Parametric or non-parametric test of association Step 2: Test one-dimensional genetic construct X on Y Step 3: assess significance - W = [b/se(b)]2, b=ln(or) - Derive correct null distribution for W (Calle et al 2007, Calle et al 2008) K Van Steen 634

116 Motivation 1 for MB-MDR Some important interactions could be missed by MDR due to pooling too many cells together (Calle et al 2008) K Van Steen 635

117 Motivation 2 for MB-MDR MDR cannot deal with main effects / confounding factors / nondichotomous outcomes K Van Steen 636

118 Motivation 3 for MB-MDR MDR has low performance in the presence of genetic heterogeneity (Calle et al 2008) K Van Steen 637

119 A comparison of analytical methods: GENN, RF, FITF, MDR, logistic regression GENN Grammatical evolution neural network (GENN) is a novel pattern recognition method developed to detect main effects or multi-locus models of association without exhaustively searching all possible multi-locus combinations. Grammatical evolution (GE) is a machine-learning algorithm inspired by the biological process of transcription and translation. GE uses a genetic algorithm in combination with a pre-specified grammar (set of translation rules) to automatically evolve an optimal computer program. GENN utilizes GE to evolve the inputs (predictor variables), architecture (arrangement of layers and functions), and weights of a neural network (NN) to optimally classify a given dataset. (Motsinger-Reif et al 2008) K Van Steen 638

120 A schematic overview of the GENN method (Motsinger-Reif 2008) K Van Steen 639

121 Random Forests (RF) RF is a machine-learning technique that builds a forest of classification trees wherein each component tree is grown from a bootstrap sample of the data, and the variable at each tree node is selected from a random subset of all variables in the data (Breiman, 2001). The final classification of an individual is determined by voting over all trees in the forest. RF models may uncover interactions among factors that do not exhibit strong marginal effects, without demanding a pre-specified model (McKinney et al., 2006). Additionally, tree methods are suited to dealing with certain types of genetic heterogeneity, since splits near the root node define separate model subsets in the data. (Motsinger-Reif et al 2008) K Van Steen 640

122 Random Forests (RF) Each tree in the forest is constructed as follows from data having N individuals and M explanatory variables: - Choose a training sample by selecting N individuals, with replacement, from the entire data set. - At each node in the tree, randomly select m variables from the entire set of M variables in the data. The absolute magnitude of m is a function of the number of variables in the data set and remains constant throughout the forest building process. - Choose the best split at the current node from among the subset of m variables selected above. - Iterate the second and third steps until the tree is fully grown (no pruning). Repetition of this algorithm yields a forest of trees, each of which has been trained on bootstrap samples of individuals (Motsinger-Reif et al 2008) K Van Steen 641

123 A schematic overview of the RF method (Motsinger-Reif et al 2008) K Van Steen 642

124 Advantages of the Random Forest method It can handle a large number of input variables. It estimates the relative importance of variables in determining classification, thus providing a metric for feature selection. RF produces a highly accurate classifier with an internal unbiased estimate of generalizability during the forest building process. RF is fairly robust in the presence of etiological heterogeneity and relatively high amounts of missing data (Lunetta et al., 2004). Finally, and of increasing importance as the number of input variables increases, learning is fast and computation time is modest even for very large data sets (Robnik-Sikonja, 2004). (Motsinger-Reif et al 2008) K Van Steen 643

125 Focused Interaction Testing Framework (FITF) The FITF was recently developed to detect epistatic interactions that predict disease risk. Details of the FITF algorithm and software can be found in Millstein et al. (2006). FITF is a modification of the interaction testing framework (ITF) method, which pre-screens all possible gene sets to focus on those that potentially are the most informative and reduce the multiple testing problem by reducing the number of statistical tests performed. FITF has been shown to outperform MDR when interactions involved additive, recessive, or dominant genes (Millstein et al., 2006). (Motsinger-Reif et al 2008) K Van Steen 644

126 Focused Interaction Testing Framework (FITF) The FITF algorithm modifies the ITF approach to reduce the overall number of variants tested with an initial filter process. A chi-square goodness-of-fit statistic that compares the observed with the expected Bayesian distribution of multi-locus genotype combinations in a combined casecontrol population is used in a prescreening initial stage. This statistic, referred to as the chi-square subset (CSS), has the form: where n i is the observed number of subjects (regardless of case/control status) in the ith genotype group and r is the total number of genotype groups. The expected n i, noted as E(n i ), is estimated based on the sample marginal genotype frequencies of each gene. K Van Steen 645

127 Conclusion on comparison MDR results in the one and two-locus models were comparable to GENN GENN performs poorly with the three locus models considered in Motsinger-Reif et al (2008). - This highlights a disadvantage of an evolutionary computation approach in exploring purely epistatic models it is much less likely that three loci will be stochastically assembled into a model to evaluate than two loci. Both GENN and MDR outperformed FITF - Because GENN and MDR both utilize permutation distributions for significance testing, correction for multiple testing is unnecessary. While the filter stage of FITF does reduce the number of tests performed with the ITF strategy, there are still a very large number of tests that are corrected for. K Van Steen 646

128 Conclusion on comparison Both RF and steplr were unable to detect purely epistatic models. - Since both require marginal main effects to perform variable selection tasks. - Future extensions/modifications of these approaches should consider this limitation and modify the variable selection process to capture pure interactions. - Some groups have in fact begun to make modifications in this way (Bureau et al., 2005) K Van Steen 647

129 2.c Interpretation of multi-locus results It is always a good idea to use several model selection criterions before interpreting (Ritchie et al 2007) K Van Steen 648

130 A flexible framework for analysis acknowledging interpretation capability The framework contains four steps to detect, characterize, and interpret epistasis - Select interesting combinations of SNPs - Construct new attributes from those selected - Develop and evaluate a classification model using the newly constructed attribute(s) - Interpret the final epistasis model using visual methods (Moore et al 2005) K Van Steen 649

131 Flexible framework Step 1 Attribute selection - Use entropy-based measures of information gain (IG) and interaction - Evaluate the gain in information about a class variable (e.g. case-control status) from merging two attributes together - This measure of IG allows us to gauge the benefit of considering two (or more) attributes as one unit (slide: Chen 2007) K Van Steen 650

132 Information gain Recall McGill s multiple mutual information (Te Sun Han 1980) : ; ; ; ; (information gain) If I(A;B;C) > 0 - Evidence for an attribute interaction that cannot be linearly decomposed If I(A;B;C) < 0 - The information between A and B is redundant If I(A;B;C) = 0 - Evidence of conditional independence or a mixture of synergy and redundancy K Van Steen 651

133 Illustration of entropy-based measures on Model 1 (Ritchie et al 2001) K Van Steen 652

134 Attribute selection based on entropy Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects). Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration (slide: Chen 2007) K Van Steen 653

135 Attribute selection based on relief F The Relief statistic was developed by the computer science community as a powerful method for determining the quality or relevance of an attribute (i.e. variable) for predicting a discrete endpoint or class variable (Kira and Rendell 1992, Konenko 1994, Robnik-Sikonja and Kononenko 2003). Relief is especially useful when there is an interaction between two or more attributes and the discrete class variable. It is thus superior to univariate filters such as a chi-square test of independence (see later) when interactions are present. K Van Steen 654

136 Attribute selection based on relief F In particular, Relief estimates the quality of attributes through a type of nearest neighbor algorithm that selects neighbors (instances) from the same class and from the different class based on the vector of values across attributes. Weights (W) or quality estimates for each attribute (A) are estimated based on whether the nearest neighbor (nearest hit, H) of a randomly selected instance (R) from the same class and the nearest neighbor from the other class (nearest miss, M) have the same or different values. This process of adjusting weights is repeated for m instances. The algorithm produces weights for each attribute ranging from -1 (worst) to +1 (best). K Van Steen 655

137 Attribute selection based on relief F (applied to M27) Only the top 10% of scores will be returned to the filtered data set K Van Steen 656

138 Attribute selection based on relief F applied to M27 For the M27 simulated data, this reduction of the overall attribute count does not make sense of course (# SNPs = 10!) K Van Steen 657

139 Attribute selection based on TuRF ReliefF is able to capture attribute interactions because it selects nearest neighbors using the entire vector of values across all attributes. However, this advantage is also a disadvantage because the presence of many noisy attributes can reduce the signal the algorithm is trying to capture. The tuned ReliefF algorithm (TuRF) systematically removes attributes that have low quality estimates so that the ReliefF values if the remaining attributes can be re-estimated. (Moore and White 2008) K Van Steen 658

140 Attribute selection based on chi-squared The MDR software provides a simple chi-square test of independence as a univariate filter. - The manual specifies that this filter should be used to condition your MDR analysis on those attributes that have an independent main effect. - However, the MDR software itself does not give you a lot of options to actually perform this conditioning The ReliefF filter will be more useful for capturing those attributes that are likely to be involved in an interaction. K Van Steen 659

141 Attribute selection based on chi-squared (applied to M27) K Van Steen 660

142 Attribute selection based on odds ratio The odds ratio (OR) is a way of comparing whether the probability of a certain event is the same for two groups. - An odds ratio of 1 implies that the event is equally likely in both groups. - An odds ratio that is greater than 1 implies that the event is most likely in the first group whereas - A value less than one implies that the event is less likely in the first group. When an attribute is polytomous (i.e. more than 2 levels) MDR calculates the OR for each possible contrast and then reports the largest OR value. - For 3 levels 0, 1, 2, the following contrasts are considered 0 vs 1 ; 0 vs 2 ; 1 vs 2 ; 0 vs 1&2 ; 1 vs 0&2 ; 2 vs 1&0 K Van Steen 661

143 Flexible framework Step 2 Constructive induction, for instance MDR - A multi-locus genotype combination is considered high-risk if the ratio of cases to controls exceeds given threshold T, else it is considered lowrisk - Genotype combinations considered to be high-risk are labeled G1 while those considered low-risk are labeled G0. - This process constructs a new one-dimensional attribute with levels G0 and G1 (slide: Chen 2007) K Van Steen 662

144 Flexible framework Step 3 Classification and machine learning - The single attribute obtained in Step 2 can be modeled using machine learning and classification techniques Bayes classifiers as one technique - Mitchell (1997) defines the naive Bayes classifier as arg max - where v j is one of a set of V classes and a i is one of n attributes describing an event or data element. The class associated with a specific attribute list is the one, which maximizes the probability of the class and the probability of each attribute value given the specified class. K Van Steen 663

145 Flexible framework Step 3 - The standard way to apply the naive Bayes classifier to genotype data would be to use the genotype information for each individual as a list of attributes to distinguish between the two hypotheses The subject is high-risk and The subject is low-risk. Alternatively, an odds ratio for the single multilocus attribute can also be estimated using logistic regression to facilitate a traditional epidemiological analysis and interpretation. - Evaluation of the predictor can be carried out using cross-validation (Hastie et al., 2001) and permutation testing (Good, 2000), for example. (Moore et al 2006) K Van Steen 664

of entropy removed (i.e. IG) by each attribute.

146 Flexible framework Step 4 Interpretation interaction graphs - Comprised of a node for each attribute with pairwise connections between them. - Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute. - Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes. (slide: Chen 2007) K Van Steen 665

147 Flexible framework Step 4 Interpretation dendrograms - Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree. K Van Steen 666

148 Hierarchical clustering with average linkage Here the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group K Van Steen 667

149 Flexible framework The flexibility of this framework is the ability to plug and play - Different attribute selection methods other than the entropy-based - Different constructive induction algorithms other than the MDR - Different machine learning strategies other than a naïve Bayes classifier (slide: Chen 2007) K Van Steen 668

150 3 Future challenges Integration of omics data in GWAs K Van Steen 669

151 Integrations of omics data in GWAs (Hirschhorn 2009) K Van Steen 670

Integration of omics data in GWAs A few straightforward examples: Post-analysis - As validation tool in main effects GWAs During the analysis: - Epistasis screening (FAM-MDR) Use expression values to

152 Integration of omics data in GWAs A few straightforward examples: Post-analysis - As validation tool in main effects GWAs During the analysis: - Epistasis screening (FAM-MDR) Use expression values to prioritize multi-locus combinations - Main effects screening (PBAT) Construct an overall phenotype for each marker based on the linear combination of expression values (e.g., within 1Mb from the marker) that maximizes heritability and perform FBAT-PC screening to prioritize SNPs K Van Steen 671

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,