INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen

Size: px
Start display at page:

Download "INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen"

Transcription

1 INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen

2 CHAPTER 7: A WORLD OF INTERACTIONS 1 Beyond main effects 1.a Dealing with multiplicity 1.b A bird s eye view on roads less travelled by 1.c Multi-locus analysis epistasis analysis 2 Epistasis detection: a challenging task 2.a Variable selection 2.b Multifactor dimensionality reduction 2.c Interpretation 3 Future challenges K Van Steen 521

3 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 1 Beyond main effects 1.a Dealing with multiplicity Multiple testing explosion: ~500,000 SNPs span 80% of common variation in genome (HapMap) n-th order interaction K Van Steen 522

4 Ways to handle multiplicity Recall that several strategies can be adopted, including: - clever multiple corrective procedures - pre-screening strategies, - multi-stage designs, - adopting haplotype tests or - multi-locus tests Which of these approaches are more powerful is still under heavy debate The multiple testing problem becomes unmanageable when looking at multiple loci jointly? K Van Steen 523

5 1.b A bird s eye view on roads less travelled by Multiple disease susceptibility loci (mdsl) Dichotomy between - Improving single markers strategies to pick up multiple signals at once (PBAT) - Testing groups of markers (FBAT multi-locus tests) K Van Steen 524

6 PBAT screening for mdsl Little has been done in the context of family-based screening for epistasis First assess how a method is capable of detecting multiple DSL Simulation strategy (10,000 replicates): - Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families - Select 5 regions; 1 DSL in each region - Generate traits according to normal distribution, including up to 5 genetic contributions - For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci) (Van Steen et al 2005) K Van Steen 525

7 General theory on FBAT testing Test statistic: - works for any phenotype, genetic model - use covariance between offspring trait and genotype Test Distribution: - computed assuming H 0 true; random variable is offspring genotype - condition on parental genotypes when available, extend to family configurations (avoid specification of allele distribution) - condition on offspring phenotypes (avoid specification of trait distribution) (Horvath et al 1998, 2001; Laird et al 2000) K Van Steen 526

8 Screen Use between-family information [f(s,y)] Calculate conditional power (a b,y,s) Select top N SNPs on the basis of power Test Use within-family information [f(x S)] while computing the FBAT statistic This step is independent from the screening step Adjust for N tests (not 500K!) ( Van Steen et al 2005) ( Lange and Laird 2006) K Van Steen 527

9 Power to detect genes with multiple DSL top : top 5 SNPs in the ranking bottom: top 10 SNPs in the ranking (Van Steen et al 2005) K Van Steen 528

10 Power to detect genes with multiple DSL top : Benjamini-Yekutieli FDR control at 5% (general dependencies) bottom: Benjamini-Hochberg FDR control at 5% (Van Steen et al 2005) K Van Steen 529

11 FBAT multi-locus tests FBAT-SNP-PC attains higher power in candidate genes with lower average pair-wise correlations and moderate to high allele frequencies with large gains (up to 80%). (Rakovski et al 2008) The new test has an overall performance very similar to that of FBAT-LC (FBAT-LC : Xin et al 2008) K Van Steen 530

12 In contrast: popular multi-locus approaches for unrelateds Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: - Combinatorial Partitioning Method (CPM) quantitative phenotypes; interactions - Multifactor-Dimensionality Reduction (MDR) qualitative phenotypes; interactions - Machine learning and data mining The multiple testing problem becomes unmanageable when looking at (genetic) interaction effects? More about this in Chapter 9. K Van Steen 531

13 1.c Multi-locus analysis epistasis analysis Epistasis: what s in a name? Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect. (Wikipedia) (slide : C Amos) K Van Steen 532

14 Epistasis: what s in a name? Distortions of Mendelian segregation ratios due to one gene masking the effects of another (William Bateson ). Deviations from linearity in a statistical model (Ronald Fisher ). Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans (Cordell 2002) K Van Steen 533

15 Why is there epistasis? From an evolutionary biology perspective, for a phenotype to be buffered against the effects of mutations, it must have an underlying genetic architecture that is comprised of networks of genes that are redundant and robust. This creates dependencies among the genes in the network and is realized as epistasis. (slide: Y Chen, 2007) K Van Steen 534

16 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Different types of interactions Trait Qq qq QQ m-a m+d m+a (Fisher, Wright) (slide: C Amos) K Van Steen 535

17 Interpretation of epistasis The study of epistasis poses problems of interpretability. Statistically, epistasis is usually defined in terms of deviation from a model of additive multiple effects, but this might be on either a linear or logarithmic scale, which implies different definitions. (Moore 2004) - Despite the aforementioned concerns, there is evidence that a direct search for epistatic effects can pay dividends. - It is expected to have an increasing role in future analyses K Van Steen 536

18 The frequency of epistasis Not a new idea! (Bateson 1909) Complexity of gene regulation and biochemical networks (Gibson 1996; Templeton 2000) Single gene results don t replicate (Hirschhorn et al. 2002) Gene-gene interactions are commonly found when properly investigated (Templeton 2000) Working hypothesis: Single gene studies don t replicate because gene-gene interactions are more important (Moore and Williams 2002) (Moore 2003) K Van Steen 537

19 Slow shift from main towards epistatis effects (Motsinger et al 2007) K Van Steen 538

20 Power of a gene-gene or gene-environment interaction analysis There is a vast literature on power considerations - Most of this literature strengthen their beliefs by extensive simulation studies There is a need for user-friendly software tools that allow the user to perform hands-on power calculations Main package targeting interaction analyses is QUANTO (v1.2.1): - Available study designs for a disease (binary) outcome include the unmatched case-control, matched case-control, case-sibling, caseparent, and case-only designs. Study designs for a quantitative trait include independent individuals and case parent designs. - Reference: Gauderman (2000a), Gauderman (2000b), Gauderman (2003) / K Van Steen 539

21 A simple example of epistasis K Van Steen 540

22 A simple disease model Penetrance - Pr (affected genotype) One-locus Dominant Model Genotype aa aa AA Status K Van Steen 541

23 A slightly more complicated two-locus model Genotype bb bb BB aa aa AA Enumeration of two-locus models Although there are 2 9 =512 possible models, because of symmetries in the data, only 50 of these are unique. Enumeration allows 0 and 1 only for penetrance values ( fully penetrant ; i.e., show example). K Van Steen 542

24 Enumeration of two-locus models (Li and Reich 2000) Each model represents a group of equivalent models under permutations. The representative model is the one with the smallest model number. The six models studied in Neuman and Rice [67] ( RR, RD, DD, T, Mod, XOR ), as well as two single-locus models ( IL ) the recessive (R) and the interference (I) model, are marked. K Van Steen 543

25 Different degrees of epistasis (slide: Motsinger) K Van Steen 544

26 Pure epistasis model for dichotomous traits Suppose - p(a)=p(b)=p(a)=p(b)=0.5 - HWE (hence, p(aa)=0.5 2 =0.25,p(Aa)= =0.5) and no LD - penetrances are given according to the table below P(affected genotype) Penetrance bb bb BB prob aa aa AA prob Then make multiple use of Bayes rule to retrieve the genotype distributions in cases and controls K Van Steen 545

27 Pure epistasis model for dichotomous traits Then the marginal genotype distributions for cases and controls are the same, and hence one-locus approaches will be powerless! P(genotypes affected) bb bb BB prob aa aa AA prob P(genotypes unaffected) bb bb BB prob aa aa AA prob P(aa,BB D) =p(d aa,bb)p(aa,bb) / p(d) = /( ) = ¼ = 0.25 K Van Steen 546

28 Purely epistatic 3-locus diseasemodel for quantitative traits Assume all allele frequencies are 0.5 Heritability is 55% and prevalence is 6.25% L.2 L.3=0 L.3=1 L.3= L (Culverhouse et al 2002) K Van Steen 547

29 Expected genotype patterns for 3-locus model L.1 L.2 L.3 p(g) E[#affected] E[#unaff] Other Sum (Culverhouse et al 2002) (sllide: J Ott 2004) K Van Steen 548

30 2 Epistasis detection: a challenging task Main challenges Variable selection Modeling Interpretation - Making inferences about biological epistasis from statistical epistasis (slide Chen 2007) K Van Steen 549

31 2.a Variable selection Introduction The aim is to make clever selections of marker combinations to look at in an epistasis analysis This may not only aid in the interpretation of analysis results, but also reduced the burden of multiple testing and the computational burden K Van Steen 550

32 Variable selection and multiple testing Multiple testing is a thorny issue, the bane of statistical genetics. - The problem is not really the number of tests that are carried out: even if a researcher only tests one SNP for one phenotype, if many other researchers do the same and the nominally significant associations are reported, there will be a problem of false positives. (Balding 2006) Example - Given 3 disease SNPS (e.g., Culverhouse 3-locus model before), making inferences is not at all an easy task: Chi-sq = (26 df), p= With 50,000 SNPS, there will be subsets of size 3 Applying Bonferroni correction, p = A more manageable approach is to test all possible pairs of loci for interaction effects, different in cases and controls (Hoh and Ott 2003) K Van Steen 551

33 Variable selection and multiple testing Pre-screening for subsequent testing: - Independent screening and testing step (PBAT screening; Van Steen et al 2005) - Dependent screening and testing step K Van Steen 552

34 Methods to correct for multiple testing Family-wise error rates (FWER) - In the presence of too many SNPs, the Bonferroni threshold will be extremely low: Bonferroni adjustments are conservative when statistical tests are not independent / Bonferroni adjustments control the error rate associated with the omnibus null hypothesis / The interpretation of a finding depends on how many statistical tests were performed Permutation data sets - It is particularly handy for rare genotypes, small studies, non-normal phenotypes, and tightly linked markers - In case-control data this is relatively straightforward / In family data this is not at all an easy task K Van Steen 553

35 Methods to correct for multiple testing False discovery rate (FDR) - With too many SNPs it starts to break down and the power over Bonferroni is minimal (e.g. see Van Steen et al 2005) False-positive report probability (FPRP) - It is the probability of no true association between a genetic variant and disease given a statistically significant finding, depends not only on the observed p-value but also on both the prior probability that the association between the genetic variant and the disease is real and the statistical power of the test (Wacholder et al 2004) - In general, Bayesian approaches do not yet have a big role in genetic association analyses, possibly because of computational burden? - Not yet well documented / What are the priors? (Balding 2006; Lucke 2008) K Van Steen 554

36 Variable selection and computation time When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. For instance, if SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require seconds (ie, 8.3 hours) of computer time. Exhaustive evaluation of the approximately pairwise combinations of SNPs would require 1286 years. Although it might be possible for a large supercomputer to complete these computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem. (Moore and Ritchie 2004) K Van Steen 555

37 2.b Modeling Failure of traditional methods A large number of SNPs are genotyped - multiple comparisons problem, very small p-values required for significance, which is even compounded in gene-environment interaction analyses. Genetic loci may interact (epistasis) in their influence on the phenotype - loci with small marginal effects may go undetected - interested in the interaction itself Curse of dimensionality and sparse cells K Van Steen 556

38 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 2 SNPs, there are 9 = 3 2 possible two locus genotype combinations. If the alleles are rare (MAF 10%), then some cells will be empty (slide: C Amos) K Van Steen 557

39 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 4 SNPs, there are 81 possible combinations with more cells possible empty (slide: C Amos) K Van Steen 558

40 Modeling: strategy 1 Strategy 1: Set association approach At each SNP, compute an association statistic T Build sum over 1, 2, 3, etc highest values t Evaluate significance of given sum by permutation test Sum with smallest p-value will point towards the markers to select Smallest p is single statistic, find significance level Is applicable to many SNPs and has also been used in microarray settings (Hoh et al 2001) K Van Steen 559

41 Strategy 1: Set association approach (Hoh et al 2001) K Van Steen 560

42 Modeling: strategy 2 Strategy 2: Multi-locus approaches Most case control studies far too often do not take into account the multilocus nature of complex traits When the aim is to analyze multiple SNPs or genes jointly, two classes of approaches emerge: - Combine (properties of) single-locus statistics over multiple SNPs to obtain a new multivariate test statistic Depending on whether SNPS are in high LD or not, different measures need to be taken - Look for patterns of genotypes at SNPs in different genomic locations K Van Steen 561

43 Two frameworks for multi-locus approaches (Onkamo and Toivonen 2006) Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: Tree-based methods: - Recursive Partitioning (Helix Tree) - Random Forests (R, CART) Pattern recognition methods: - Mining association rules - Neural networks (NN) - Support vector machines (SVM) Data reduction methods: - DICE (Detection of Informative Combined Effects) - MDR (Multifactor Dimensionality Reduction) K Van Steen 562

44 Non-parametric chi-square The question is how to test for epistatic effects above and beyond (independent) main effects (of single-locus genotype effects) - Use usual chi-square for interactions independent of main effects. Isolate individual df s. - Assess difference in interactions between cases and controls, since then interactions may be better indicative for underlying pathways Locus 2 Locus 1 BB Bb Bb AA Aa aa Main effect locus 1 2df Main effect locus 2 2 df Interactions 4 df Total 8 df K Van Steen 563

45 Partitioning chi-squares for one locus 2df 1 df 1 df Simple disease model, population frequency K = 0.10 N = 100 cases, 100 controls. Predicted numbers of cases and controls in given genotype classes, and resulting odds ratios, OR K Van Steen 564

46 Partitioning chi-squares for two loci 3 3 table of genotypes (4 df) may be partitioned into 4 independent components, each with 1 df. Do such partitioning for cases and controls each (Agresti 2002). AA Aa BB Bb AA Aa BB, Bb bb AA, Aa aa BB Bb AA, Aa aa BB, Bb bb K Van Steen 565

47 Partitioning chi-squares for two loci Compare each of the four 2 by 2 subtables between cases and controls to see whether their odds ratios are the same K Van Steen 566

48 Logistic regression LR is a derivative of linear regression that fits a function to continuous or discrete independent variables based on a dichotomous dependent variable (Hosmer and Lemeshow, 2000). One of the most common procedures for variable selection in a LR analysis is step-wise logistic regression (step LR) [Hosmer and Lemeshow, 2000]. - In the step-wise procedure, each variable is tested for independent effects, and those variables with significant effects are included in the model. - In a second step, interaction terms of those variables with significant main effects are included, and significant effects are included in the model. (Motsinger-Reif et al 2008) K Van Steen 567

49 Logistic regression LR is a de facto standard for traditional association studies. Using independent variables to predict a dichotomous dependent variable, LR by definition lacks the ability to characterize purely interactive effects. Only variables that contain an independent main effect will be included in the final model. To properly evaluate non-linear purely interactive effects, combinations of variables must be encoded as a single variable for inclusion in the analysis. Such an encoding scheme can be computationally expensive, depending on the number of variables used. (Motsinger-Reif et al 2008) K Van Steen 568

50 Strategy 2: Look for patterns of genotypes using unrelated individuals CPM = combinatorial partitioning method (Charlie Sing, U Michigan). Applicable to small number (~50) of SNPs only. MDR = multifactor-dimensionality reduction method (Jason Moore, Vanderbuilt U) LAD = logical analysis of data (P. Hammer, Rutgers U) Mining association rules, Apriori algorithm (R. Agrawal) Special approaches for microarray data (Hoh and Ott 2003) K Van Steen 569

51 The MDR algorithm What is MDR? A data mining approach to identify interactions among discrete variables that influence a binary outcome A nonparametric alternative to traditional statistical methods such as logistic regression Driven by the need to improve the power to detect gene-gene interactions (slide: L Mustavich) K Van Steen 570

52 The 6 steps of MDR K Van Steen 571

53 MDR Step 1 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets K Van Steen 572

54 MDR Step 2 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set K Van Steen 573

55 MDR Step 3 Create a contingency table for these multi-locus genotypes, counting the number of affected and unaffected individuals with each multi-locus genotype K Van Steen 574

56 MDR Step 4 Calculate the ratio of cases to controls for each multi-locus genotype Label each multi-locus genotype as high-risk or low-risk, depending on whether the casecontrol ratio is above a certain threshold This is the dimensionality reduction step: Reduces n-dimensional space to 1 dimension with 2 levels K Van Steen 575

57 MDR Step 5 To evaluate the developed model in Step 4, use labels to classify individuals as cases or controls, and calculate the misclassification error In fact: balanced accuracy is used (arithmetic mean between sensitivity and specificity), which IS mathematically equivalent to classification accuracy when data are balanced K Van Steen 576

58 Repeat Steps 2 to 5 All possible combinations of n factors are evaluated sequentially for their ability to classify affected and unaffected individuals in the training data, and the best n-factor model is selected in terms of minimal misclassification error K Van Steen 577

59 MDR Step 6 The independent test data from the cross-validation are used to estimate the prediction error (testing accuracy) of the best model selected K Van Steen 578

60 Towards MDR Final Steps 1 through 6 are repeated for each possible cross-validation interval The best model across all 10 training and testing sets is selected on the basis of the criterion: - Maximize the cross-validation consistency = The number of times a particular model was the best model across the cross-validation subsets The end of a cross-validation procedure also allows to compute the - average training accuracy - average testing accuracy of best models over all cross-validation sets, and possible over multiple runs (with different seeds, to reduce the chance of observing spurious results due to chance divisions of the data) K Van Steen 579

61 MDR final The entire process is repeated for each k=1 to N loci combinations that are computationally feasible and an optimal k-locus model is chosen for each level of k considered. The final model is based on maximizing two criteria: - maximizing the (average) prediction accuracy - maximizing the (average) cross-validation consistency Statistical significance is obtained by comparing the average crossvalidation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations, derived empirically from 1000 permutations (Ritchie et al 2001, Ritchie et al 2003, Hahn et al 2003) K Van Steen 580

62 Several measures of fitness to compare models Balanced accuracy Balanced accuracy(ba) weighs the classification accuracy of the two classes equally and it is thought to be more powerful than using accuracy alone when data are imbalanced, or when the counts of cases and controls are not equal (Velez et al 2007) - BA is calculated from a 2 2 table relating exposure to status by [(sensitivity+specificity)/2]. Real case Model case TP Model control FN Real control FP TN When #cases = #controls, then TP+FN = FP+TN and BA = (TP+TN)/2*#cases = TP+TN/(total sample size) K Van Steen 581

63 Several measures of fitness to compare models Model-adjusted balanced accuracy Model-adjusted balanced accuracy uses in addition a different threshold in the MDR modeling, one that is based on the actual counts of case and control samples in the data. - When individuals have missing data, it accounts for the precise number of individuals with complete data for that particular multi-locus combination - This makes MDR robust to class imbalances (Velez et al 2007) K Van Steen 582

64 Hypothesis test of best model Evaluate magnitude of cross-validation consistency and prediction error estimates by adopting a permutation strategy In particular: - Randomize disease labels - Repeat MDR analysis several times (1000?) to get distribution of crossvalidation consistencies and prediction errors - Use distributions to derive the p-values for the actual cross-validation consistencies and prediction errors K Van Steen 583

65 Sample Quantiles 0% An Example Empirical Distribution 25% % % % % % % Frequency % K Van Steen 584

66 The probability that we would see results as, or more, extreme than for instance , simply by chance, is between 5% and 10% (slide: L Mustavich) The MDR Software Downloads Available from The MDR method is described in further detail by Ritchie et al. (2001) and reviewed by Moore and Williams (2002). An MDR software package is available from the authors by request, and is described in detail by Hahn et al. (2003). More information can also be found at K Van Steen 585

67 The authors Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, Required operating system software Linux: Linux (Fedora version Core 3): Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_06-b03) Java HotSpot(TM) Client VM (build 1.4.2_06-b03, mixed mode) Windows: Windows (XP Professional and XP Home): Java(TM) 2 Runtime Environment, Standard Edition (build v1.4.2_05) Minimum system requirements 1 GHz Processor 256 MB Ram 800x600 screen resolution K Van Steen 586

68 K Van Steen 587

69 Application to simulated data To show MDR in action, we simulated 200 cases and 200 controls using different multi-locus epistasis models (Evans 2006) - Scenario 1: 10 SNPs, adapted epistasis model M170, minor allele frequencies of disease susceptibility pair Scenario 2: 10 SNPs, epistasis model M27, minor allele frequencies of disease susceptibility pair 0.25 M170 M All markers were assumed to be in HWE. No LD between the markers. K Van Steen 588

70 Application to simulated data Marginal distributions for the controls M Marginal distributions for the cases M M M K Van Steen 589

71 Data format The definition of the format is as follows: - All fields are tab-delimited. - The first line contains a header row. This row assigns a label to each column of data. Labels should not contain whitespace. - Each following line contains a data row. Data values may be any string value which does not contain whitespace. - The right-most column of data is the class, or status, column. The data values for this column must be 1, to represent Affected or Case status, or 0, to represent Unaffected or Control status. No other values are allowed. K Van Steen 590

72 Easy data conversion > M170data[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] M170data <- rbind(m170.cases,m170.controls) M170ccdata <- matrix(na,nrow=ss,ncol=nsnps) for (i in 1:nsnps){ M170ccdata[,i] <- apply(m170data[,c(2*i-1,2*i)],1,sum)-2 } M170ccdata <- cbind(m170ccdata,c(rep(1,200),rep(0,200))) write.table(m170ccdata,"m170ccdata.txt",sep="\t",row.names=f,col.names=f) > M170ccdata[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] K Van Steen 591

73 M170 case control data SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class K Van Steen 592

74 Loading a data file (MDR 2.0 beta 3) K Van Steen 593

75 Configuring the analysis K Van Steen 594

76 Reducing the number of cross-validations CV=10 CV=3 (Motsinger and Ritchie 2006) K Van Steen 595

77 Reducing the number of cross-validations (CVs) In general, CV is a useful approach for limiting false-positives by assessing the generalizability of models (Coffey et al 2004) The number of CV intervals in an MB-MDR analysis can be reduced from 10 to 5, but not to 3 CV seems to be rather important in the MDR algorithm: - Motsinger and Ritchie (2003) showed that, without CV, selection of a final model is difficult, but that it is encouraging that the false-positive results almost always include at least one correct functional locus. - This indicates that perhaps, in the case of extremely large datasets, like genomewide scans, where using any type of CV would be computationally infeasible, MDR could still be used (without CV) to identify at least one functional locus K Van Steen 596

78 Search method configuration K Van Steen 597

79 Running the MDR analysis K Van Steen 598

80 Summary of results K Van Steen 599

81 Best MDR model K Van Steen 600

82 MDR best model K Van Steen 601

83 Values calculated by MDR Measure Balanced Accuracy Accuracy Sensitivity Specificity Odds Ratio Formula/Interpretation (Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class (TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified (skewed in favor of larger class) TP/(TP+FN); how likely a positive classification is correct TN/(TN+FP); how likely a negative classification is correct (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups K Van Steen 602

84 Values calculated by MDR Measure Precision Kappa X 2 F-Measure Formula/Interpretation TP/(TP+FP); the proportion of relevant cases returned 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy Chi-squared score for the attribute constructed by MDR from this attribute combination 2*TP/(2*TP+FP+FN); a function of sensitivity and precision TP: true positive; TN: true negative; FP: false positive; FN: false negative K Van Steen 603

85 MDR CV results average = K Van Steen 604

86 MDR best model Graphical display on whole data If-then rules on whole data K Van Steen 605

87 The fitness landscape Gives the fitness landscape across all models as a line chart (the default). - The models produced are on the x-axis of the chart. The models on the x-axis are in the order in which they were generated (e.g., 1,2,3,, 12, 13, 14, ) - Training accuracy is shown on the y-axis. K Van Steen 606

88 The fitness landscape SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP1,SNP SNP1,SNP SNP1,SNP SNP1,SNP K Van Steen 607

89 Locus Dendrogram The dendrogram provides a graphical representation of the interactions between attributes (and the strength of those interactions) from the MDR analysis (max nr of interactions asked for) using an interaction dendrogram. The purpose of the interaction dendrogram is to assist the user with determining the nature of the interactions (redundant, additive, or synergistic). K Van Steen 608

90 Locus Dendrogram The dendrogram is constructed using hierarchical cluster analysis with average-linking. The distance matrix used by the cluster analysis is constructed by calculating the information gained by constructing two attributes using the MDR function (Moore et al 2006, Jakulin and Bratko 2003, Jakulin et al 2003) K Van Steen 609

91 Raw entropy values Entropy is basically a defined a measure of randomness or disorder within a system. More specifically indicates that the lower the entropy values are the higher likelihood that the system is in a more probable state. A classic example of this principle is the melting of a glass of ice in which as the state becomes more unstable as the entropy increases. A graphical illustration of the relationships between information theoretic measures on the joint distribution of attributes A and B. The surface area of a section corresponds to the labeled quantity (Jakulin 2003) [I(A;B) = mutual information = the amount of information provided by A about B = information gain.; H(A) = entropy of A] K Van Steen 610

92 Raw entropy values Let us assume an attribute, A. We have observed its probability distribution, P A (a). Shannon s entropy measured in bits is a measure of predictability of an attribute is defined as: Hence phrased differently, the higher the entropy, the less reliable are our predictions about A. We can understand H(A) as the amount of uncertainty about A, as estimated from its probability distribution. K Van Steen 611

93 Raw entropy values Single Attribute Values: - H(A): This is the entropy of the given attribute (A) - H(A C): This is the entropy of the given attribute (A) given the class (C) - I(A;C): This is the information gain of the given attribute (A) given the class (C) Pairwise Values: - H(AB): This is the entropy of the given constructed attribute (AB) - H(AB C): This is the entropy of the given constructed attribute (AB) given class I - I(A;B): This is the information gain of attribute (A) given attribute (B) - I(A;B;C): This is the information gain for attribute (A) or Attribute (B) given class (C) - I(AB;C): This is the information for the constructed attribute (AB) given class I K Van Steen 612

94 Raw entropy values Mutual information I(A ;B) as a function of r 2 (as a measure of LD between markers), for a subset of the Spanish Bladder Cancer data (SBCS) unpublished results K Van Steen 613

95 K Van Steen 614

96 Locus dendrogram The colors range from red representing a high degree of synergy (positive information gain), orange a lesser degree, and gold representing the midway point between synergy and redundancy. On the redundancy end of the spectrum, the highest degree is represented by the blue color (negative information gain) with a lesser degree represented by green. Synergy The interaction between two attributes provides more information than the sum of the individual attributes. Redundancy The interaction between attributes provides redundant information. K Van Steen 615

97 Positive and negative interactions Say I(A;B;C) = I(A,B;C) I(A;C) I(B;C) Assume that we are uncertain about the value of C, but we have information about A and B. - Knowledge of A alone eliminates I(A;C) bits of uncertainty from C. - Knowledge of B alone eliminates I(B;C) bits of uncertainty from C. - However, the joint knowledge of A and B eliminates I(A,B;C) bits of uncertainty. Hence, if interaction information is positive, we benefit from an unexpected synergy. If interaction information is negative, we suffer diminishing marginal returns by introducing attributes that partly contribute redundant information. K Van Steen 616

98 Significance of the results We simulated data from a two-locus epistasis model. The remaining SNPs were generated at random Hence, what does it mean that the best single effects model SNP5 was chosen? Answer: Every k-locus setting will give rise to a best model. MDR forces for every k-locus setting an optimal model. K Van Steen 617

99 Significance of results The best model among all 1-3 locus models, is the one with maximal cross validation consistency and maximum average balanced prediction accuracy But how significant is this result? K Van Steen 618

100 Configuring the permutation analysis (MDR PT Module alpha) K Van Steen 619

101 Performing the MDR permutation test K Van Steen 620

102 Performing the MDR permutation test K Van Steen 621

103 Performing the MDR permutation test Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) (<0.0010) (<0.0010) (0.2160) (0.2160) (0.2160) Obtained from MDR summary table Obtained from MDR Permutation Testing p-value calculator K Van Steen 622

104 Performing the MDR permutation test Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) 10 (0.2160) (<0.0010) 10 (0.2160) (<0.0010) 10 (0.2160) Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) K Van Steen 623

105 What is going on? Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) Effect of strong main effect is carried through in higher order interactions? What will happen for data simulated under M27 (with main effects by simulation? K Van Steen 624

106 Results for M27 K Van Steen 625

107 Results for M27 Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP1-SNP (<0.0010) 10 (0.2310) SNP1-SNP2-SNP (<0.0010) 5 (0.9110) What about SNP2? Why is this not highlighted as an important main effect? Maximizing CVC first and then looking at prediction accuracy highlights SNP1-SNP2. Maximizing prediction accuracy alone, would point towards SNP1-SNP2-SNP4. K Van Steen 626

108 Results for M27 Using permutation null distributions per k-locus setting, the following results are obtained: Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP1 SNP1-SNP2 SNP1-SNP2-SNP (<0.0010) 10 (0.1790) (<0.0010) 10 (0.0620) (<0.0010) 5 (0.9110) Wouldn t it be natural to correct for SNP1 when looking for interactions? What if more than one main effect is present in the data? K Van Steen 627

109 Strengths of MDR Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multi-locus data Non-parametric no values are estimated Assumes no particular genetic model Minimal false-positive rates K Van Steen 628

110 Weaknesses of MDR Computationally intensive (especially with >10 loci) - The original MDR software supports diseasemodels with up to 15 factors at a time from a list of up to 500 total factors and a maximum sample size of 4,000 subjects. - Parallel MDR (Bush et al 2006) is a redesign of the initial MDR algorithm to allow an unlimited number of study subjects, total variables and variable states, and to remove restrictions on the order of interactions being analyzed The algorithm gives an approximate 150-fold decrease in runtime for equivalent analyses. The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data K Van Steen 629

111 Several (other) extensions to the MDR paradigm (CV based) (Lou et al 2008) K Van Steen 630

112 Different measure to score model quality One crucial component of the MDR algorithm measures the percentage of cases and controls incorrectly labelled by the proposed classification the classification error. - The combination of variables that produces the lowest classification error is selected as the best or most fit model. The correctly and incorrectly labelled cases and controls can be expressed as a two-way contingency table. The ability of MDR to detect gene-gene interactions can be improved by replacing classification error with a different measure to score model quality. - Of 10 measures evaluated, Bush et al (2008) found that the likelihood ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model. K Van Steen 631

113 Contingency table measures of classification performance (Bush et al 2008) K Van Steen 632

114 Towards an easy-to- adapt framework MB-MDR (Lou et al 2008) FAM-MDR K Van Steen 633

115 MB-MDR as a semi-parametric approach for unrelateds Step 1: New risk cell identification via association test on each genotype cell c j - Parametric or non-parametric test of association Step 2: Test one-dimensional genetic construct X on Y Step 3: assess significance - W = [b/se(b)]2, b=ln(or) - Derive correct null distribution for W (Calle et al 2007, Calle et al 2008) K Van Steen 634

116 Motivation 1 for MB-MDR Some important interactions could be missed by MDR due to pooling too many cells together (Calle et al 2008) K Van Steen 635

117 Motivation 2 for MB-MDR MDR cannot deal with main effects / confounding factors / nondichotomous outcomes K Van Steen 636

118 Motivation 3 for MB-MDR MDR has low performance in the presence of genetic heterogeneity (Calle et al 2008) K Van Steen 637

119 A comparison of analytical methods: GENN, RF, FITF, MDR, logistic regression GENN Grammatical evolution neural network (GENN) is a novel pattern recognition method developed to detect main effects or multi-locus models of association without exhaustively searching all possible multi-locus combinations. Grammatical evolution (GE) is a machine-learning algorithm inspired by the biological process of transcription and translation. GE uses a genetic algorithm in combination with a pre-specified grammar (set of translation rules) to automatically evolve an optimal computer program. GENN utilizes GE to evolve the inputs (predictor variables), architecture (arrangement of layers and functions), and weights of a neural network (NN) to optimally classify a given dataset. (Motsinger-Reif et al 2008) K Van Steen 638

120 A schematic overview of the GENN method (Motsinger-Reif 2008) K Van Steen 639

121 Random Forests (RF) RF is a machine-learning technique that builds a forest of classification trees wherein each component tree is grown from a bootstrap sample of the data, and the variable at each tree node is selected from a random subset of all variables in the data (Breiman, 2001). The final classification of an individual is determined by voting over all trees in the forest. RF models may uncover interactions among factors that do not exhibit strong marginal effects, without demanding a pre-specified model (McKinney et al., 2006). Additionally, tree methods are suited to dealing with certain types of genetic heterogeneity, since splits near the root node define separate model subsets in the data. (Motsinger-Reif et al 2008) K Van Steen 640

122 Random Forests (RF) Each tree in the forest is constructed as follows from data having N individuals and M explanatory variables: - Choose a training sample by selecting N individuals, with replacement, from the entire data set. - At each node in the tree, randomly select m variables from the entire set of M variables in the data. The absolute magnitude of m is a function of the number of variables in the data set and remains constant throughout the forest building process. - Choose the best split at the current node from among the subset of m variables selected above. - Iterate the second and third steps until the tree is fully grown (no pruning). Repetition of this algorithm yields a forest of trees, each of which has been trained on bootstrap samples of individuals (Motsinger-Reif et al 2008) K Van Steen 641

123 A schematic overview of the RF method (Motsinger-Reif et al 2008) K Van Steen 642

124 Advantages of the Random Forest method It can handle a large number of input variables. It estimates the relative importance of variables in determining classification, thus providing a metric for feature selection. RF produces a highly accurate classifier with an internal unbiased estimate of generalizability during the forest building process. RF is fairly robust in the presence of etiological heterogeneity and relatively high amounts of missing data (Lunetta et al., 2004). Finally, and of increasing importance as the number of input variables increases, learning is fast and computation time is modest even for very large data sets (Robnik-Sikonja, 2004). (Motsinger-Reif et al 2008) K Van Steen 643

125 Focused Interaction Testing Framework (FITF) The FITF was recently developed to detect epistatic interactions that predict disease risk. Details of the FITF algorithm and software can be found in Millstein et al. (2006). FITF is a modification of the interaction testing framework (ITF) method, which pre-screens all possible gene sets to focus on those that potentially are the most informative and reduce the multiple testing problem by reducing the number of statistical tests performed. FITF has been shown to outperform MDR when interactions involved additive, recessive, or dominant genes (Millstein et al., 2006). (Motsinger-Reif et al 2008) K Van Steen 644

126 Focused Interaction Testing Framework (FITF) The FITF algorithm modifies the ITF approach to reduce the overall number of variants tested with an initial filter process. A chi-square goodness-of-fit statistic that compares the observed with the expected Bayesian distribution of multi-locus genotype combinations in a combined casecontrol population is used in a prescreening initial stage. This statistic, referred to as the chi-square subset (CSS), has the form: where n i is the observed number of subjects (regardless of case/control status) in the ith genotype group and r is the total number of genotype groups. The expected n i, noted as E(n i ), is estimated based on the sample marginal genotype frequencies of each gene. K Van Steen 645

127 Conclusion on comparison MDR results in the one and two-locus models were comparable to GENN GENN performs poorly with the three locus models considered in Motsinger-Reif et al (2008). - This highlights a disadvantage of an evolutionary computation approach in exploring purely epistatic models it is much less likely that three loci will be stochastically assembled into a model to evaluate than two loci. Both GENN and MDR outperformed FITF - Because GENN and MDR both utilize permutation distributions for significance testing, correction for multiple testing is unnecessary. While the filter stage of FITF does reduce the number of tests performed with the ITF strategy, there are still a very large number of tests that are corrected for. K Van Steen 646

128 Conclusion on comparison Both RF and steplr were unable to detect purely epistatic models. - Since both require marginal main effects to perform variable selection tasks. - Future extensions/modifications of these approaches should consider this limitation and modify the variable selection process to capture pure interactions. - Some groups have in fact begun to make modifications in this way (Bureau et al., 2005) K Van Steen 647

129 2.c Interpretation of multi-locus results It is always a good idea to use several model selection criterions before interpreting (Ritchie et al 2007) K Van Steen 648

130 A flexible framework for analysis acknowledging interpretation capability The framework contains four steps to detect, characterize, and interpret epistasis - Select interesting combinations of SNPs - Construct new attributes from those selected - Develop and evaluate a classification model using the newly constructed attribute(s) - Interpret the final epistasis model using visual methods (Moore et al 2005) K Van Steen 649

131 Flexible framework Step 1 Attribute selection - Use entropy-based measures of information gain (IG) and interaction - Evaluate the gain in information about a class variable (e.g. case-control status) from merging two attributes together - This measure of IG allows us to gauge the benefit of considering two (or more) attributes as one unit (slide: Chen 2007) K Van Steen 650

132 Information gain Recall McGill s multiple mutual information (Te Sun Han 1980) : ; ; ; ; (information gain) If I(A;B;C) > 0 - Evidence for an attribute interaction that cannot be linearly decomposed If I(A;B;C) < 0 - The information between A and B is redundant If I(A;B;C) = 0 - Evidence of conditional independence or a mixture of synergy and redundancy K Van Steen 651

133 Illustration of entropy-based measures on Model 1 (Ritchie et al 2001) K Van Steen 652

134 Attribute selection based on entropy Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects). Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration (slide: Chen 2007) K Van Steen 653

135 Attribute selection based on relief F The Relief statistic was developed by the computer science community as a powerful method for determining the quality or relevance of an attribute (i.e. variable) for predicting a discrete endpoint or class variable (Kira and Rendell 1992, Konenko 1994, Robnik-Sikonja and Kononenko 2003). Relief is especially useful when there is an interaction between two or more attributes and the discrete class variable. It is thus superior to univariate filters such as a chi-square test of independence (see later) when interactions are present. K Van Steen 654

136 Attribute selection based on relief F In particular, Relief estimates the quality of attributes through a type of nearest neighbor algorithm that selects neighbors (instances) from the same class and from the different class based on the vector of values across attributes. Weights (W) or quality estimates for each attribute (A) are estimated based on whether the nearest neighbor (nearest hit, H) of a randomly selected instance (R) from the same class and the nearest neighbor from the other class (nearest miss, M) have the same or different values. This process of adjusting weights is repeated for m instances. The algorithm produces weights for each attribute ranging from -1 (worst) to +1 (best). K Van Steen 655

137 Attribute selection based on relief F (applied to M27) Only the top 10% of scores will be returned to the filtered data set K Van Steen 656

138 Attribute selection based on relief F applied to M27 For the M27 simulated data, this reduction of the overall attribute count does not make sense of course (# SNPs = 10!) K Van Steen 657

139 Attribute selection based on TuRF ReliefF is able to capture attribute interactions because it selects nearest neighbors using the entire vector of values across all attributes. However, this advantage is also a disadvantage because the presence of many noisy attributes can reduce the signal the algorithm is trying to capture. The tuned ReliefF algorithm (TuRF) systematically removes attributes that have low quality estimates so that the ReliefF values if the remaining attributes can be re-estimated. (Moore and White 2008) K Van Steen 658

140 Attribute selection based on chi-squared The MDR software provides a simple chi-square test of independence as a univariate filter. - The manual specifies that this filter should be used to condition your MDR analysis on those attributes that have an independent main effect. - However, the MDR software itself does not give you a lot of options to actually perform this conditioning The ReliefF filter will be more useful for capturing those attributes that are likely to be involved in an interaction. K Van Steen 659

141 Attribute selection based on chi-squared (applied to M27) K Van Steen 660

142 Attribute selection based on odds ratio The odds ratio (OR) is a way of comparing whether the probability of a certain event is the same for two groups. - An odds ratio of 1 implies that the event is equally likely in both groups. - An odds ratio that is greater than 1 implies that the event is most likely in the first group whereas - A value less than one implies that the event is less likely in the first group. When an attribute is polytomous (i.e. more than 2 levels) MDR calculates the OR for each possible contrast and then reports the largest OR value. - For 3 levels 0, 1, 2, the following contrasts are considered 0 vs 1 ; 0 vs 2 ; 1 vs 2 ; 0 vs 1&2 ; 1 vs 0&2 ; 2 vs 1&0 K Van Steen 661

143 Flexible framework Step 2 Constructive induction, for instance MDR - A multi-locus genotype combination is considered high-risk if the ratio of cases to controls exceeds given threshold T, else it is considered lowrisk - Genotype combinations considered to be high-risk are labeled G1 while those considered low-risk are labeled G0. - This process constructs a new one-dimensional attribute with levels G0 and G1 (slide: Chen 2007) K Van Steen 662

144 Flexible framework Step 3 Classification and machine learning - The single attribute obtained in Step 2 can be modeled using machine learning and classification techniques Bayes classifiers as one technique - Mitchell (1997) defines the naive Bayes classifier as arg max - where v j is one of a set of V classes and a i is one of n attributes describing an event or data element. The class associated with a specific attribute list is the one, which maximizes the probability of the class and the probability of each attribute value given the specified class. K Van Steen 663

145 Flexible framework Step 3 - The standard way to apply the naive Bayes classifier to genotype data would be to use the genotype information for each individual as a list of attributes to distinguish between the two hypotheses The subject is high-risk and The subject is low-risk. Alternatively, an odds ratio for the single multilocus attribute can also be estimated using logistic regression to facilitate a traditional epidemiological analysis and interpretation. - Evaluation of the predictor can be carried out using cross-validation (Hastie et al., 2001) and permutation testing (Good, 2000), for example. (Moore et al 2006) K Van Steen 664

146 Flexible framework Step 4 Interpretation interaction graphs - Comprised of a node for each attribute with pairwise connections between them. - Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute. - Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes. (slide: Chen 2007) K Van Steen 665

147 Flexible framework Step 4 Interpretation dendrograms - Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree. K Van Steen 666

148 Hierarchical clustering with average linkage Here the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group K Van Steen 667

149 Flexible framework The flexibility of this framework is the ability to plug and play - Different attribute selection methods other than the entropy-based - Different constructive induction algorithms other than the MDR - Different machine learning strategies other than a naïve Bayes classifier (slide: Chen 2007) K Van Steen 668

150 3 Future challenges Integration of omics data in GWAs K Van Steen 669

151 Integrations of omics data in GWAs (Hirschhorn 2009) K Van Steen 670

152 Integration of omics data in GWAs A few straightforward examples: Post-analysis - As validation tool in main effects GWAs During the analysis: - Epistasis screening (FAM-MDR) Use expression values to prioritize multi-locus combinations - Main effects screening (PBAT) Construct an overall phenotype for each marker based on the linear combination of expression values (e.g., within 1Mb from the marker) that maximizes heritability and perform FBAT-PC screening to prioritize SNPs K Van Steen 671

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study Xiang Zhang A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy

More information

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics. Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Filter Methods. Part I : Basic Principles and Methods

Filter Methods. Part I : Basic Principles and Methods Filter Methods Part I : Basic Principles and Methods Feature Selection: Wrappers Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate error of a classifier using

More information

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

More information

SNP-SNP Interactions in Case-Parent Trios

SNP-SNP Interactions in Case-Parent Trios Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Aggregated Quantitative Multifactor Dimensionality Reduction

Aggregated Quantitative Multifactor Dimensionality Reduction University of Kentucky UKnowledge Theses and Dissertations--Statistics Statistics 2016 Aggregated Quantitative Multifactor Dimensionality Reduction Rebecca E. Crouch University of Kentucky, rebecca.crouch@uky.edu

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017 Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component

More information

Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

SF2930 Regression Analysis

SF2930 Regression Analysis SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Overview. Background

Overview. Background Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Package LBLGXE. R topics documented: July 20, Type Package

Package LBLGXE. R topics documented: July 20, Type Package Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information