INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen
|
|
- Annis Kathryn Stewart
- 5 years ago
- Views:
Transcription
1 INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. Van Steen
2 CHAPTER 7: A WORLD OF INTERACTIONS 1 Beyond main effects 1.a Dealing with multiplicity 1.b A bird s eye view on roads less travelled by 1.c Multi-locus analysis epistasis analysis 2 Epistasis detection: a challenging task 2.a Variable selection 2.b Multifactor dimensionality reduction 2.c Interpretation 3 Future challenges K Van Steen 521
3 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions 1 Beyond main effects 1.a Dealing with multiplicity Multiple testing explosion: ~500,000 SNPs span 80% of common variation in genome (HapMap) n-th order interaction K Van Steen 522
4 Ways to handle multiplicity Recall that several strategies can be adopted, including: - clever multiple corrective procedures - pre-screening strategies, - multi-stage designs, - adopting haplotype tests or - multi-locus tests Which of these approaches are more powerful is still under heavy debate The multiple testing problem becomes unmanageable when looking at multiple loci jointly? K Van Steen 523
5 1.b A bird s eye view on roads less travelled by Multiple disease susceptibility loci (mdsl) Dichotomy between - Improving single markers strategies to pick up multiple signals at once (PBAT) - Testing groups of markers (FBAT multi-locus tests) K Van Steen 524
6 PBAT screening for mdsl Little has been done in the context of family-based screening for epistasis First assess how a method is capable of detecting multiple DSL Simulation strategy (10,000 replicates): - Genetic data from Affymetrix SNPChip 10K array on 467 subjects from 167 families - Select 5 regions; 1 DSL in each region - Generate traits according to normal distribution, including up to 5 genetic contributions - For each replicate: generate heritability according to uniform distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci) (Van Steen et al 2005) K Van Steen 525
7 General theory on FBAT testing Test statistic: - works for any phenotype, genetic model - use covariance between offspring trait and genotype Test Distribution: - computed assuming H 0 true; random variable is offspring genotype - condition on parental genotypes when available, extend to family configurations (avoid specification of allele distribution) - condition on offspring phenotypes (avoid specification of trait distribution) (Horvath et al 1998, 2001; Laird et al 2000) K Van Steen 526
8 Screen Use between-family information [f(s,y)] Calculate conditional power (a b,y,s) Select top N SNPs on the basis of power Test Use within-family information [f(x S)] while computing the FBAT statistic This step is independent from the screening step Adjust for N tests (not 500K!) ( Van Steen et al 2005) ( Lange and Laird 2006) K Van Steen 527
9 Power to detect genes with multiple DSL top : top 5 SNPs in the ranking bottom: top 10 SNPs in the ranking (Van Steen et al 2005) K Van Steen 528
10 Power to detect genes with multiple DSL top : Benjamini-Yekutieli FDR control at 5% (general dependencies) bottom: Benjamini-Hochberg FDR control at 5% (Van Steen et al 2005) K Van Steen 529
11 FBAT multi-locus tests FBAT-SNP-PC attains higher power in candidate genes with lower average pair-wise correlations and moderate to high allele frequencies with large gains (up to 80%). (Rakovski et al 2008) The new test has an overall performance very similar to that of FBAT-LC (FBAT-LC : Xin et al 2008) K Van Steen 530
12 In contrast: popular multi-locus approaches for unrelateds Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: - Combinatorial Partitioning Method (CPM) quantitative phenotypes; interactions - Multifactor-Dimensionality Reduction (MDR) qualitative phenotypes; interactions - Machine learning and data mining The multiple testing problem becomes unmanageable when looking at (genetic) interaction effects? More about this in Chapter 9. K Van Steen 531
13 1.c Multi-locus analysis epistasis analysis Epistasis: what s in a name? Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect. (Wikipedia) (slide : C Amos) K Van Steen 532
14 Epistasis: what s in a name? Distortions of Mendelian segregation ratios due to one gene masking the effects of another (William Bateson ). Deviations from linearity in a statistical model (Ronald Fisher ). Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans (Cordell 2002) K Van Steen 533
15 Why is there epistasis? From an evolutionary biology perspective, for a phenotype to be buffered against the effects of mutations, it must have an underlying genetic architecture that is comprised of networks of genes that are redundant and robust. This creates dependencies among the genes in the network and is realized as epistasis. (slide: Y Chen, 2007) K Van Steen 534
16 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Different types of interactions Trait Qq qq QQ m-a m+d m+a (Fisher, Wright) (slide: C Amos) K Van Steen 535
17 Interpretation of epistasis The study of epistasis poses problems of interpretability. Statistically, epistasis is usually defined in terms of deviation from a model of additive multiple effects, but this might be on either a linear or logarithmic scale, which implies different definitions. (Moore 2004) - Despite the aforementioned concerns, there is evidence that a direct search for epistatic effects can pay dividends. - It is expected to have an increasing role in future analyses K Van Steen 536
18 The frequency of epistasis Not a new idea! (Bateson 1909) Complexity of gene regulation and biochemical networks (Gibson 1996; Templeton 2000) Single gene results don t replicate (Hirschhorn et al. 2002) Gene-gene interactions are commonly found when properly investigated (Templeton 2000) Working hypothesis: Single gene studies don t replicate because gene-gene interactions are more important (Moore and Williams 2002) (Moore 2003) K Van Steen 537
19 Slow shift from main towards epistatis effects (Motsinger et al 2007) K Van Steen 538
20 Power of a gene-gene or gene-environment interaction analysis There is a vast literature on power considerations - Most of this literature strengthen their beliefs by extensive simulation studies There is a need for user-friendly software tools that allow the user to perform hands-on power calculations Main package targeting interaction analyses is QUANTO (v1.2.1): - Available study designs for a disease (binary) outcome include the unmatched case-control, matched case-control, case-sibling, caseparent, and case-only designs. Study designs for a quantitative trait include independent individuals and case parent designs. - Reference: Gauderman (2000a), Gauderman (2000b), Gauderman (2003) / K Van Steen 539
21 A simple example of epistasis K Van Steen 540
22 A simple disease model Penetrance - Pr (affected genotype) One-locus Dominant Model Genotype aa aa AA Status K Van Steen 541
23 A slightly more complicated two-locus model Genotype bb bb BB aa aa AA Enumeration of two-locus models Although there are 2 9 =512 possible models, because of symmetries in the data, only 50 of these are unique. Enumeration allows 0 and 1 only for penetrance values ( fully penetrant ; i.e., show example). K Van Steen 542
24 Enumeration of two-locus models (Li and Reich 2000) Each model represents a group of equivalent models under permutations. The representative model is the one with the smallest model number. The six models studied in Neuman and Rice [67] ( RR, RD, DD, T, Mod, XOR ), as well as two single-locus models ( IL ) the recessive (R) and the interference (I) model, are marked. K Van Steen 543
25 Different degrees of epistasis (slide: Motsinger) K Van Steen 544
26 Pure epistasis model for dichotomous traits Suppose - p(a)=p(b)=p(a)=p(b)=0.5 - HWE (hence, p(aa)=0.5 2 =0.25,p(Aa)= =0.5) and no LD - penetrances are given according to the table below P(affected genotype) Penetrance bb bb BB prob aa aa AA prob Then make multiple use of Bayes rule to retrieve the genotype distributions in cases and controls K Van Steen 545
27 Pure epistasis model for dichotomous traits Then the marginal genotype distributions for cases and controls are the same, and hence one-locus approaches will be powerless! P(genotypes affected) bb bb BB prob aa aa AA prob P(genotypes unaffected) bb bb BB prob aa aa AA prob P(aa,BB D) =p(d aa,bb)p(aa,bb) / p(d) = /( ) = ¼ = 0.25 K Van Steen 546
28 Purely epistatic 3-locus diseasemodel for quantitative traits Assume all allele frequencies are 0.5 Heritability is 55% and prevalence is 6.25% L.2 L.3=0 L.3=1 L.3= L (Culverhouse et al 2002) K Van Steen 547
29 Expected genotype patterns for 3-locus model L.1 L.2 L.3 p(g) E[#affected] E[#unaff] Other Sum (Culverhouse et al 2002) (sllide: J Ott 2004) K Van Steen 548
30 2 Epistasis detection: a challenging task Main challenges Variable selection Modeling Interpretation - Making inferences about biological epistasis from statistical epistasis (slide Chen 2007) K Van Steen 549
31 2.a Variable selection Introduction The aim is to make clever selections of marker combinations to look at in an epistasis analysis This may not only aid in the interpretation of analysis results, but also reduced the burden of multiple testing and the computational burden K Van Steen 550
32 Variable selection and multiple testing Multiple testing is a thorny issue, the bane of statistical genetics. - The problem is not really the number of tests that are carried out: even if a researcher only tests one SNP for one phenotype, if many other researchers do the same and the nominally significant associations are reported, there will be a problem of false positives. (Balding 2006) Example - Given 3 disease SNPS (e.g., Culverhouse 3-locus model before), making inferences is not at all an easy task: Chi-sq = (26 df), p= With 50,000 SNPS, there will be subsets of size 3 Applying Bonferroni correction, p = A more manageable approach is to test all possible pairs of loci for interaction effects, different in cases and controls (Hoh and Ott 2003) K Van Steen 551
33 Variable selection and multiple testing Pre-screening for subsequent testing: - Independent screening and testing step (PBAT screening; Van Steen et al 2005) - Dependent screening and testing step K Van Steen 552
34 Methods to correct for multiple testing Family-wise error rates (FWER) - In the presence of too many SNPs, the Bonferroni threshold will be extremely low: Bonferroni adjustments are conservative when statistical tests are not independent / Bonferroni adjustments control the error rate associated with the omnibus null hypothesis / The interpretation of a finding depends on how many statistical tests were performed Permutation data sets - It is particularly handy for rare genotypes, small studies, non-normal phenotypes, and tightly linked markers - In case-control data this is relatively straightforward / In family data this is not at all an easy task K Van Steen 553
35 Methods to correct for multiple testing False discovery rate (FDR) - With too many SNPs it starts to break down and the power over Bonferroni is minimal (e.g. see Van Steen et al 2005) False-positive report probability (FPRP) - It is the probability of no true association between a genetic variant and disease given a statistically significant finding, depends not only on the observed p-value but also on both the prior probability that the association between the genetic variant and the disease is real and the statistical power of the test (Wacholder et al 2004) - In general, Bayesian approaches do not yet have a big role in genetic association analyses, possibly because of computational burden? - Not yet well documented / What are the priors? (Balding 2006; Lucke 2008) K Van Steen 554
36 Variable selection and computation time When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. For instance, if SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require seconds (ie, 8.3 hours) of computer time. Exhaustive evaluation of the approximately pairwise combinations of SNPs would require 1286 years. Although it might be possible for a large supercomputer to complete these computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem. (Moore and Ritchie 2004) K Van Steen 555
37 2.b Modeling Failure of traditional methods A large number of SNPs are genotyped - multiple comparisons problem, very small p-values required for significance, which is even compounded in gene-environment interaction analyses. Genetic loci may interact (epistasis) in their influence on the phenotype - loci with small marginal effects may go undetected - interested in the interaction itself Curse of dimensionality and sparse cells K Van Steen 556
38 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 2 SNPs, there are 9 = 3 2 possible two locus genotype combinations. If the alleles are rare (MAF 10%), then some cells will be empty (slide: C Amos) K Van Steen 557
39 Introduction to Genetic Epidemiology Chapter 7: A World of Interactions Curse of dimensionality and sparse cells For 4 SNPs, there are 81 possible combinations with more cells possible empty (slide: C Amos) K Van Steen 558
40 Modeling: strategy 1 Strategy 1: Set association approach At each SNP, compute an association statistic T Build sum over 1, 2, 3, etc highest values t Evaluate significance of given sum by permutation test Sum with smallest p-value will point towards the markers to select Smallest p is single statistic, find significance level Is applicable to many SNPs and has also been used in microarray settings (Hoh et al 2001) K Van Steen 559
41 Strategy 1: Set association approach (Hoh et al 2001) K Van Steen 560
42 Modeling: strategy 2 Strategy 2: Multi-locus approaches Most case control studies far too often do not take into account the multilocus nature of complex traits When the aim is to analyze multiple SNPs or genes jointly, two classes of approaches emerge: - Combine (properties of) single-locus statistics over multiple SNPs to obtain a new multivariate test statistic Depending on whether SNPS are in high LD or not, different measures need to be taken - Look for patterns of genotypes at SNPs in different genomic locations K Van Steen 561
43 Two frameworks for multi-locus approaches (Onkamo and Toivonen 2006) Parametric methods: - Regression - Logistic or (Bagged) logic regression Non-parametric methods: Tree-based methods: - Recursive Partitioning (Helix Tree) - Random Forests (R, CART) Pattern recognition methods: - Mining association rules - Neural networks (NN) - Support vector machines (SVM) Data reduction methods: - DICE (Detection of Informative Combined Effects) - MDR (Multifactor Dimensionality Reduction) K Van Steen 562
44 Non-parametric chi-square The question is how to test for epistatic effects above and beyond (independent) main effects (of single-locus genotype effects) - Use usual chi-square for interactions independent of main effects. Isolate individual df s. - Assess difference in interactions between cases and controls, since then interactions may be better indicative for underlying pathways Locus 2 Locus 1 BB Bb Bb AA Aa aa Main effect locus 1 2df Main effect locus 2 2 df Interactions 4 df Total 8 df K Van Steen 563
45 Partitioning chi-squares for one locus 2df 1 df 1 df Simple disease model, population frequency K = 0.10 N = 100 cases, 100 controls. Predicted numbers of cases and controls in given genotype classes, and resulting odds ratios, OR K Van Steen 564
46 Partitioning chi-squares for two loci 3 3 table of genotypes (4 df) may be partitioned into 4 independent components, each with 1 df. Do such partitioning for cases and controls each (Agresti 2002). AA Aa BB Bb AA Aa BB, Bb bb AA, Aa aa BB Bb AA, Aa aa BB, Bb bb K Van Steen 565
47 Partitioning chi-squares for two loci Compare each of the four 2 by 2 subtables between cases and controls to see whether their odds ratios are the same K Van Steen 566
48 Logistic regression LR is a derivative of linear regression that fits a function to continuous or discrete independent variables based on a dichotomous dependent variable (Hosmer and Lemeshow, 2000). One of the most common procedures for variable selection in a LR analysis is step-wise logistic regression (step LR) [Hosmer and Lemeshow, 2000]. - In the step-wise procedure, each variable is tested for independent effects, and those variables with significant effects are included in the model. - In a second step, interaction terms of those variables with significant main effects are included, and significant effects are included in the model. (Motsinger-Reif et al 2008) K Van Steen 567
49 Logistic regression LR is a de facto standard for traditional association studies. Using independent variables to predict a dichotomous dependent variable, LR by definition lacks the ability to characterize purely interactive effects. Only variables that contain an independent main effect will be included in the final model. To properly evaluate non-linear purely interactive effects, combinations of variables must be encoded as a single variable for inclusion in the analysis. Such an encoding scheme can be computationally expensive, depending on the number of variables used. (Motsinger-Reif et al 2008) K Van Steen 568
50 Strategy 2: Look for patterns of genotypes using unrelated individuals CPM = combinatorial partitioning method (Charlie Sing, U Michigan). Applicable to small number (~50) of SNPs only. MDR = multifactor-dimensionality reduction method (Jason Moore, Vanderbuilt U) LAD = logical analysis of data (P. Hammer, Rutgers U) Mining association rules, Apriori algorithm (R. Agrawal) Special approaches for microarray data (Hoh and Ott 2003) K Van Steen 569
51 The MDR algorithm What is MDR? A data mining approach to identify interactions among discrete variables that influence a binary outcome A nonparametric alternative to traditional statistical methods such as logistic regression Driven by the need to improve the power to detect gene-gene interactions (slide: L Mustavich) K Van Steen 570
52 The 6 steps of MDR K Van Steen 571
53 MDR Step 1 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets K Van Steen 572
54 MDR Step 2 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set K Van Steen 573
55 MDR Step 3 Create a contingency table for these multi-locus genotypes, counting the number of affected and unaffected individuals with each multi-locus genotype K Van Steen 574
56 MDR Step 4 Calculate the ratio of cases to controls for each multi-locus genotype Label each multi-locus genotype as high-risk or low-risk, depending on whether the casecontrol ratio is above a certain threshold This is the dimensionality reduction step: Reduces n-dimensional space to 1 dimension with 2 levels K Van Steen 575
57 MDR Step 5 To evaluate the developed model in Step 4, use labels to classify individuals as cases or controls, and calculate the misclassification error In fact: balanced accuracy is used (arithmetic mean between sensitivity and specificity), which IS mathematically equivalent to classification accuracy when data are balanced K Van Steen 576
58 Repeat Steps 2 to 5 All possible combinations of n factors are evaluated sequentially for their ability to classify affected and unaffected individuals in the training data, and the best n-factor model is selected in terms of minimal misclassification error K Van Steen 577
59 MDR Step 6 The independent test data from the cross-validation are used to estimate the prediction error (testing accuracy) of the best model selected K Van Steen 578
60 Towards MDR Final Steps 1 through 6 are repeated for each possible cross-validation interval The best model across all 10 training and testing sets is selected on the basis of the criterion: - Maximize the cross-validation consistency = The number of times a particular model was the best model across the cross-validation subsets The end of a cross-validation procedure also allows to compute the - average training accuracy - average testing accuracy of best models over all cross-validation sets, and possible over multiple runs (with different seeds, to reduce the chance of observing spurious results due to chance divisions of the data) K Van Steen 579
61 MDR final The entire process is repeated for each k=1 to N loci combinations that are computationally feasible and an optimal k-locus model is chosen for each level of k considered. The final model is based on maximizing two criteria: - maximizing the (average) prediction accuracy - maximizing the (average) cross-validation consistency Statistical significance is obtained by comparing the average crossvalidation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations, derived empirically from 1000 permutations (Ritchie et al 2001, Ritchie et al 2003, Hahn et al 2003) K Van Steen 580
62 Several measures of fitness to compare models Balanced accuracy Balanced accuracy(ba) weighs the classification accuracy of the two classes equally and it is thought to be more powerful than using accuracy alone when data are imbalanced, or when the counts of cases and controls are not equal (Velez et al 2007) - BA is calculated from a 2 2 table relating exposure to status by [(sensitivity+specificity)/2]. Real case Model case TP Model control FN Real control FP TN When #cases = #controls, then TP+FN = FP+TN and BA = (TP+TN)/2*#cases = TP+TN/(total sample size) K Van Steen 581
63 Several measures of fitness to compare models Model-adjusted balanced accuracy Model-adjusted balanced accuracy uses in addition a different threshold in the MDR modeling, one that is based on the actual counts of case and control samples in the data. - When individuals have missing data, it accounts for the precise number of individuals with complete data for that particular multi-locus combination - This makes MDR robust to class imbalances (Velez et al 2007) K Van Steen 582
64 Hypothesis test of best model Evaluate magnitude of cross-validation consistency and prediction error estimates by adopting a permutation strategy In particular: - Randomize disease labels - Repeat MDR analysis several times (1000?) to get distribution of crossvalidation consistencies and prediction errors - Use distributions to derive the p-values for the actual cross-validation consistencies and prediction errors K Van Steen 583
65 Sample Quantiles 0% An Example Empirical Distribution 25% % % % % % % Frequency % K Van Steen 584
66 The probability that we would see results as, or more, extreme than for instance , simply by chance, is between 5% and 10% (slide: L Mustavich) The MDR Software Downloads Available from The MDR method is described in further detail by Ritchie et al. (2001) and reviewed by Moore and Williams (2002). An MDR software package is available from the authors by request, and is described in detail by Hahn et al. (2003). More information can also be found at K Van Steen 585
67 The authors Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, Required operating system software Linux: Linux (Fedora version Core 3): Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_06-b03) Java HotSpot(TM) Client VM (build 1.4.2_06-b03, mixed mode) Windows: Windows (XP Professional and XP Home): Java(TM) 2 Runtime Environment, Standard Edition (build v1.4.2_05) Minimum system requirements 1 GHz Processor 256 MB Ram 800x600 screen resolution K Van Steen 586
68 K Van Steen 587
69 Application to simulated data To show MDR in action, we simulated 200 cases and 200 controls using different multi-locus epistasis models (Evans 2006) - Scenario 1: 10 SNPs, adapted epistasis model M170, minor allele frequencies of disease susceptibility pair Scenario 2: 10 SNPs, epistasis model M27, minor allele frequencies of disease susceptibility pair 0.25 M170 M All markers were assumed to be in HWE. No LD between the markers. K Van Steen 588
70 Application to simulated data Marginal distributions for the controls M Marginal distributions for the cases M M M K Van Steen 589
71 Data format The definition of the format is as follows: - All fields are tab-delimited. - The first line contains a header row. This row assigns a label to each column of data. Labels should not contain whitespace. - Each following line contains a data row. Data values may be any string value which does not contain whitespace. - The right-most column of data is the class, or status, column. The data values for this column must be 1, to represent Affected or Case status, or 0, to represent Unaffected or Control status. No other values are allowed. K Van Steen 590
72 Easy data conversion > M170data[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] M170data <- rbind(m170.cases,m170.controls) M170ccdata <- matrix(na,nrow=ss,ncol=nsnps) for (i in 1:nsnps){ M170ccdata[,i] <- apply(m170data[,c(2*i-1,2*i)],1,sum)-2 } M170ccdata <- cbind(m170ccdata,c(rep(1,200),rep(0,200))) write.table(m170ccdata,"m170ccdata.txt",sep="\t",row.names=f,col.names=f) > M170ccdata[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] K Van Steen 591
73 M170 case control data SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class K Van Steen 592
74 Loading a data file (MDR 2.0 beta 3) K Van Steen 593
75 Configuring the analysis K Van Steen 594
76 Reducing the number of cross-validations CV=10 CV=3 (Motsinger and Ritchie 2006) K Van Steen 595
77 Reducing the number of cross-validations (CVs) In general, CV is a useful approach for limiting false-positives by assessing the generalizability of models (Coffey et al 2004) The number of CV intervals in an MB-MDR analysis can be reduced from 10 to 5, but not to 3 CV seems to be rather important in the MDR algorithm: - Motsinger and Ritchie (2003) showed that, without CV, selection of a final model is difficult, but that it is encouraging that the false-positive results almost always include at least one correct functional locus. - This indicates that perhaps, in the case of extremely large datasets, like genomewide scans, where using any type of CV would be computationally infeasible, MDR could still be used (without CV) to identify at least one functional locus K Van Steen 596
78 Search method configuration K Van Steen 597
79 Running the MDR analysis K Van Steen 598
80 Summary of results K Van Steen 599
81 Best MDR model K Van Steen 600
82 MDR best model K Van Steen 601
83 Values calculated by MDR Measure Balanced Accuracy Accuracy Sensitivity Specificity Odds Ratio Formula/Interpretation (Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class (TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified (skewed in favor of larger class) TP/(TP+FN); how likely a positive classification is correct TN/(TN+FP); how likely a negative classification is correct (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups K Van Steen 602
84 Values calculated by MDR Measure Precision Kappa X 2 F-Measure Formula/Interpretation TP/(TP+FP); the proportion of relevant cases returned 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy Chi-squared score for the attribute constructed by MDR from this attribute combination 2*TP/(2*TP+FP+FN); a function of sensitivity and precision TP: true positive; TN: true negative; FP: false positive; FN: false negative K Van Steen 603
85 MDR CV results average = K Van Steen 604
86 MDR best model Graphical display on whole data If-then rules on whole data K Van Steen 605
87 The fitness landscape Gives the fitness landscape across all models as a line chart (the default). - The models produced are on the x-axis of the chart. The models on the x-axis are in the order in which they were generated (e.g., 1,2,3,, 12, 13, 14, ) - Training accuracy is shown on the y-axis. K Van Steen 606
88 The fitness landscape SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP SNP1,SNP SNP1,SNP SNP1,SNP SNP1,SNP K Van Steen 607
89 Locus Dendrogram The dendrogram provides a graphical representation of the interactions between attributes (and the strength of those interactions) from the MDR analysis (max nr of interactions asked for) using an interaction dendrogram. The purpose of the interaction dendrogram is to assist the user with determining the nature of the interactions (redundant, additive, or synergistic). K Van Steen 608
90 Locus Dendrogram The dendrogram is constructed using hierarchical cluster analysis with average-linking. The distance matrix used by the cluster analysis is constructed by calculating the information gained by constructing two attributes using the MDR function (Moore et al 2006, Jakulin and Bratko 2003, Jakulin et al 2003) K Van Steen 609
91 Raw entropy values Entropy is basically a defined a measure of randomness or disorder within a system. More specifically indicates that the lower the entropy values are the higher likelihood that the system is in a more probable state. A classic example of this principle is the melting of a glass of ice in which as the state becomes more unstable as the entropy increases. A graphical illustration of the relationships between information theoretic measures on the joint distribution of attributes A and B. The surface area of a section corresponds to the labeled quantity (Jakulin 2003) [I(A;B) = mutual information = the amount of information provided by A about B = information gain.; H(A) = entropy of A] K Van Steen 610
92 Raw entropy values Let us assume an attribute, A. We have observed its probability distribution, P A (a). Shannon s entropy measured in bits is a measure of predictability of an attribute is defined as: Hence phrased differently, the higher the entropy, the less reliable are our predictions about A. We can understand H(A) as the amount of uncertainty about A, as estimated from its probability distribution. K Van Steen 611
93 Raw entropy values Single Attribute Values: - H(A): This is the entropy of the given attribute (A) - H(A C): This is the entropy of the given attribute (A) given the class (C) - I(A;C): This is the information gain of the given attribute (A) given the class (C) Pairwise Values: - H(AB): This is the entropy of the given constructed attribute (AB) - H(AB C): This is the entropy of the given constructed attribute (AB) given class I - I(A;B): This is the information gain of attribute (A) given attribute (B) - I(A;B;C): This is the information gain for attribute (A) or Attribute (B) given class (C) - I(AB;C): This is the information for the constructed attribute (AB) given class I K Van Steen 612
94 Raw entropy values Mutual information I(A ;B) as a function of r 2 (as a measure of LD between markers), for a subset of the Spanish Bladder Cancer data (SBCS) unpublished results K Van Steen 613
95 K Van Steen 614
96 Locus dendrogram The colors range from red representing a high degree of synergy (positive information gain), orange a lesser degree, and gold representing the midway point between synergy and redundancy. On the redundancy end of the spectrum, the highest degree is represented by the blue color (negative information gain) with a lesser degree represented by green. Synergy The interaction between two attributes provides more information than the sum of the individual attributes. Redundancy The interaction between attributes provides redundant information. K Van Steen 615
97 Positive and negative interactions Say I(A;B;C) = I(A,B;C) I(A;C) I(B;C) Assume that we are uncertain about the value of C, but we have information about A and B. - Knowledge of A alone eliminates I(A;C) bits of uncertainty from C. - Knowledge of B alone eliminates I(B;C) bits of uncertainty from C. - However, the joint knowledge of A and B eliminates I(A,B;C) bits of uncertainty. Hence, if interaction information is positive, we benefit from an unexpected synergy. If interaction information is negative, we suffer diminishing marginal returns by introducing attributes that partly contribute redundant information. K Van Steen 616
98 Significance of the results We simulated data from a two-locus epistasis model. The remaining SNPs were generated at random Hence, what does it mean that the best single effects model SNP5 was chosen? Answer: Every k-locus setting will give rise to a best model. MDR forces for every k-locus setting an optimal model. K Van Steen 617
99 Significance of results The best model among all 1-3 locus models, is the one with maximal cross validation consistency and maximum average balanced prediction accuracy But how significant is this result? K Van Steen 618
100 Configuring the permutation analysis (MDR PT Module alpha) K Van Steen 619
101 Performing the MDR permutation test K Van Steen 620
102 Performing the MDR permutation test K Van Steen 621
103 Performing the MDR permutation test Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) (<0.0010) (<0.0010) (0.2160) (0.2160) (0.2160) Obtained from MDR summary table Obtained from MDR Permutation Testing p-value calculator K Van Steen 622
104 Performing the MDR permutation test Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP (0.0540) 10 (0.2160) (<0.0010) 10 (0.2160) (<0.0010) 10 (0.2160) Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) K Van Steen 623
105 What is going on? Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP5 SNP1-SNP2 SNP1-SNP2-SNP ( ) 10 (0.1720) ( ) 10 (0.0570) ( ) 10 (0.0440) Effect of strong main effect is carried through in higher order interactions? What will happen for data simulated under M27 (with main effects by simulation? K Van Steen 624
106 Results for M27 K Van Steen 625
107 Results for M27 Perm null distr for best k=1-3 models Testing BA (p-value) CVC (p-value) SNP1-SNP (<0.0010) 10 (0.2310) SNP1-SNP2-SNP (<0.0010) 5 (0.9110) What about SNP2? Why is this not highlighted as an important main effect? Maximizing CVC first and then looking at prediction accuracy highlights SNP1-SNP2. Maximizing prediction accuracy alone, would point towards SNP1-SNP2-SNP4. K Van Steen 626
108 Results for M27 Using permutation null distributions per k-locus setting, the following results are obtained: Perm null distr for best k- locus model (hence 3 distr) Testing BA (p-value) CVC (p-value) SNP1 SNP1-SNP2 SNP1-SNP2-SNP (<0.0010) 10 (0.1790) (<0.0010) 10 (0.0620) (<0.0010) 5 (0.9110) Wouldn t it be natural to correct for SNP1 when looking for interactions? What if more than one main effect is present in the data? K Van Steen 627
109 Strengths of MDR Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multi-locus data Non-parametric no values are estimated Assumes no particular genetic model Minimal false-positive rates K Van Steen 628
110 Weaknesses of MDR Computationally intensive (especially with >10 loci) - The original MDR software supports diseasemodels with up to 15 factors at a time from a list of up to 500 total factors and a maximum sample size of 4,000 subjects. - Parallel MDR (Bush et al 2006) is a redesign of the initial MDR algorithm to allow an unlimited number of study subjects, total variables and variable states, and to remove restrictions on the order of interactions being analyzed The algorithm gives an approximate 150-fold decrease in runtime for equivalent analyses. The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data K Van Steen 629
111 Several (other) extensions to the MDR paradigm (CV based) (Lou et al 2008) K Van Steen 630
112 Different measure to score model quality One crucial component of the MDR algorithm measures the percentage of cases and controls incorrectly labelled by the proposed classification the classification error. - The combination of variables that produces the lowest classification error is selected as the best or most fit model. The correctly and incorrectly labelled cases and controls can be expressed as a two-way contingency table. The ability of MDR to detect gene-gene interactions can be improved by replacing classification error with a different measure to score model quality. - Of 10 measures evaluated, Bush et al (2008) found that the likelihood ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model. K Van Steen 631
113 Contingency table measures of classification performance (Bush et al 2008) K Van Steen 632
114 Towards an easy-to- adapt framework MB-MDR (Lou et al 2008) FAM-MDR K Van Steen 633
115 MB-MDR as a semi-parametric approach for unrelateds Step 1: New risk cell identification via association test on each genotype cell c j - Parametric or non-parametric test of association Step 2: Test one-dimensional genetic construct X on Y Step 3: assess significance - W = [b/se(b)]2, b=ln(or) - Derive correct null distribution for W (Calle et al 2007, Calle et al 2008) K Van Steen 634
116 Motivation 1 for MB-MDR Some important interactions could be missed by MDR due to pooling too many cells together (Calle et al 2008) K Van Steen 635
117 Motivation 2 for MB-MDR MDR cannot deal with main effects / confounding factors / nondichotomous outcomes K Van Steen 636
118 Motivation 3 for MB-MDR MDR has low performance in the presence of genetic heterogeneity (Calle et al 2008) K Van Steen 637
119 A comparison of analytical methods: GENN, RF, FITF, MDR, logistic regression GENN Grammatical evolution neural network (GENN) is a novel pattern recognition method developed to detect main effects or multi-locus models of association without exhaustively searching all possible multi-locus combinations. Grammatical evolution (GE) is a machine-learning algorithm inspired by the biological process of transcription and translation. GE uses a genetic algorithm in combination with a pre-specified grammar (set of translation rules) to automatically evolve an optimal computer program. GENN utilizes GE to evolve the inputs (predictor variables), architecture (arrangement of layers and functions), and weights of a neural network (NN) to optimally classify a given dataset. (Motsinger-Reif et al 2008) K Van Steen 638
120 A schematic overview of the GENN method (Motsinger-Reif 2008) K Van Steen 639
121 Random Forests (RF) RF is a machine-learning technique that builds a forest of classification trees wherein each component tree is grown from a bootstrap sample of the data, and the variable at each tree node is selected from a random subset of all variables in the data (Breiman, 2001). The final classification of an individual is determined by voting over all trees in the forest. RF models may uncover interactions among factors that do not exhibit strong marginal effects, without demanding a pre-specified model (McKinney et al., 2006). Additionally, tree methods are suited to dealing with certain types of genetic heterogeneity, since splits near the root node define separate model subsets in the data. (Motsinger-Reif et al 2008) K Van Steen 640
122 Random Forests (RF) Each tree in the forest is constructed as follows from data having N individuals and M explanatory variables: - Choose a training sample by selecting N individuals, with replacement, from the entire data set. - At each node in the tree, randomly select m variables from the entire set of M variables in the data. The absolute magnitude of m is a function of the number of variables in the data set and remains constant throughout the forest building process. - Choose the best split at the current node from among the subset of m variables selected above. - Iterate the second and third steps until the tree is fully grown (no pruning). Repetition of this algorithm yields a forest of trees, each of which has been trained on bootstrap samples of individuals (Motsinger-Reif et al 2008) K Van Steen 641
123 A schematic overview of the RF method (Motsinger-Reif et al 2008) K Van Steen 642
124 Advantages of the Random Forest method It can handle a large number of input variables. It estimates the relative importance of variables in determining classification, thus providing a metric for feature selection. RF produces a highly accurate classifier with an internal unbiased estimate of generalizability during the forest building process. RF is fairly robust in the presence of etiological heterogeneity and relatively high amounts of missing data (Lunetta et al., 2004). Finally, and of increasing importance as the number of input variables increases, learning is fast and computation time is modest even for very large data sets (Robnik-Sikonja, 2004). (Motsinger-Reif et al 2008) K Van Steen 643
125 Focused Interaction Testing Framework (FITF) The FITF was recently developed to detect epistatic interactions that predict disease risk. Details of the FITF algorithm and software can be found in Millstein et al. (2006). FITF is a modification of the interaction testing framework (ITF) method, which pre-screens all possible gene sets to focus on those that potentially are the most informative and reduce the multiple testing problem by reducing the number of statistical tests performed. FITF has been shown to outperform MDR when interactions involved additive, recessive, or dominant genes (Millstein et al., 2006). (Motsinger-Reif et al 2008) K Van Steen 644
126 Focused Interaction Testing Framework (FITF) The FITF algorithm modifies the ITF approach to reduce the overall number of variants tested with an initial filter process. A chi-square goodness-of-fit statistic that compares the observed with the expected Bayesian distribution of multi-locus genotype combinations in a combined casecontrol population is used in a prescreening initial stage. This statistic, referred to as the chi-square subset (CSS), has the form: where n i is the observed number of subjects (regardless of case/control status) in the ith genotype group and r is the total number of genotype groups. The expected n i, noted as E(n i ), is estimated based on the sample marginal genotype frequencies of each gene. K Van Steen 645
127 Conclusion on comparison MDR results in the one and two-locus models were comparable to GENN GENN performs poorly with the three locus models considered in Motsinger-Reif et al (2008). - This highlights a disadvantage of an evolutionary computation approach in exploring purely epistatic models it is much less likely that three loci will be stochastically assembled into a model to evaluate than two loci. Both GENN and MDR outperformed FITF - Because GENN and MDR both utilize permutation distributions for significance testing, correction for multiple testing is unnecessary. While the filter stage of FITF does reduce the number of tests performed with the ITF strategy, there are still a very large number of tests that are corrected for. K Van Steen 646
128 Conclusion on comparison Both RF and steplr were unable to detect purely epistatic models. - Since both require marginal main effects to perform variable selection tasks. - Future extensions/modifications of these approaches should consider this limitation and modify the variable selection process to capture pure interactions. - Some groups have in fact begun to make modifications in this way (Bureau et al., 2005) K Van Steen 647
129 2.c Interpretation of multi-locus results It is always a good idea to use several model selection criterions before interpreting (Ritchie et al 2007) K Van Steen 648
130 A flexible framework for analysis acknowledging interpretation capability The framework contains four steps to detect, characterize, and interpret epistasis - Select interesting combinations of SNPs - Construct new attributes from those selected - Develop and evaluate a classification model using the newly constructed attribute(s) - Interpret the final epistasis model using visual methods (Moore et al 2005) K Van Steen 649
131 Flexible framework Step 1 Attribute selection - Use entropy-based measures of information gain (IG) and interaction - Evaluate the gain in information about a class variable (e.g. case-control status) from merging two attributes together - This measure of IG allows us to gauge the benefit of considering two (or more) attributes as one unit (slide: Chen 2007) K Van Steen 650
132 Information gain Recall McGill s multiple mutual information (Te Sun Han 1980) : ; ; ; ; (information gain) If I(A;B;C) > 0 - Evidence for an attribute interaction that cannot be linearly decomposed If I(A;B;C) < 0 - The information between A and B is redundant If I(A;B;C) = 0 - Evidence of conditional independence or a mixture of synergy and redundancy K Van Steen 651
133 Illustration of entropy-based measures on Model 1 (Ritchie et al 2001) K Van Steen 652
134 Attribute selection based on entropy Entropy-based IG is estimated for each individual attribute (i.e. main effects) and each pairwise combination of attributes (i.e. interaction effects). Pairs of attributes are sorted and those with the highest IG, or percentage of entropy in the class removed, are selected for further consideration (slide: Chen 2007) K Van Steen 653
135 Attribute selection based on relief F The Relief statistic was developed by the computer science community as a powerful method for determining the quality or relevance of an attribute (i.e. variable) for predicting a discrete endpoint or class variable (Kira and Rendell 1992, Konenko 1994, Robnik-Sikonja and Kononenko 2003). Relief is especially useful when there is an interaction between two or more attributes and the discrete class variable. It is thus superior to univariate filters such as a chi-square test of independence (see later) when interactions are present. K Van Steen 654
136 Attribute selection based on relief F In particular, Relief estimates the quality of attributes through a type of nearest neighbor algorithm that selects neighbors (instances) from the same class and from the different class based on the vector of values across attributes. Weights (W) or quality estimates for each attribute (A) are estimated based on whether the nearest neighbor (nearest hit, H) of a randomly selected instance (R) from the same class and the nearest neighbor from the other class (nearest miss, M) have the same or different values. This process of adjusting weights is repeated for m instances. The algorithm produces weights for each attribute ranging from -1 (worst) to +1 (best). K Van Steen 655
137 Attribute selection based on relief F (applied to M27) Only the top 10% of scores will be returned to the filtered data set K Van Steen 656
138 Attribute selection based on relief F applied to M27 For the M27 simulated data, this reduction of the overall attribute count does not make sense of course (# SNPs = 10!) K Van Steen 657
139 Attribute selection based on TuRF ReliefF is able to capture attribute interactions because it selects nearest neighbors using the entire vector of values across all attributes. However, this advantage is also a disadvantage because the presence of many noisy attributes can reduce the signal the algorithm is trying to capture. The tuned ReliefF algorithm (TuRF) systematically removes attributes that have low quality estimates so that the ReliefF values if the remaining attributes can be re-estimated. (Moore and White 2008) K Van Steen 658
140 Attribute selection based on chi-squared The MDR software provides a simple chi-square test of independence as a univariate filter. - The manual specifies that this filter should be used to condition your MDR analysis on those attributes that have an independent main effect. - However, the MDR software itself does not give you a lot of options to actually perform this conditioning The ReliefF filter will be more useful for capturing those attributes that are likely to be involved in an interaction. K Van Steen 659
141 Attribute selection based on chi-squared (applied to M27) K Van Steen 660
142 Attribute selection based on odds ratio The odds ratio (OR) is a way of comparing whether the probability of a certain event is the same for two groups. - An odds ratio of 1 implies that the event is equally likely in both groups. - An odds ratio that is greater than 1 implies that the event is most likely in the first group whereas - A value less than one implies that the event is less likely in the first group. When an attribute is polytomous (i.e. more than 2 levels) MDR calculates the OR for each possible contrast and then reports the largest OR value. - For 3 levels 0, 1, 2, the following contrasts are considered 0 vs 1 ; 0 vs 2 ; 1 vs 2 ; 0 vs 1&2 ; 1 vs 0&2 ; 2 vs 1&0 K Van Steen 661
143 Flexible framework Step 2 Constructive induction, for instance MDR - A multi-locus genotype combination is considered high-risk if the ratio of cases to controls exceeds given threshold T, else it is considered lowrisk - Genotype combinations considered to be high-risk are labeled G1 while those considered low-risk are labeled G0. - This process constructs a new one-dimensional attribute with levels G0 and G1 (slide: Chen 2007) K Van Steen 662
144 Flexible framework Step 3 Classification and machine learning - The single attribute obtained in Step 2 can be modeled using machine learning and classification techniques Bayes classifiers as one technique - Mitchell (1997) defines the naive Bayes classifier as arg max - where v j is one of a set of V classes and a i is one of n attributes describing an event or data element. The class associated with a specific attribute list is the one, which maximizes the probability of the class and the probability of each attribute value given the specified class. K Van Steen 663
145 Flexible framework Step 3 - The standard way to apply the naive Bayes classifier to genotype data would be to use the genotype information for each individual as a list of attributes to distinguish between the two hypotheses The subject is high-risk and The subject is low-risk. Alternatively, an odds ratio for the single multilocus attribute can also be estimated using logistic regression to facilitate a traditional epidemiological analysis and interpretation. - Evaluation of the predictor can be carried out using cross-validation (Hastie et al., 2001) and permutation testing (Good, 2000), for example. (Moore et al 2006) K Van Steen 664
146 Flexible framework Step 4 Interpretation interaction graphs - Comprised of a node for each attribute with pairwise connections between them. - Each node is labeled the percentage of entropy removed (i.e. IG) by each attribute. - Each connection is labeled the percentage of entropy removed for each pairwise Cartesian product of attributes. (slide: Chen 2007) K Van Steen 665
147 Flexible framework Step 4 Interpretation dendrograms - Hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree. K Van Steen 666
148 Hierarchical clustering with average linkage Here the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group K Van Steen 667
149 Flexible framework The flexibility of this framework is the ability to plug and play - Different attribute selection methods other than the entropy-based - Different constructive induction algorithms other than the MDR - Different machine learning strategies other than a naïve Bayes classifier (slide: Chen 2007) K Van Steen 668
150 3 Future challenges Integration of omics data in GWAs K Van Steen 669
151 Integrations of omics data in GWAs (Hirschhorn 2009) K Van Steen 670
152 Integration of omics data in GWAs A few straightforward examples: Post-analysis - As validation tool in main effects GWAs During the analysis: - Epistasis screening (FAM-MDR) Use expression values to prioritize multi-locus combinations - Main effects screening (PBAT) Construct an overall phenotype for each marker based on the linear combination of expression values (e.g., within 1Mb from the marker) that maximizes heritability and perform FBAT-PC screening to prioritize SNPs K Van Steen 671
Bayesian Inference of Interactions and Associations
Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,
More informationAdvanced Statistical Methods: Beyond Linear Regression
Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi
More informationA novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction
A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction Sangseob Leem, Hye-Young Jung, Sungyoung Lee and Taesung Park Bioinformatics and Biostatistics lab
More informationLecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017
Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More information1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics
1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationQuantitative Genomics and Genetics BTRY 4830/6830; PBSB
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationp(d g A,g B )p(g B ), g B
Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationStatistical testing. Samantha Kleinberg. October 20, 2009
October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find
More informationExpression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia
Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.
More informationBTRY 4830/6830: Quantitative Genomics and Genetics
BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements
More informationClassification using stochastic ensembles
July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics
More informationAssociation Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5
Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationSNP Association Studies with Case-Parent Trios
SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationEmpirical Bayes Moderation of Asymptotically Linear Parameters
Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi
More informationRegression tree methods for subgroup identification I
Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationEfficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study
Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study Xiang Zhang A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial
More informationNIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.
NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationLecture 7: DecisionTrees
Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca
More informationTheoretical and computational aspects of association tests: application in case-control genome-wide association studies.
Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus
More informationNon-specific filtering and control of false positives
Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview
More informationBTRY 7210: Topics in Quantitative Genomics and Genetics
BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:
More informationHigh-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018
High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationCausal Model Selection Hypothesis Tests in Systems Genetics
1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationTwo Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests
Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationProbability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National
More informationVariable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning
Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy
More informationBioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.
Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor
More informationRelated Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM
Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru
More informationBackward Genotype-Trait Association. in Case-Control Designs
Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,
More informationUVA CS 4501: Machine Learning
UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationHERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)
BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability
More informationMultiple QTL mapping
Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power
More informationFilter Methods. Part I : Basic Principles and Methods
Filter Methods Part I : Basic Principles and Methods Feature Selection: Wrappers Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate error of a classifier using
More informationData Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td
Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak
More informationSNP-SNP Interactions in Case-Parent Trios
Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms
More informationEnsemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite
More informationPart I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes
Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationAggregated Quantitative Multifactor Dimensionality Reduction
University of Kentucky UKnowledge Theses and Dissertations--Statistics Statistics 2016 Aggregated Quantitative Multifactor Dimensionality Reduction Rebecca E. Crouch University of Kentucky, rebecca.crouch@uky.edu
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationMachine Learning Overview
Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationLecture 2. Judging the Performance of Classifiers. Nitin R. Patel
Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only
More informationFundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur
Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new
More informationPowerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions
Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationLecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017
Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component
More informationResearch Statement on Statistics Jun Zhang
Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationGenotype Imputation. Biostatistics 666
Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives
More informationComparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees
Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationLecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017
Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping
More informationSF2930 Regression Analysis
SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationNonlinear Classification
Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions
More informationOverview. Background
Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems
More informationAnalysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing
Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationGene mapping in model organisms
Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2
More informationPackage LBLGXE. R topics documented: July 20, Type Package
Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn
More information1. Understand the methods for analyzing population structure in genomes
MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population
More informationData Warehousing & Data Mining
13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.
More informationPerformance Evaluation
Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}
More informationBayesian Regression (1/31/13)
STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed
More information