Greedy Biomarker Discovery in the Genome with Applications to Antibiotic Resistance

Size: px

Start display at page:

Download "Greedy Biomarker Discovery in the Genome with Applications to Antibiotic Resistance"

Kelly Cole
5 years ago
Views:

Greedy Biomarker Discovery in the Genome with Applications to Antibiotic Resistance Alexandre Drouin, Sébastien Giguère, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil

1 Greedy Biomarker Discovery in the Genome with Applications to Antibiotic Resistance Alexandre Drouin, Sébastien Giguère, Maxime Déraspe, François Laviolette, Mario Marchand, Jacques Corbeil Department of Computer Science and Software Engineering, Université Laval Department of Molecular Medicine, Université Laval Institute for Research in Immunology and Cancer, Université de Montréal Greed is Great ICML 2015 Lille, France July 10, 2015 Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

2 Outline 1 Introduction Genomics Formalization 2 Methods Data Representation for Genomes The Set Covering Machine Risk bounds 3 Results Dataset Overview Benchmark 4 Conclusion Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

3 Introduction Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

4 Genomics Study of the entire genetic material of individuals DNA is composed of four nucleotides (A, T, G, C) DNA molecules are sequenced using DNA sequencers Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

5 Cost of sequencing Consequence: more and more data to analyse Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

6 Biomarker Discovery: Case-Control Studies Cancer Healthy VS Biomarker: a measurable characteristic that is predictive of some biological state Motivation Obtain a better understanding of the biological processes involved Develop diagnostic tests, new therapies and drug treatments Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

7 Biomarker Discovery: Case-Control Studies Cancer Healthy VS Biomarker: a measurable characteristic that is predictive of some biological state Motivation Obtain a better understanding of the biological processes involved Develop diagnostic tests, new therapies and drug treatments Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

8 Formalization as a Supervised Learning Problem Data Sample S = def {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ))} D m x X = def {A, T, G, C} is a genome y {0, 1} is a label (control or case) D is a data generating distribution Objective 1 Define a suitable representation for genomes φ : X R d 2 Find a predictor h : R d {0, 1} that has a good generalization performance, i.e. that minimizes: R(h) = def Pr [h(φ(x)) y] (x,y) D Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

9 Formalization as a Supervised Learning Problem Data Sample S = def {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ))} D m x X = def {A, T, G, C} is a genome y {0, 1} is a label (control or case) D is a data generating distribution Objective 1 Define a suitable representation for genomes φ : X R d 2 Find a predictor h : R d {0, 1} that has a good generalization performance, i.e. that minimizes: R(h) = def Pr [h(φ(x)) y] (x,y) D Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

10 Formalization as a Supervised Learning Problem Data Sample S = def {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ))} D m x X = def {A, T, G, C} is a genome y {0, 1} is a label (control or case) D is a data generating distribution Objective 1 Define a suitable representation for genomes φ : X R d 2 Find a predictor h : R d {0, 1} that has a good generalization performance, i.e. that minimizes: R(h) = def Pr [h(φ(x)) y] (x,y) D Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

11 The Biomarker Discovery Problem Additionnal objective : Model Interpretability Must have a form that is understandable by domain experts (validation/acceptance) Some types of models are more easily understood (rule-based vs linear combination) Sparsity is essential (less costs, faster diagnostics) Challenges Extremely high dimensional feature spaces (d is often > 10 7 ) Many highly correlated features (genes) Small sample size (m d) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

12 The Biomarker Discovery Problem Additionnal objective : Model Interpretability Must have a form that is understandable by domain experts (validation/acceptance) Some types of models are more easily understood (rule-based vs linear combination) Sparsity is essential (less costs, faster diagnostics) Challenges Extremely high dimensional feature spaces (d is often > 10 7 ) Many highly correlated features (genes) Small sample size (m d) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

13 Methods Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

14 Genome Representation Definition k-mer: a sequence of k nucleotides Note: There are 4 k possible sequences (HUGE) Definition K: the set of all k-mers that are at least in one genome of S Note: K can STILL be HUGE (tens of millions in our case) We represent each genome x by a binary vector φ(x) B K, such that { 1, if k j K is a substring of x φ(x) j = 0, otherwise Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

15 Genome Representation Definition k-mer: a sequence of k nucleotides Note: There are 4 k possible sequences (HUGE) Definition K: the set of all k-mers that are at least in one genome of S Note: K can STILL be HUGE (tens of millions in our case) We represent each genome x by a binary vector φ(x) B K, such that { 1, if k j K is a substring of x φ(x) j = 0, otherwise Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

16 Genome Representation Definition k-mer: a sequence of k nucleotides Note: There are 4 k possible sequences (HUGE) Definition K: the set of all k-mers that are at least in one genome of S Note: K can STILL be HUGE (tens of millions in our case) We represent each genome x by a binary vector φ(x) B K, such that { 1, if k j K is a substring of x φ(x) j = 0, otherwise Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

17 Genome Representation (Example) Représenta&on* bagxofxwords * K = { CAGATA* AGATAG* AACAGC* GATAGA* AGAACA* TAGAAC* GAACAG* ATAGAA* TTTCGG* CGATGA* CCGGCT* AAATAC* { x = CAGATAGAACAGC* (x) = 1* 0* 1* 1* 0* 1* 1* 0* 1* 1* 1* 0* CAGATA* TTTCGG* AGATAG* GATAGA* CGATGA* AACAGC* ATAGAA* CCGGCT* TAGAAC* GAACAG* AGAACA* AAATAC* Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

18 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Learns conjunctions or disjunctions of boolean-valued rules: r i : R d {True, False} We use a presence and an absence rule for each k-mer Objective Given a set of boolean-valued rules R, find the predictor that minimizes the empirical risk: R S def = m i=1 1 m I [h(φ(x i )) y i ], while using the smallest subset of R. This problem is NP-hard (minimum set cover problem) Solution: Use a greedy approximation algorithm inspired by the one of Chvátal (1979) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

19 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Learns conjunctions or disjunctions of boolean-valued rules: r i : R d {True, False} We use a presence and an absence rule for each k-mer Objective Given a set of boolean-valued rules R, find the predictor that minimizes the empirical risk: R S def = m i=1 1 m I [h(φ(x i )) y i ], while using the smallest subset of R. This problem is NP-hard (minimum set cover problem) Solution: Use a greedy approximation algorithm inspired by the one of Chvátal (1979) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

20 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Learns conjunctions or disjunctions of boolean-valued rules: r i : R d {True, False} We use a presence and an absence rule for each k-mer Objective Given a set of boolean-valued rules R, find the predictor that minimizes the empirical risk: R S def = m i=1 1 m I [h(φ(x i )) y i ], while using the smallest subset of R. This problem is NP-hard (minimum set cover problem) Solution: Use a greedy approximation algorithm inspired by the one of Chvátal (1979) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

21 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Learns conjunctions or disjunctions of boolean-valued rules: r i : R d {True, False} We use a presence and an absence rule for each k-mer Objective Given a set of boolean-valued rules R, find the predictor that minimizes the empirical risk: R S def = m i=1 1 m I [h(φ(x i )) y i ], while using the smallest subset of R. This problem is NP-hard (minimum set cover problem) Solution: Use a greedy approximation algorithm inspired by the one of Chvátal (1979) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

22 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Learns conjunctions or disjunctions of boolean-valued rules: r i : R d {True, False} We use a presence and an absence rule for each k-mer Objective Given a set of boolean-valued rules R, find the predictor that minimizes the empirical risk: R S def = m i=1 1 m I [h(φ(x i )) y i ], while using the smallest subset of R. This problem is NP-hard (minimum set cover problem) Solution: Use a greedy approximation algorithm inspired by the one of Chvátal (1979) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

23 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

24 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

25 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

26 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

27 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

28 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Greedy Algorithm (Conjunction Case) 1 Start with an empty conjunction 2 Compute a utility function for each rule of R 3 Select the rule with the greatest utility (r ) 4 Remove all the examples for which r (φ(x)) = False 5 Go to step 2 until one of the following is true: All the negative examples have been removed s iterations have been performed (hyperparameter) Motivation for Step 4 The outcome of the conjunction is definitive for any example that is predicted as negative by at least one rule. There is no need to consider these examples for further iterations Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

29 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Utility Function For any rule r i, the utility function is given by: def U i = A i p B i, where A i is the subset of negative examples correctly classified by r i, B i is the subset of positive examples incorrectly classified by r i, p is a hyperparameter. Scalability The complexity is O(m R s), thus linear in the number of examples and rules We developed an out-of-core implementation (data is loaded/analysed in blocks) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

30 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Utility Function For any rule r i, the utility function is given by: def U i = A i p B i, where A i is the subset of negative examples correctly classified by r i, B i is the subset of positive examples incorrectly classified by r i, p is a hyperparameter. Scalability The complexity is O(m R s), thus linear in the number of examples and rules We developed an out-of-core implementation (data is loaded/analysed in blocks) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

31 The Set Covering Machine (Marchand and Shawe-Taylor, 2003) Utility Function For any rule r i, the utility function is given by: def U i = A i p B i, where A i is the subset of negative examples correctly classified by r i, B i is the subset of positive examples incorrectly classified by r i, p is a hyperparameter. Scalability The complexity is O(m R s), thus linear in the number of examples and rules We developed an out-of-core implementation (data is loaded/analysed in blocks) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

32 Can we expect good generalization? We can bound the risk of a predictor based on its performance on the training set. The following term bounds the risk of every conjonction h of rules in R with probability 1 δ. Occam s Razor Bound [ ln ɛ = 1 def m r ( m r ) + ln ( 2 4 k h where r is the number of errors on the training set, h is the number of rules in the conjunction, ζ is any function such that b N ζ(b) 1 ) ] ln(ζ(r) ζ( h ) δ), The combinatorial term dominates the bound even for classifiers that make few errors The bound seems to indicate bad generalization performance. Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

33 Can we expect good generalization? We can bound the risk of a predictor based on its performance on the training set. The following term bounds the risk of every conjonction h of rules in R with probability 1 δ. Occam s Razor Bound [ ln ɛ = 1 def m r ( m r ) + ln ( 2 4 k h where r is the number of errors on the training set, h is the number of rules in the conjunction, ζ is any function such that b N ζ(b) 1 ) ] ln(ζ(r) ζ( h ) δ), The combinatorial term dominates the bound even for classifiers that make few errors The bound seems to indicate bad generalization performance. Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

34 Can we expect good generalization? We can bound the risk of a predictor based on its performance on the training set. The following term bounds the risk of every conjonction h of rules in R with probability 1 δ. Occam s Razor Bound [ ln ɛ = 1 def m r ( m r ) + ln ( 2 4 k h where r is the number of errors on the training set, h is the number of rules in the conjunction, ζ is any function such that b N ζ(b) 1 ) ] ln(ζ(r) ζ( h ) δ), The combinatorial term dominates the bound even for classifiers that make few errors The bound seems to indicate bad generalization performance. Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

35 Can we expect good generalization? We can bound the risk of a predictor based on its performance on the training set. The following term bounds the risk of every conjonction h of rules in R with probability 1 δ. Occam s Razor Bound [ ln ɛ = 1 def m r ( m r ) + ln ( 2 4 k h where r is the number of errors on the training set, h is the number of rules in the conjunction, ζ is any function such that b N ζ(b) 1 ) ] ln(ζ(r) ζ( h ) δ), The combinatorial term dominates the bound even for classifiers that make few errors The bound seems to indicate bad generalization performance. Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

36 Can we expect good generalization? In the sample compression framework, the predictor h is specified using a small set of training examples (Z i ): Sample Compression Bound ( ) ( ) 1 m m h ɛ = def ln + ln m h r h r + ln(2 x ) ln(ζ( h ) ζ(r) δ), x Z i where r is the number of errors made on S \ Z i. The bound does not depend on k anymore We can consider exponentially more complex feature spaces without any penalty on the generalization error Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

37 Can we expect good generalization? In the sample compression framework, the predictor h is specified using a small set of training examples (Z i ): Sample Compression Bound ( ) ( ) 1 m m h ɛ = def ln + ln m h r h r + ln(2 x ) ln(ζ( h ) ζ(r) δ), x Z i where r is the number of errors made on S \ Z i. The bound does not depend on k anymore We can consider exponentially more complex feature spaces without any penalty on the generalization error Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

38 Can we expect good generalization? In the sample compression framework, the predictor h is specified using a small set of training examples (Z i ): Sample Compression Bound ( ) ( ) 1 m m h ɛ = def ln + ln m h r h r + ln(2 x ) ln(ζ( h ) ζ(r) δ), x Z i where r is the number of errors made on S \ Z i. The bound does not depend on k anymore We can consider exponentially more complex feature spaces without any penalty on the generalization error Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

39 Can we expect good generalization? In the sample compression framework, the predictor h is specified using a small set of training examples (Z i ): Sample Compression Bound ( ) ( ) 1 m m h ɛ = def ln + ln m h r h r + ln(2 x ) ln(ζ( h ) ζ(r) δ), x Z i where r is the number of errors made on S \ Z i. The bound does not depend on k anymore We can consider exponentially more complex feature spaces without any penalty on the generalization error Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

40 Results Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

41 Datasets Dataset K-mers Examples Clostridium difficile Azithromycin Ceftriaxone Clarithromycin Clindamycin Moxifloxacin Pseudomonas aeruginosa Amikacin Doripenem Meropenem Levofloxacin Streptococcus pneumoniae Benzylpenicillin Erythromycin Tetracyclin Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

42 Comparison to other methods We compared the SCM to CART and both L 1 and L 2 regularized support vector machines For all algorithms except the SCM, the dimensionality of the feature space had to be reduced Univariate filter: χ 2 test to score each feature and Benjamini and Yekutiely method to correct for multiple testing We performed 5-fold nested cross-validation and compared the average risk and number of k-mers in the models over the outer-folds Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

43 Comparison to other methods We compared the SCM to CART and both L 1 and L 2 regularized support vector machines For all algorithms except the SCM, the dimensionality of the feature space had to be reduced Univariate filter: χ 2 test to score each feature and Benjamini and Yekutiely method to correct for multiple testing We performed 5-fold nested cross-validation and compared the average risk and number of k-mers in the models over the outer-folds Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

44 Comparison to other methods We compared the SCM to CART and both L 1 and L 2 regularized support vector machines For all algorithms except the SCM, the dimensionality of the feature space had to be reduced Univariate filter: χ 2 test to score each feature and Benjamini and Yekutiely method to correct for multiple testing We performed 5-fold nested cross-validation and compared the average risk and number of k-mers in the models over the outer-folds Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

45 Comparison to other methods We compared the SCM to CART and both L 1 and L 2 regularized support vector machines For all algorithms except the SCM, the dimensionality of the feature space had to be reduced Univariate filter: χ 2 test to score each feature and Benjamini and Yekutiely method to correct for multiple testing We performed 5-fold nested cross-validation and compared the average risk and number of k-mers in the models over the outer-folds Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

46 Dataset SCM χ 2 + SCM χ 2 + CART χ 2 + L1SVM χ 2 + L2SVM Baseline C. difficile Azithromycin (3.2) (4.8) (6.6) (494.6) ( ) Ceftriaxone (2.0) (5.6) (7.2) (277.8) ( ) Clarithromycin (3.0) (4.6) (7.6) (522.6) ( ) Clindamycin (2.0) (2.4) (2.4) (702.2) ( ) Moxifloxacin (1.0) (1.8) (1.0) (173.6) ( ) P. aeruginosa Amikacin (6.0) (9.8) (18.8) (687.8) ( ) Doripenem (1.4) (1.6) (25.4) (44.8) ( ) Meropenem (1.8) (1.8) (9.2) (233.6) (3475.6) Levofloxacin (1.4) (1.8) (1.0) (180.4) ( ) S. pneumoniae Benzylpenicillin (1.0) (1.2) (1.8) (295.8) ( ) Erythromrycin (2.0) (5.6) (4.4) (299.4) ( ) Tetracyclin (1.2) (2.2) (1.0) (479.8) ( ) Average (2.2) (3.6) (7.2) (366.0) ( ) The SCM tends to learn the sparsest models On most datasets, the SCM generalizes well and outperforms the baseline Using univariate filters seems to degrade performance (SCM vs χ 2 + SCM) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

47 Dataset SCM χ 2 + SCM χ 2 + CART χ 2 + L1SVM χ 2 + L2SVM Baseline C. difficile Azithromycin (3.2) (4.8) (6.6) (494.6) ( ) Ceftriaxone (2.0) (5.6) (7.2) (277.8) ( ) Clarithromycin (3.0) (4.6) (7.6) (522.6) ( ) Clindamycin (2.0) (2.4) (2.4) (702.2) ( ) Moxifloxacin (1.0) (1.8) (1.0) (173.6) ( ) P. aeruginosa Amikacin (6.0) (9.8) (18.8) (687.8) ( ) Doripenem (1.4) (1.6) (25.4) (44.8) ( ) Meropenem (1.8) (1.8) (9.2) (233.6) (3475.6) Levofloxacin (1.4) (1.8) (1.0) (180.4) ( ) S. pneumoniae Benzylpenicillin (1.0) (1.2) (1.8) (295.8) ( ) Erythromrycin (2.0) (5.6) (4.4) (299.4) ( ) Tetracyclin (1.2) (2.2) (1.0) (479.8) ( ) Average (2.2) (3.6) (7.2) (366.0) ( ) The SCM tends to learn the sparsest models On most datasets, the SCM generalizes well and outperforms the baseline Using univariate filters seems to degrade performance (SCM vs χ 2 + SCM) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

48 Dataset SCM χ 2 + SCM χ 2 + CART χ 2 + L1SVM χ 2 + L2SVM Baseline C. difficile Azithromycin (3.2) (4.8) (6.6) (494.6) ( ) Ceftriaxone (2.0) (5.6) (7.2) (277.8) ( ) Clarithromycin (3.0) (4.6) (7.6) (522.6) ( ) Clindamycin (2.0) (2.4) (2.4) (702.2) ( ) Moxifloxacin (1.0) (1.8) (1.0) (173.6) ( ) P. aeruginosa Amikacin (6.0) (9.8) (18.8) (687.8) ( ) Doripenem (1.4) (1.6) (25.4) (44.8) ( ) Meropenem (1.8) (1.8) (9.2) (233.6) (3475.6) Levofloxacin (1.4) (1.8) (1.0) (180.4) ( ) S. pneumoniae Benzylpenicillin (1.0) (1.2) (1.8) (295.8) ( ) Erythromrycin (2.0) (5.6) (4.4) (299.4) ( ) Tetracyclin (1.2) (2.2) (1.0) (479.8) ( ) Average (2.2) (3.6) (7.2) (366.0) ( ) The SCM tends to learn the sparsest models On most datasets, the SCM generalizes well and outperforms the baseline Using univariate filters seems to degrade performance (SCM vs χ 2 + SCM) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

49 Dataset SCM χ 2 + SCM χ 2 + CART χ 2 + L1SVM χ 2 + L2SVM Baseline C. difficile Azithromycin (3.2) (4.8) (6.6) (494.6) ( ) Ceftriaxone (2.0) (5.6) (7.2) (277.8) ( ) Clarithromycin (3.0) (4.6) (7.6) (522.6) ( ) Clindamycin (2.0) (2.4) (2.4) (702.2) ( ) Moxifloxacin (1.0) (1.8) (1.0) (173.6) ( ) P. aeruginosa Amikacin (6.0) (9.8) (18.8) (687.8) ( ) Doripenem (1.4) (1.6) (25.4) (44.8) ( ) Meropenem (1.8) (1.8) (9.2) (233.6) (3475.6) Levofloxacin (1.4) (1.8) (1.0) (180.4) ( ) S. pneumoniae Benzylpenicillin (1.0) (1.2) (1.8) (295.8) ( ) Erythromrycin (2.0) (5.6) (4.4) (299.4) ( ) Tetracyclin (1.2) (2.2) (1.0) (479.8) ( ) Average (2.2) (3.6) (7.2) (366.0) ( ) The SCM tends to learn the sparsest models On most datasets, the SCM generalizes well and outperforms the baseline Using univariate filters seems to degrade performance (SCM vs χ 2 + SCM) Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

The Obtained Models are Interpretable Valida&on*biologique* 1* 2* 5* 3* 10* 4* 15* 6* Azithromycine$ Ce=riaxone$ Clarithromycin$ Clindamycin$ 11* * * * * * * 9* DNA*gyrase*subunit*A*

50 The Obtained Models are Interpretable Valida&on*biologique* 1* 2* 5* 3* 10* 4* 15* 6* Azithromycine$ Ce=riaxone$ Clarithromycin$ Clindamycin$ 11* * * * * * * 9* DNA*gyrase*subunit*A* Tn6194Xlike*Transposon,*other* TwoXcomponent*sensor*his&dine* kinase* 8* * * * Moxifloxacin$ 26* Transposon*Tn6110*and*Clostridium* Saccharoly&cum*23S*rRNA* m(2)ax2503*methyltransferase* PenicillinXbinding*protein* Conjuga&ve*transposon*FtsK/SpoIIIEXlike* * * * * * * * 12* * * * * 14* 13* ErmB*rRNA*adenine*NX6X methyltransferase* Other:*hypothe&cal*proteins*and* unmatched*kxmers****** Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

51 Conclusion Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

52 Conclusion We used the Set Covering Machine to learn from extremely high dimensional feature spaces with small sample sizes. Scalability: The Set Covering Machine is the only algorithm that did not require feature selection. Generalization: The obtained models compare favorably to other learning algorithms in terms of prediction error. Interpretability: The obtained models are sparse and explicitely highlight the importance of small DNA sequences. For all these reasons, greed is great! Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

53 Conclusion We used the Set Covering Machine to learn from extremely high dimensional feature spaces with small sample sizes. Scalability: The Set Covering Machine is the only algorithm that did not require feature selection. Generalization: The obtained models compare favorably to other learning algorithms in terms of prediction error. Interpretability: The obtained models are sparse and explicitely highlight the importance of small DNA sequences. For all these reasons, greed is great! Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

54 Conclusion We used the Set Covering Machine to learn from extremely high dimensional feature spaces with small sample sizes. Scalability: The Set Covering Machine is the only algorithm that did not require feature selection. Generalization: The obtained models compare favorably to other learning algorithms in terms of prediction error. Interpretability: The obtained models are sparse and explicitely highlight the importance of small DNA sequences. For all these reasons, greed is great! Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

55 Conclusion We used the Set Covering Machine to learn from extremely high dimensional feature spaces with small sample sizes. Scalability: The Set Covering Machine is the only algorithm that did not require feature selection. Generalization: The obtained models compare favorably to other learning algorithms in terms of prediction error. Interpretability: The obtained models are sparse and explicitely highlight the importance of small DNA sequences. For all these reasons, greed is great! Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

56 Conclusion We used the Set Covering Machine to learn from extremely high dimensional feature spaces with small sample sizes. Scalability: The Set Covering Machine is the only algorithm that did not require feature selection. Generalization: The obtained models compare favorably to other learning algorithms in terms of prediction error. Interpretability: The obtained models are sparse and explicitely highlight the importance of small DNA sequences. For all these reasons, greed is great! Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

57 Thank you! Come see me at our poster Thanks to my co-authors: Sébastien Giguère Maxime Déraspe François Laviolette Mario Marchand Jacques Corbeil Alexandre Drouin (Université Laval) Greedy Biomarker Discovery July 10, / 24

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université