Machine Learning in Action

Size: px

Start display at page:

Download "Machine Learning in Action"

Martha Scarlett Jordan
5 years ago
Views:

1 Machine Learning in Action Tatyana Goldberg August 16,

Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences UniProt 2 69,560,473 protein sequences GOLD 3 total # of genomes 49,092

2 Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences UniProt 2 69,560,473 protein sequences GOLD 3 total # of genomes 49,092 Biological data is growing much faster than our knowledge of biological processes!

Naive Bayes Car theft example: chances your car gets stolen?

New Yes Green Lada New No Green Lada Old No Green Lada New

3 Naive Bayes Car theft example: chances your car gets stolen? Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Sample vectors x = x 1,, x n X Class labels y Y 2

4 Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve 3

5 Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve Naive Bayes classifier adds a decision rule Decision rule y max = arg max y Y p x y p(y) 3

6 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p No = 0.5 4

7 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 4

8 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 Black Lada Old No Black Lada Old No p Lada Yes = 2 5 ; p Lada No = 4 5 Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? 4

9 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 4

10 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes p Yes Black, Lada, New = /z p No Black, Lada, New = 0.032/z Black Lada New? 4

11 Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? p Yes Black, Lada, New = /z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z 4

12 Naive Bayes Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? Green Lada?? p x y p(y) p y x = p(x) p Yes = 0.5 p No = 0.5 p Black Yes = 2 p Black No = 2 5 ; 5 p Lada Yes = 2 5 ; p Lada No = 4 5 p New Yes = 4 5 ; p New No = 1 5 p Yes Black, Lada, New = /z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z p Yes Green, Lada,? = 0.12/z p No Green, Lada,? = 0.24/z 4

13 Support Vector Machine Price Vladimir Vapnik? - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

14 Support Vector Machine Price Vladimir Vapnik? - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

15 Support Vector Machine Price Vladimir Vapnik? - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.

16 Support Vector Machine Price Vladimir Vapnik? - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

17 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6

18 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6

19 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6

20 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6

21 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i y i w T x i + b 1 ξ i for i 6

Support Vector Machine http://media.philly.

22 Support Vector Machine 7

23 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8

24 Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 b ξ<1 ξ>1 support vector y = -1 Age w T x i + b 1 for y i = 1 Can be combined to: y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8

Localization Predictions Indispensable High-throughput methods cost money not accurate not complete Computational prediction http://www.microscopyu.

25 Localization Predictions Indispensable High-throughput methods cost money not accurate not complete Computational prediction Intracellular mitochondrial network, microtubules, nuclei MTSHSYYKDRLGFDPNEQQP GSNNSMKRSSSRQTTHHHQ SYHHATTSSSQSPARISVSPG GNNGTLEYQQVQRENNW Predictor Protein Function 9

that govern their transport and localization in

26 The Signal Hypothesis Günter Blobel for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell

Signal Peptide Prediction Eukaryotes Created with

cleavage sites Henrik Nielsen Gunnar von Heijne

27 Signal Peptide Prediction Eukaryotes Created with using the SignapP 4.0 data set SignalP v : prediction of signal peptides and their cleavage sites Henrik Nielsen Gunnar von Heijne Søren Brunak

28 Data Preprocessing 40N terminal sequences of positive and negative samples MKSLKSSTHDVPHPEHVVWAPPAYDEQHHLFFSHGTVLIG +1 MAQSVTAFQAAYISIEVLIALVSVPGNILVIWAVKMNQAL +1 MGRKSLYLLIVGILIAYYIYTPLPDNVEEPWRMMWINAHL +1 MDNDGGAPPPPPTLVVEEPKKAEIRGVAFKELFRFADGLD +1 MGFEPLDWYCKPVPNGVWTKTVDYAFGAYTPCAIDSFVLG +1 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQH -1 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG -1 MVEMLPTVAVLVLAVSVVAKDNTTCDGPCGLRFRQNSQAG -1 MNSNLPAENLTIAVNMTKTLPTAVTHGFNSTNDPPSMSIT -1 MALHMILVMVSLLPLLEAQNPEHVNITIGDPITNETLSWL -1 MANKLFLVSATLAFFFLLTNASIYRTIVEVDEDDATNPAG -1 Convert sequences into feature vectors (e.g. amino acid composition) Sequence MPPPAD 20-dimensional feature vector [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0] 12

29 Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class Visualize Your Data Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class 2D Scatterplots for all attribute pairs Colors are the labels (+1/-1) Some clusters are easily separable (class/any attribute, Attr.1 /Attr. 3) Whereas clusters of Attr3/Attr4 are much harder to separate 13

30 Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation 14

31 Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters Data set The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation Test Train 14

32 Stratified k-fold Cross-validation Train (k-1 folds) Test (1 fold) Data Performance! K-folds: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training Stratified: each fold has the same proportion of each class value 15

33 Sensitivity observed Performance Metric: AUC +1-1 predicted +1-1 TP FN FP TN Sensitivity = Specificity = TP TP+FN TN TN+FP ROC: Receiver Operator Characteristic AUC: Area under the ROC Curve Threshold independent Better than Acc, Cov, F1, etc. 1 TP Threshold TN FN FP Threshold Classifier s confidence 0 Threshold 1 - Specificity perfect bad perfect random 16

Environment for Knowledge Analysis Machine learning

classification 75+ for data pre-processing and analysis

25 to assist with feature selection http://www.cs.waikato.

34 WEKA What is Weka? Weka is a bird found only in New Zealand Waikato Environment for Knowledge Analysis Machine learning workbench for data mining tasks 100+ algorithms for classification 75+ for data pre-processing and analysis 20+ for clustering, finding association rules, etc. 25 to assist with feature selection Java-based & supports multiple platforms Ian H. Witten Eibe Frank 17

35 WEKA in Action Live Demo Presentation 18

36 ML Workflow Data Select a subset of data Invent features abnormalities Remaining data Run classifier Get Performance Plot data Run classifer(s) Compare to random, competitors Happy? nope nope Happy? yes yes Publish predictor 19

37 Goldberg T, Nielsen H, Rost B et al. (2014). LocTree3 prediction of localization. Nucleic Acids Research

Model Accuracy Measures

Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses