Machine Learning in Action Tatyana Goldberg (goldberg@rostlab.org) August 16, 2016 @
Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences UniProt 2 69,560,473 protein sequences GOLD 3 total # of genomes 49,092 Biological data is growing much faster than our knowledge of biological processes! 1 http://www.genomesonline.org/cgi-bin/gold/index.cgi 2 http://www.ncbi.nlm.nih.gov/genbank/ 3 http://www.uniprot.org/ 1
Naive Bayes Car theft example: chances your car gets stolen? Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes http://www.vistabmw.com http://ramario.nichost.ru Sample vectors x = x 1,, x n X Class labels y Y 2
Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve http://www.wikipedia.org/ 3
Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve http://www.wikipedia.org/ Naive Bayes classifier adds a decision rule Decision rule y max = arg max y Y p x y p(y) 3
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p No = 0.5 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 Black Lada Old No Black Lada Old No p Lada Yes = 2 5 ; p Lada No = 4 5 Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z Black Lada New? 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z 4
Naive Bayes Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? Green Lada?? p x y p(y) p y x = p(x) p Yes = 0.5 p No = 0.5 p Black Yes = 2 p Black No = 2 5 ; 5 p Lada Yes = 2 5 ; p Lada No = 4 5 p New Yes = 4 5 ; p New No = 1 5 p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z p Yes Green, Lada,? = 0.12/z p No Green, Lada,? = 0.24/z 4
Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5
Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5
Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5
Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i y i w T x i + b 1 ξ i for i 6
Support Vector Machine http://media.philly.com/images/and_then_a_miracle_happens_cartoon.jpg 7
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8
Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 b ξ<1 ξ>1 support vector y = -1 Age w T x i + b 1 for y i = 1 Can be combined to: y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8
Localization Predictions Indispensable High-throughput methods cost money not accurate not complete Computational prediction http://www.microscopyu.com Intracellular mitochondrial network, microtubules, nuclei MTSHSYYKDRLGFDPNEQQP GSNNSMKRSSSRQTTHHHQ SYHHATTSSSQSPARISVSPG GNNGTLEYQQVQRENNW Predictor Protein Function 9
The Signal Hypothesis Günter Blobel for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell http://www.nobelprize.org/ http://www.nobelprize.org/ 10
Signal Peptide Prediction Eukaryotes Created with http://weblogo.berkeley.edu/ using the SignapP 4.0 data set SignalP v. 1.1 4.0: prediction of signal peptides and their cleavage sites Henrik Nielsen Gunnar von Heijne Søren Brunak http://www.ebi.ac.uk http://www.sbc.su.se/ http://www.cbs.dtu.dk 11
Data Preprocessing 40N terminal sequences of positive and negative samples MKSLKSSTHDVPHPEHVVWAPPAYDEQHHLFFSHGTVLIG +1 MAQSVTAFQAAYISIEVLIALVSVPGNILVIWAVKMNQAL +1 MGRKSLYLLIVGILIAYYIYTPLPDNVEEPWRMMWINAHL +1 MDNDGGAPPPPPTLVVEEPKKAEIRGVAFKELFRFADGLD +1 MGFEPLDWYCKPVPNGVWTKTVDYAFGAYTPCAIDSFVLG +1 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQH -1 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG -1 MVEMLPTVAVLVLAVSVVAKDNTTCDGPCGLRFRQNSQAG -1 MNSNLPAENLTIAVNMTKTLPTAVTHGFNSTNDPPSMSIT -1 MALHMILVMVSLLPLLEAQNPEHVNITIGDPITNETLSWL -1 MANKLFLVSATLAFFFLLTNASIYRTIVEVDEDDATNPAG -1 Convert sequences into feature vectors (e.g. amino acid composition) Sequence MPPPAD 20-dimensional feature vector [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0] 12
Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class Visualize Your Data Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class 2D Scatterplots for all attribute pairs Colors are the labels (+1/-1) Some clusters are easily separable (class/any attribute, Attr.1 /Attr. 3) Whereas clusters of Attr3/Attr4 are much harder to separate 13
Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation 14
Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters Data set The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation Test Train 14
Stratified k-fold Cross-validation Train (k-1 folds) Test (1 fold) Data Performance! K-folds: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training Stratified: each fold has the same proportion of each class value 15
Sensitivity observed Performance Metric: AUC +1-1 predicted +1-1 TP FN FP TN Sensitivity = Specificity = TP TP+FN TN TN+FP ROC: Receiver Operator Characteristic AUC: Area under the ROC Curve Threshold independent Better than Acc, Cov, F1, etc. 1 TP Threshold TN FN FP Threshold Classifier s confidence 0 Threshold 1 - Specificity 1 1.0 0.5 perfect bad perfect random 16
WEKA What is Weka? Weka is a bird found only in New Zealand Waikato Environment for Knowledge Analysis Machine learning workbench for data mining tasks 100+ algorithms for classification 75+ for data pre-processing and analysis 20+ for clustering, finding association rules, etc. 25 to assist with feature selection http://www.cs.waikato.ac.nz/ml/weka/ Java-based & supports multiple platforms Ian H. Witten http://www.sai.com.ar Eibe Frank http://arnetminer.org 17
WEKA in Action Live Demo Presentation 18
ML Workflow Data Select a subset of data Invent features abnormalities Remaining data Run classifier Get Performance Plot data Run classifer(s) Compare to random, competitors Happy? nope nope Happy? yes yes Publish predictor 19
Goldberg T, Nielsen H, Rost B et al. (2014). LocTree3 prediction of localization. Nucleic Acids Research. 37 17 20