Machine Learning in Action

Similar documents
Model Accuracy Measures

Performance Evaluation and Comparison

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Bayesian Decision Theory

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine Learning Concepts in Chemoinformatics

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Machine Learning Linear Classification. Prof. Matteo Matteucci

Stephen Scott.

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Applied Machine Learning Annalisa Marsico

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

CMU-Q Lecture 24:

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

How do we compare the relative performance among competing models?

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Logistic Regression. Machine Learning Fall 2018

Data Mining and Knowledge Discovery: Practice Notes

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Linear Classifiers as Pattern Detectors

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Machine Learning for natural language processing

Linear discriminant functions

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Lecture 9: Classification, LDA

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

The Naïve Bayes Classifier. Machine Learning Fall 2017

Evaluation & Credibility Issues

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

CSCI-567: Machine Learning (Spring 2019)

Diagnostics. Gad Kimmel

Lecture 9: Classification, LDA

Data Science and Technology Entrepreneurship

Decision Support. Dr. Johan Hagelbäck.

#33 - Genomics 11/09/07

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

Naïve Bayes classification

Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins

Bayes Decision Theory

COMP 328: Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Learning Methods for Linear Detectors

An Introduction to Statistical and Probabilistic Linear Models

Evaluation. Andrea Passerini Machine Learning. Evaluation

CS145: INTRODUCTION TO DATA MINING

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

MIRA, SVM, k-nn. Lirong Xia

CSC242: Intro to AI. Lecture 21

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Linear Classifiers as Pattern Detectors

CS 188: Artificial Intelligence. Outline

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

CSC 411: Lecture 03: Linear Classification

Bayesian Learning (II)

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Optimization Methods for Machine Learning (OMML)

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Microarray Data Analysis: Discovery

Evaluation requires to define performance measures to be optimized

Data Mining and Analysis: Fundamental Concepts and Algorithms

Empirical Risk Minimization, Model Selection, and Model Assessment

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Identifying Extracellular Plant Proteins Based on Frequent Subsequences

Pointwise Exact Bootstrap Distributions of Cost Curves

Computational methods for predicting protein-protein interactions

Introduction to Supervised Learning. Performance Evaluation

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Warm up: risk prediction with logistic regression

A Bayesian Approach to Concept Drift

Be able to define the following terms and answer basic questions about them:

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Is cross-validation valid for small-sample microarray classification?

Introduction to Machine Learning

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Machine Learning for Signal Processing Bayes Classification and Regression

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

day month year documentname/initials 1

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning: A Statistics and Optimization Perspective

Data Mining Part 4. Prediction

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Support Vector Machines

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Transcription:

Machine Learning in Action Tatyana Goldberg (goldberg@rostlab.org) August 16, 2016 @

Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences UniProt 2 69,560,473 protein sequences GOLD 3 total # of genomes 49,092 Biological data is growing much faster than our knowledge of biological processes! 1 http://www.genomesonline.org/cgi-bin/gold/index.cgi 2 http://www.ncbi.nlm.nih.gov/genbank/ 3 http://www.uniprot.org/ 1

Naive Bayes Car theft example: chances your car gets stolen? Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes http://www.vistabmw.com http://ramario.nichost.ru Sample vectors x = x 1,, x n X Class labels y Y 2

Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve http://www.wikipedia.org/ 3

Naive Bayes Sample vectors x = x 1,, x n X Class labels p y x posterior likelihood = y Y p x y p(y) p(x) evidence prior Thomas Bayes n p x y = p(x i y) i=1 naïve http://www.wikipedia.org/ Naive Bayes classifier adds a decision rule Decision rule y max = arg max y Y p x y p(y) 3

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p No = 0.5 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No p Yes = 0.5 p Black Yes = 2 5 ; p No = 0.5 p Black No = 2 5 Black Lada Old No Black Lada Old No p Lada Yes = 2 5 ; p Lada No = 4 5 Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z Black Lada New? 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes p y x = p x y p(y) p(x) Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes p Yes = 0.5 p Black Yes = 2 5 ; p Lada Yes = 2 5 ; p New Yes = 4 5 ; p No = 0.5 p Black No = 2 5 p Lada No = 4 5 p New No = 1 5 Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z 4

Naive Bayes Color Brand Age Stolen? Black BMW New Yes Black BMW Old Yes Green BMW New Yes Green BMW Old No Black Lada Old No Black Lada Old No Green Lada New Yes Green Lada New No Green Lada Old No Green Lada New Yes Black Lada New? Green BMW New? Green Lada?? p x y p(y) p y x = p(x) p Yes = 0.5 p No = 0.5 p Black Yes = 2 p Black No = 2 5 ; 5 p Lada Yes = 2 5 ; p Lada No = 4 5 p New Yes = 4 5 ; p New No = 1 5 p Yes Black, Lada, New = 0. 064/z p No Black, Lada, New = 0.032/z p Yes Green, BMW, New = 0.14/z p No Green, BMW, New = 0.005/z p Yes Green, Lada,? = 0.12/z p No Green, Lada,? = 0.24/z 4

Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

Support Vector Machine Price Vladimir Vapnik? http://www.nec-labs.com - Linear separation by building a hyperplane - Hyperplane with the maximum margin is the best Age Cortes C. and Vapnik V. Machine Learning (1995) 5

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 b Age 6

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i 6

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i y i w T x i + b 1 ξ i for i 6

Support Vector Machine http://media.philly.com/images/and_then_a_miracle_happens_cartoon.jpg 7

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 ξ<1 ξ>1 y = -1 w T x i + b 1 for y i = 1 Can be combined to: b Age y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8

Support Vector Machine Price y = +1 x i Sample vectors x i = (x 1,, x n ) i, i 1,, N Class labels y i 1, +1 Separating hyperplane w T x i + b = 0 w w T x i + b 1 for y i = +1 b ξ<1 ξ>1 support vector y = -1 Age w T x i + b 1 for y i = 1 Can be combined to: y i w T x i + b 1 for i Decision function f x = sgn( α i y i x i T x + b) i,α i >0 y i w T x i + b 1 ξ i for i Optimization Support vectors (x i α i > 0) 8

Localization Predictions Indispensable High-throughput methods cost money not accurate not complete Computational prediction http://www.microscopyu.com Intracellular mitochondrial network, microtubules, nuclei MTSHSYYKDRLGFDPNEQQP GSNNSMKRSSSRQTTHHHQ SYHHATTSSSQSPARISVSPG GNNGTLEYQQVQRENNW Predictor Protein Function 9

The Signal Hypothesis Günter Blobel for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell http://www.nobelprize.org/ http://www.nobelprize.org/ 10

Signal Peptide Prediction Eukaryotes Created with http://weblogo.berkeley.edu/ using the SignapP 4.0 data set SignalP v. 1.1 4.0: prediction of signal peptides and their cleavage sites Henrik Nielsen Gunnar von Heijne Søren Brunak http://www.ebi.ac.uk http://www.sbc.su.se/ http://www.cbs.dtu.dk 11

Data Preprocessing 40N terminal sequences of positive and negative samples MKSLKSSTHDVPHPEHVVWAPPAYDEQHHLFFSHGTVLIG +1 MAQSVTAFQAAYISIEVLIALVSVPGNILVIWAVKMNQAL +1 MGRKSLYLLIVGILIAYYIYTPLPDNVEEPWRMMWINAHL +1 MDNDGGAPPPPPTLVVEEPKKAEIRGVAFKELFRFADGLD +1 MGFEPLDWYCKPVPNGVWTKTVDYAFGAYTPCAIDSFVLG +1 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQH -1 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG -1 MVEMLPTVAVLVLAVSVVAKDNTTCDGPCGLRFRQNSQAG -1 MNSNLPAENLTIAVNMTKTLPTAVTHGFNSTNDPPSMSIT -1 MALHMILVMVSLLPLLEAQNPEHVNITIGDPITNETLSWL -1 MANKLFLVSATLAFFFLLTNASIYRTIVEVDEDDATNPAG -1 Convert sequences into feature vectors (e.g. amino acid composition) Sequence MPPPAD 20-dimensional feature vector [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0] 12

Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class Visualize Your Data Attr. 4 Attr. 3 Attr. 2 Attr. 1 Class 2D Scatterplots for all attribute pairs Colors are the labels (+1/-1) Some clusters are easily separable (class/any attribute, Attr.1 /Attr. 3) Whereas clusters of Attr3/Attr4 are much harder to separate 13

Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation 14

Performance Overfitting: theory PredictProtein lecture, Prof. Rost Free parameters Data set The predictor fits the data too well Many more free parameters than samples Poor prediction on new data sets Rule of thumb: samples > 10 free parameters Check for overfitting using cross- validation Test Train 14

Stratified k-fold Cross-validation Train (k-1 folds) Test (1 fold) Data Performance! K-folds: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training Stratified: each fold has the same proportion of each class value 15

Sensitivity observed Performance Metric: AUC +1-1 predicted +1-1 TP FN FP TN Sensitivity = Specificity = TP TP+FN TN TN+FP ROC: Receiver Operator Characteristic AUC: Area under the ROC Curve Threshold independent Better than Acc, Cov, F1, etc. 1 TP Threshold TN FN FP Threshold Classifier s confidence 0 Threshold 1 - Specificity 1 1.0 0.5 perfect bad perfect random 16

WEKA What is Weka? Weka is a bird found only in New Zealand Waikato Environment for Knowledge Analysis Machine learning workbench for data mining tasks 100+ algorithms for classification 75+ for data pre-processing and analysis 20+ for clustering, finding association rules, etc. 25 to assist with feature selection http://www.cs.waikato.ac.nz/ml/weka/ Java-based & supports multiple platforms Ian H. Witten http://www.sai.com.ar Eibe Frank http://arnetminer.org 17

WEKA in Action Live Demo Presentation 18

ML Workflow Data Select a subset of data Invent features abnormalities Remaining data Run classifier Get Performance Plot data Run classifer(s) Compare to random, competitors Happy? nope nope Happy? yes yes Publish predictor 19

Goldberg T, Nielsen H, Rost B et al. (2014). LocTree3 prediction of localization. Nucleic Acids Research. 37 17 20