Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Similar documents
Performance Evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

FINAL: CS 6375 (Machine Learning) Fall 2014

Performance Evaluation and Comparison

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Decision Theory

Evaluation requires to define performance measures to be optimized

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Evaluation. Andrea Passerini Machine Learning. Evaluation

Machine Learning for NLP

Midterm: CS 6375 Spring 2015 Solutions

Model Accuracy Measures

Biochip informatics-(i)

Stephen Scott.

Supplementary Information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Applied Machine Learning Annalisa Marsico

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Multivariate statistical methods and data mining in particle physics

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Linear Models for Classification

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Support Vector Machines

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Performance evaluation of binary classifiers

Diagnostics. Gad Kimmel

Lecture 9: Classification, LDA

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 9: Classification, LDA

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Performance Evaluation

Machine Learning

Lecture 9: Classification, LDA

Linear & nonlinear classifiers

Empirical Risk Minimization, Model Selection, and Model Assessment

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Introduction to Logistic Regression

Introduction to Signal Detection and Classification. Phani Chavali

Classifier performance evaluation

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Data Mining and Analysis: Fundamental Concepts and Algorithms

Qualifying Exam in Machine Learning

Warm up: risk prediction with logistic regression

Nonlinear Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification and Pattern Recognition

Algorithmisches Lernen/Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Evaluation & Credibility Issues

Holdout and Cross-Validation Methods Overfitting Avoidance

CS534 Machine Learning - Spring Final Exam

Linear Classifiers as Pattern Detectors

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

1 Machine Learning Concepts (16 points)

Logistic Regression. Machine Learning Fall 2018

Machine Learning (CS 567) Lecture 2

Machine Learning

Machine Learning Concepts in Chemoinformatics

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Machine Learning. Lecture 9: Learning Theory. Feng Li.

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

CSCI-567: Machine Learning (Spring 2019)

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

CS145: INTRODUCTION TO DATA MINING

Stochastic Gradient Descent

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Support Vector Machines and Kernel Methods

Machine Learning for Signal Processing Bayes Classification and Regression

Mining Classification Knowledge

Learning Methods for Linear Detectors

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Bias-Variance Tradeoff

Introduction to Supervised Learning. Performance Evaluation

Machine Learning

Generative v. Discriminative classifiers Intuition

Microarray Data Analysis: Discovery

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning (CS 567) Lecture 3

Mining Classification Knowledge

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning for NLP

Performance Evaluation and Hypothesis Testing

Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Tufts COMP 135: Introduction to Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear and Logistic Regression. Dr. Xiaowei Huang

Logistic Regression. COMP 527 Danushka Bollegala

Transcription:

Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio

Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant Analysis Logistic Regression (next class probably): Other classification algorithms (Decision Trees, Support Vector Machines, Multilayer Perceptrons, Ensemble classifiers)

Problem Set #2 1) Recall that the Wilcoxon sign rank statistic depends on W +, the sum of the ranks of the positive values of X i, and W -, the sum of the ranks of the negative values of X i. We talked about two test statistics: S = W + - W - and min(w -, W + ). These test statistics have different null distributions but given the same data, they can be used to generate the same P-values. Is the same true for W -? Why or why not?

Problem Set #2 2) The Wilcoxon sign test tests whether or not the distribution of X i has a median of by comparing the number of positive values of X i to the number of negative values. Intuitively, there should be about an equal number of each. What is the null distribution of the number of positive values of X i in N samples? 2b) What s the P-value associated with observing 4 positive values of X i out of 1? How about 2 out of 2?

Problem Set #2 3) The P-value is not the probability of the incorrect rejection of the null hypothesis. Can you explain why? 5) Can the FDR correction ever lead to fewer rejections than the FWER correction? If not, then why not? If so, please give an example.

Classification example I: Predicting gene function from expression profiles Microarray profiles are relatively easily measured and reflect function Zhang et al (J Biol 24)

Pattern detection RNA splicing + RNA splicing What distinguishes these two sets of profiles?

Classification example II: Classifying cancer from cellular profile Microarray profiles can be used to subcategorize cancer (leukemia) rmalized expression Golub et al (Science 1999)

Classification in a nutshell Input Features aka covariates Θ Parameters aka coefficients, weights X e.g. microarray profile Classification algorithm e.g. Neural network, SVM, KNN Output aka discriminant value, confidence Y versus η threshold Goal: Find parameters that make outputs predictive of targets on a training set of matched inputs and labeled target values

Formal definition Given: 1. a training set {(X 1, t 1 ), (X 2, t 2 ), (X N, t N )} of matched inputs X i and target labels t i (t i = or 1) 2. a classification procedure represented by a discriminant function, f(x; Θ) and a threshold η, so that I[f(X; Θ) > η] is the predicted label given input X. Goal: Set Θ to maximize the agreement between the predicted target labels and actual target labels on the training set. I[H] is a function that has value 1 if the statement H is true, otherwise it has value.

Important concepts Training and test sets Uncertainty about classification Overfitting Cross-validation (leave-one-out)

Put yourself in the machine s shoes Feature1 Feature2 Feature3 Feature4 Feature5 Expression level during heat shock Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Which uncharacterized genes are involved in trna processing?

Training Positives Negatives Known genes

Training Positives Negatives What pattern distinguishes the positives and the negatives?

Training Positives Negatives 4 green features features 1,3, and 5 are green features 1 and 3 are green and feature 2 is red features 1 and 3 are green

Training Positives Negatives features 1,3, and 5 are green features 1 and 3 are green and feature 2 is red features 1 and 3 are green Known genes

Training Positives Negatives features 1 and 3 are green and feature 2 is red features 1 and 3 are green Known genes

Training Positives Negatives features 1 and 3 are green Known genes

Training Positives Negatives features 1 and 3 are green Known genes

Prediction Unknowns Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Which genes are involved in trna processing?

Prediction Feature1 Feature3 Features 1 and 3 green? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Which genes are involved in trna processing?

Prediction Feature1 Feature3 Features 1 and 3 green? Prediction: Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Involved Involved t Involved Involved t Involved t Involved Which genes are involved in trna processing?

Experimental validation Prediction: Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Involved Involved t Involved Involved t Involved t Involved

Experimental validation Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Prediction: Involved Involved t Involved Involved t Involved t Involved Assay: + + - + - - All predictions are correct!

Sparse annotation Positives Negatives What pattern distinguishes the positives and the negatives?

Multiple lines separate the two classes x 1 x 2

Training under sparse annotation Positives Negatives 4 green features features 1 and 3 are green What pattern distinguishes the positives and the negatives?

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Yes Yes Which genes are involved in trna processing?

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Confidence Gene1 Yes Yes 1. Gene2 Yes.5 Gene3 Gene4 Yes.5 Gene5 Yes.5 Gene6 Legend 1..5 Definitely involved May be involved Definitely not involved

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Confidence Gene1 Yes Yes 1. Gene2 Yes.5 Gene3 Gene4 Yes.5 Gene5 Yes.5 Gene6 Prediction: Gene1, and probably Genes 2, 4, and 5 are involved in trna processing.

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5

Experimental validation Label + + - + - - Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1..5.5.5

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5 One correct confidence 1 prediction

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5 Two out of three confidence.5 predictions correct.

Validation results Confidence Cutoff # True Positives 1 1.5 3 1 # False Positives Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1..5.5.5 3 3

isy features Positives Negatives Incorrect measurement, should be green.

isy features Positives Negatives What distinguishes the positives and the negatives?

isy features + sparse data = overfitting Positives Negatives What distinguishes the positives and the negatives?

Training Positives Negatives 4 green features

Prediction Four green features? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Which genes are involved in trna processing?

Prediction Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. Prediction: Gene1 and 5 are involved in trna processing.

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1.

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. One incorrect high confidence prediction, i.e., one false positive

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. Two genes missed completely, i.e., two false negatives

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. One incorrect high confidence prediction, two genes missed completely

Validation results Confidence Cutoff # True Positives # False Positives 1 1 1 3 3 Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1. 1.

What have we learned? Sparse data: many different patterns distinguish positives and negatives.

What have we learned? Sparse data: many different patterns distinguish positives and negatives. isy features: Actual distinguishing pattern may not be observable

What have we learned? Sparse data: many different patterns distinguish positives and negatives. isy features: Actual distinguishing pattern may not be observable Sparse data + noisy features: may detect, and be highly confident in, spurious, incorrect patterns. Overfitting

Overfitting For a given training / test set: Generalization (test set) error Classification error Training set error (Effective) # of parameters aka Complexity aka VC dimension

Validation Different algorithms assign confidence to their predictions differently Need to 1. Determine meaning of each algorithm s confidence score. 2. Determine what level of confidence is warranted by the data

Cross-validation Basic idea: Hold out part of the data and use it to validate confidence levels

Cross-validation Positives Negatives

Cross-validation Positives Negatives Hold-out Label + - +

Cross-validation: training Positives Negatives

Cross-validation: training Positives Negatives Features 1 and 3 are green

Cross-validation: testing Features 1 and 3 green? Hold-out Yes

Cross-validation: testing Features 1 and 3 green? Yes Hold-out Confidence 1.

Cross-validation: testing Features 1 and 3 green? Yes Hold-out Confidence 1. Label + - +

Confidence cutoff Cross-validation: testing # True Positives # False Positives 1 1 2 1 Hold-out Confidence 1. Label + - +

- N-fold cross validation Step 1: Randomly reorder rows Step 2: Split into N sets (e.g. N = 5) Step 3: Train N times, using each split, in turn, as the hold-out set + - Labelled data Permuted data Training splits

- N-fold cross validation Step 1: Randomly reorder rows Step 2: Split into N sets (e.g. N = 5) Step 3: Train N times, using each split, in turn, as the hold-out set + - Labelled data Permuted data Training splits

Using N-fold cross validation to assign confidence to predictions Training set Test set +ves -ves +ves -ves Classification statistics Thres #TP #FP Fold 1 1 1 1 1 Fold 2.7 1 2 Fold 3... Fold N.8 1.3 1 1.9 1.5 2

Cross-validation results Confidence cutoff # True Positives # False Positives 1 3.75 3 1.5 4 2.25 5 3 5 5

Displaying results: ROC curves Confidence cutoff # True Positives # False Positives 1 3.75 3 1 # TP 5 ROC curve.5 4 2 1.25 5 3 1 # FP 5 5 5

Making new predictions Confidence cutoff # True Positives # False Positives 1 3.75 3 1 # TP 5 ROC curve x.5 4 2 1.25 5 3 1 # FP 5 5 5

Figures of merit Predicted T F Precision: #TP / (#TP + #FP) (also known as positive predictive value) Actual T F TP FP FN TN Recall: #TP / (#TP + #FN) (also known as sensitivity) Specificity: #TN / (#FP + #TN) Negative predictive value: #TN / (#FN + #TN) Accuracy: (#TP + #TN) / (#TP + #FP + #TN + #FN)

Area under the ROC curve ROC curve Area Under the roc Curve (AUC) = Average proportion of negatives with confidence levels less than a random positive Sensitivity 1 Quick facts: - < AUC < 1 - AUC of random classifier =.5 1-Specificity 1

Quick facts: Area under the ROC curve Area under the ROC curve (AUC): ratio of positive/negative pairs correctly ordered. Quick facts: < AUC < 1 AUC of random classifier =.5 TP rate AUC Mann-Whitney U statistic My favourite classification error measure: 1-AUC 1 ROC curve FP rate 1 69

Precision-recall curves 1 Precision Baseline precision: #P / (#P + #N) 1 Recall Often, people report Area under the Precision-Recall (PR) curve (AUPRC) as a performance metric when # of positives is low and when you want to make good predictions about which genes are positives. Area = average precision using thresholds determined by the positives Unlike ROC curve, the PR curve is not monotonic nor is there a statistical test (that I know of) associated with it.

Simple classifier based on Bayes X is a real-value rule. We know that if y =, then X ~ N(m, s) and if y = 1, then X ~ N(m 1, s) What is P(y = 1 X = x)? P(y = 1 X = x) = P(X = x y = 1) P(y = 1) / P(X = x) Let s say P(y = 1) = p 1 [and P(y = ) = p = 1-p 1 ]

Some definitions P(X = x y = 1) = N(x; m 1, v) = Z(v) -1 exp[-.5 (x m 1 ) 2 ] P(X = x y = ) = N(x; m, v) = Z(v) -1 exp[-.5 (x m ) 2 ] Where, Z(v) = (v2π) 1/2

P(y =1 X = x) = P(X = x y =1)P(y =1) P(X = x y =1)P(y =1) + P(X = x y = )P(y = ) = 1 1+ P(X = x y = )P(y = ) /[P(X = x y =1)P(y =1)], Assuming p = p 1 = 1 1+ exp[.5(x m ) 2 /v +.5(x m 1 ) 2 /v] 1 = 1+ exp[x (m m 1 ) /v w.5(m 2 m 2 1 ) /v], te (m 2 m 2 1 ) = (m + m 1 )(m m 1 ) 1 = 1+ exp[w (x (m + m 1 ) /2 )] = 1 1+ exp[w(x b)] b