Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio

Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant Analysis Logistic Regression (next class probably): Other classification algorithms (Decision Trees, Support Vector Machines, Multilayer Perceptrons, Ensemble classifiers)

Problem Set #2 1) Recall that the Wilcoxon sign rank statistic depends on W +, the sum of the ranks of the positive values of X i, and W -, the sum of the ranks of the negative values of X i. We talked about two test statistics: S = W + - W - and min(w -, W + ). These test statistics have different null distributions but given the same data, they can be used to generate the same P-values. Is the same true for W -? Why or why not?

Problem Set #2 2) The Wilcoxon sign test tests whether or not the distribution of X i has a median of by comparing the number of positive values of X i to the number of negative values. Intuitively, there should be about an equal number of each. What is the null distribution of the number of positive values of X i in N samples? 2b) What s the P-value associated with observing 4 positive values of X i out of 1? How about 2 out of 2?

Problem Set #2 3) The P-value is not the probability of the incorrect rejection of the null hypothesis. Can you explain why? 5) Can the FDR correction ever lead to fewer rejections than the FWER correction? If not, then why not? If so, please give an example.

Classification example I: Predicting gene function from expression profiles Microarray profiles are relatively easily measured and reflect function Zhang et al (J Biol 24)

Pattern detection RNA splicing + RNA splicing What distinguishes these two sets of profiles?

Classification example II: Classifying cancer from cellular profile Microarray profiles can be used to subcategorize cancer (leukemia) rmalized expression Golub et al (Science 1999)

Classification in a nutshell Input Features aka covariates Θ Parameters aka coefficients, weights X e.g. microarray profile Classification algorithm e.g. Neural network, SVM, KNN Output aka discriminant value, confidence Y versus η threshold Goal: Find parameters that make outputs predictive of targets on a training set of matched inputs and labeled target values

Formal definition Given: 1. a training set {(X 1, t 1 ), (X 2, t 2 ), (X N, t N )} of matched inputs X i and target labels t i (t i = or 1) 2. a classification procedure represented by a discriminant function, f(x; Θ) and a threshold η, so that I[f(X; Θ) > η] is the predicted label given input X. Goal: Set Θ to maximize the agreement between the predicted target labels and actual target labels on the training set. I[H] is a function that has value 1 if the statement H is true, otherwise it has value.

Important concepts Training and test sets Uncertainty about classification Overfitting Cross-validation (leave-one-out)

Put yourself in the machine s shoes Feature1 Feature2 Feature3 Feature4 Feature5 Expression level during heat shock Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Which uncharacterized genes are involved in trna processing?

Training Positives Negatives Known genes

Training Positives Negatives What pattern distinguishes the positives and the negatives?

Training Positives Negatives 4 green features features 1,3, and 5 are green features 1 and 3 are green and feature 2 is red features 1 and 3 are green

Training Positives Negatives features 1,3, and 5 are green features 1 and 3 are green and feature 2 is red features 1 and 3 are green Known genes

Training Positives Negatives features 1 and 3 are green and feature 2 is red features 1 and 3 are green Known genes

Training Positives Negatives features 1 and 3 are green Known genes

Prediction Unknowns Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Which genes are involved in trna processing?

Prediction Feature1 Feature3 Features 1 and 3 green? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Which genes are involved in trna processing?

Prediction Feature1 Feature3 Features 1 and 3 green? Prediction: Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Involved Involved t Involved Involved t Involved t Involved Which genes are involved in trna processing?

Experimental validation Prediction: Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Involved Involved t Involved Involved t Involved t Involved

Experimental validation Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Prediction: Involved Involved t Involved Involved t Involved t Involved Assay: + + - + - - All predictions are correct!

Sparse annotation Positives Negatives What pattern distinguishes the positives and the negatives?

Multiple lines separate the two classes x 1 x 2

Training under sparse annotation Positives Negatives 4 green features features 1 and 3 are green What pattern distinguishes the positives and the negatives?

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Yes Yes Yes Which genes are involved in trna processing?

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Confidence Gene1 Yes Yes 1. Gene2 Yes.5 Gene3 Gene4 Yes.5 Gene5 Yes.5 Gene6 Legend 1..5 Definitely involved May be involved Definitely not involved

Prediction under sparse annotation Feature1 Feature3 Four green features? Features 1 and 3 green? Confidence Gene1 Yes Yes 1. Gene2 Yes.5 Gene3 Gene4 Yes.5 Gene5 Yes.5 Gene6 Prediction: Gene1, and probably Genes 2, 4, and 5 are involved in trna processing.

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5

Experimental validation Label + + - + - - Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1..5.5.5

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5 One correct confidence 1 prediction

Experimental validation Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 1..5.5.5 Two out of three confidence.5 predictions correct.

Validation results Confidence Cutoff # True Positives 1 1.5 3 1 # False Positives Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1..5.5.5 3 3

isy features Positives Negatives Incorrect measurement, should be green.

isy features Positives Negatives What distinguishes the positives and the negatives?

isy features + sparse data = overfitting Positives Negatives What distinguishes the positives and the negatives?

Training Positives Negatives 4 green features

Prediction Four green features? Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes Which genes are involved in trna processing?

Prediction Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. Prediction: Gene1 and 5 are involved in trna processing.

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1.

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. One incorrect high confidence prediction, i.e., one false positive

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. Two genes missed completely, i.e., two false negatives

Experimental validation Four green features? Confidence Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Yes Yes 1. 1. One incorrect high confidence prediction, two genes missed completely

Validation results Confidence Cutoff # True Positives # False Positives 1 1 1 3 3 Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Confidence 1. 1.

What have we learned? Sparse data: many different patterns distinguish positives and negatives.

What have we learned? Sparse data: many different patterns distinguish positives and negatives. isy features: Actual distinguishing pattern may not be observable

What have we learned? Sparse data: many different patterns distinguish positives and negatives. isy features: Actual distinguishing pattern may not be observable Sparse data + noisy features: may detect, and be highly confident in, spurious, incorrect patterns. Overfitting

Overfitting For a given training / test set: Generalization (test set) error Classification error Training set error (Effective) # of parameters aka Complexity aka VC dimension

Validation Different algorithms assign confidence to their predictions differently Need to 1. Determine meaning of each algorithm s confidence score. 2. Determine what level of confidence is warranted by the data

Cross-validation Basic idea: Hold out part of the data and use it to validate confidence levels

Cross-validation Positives Negatives

Cross-validation Positives Negatives Hold-out Label + - +

Cross-validation: training Positives Negatives

Cross-validation: training Positives Negatives Features 1 and 3 are green

Cross-validation: testing Features 1 and 3 green? Hold-out Yes

Cross-validation: testing Features 1 and 3 green? Yes Hold-out Confidence 1.

Cross-validation: testing Features 1 and 3 green? Yes Hold-out Confidence 1. Label + - +

Confidence cutoff Cross-validation: testing # True Positives # False Positives 1 1 2 1 Hold-out Confidence 1. Label + - +

- N-fold cross validation Step 1: Randomly reorder rows Step 2: Split into N sets (e.g. N = 5) Step 3: Train N times, using each split, in turn, as the hold-out set + - Labelled data Permuted data Training splits

Using N-fold cross validation to assign confidence to predictions Training set Test set +ves -ves +ves -ves Classification statistics Thres #TP #FP Fold 1 1 1 1 1 Fold 2.7 1 2 Fold 3... Fold N.8 1.3 1 1.9 1.5 2

Cross-validation results Confidence cutoff # True Positives # False Positives 1 3.75 3 1.5 4 2.25 5 3 5 5

Displaying results: ROC curves Confidence cutoff # True Positives # False Positives 1 3.75 3 1 # TP 5 ROC curve.5 4 2 1.25 5 3 1 # FP 5 5 5

Making new predictions Confidence cutoff # True Positives # False Positives 1 3.75 3 1 # TP 5 ROC curve x.5 4 2 1.25 5 3 1 # FP 5 5 5

Figures of merit Predicted T F Precision: #TP / (#TP + #FP) (also known as positive predictive value) Actual T F TP FP FN TN Recall: #TP / (#TP + #FN) (also known as sensitivity) Specificity: #TN / (#FP + #TN) Negative predictive value: #TN / (#FN + #TN) Accuracy: (#TP + #TN) / (#TP + #FP + #TN + #FN)

Area under the ROC curve ROC curve Area Under the roc Curve (AUC) = Average proportion of negatives with confidence levels less than a random positive Sensitivity 1 Quick facts: - < AUC < 1 - AUC of random classifier =.5 1-Specificity 1

Quick facts: Area under the ROC curve Area under the ROC curve (AUC): ratio of positive/negative pairs correctly ordered. Quick facts: < AUC < 1 AUC of random classifier =.5 TP rate AUC Mann-Whitney U statistic My favourite classification error measure: 1-AUC 1 ROC curve FP rate 1 69

Precision-recall curves 1 Precision Baseline precision: #P / (#P + #N) 1 Recall Often, people report Area under the Precision-Recall (PR) curve (AUPRC) as a performance metric when # of positives is low and when you want to make good predictions about which genes are positives. Area = average precision using thresholds determined by the positives Unlike ROC curve, the PR curve is not monotonic nor is there a statistical test (that I know of) associated with it.

Simple classifier based on Bayes X is a real-value rule. We know that if y =, then X ~ N(m, s) and if y = 1, then X ~ N(m 1, s) What is P(y = 1 X = x)? P(y = 1 X = x) = P(X = x y = 1) P(y = 1) / P(X = x) Let s say P(y = 1) = p 1 [and P(y = ) = p = 1-p 1 ]

Some definitions P(X = x y = 1) = N(x; m 1, v) = Z(v) -1 exp[-.5 (x m 1 ) 2 ] P(X = x y = ) = N(x; m, v) = Z(v) -1 exp[-.5 (x m ) 2 ] Where, Z(v) = (v2π) 1/2

P(y =1 X = x) = P(X = x y =1)P(y =1) P(X = x y =1)P(y =1) + P(X = x y = )P(y = ) = 1 1+ P(X = x y = )P(y = ) /[P(X = x y =1)P(y =1)], Assuming p = p 1 = 1 1+ exp[.5(x m ) 2 /v +.5(x m 1 ) 2 /v] 1 = 1+ exp[x (m m 1 ) /v w.5(m 2 m 2 1 ) /v], te (m 2 m 2 1 ) = (m + m 1 )(m m 1 ) 1 = 1+ exp[w (x (m + m 1 ) /2 )] = 1 1+ exp[w(x b)] b