Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Size: px

Start display at page:

Download "Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University"

Cynthia Bennett
5 years ago
Views:

1 Classifica(on and predic(on omics style Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

2 Classifica(on Learning Set Data with known classes Prediction Classification rule Data with unknown classes Classification Technique Discrimination Class Assignment 2

3 Classification Rule Classification technique Feature selection Parameters [pre-determined, estimable] Distance measure Aggregation methods The classifica(on rule is like a black box, some methods provide more insight into the contents of the box 3

4 Classifica(on Techniques Decision Tree based Methods e.g. random forests (Breiman 2001) Rule- based Methods Memory based reasoning Neural Networks Naïve Bayes (DLDA) and Bayesian Belief Networks Support Vector Machines Widely used but parameters are difficult to interpret and most programs are black boxes

5 Ensemble packages in R Allow applica(on and evalua(on of mul(ple techniques on a dataset Simple and easy to use CMA caret ClassifyR

6 Classification Method Descrip.on Func.on(s) DM DV DD Wrapper for sparsediscrim s diagonal LDA func(on dlda. DLDAtrainInterface, DLDApredictInterface Wrapper for PoiClaClus s Poisson LDA func(on classify. classifyinterface Wrapper for glmnet s elas(c net GLM func(on glmnet. elasticnetglminterface P Wrapper for pamr s Nearest Shrunken Centroid functions pamr.train and pamr.predict. Wrapper for multinomial logistic regression as implemented in CRAN package mnlogit. NSCtrainInterface NSCpredictInterface logisticregressiontraininterfa ce logisticregressionpredictinter face Fisher s Linear Discrimiant Analysis fisherdiscriminant P P * Feature- wise mixtures of normals and vo(ng mixmodelstrain, mixmodelspredict P P P P P P P Feature- wise kernel density es(ma(on and vo(ng naivebayeskernel P P P Wrapper for randomforest's fuc(on randomforest. randomforestinterface P P P Wrapper for e1071 s Support Vector Machine func(on svm. SVMinterface P P P * If ordinary numeric measurements have been transformed to absolute deviations by subtractfromlocation. If kernel is not linear.

7 MODEL PERFORMANCE

8 Valida(on: Performance assessment Can be based on: Cross- valida(on Test set Independent tes(ng on future dataset. Independent tes(ng on exis(ng dataset (integra(ve analysis). 8

9 Cross- valida(on For i=1,, 100 For k=1,, n Par((on data into n disjoint sets S 1, S 2,, S n Omit S k Using all data except S k build classifier Use classifier to predict classes for S k Summary sta.s.cs of performance 9

10 Metrics for Performance Evalua(on Focus on the predic(ve capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes a b Class=No c d a: TP (true posi(ve) b: FN (false nega(ve) c: FP (false posi(ve) d: TN (true nega(ve)

11 Accuracy Accuracy = a a + b + + d c + d = TP TP + TN + TN + FP + FN Issue if data highly skewed/biased For instance, if only 0.5% of the data is in category 1 and the rest is in category 0. Model has 99.5% accuracy! But because of the skew in that data, your model could just be: classify each observa(on to be in category 0, and it would achieve that accuracy.

12 Other metrics Misclassifica(on rate: 1- Accuracy Sensi(vity/Recall/true posi(ve rate: TP/(TP+FN) Specificity/true nega(ve rate: TN/(TN+FP) Posi(ve predic(ve value/precision: TP/(TP+FP) Nega(ve predic(ve value: TN/(TN+FN) F- score: harmonic mean of precision & recall 2*(precision*recall)/(precision+recall) 1=good, 0=bad, doesn t consider TNs

13 ROC (Receiver Opera(ng Characteris(c) Developed in 1950s for signal detec(on theory to analyze noisy signals Characterize the trade- off between posi(ve hits and false alarms ROC curve plots sensi-vity (on the y- axis) against (1- specificity) (on the x- axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribu(on or cost matrix changes the loca(on of the point

14 ROC Curve (Sensi(vity,1- specificity): (0,0): declare everything to be nega(ve class (1,1): declare everything to be posi(ve class (1,0): ideal Diagonal line: Random guessing Below diagonal line: predic(on is opposite of the true class

15 Model Selec(on In prac(ce, omen not much difference in performance between several approaches. Aim to choose the model which is: Interpretable - can we see or understand why the model is making the decisions it makes? Simple - easy to explain and understand Accurate Fast (to train and test) Scalable (can be applied to a large dataset)

16 EXAMPLES

17 Hidden Markov Models Hidden States π i 2 K 2 K 2 K 2 K x 1 x 2 x 3 x K Observa(ons

18 input sequence: most probable path: gene prediction: AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACG exon 1 exon 2 exon 3 Gene finding

19 Crossovers in meiosis ChromHMM: annota-ng genomic regions Ernst & Kellis 2002

20 MammaPrint van t Veer et al Nature 2002; van de Vijver et al NEJM 2002 Based on correla(on

21 Breast cancer survival Reyal et al 2008 BCR

22 Kok et al 2009 J Pathology

23 Kirschner et al JTO 2015

24 CONTINUOUS Y

25 Mul(ple Regression Linear Logis(c! =!!!!!!!!!!!!!! 1 +!!!!!!!!!!!!!! X is the genotype, i.e. the SNPs. Y is the phenotype: Linear: con(nuous phenotypic measurement. Logis(c: 0 = no, 1 = yes. β are the regression coefficients. 25

26 Ridge Regression X is the genotype, i.e. the SNPs Y is the phenotype i.e. risk of developing the disease. β ridge is the ridge regression coefficient. λ 0 is the tuning parameter that controls the amount of ridge penalty. is the ridge penalty. 26

27 LASSO (the Least Absolute Shrinkage and Selec(on Operator) X is the genotype, i.e. the SNPs Y is the phenotype i.e. risk of developing the disease. β lasso is the lasso regression coefficient. λ 0 is the tuning parameter that controls the amount of lasso penalty. is the lasso penalty. 27

28 Elas(c Net X is the genotype, i.e. the SNPs Y is the phenotype i.e. risk of developing the disease. β 0, β is the elas(c net regression coefficient. λ 0 is the tuning parameter that controls the amount of elas(c net penalty. 0 α 1 is elas(c penalty weight 28

29 Ridge vs. LASSO vs. Elas(c Net * Not well in case no of SNPs >> no of people, the maximum number of variables that LASSO can select before it saturates is equal to the number of people. All regression methods rely on linearity assump(on 29

30 Armstrong et al Epigenomics 2017 Hannum et al 2014, elas(c net & Horvath 2013, elas(c net

Bias/variance tradeoff, Model assessment and selec+on

Applied induc+ve learning Bias/variance tradeoff, Model assessment and selec+on Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège October 29, 2012 1 Supervised