Applied Machine Learning Annalisa Marsico

Size: px

Start display at page:

Download "Applied Machine Learning Annalisa Marsico"

Darlene Cross
5 years ago
Views:

1 Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015

2 Goals Feature Selection rather than Feature reduction: regularized linear models From Regression to Classification Logistic regression Regularization, partial least square also possible How to improve overfitting How to evaluate a classification model Class imbalance

3 The Variance-Bias Tradeoff MSE N 1 2 Mean Square Error N y i yˆ i i1 can be decomposed E( MSE) 2 ( Model _ bias) 2 Model _ variance Irreducible noise Reflects how close the function of the model is to the real relationship input-output Reflects how good the model is in generalizing Low variance /high bias Low bias / high variance

4 The Variance-Bias Tradeoff Complex models can have high variance. Collinearity gives rise to high variance models -> we can try to reduce the model s variance as a way to reduce collinearity -> by reducing the variance we increase the bias in the model E( MSE) 2 ( Model _ bias) 2 Model _ variance N.B. Ordinary linear regression produces unbiased coefficients

5 Ridge Regression Controlling or regularizing the parameter estimates can be done by adding a penalty to the SSE SSE N y i yˆ i i1 2 SSE L N 2 y i yi ˆ 2 j i1 P j1 Penalty controls the amount of shrinkage Path of the regression coefficients for different values of λ. 2 L 2 -norm

6 Ridge Regression-How to chose the penalty What s happening in this region?

7 Lasso (Least Absolute Shrinkage and SSE L Selection Operation) regression N 2 y i yi 1 ˆ i1 P j1 It seems like a small modification but the practical implications are significant. While regression coefficients are still shrunk towards zero, by penalizing the absolute value some parameters are actually set to zero for some values of λ

8 Questions Is PLSR a feature selection method? Is Ridge Regression a feature selection method? Is Lasso a feature selection method?

9 Elastic net Generalizaiton of the Lasso model. It combines two types of penalty. p j j P j j N i i i Enet y y SSE ˆ Advantage: enables regularization via the ridge-type penalty and feature selection via the Lasso-like penalty. Zou and Hastie (2005) suggested This model is good to deal with groups of high correlated predictors

10 Comparison between Ridge, Lasso and Elastic Net Lasso subjected to the penalty: 1 2 t Ridge subjected to the penalty: t Elliptical regions is the residual sum of square function. The center is the least square estimate Elastic net penalty: j 2 j 0.2 (1 ) j

11 Linear models for Classification

Classification The process of predicting categorical / qualitative responses Often predict probability to belong to a certain class / category We have a set of observation (x 1,y 1 ).

12 Classification The process of predicting categorical / qualitative responses Often predict probability to belong to a certain class / category We have a set of observation (x 1,y 1 )...(x n,y n ) to train classifier Example: predict if individual will default with his credit card based on annual Income and monthly balance on a set of individuals We want to learn a model that predicts Y (default) from X 1 (balance) and X 2 (income)

13 Classification Linear regression not suitable No natural way to convert qualitative response into quantitative For binary classes we can use dummy variables Y 0 1 If not default If default Yˆ Fitted values converted to output Class G G default if no _ default if Y ˆ 0.5 Yˆ 0. 5 If we try to predict Y with linear regression we might not get a number between 0 and 1 Rather than modeling the response Y directly we model the probability that Y belongs to a certain class P(G=1 X) and P(G=0 X)

14 Classification Linear decision boundary x x x : x p p

15 Logistic Regression How to model the relationship between p(x) = P(Y=1 X) and X? Logistic function ) ( 1 ) 0 ( 1 ) ( ) 1 ( X p X Y P e e X p X Y P X X After a bit of manipulation X X p X p e X p X p X 1 0 ) ( 1 ) ( log ) ( 1 ) ( 1 0 odds Log-odds or logit Logit is linear in X!

16 Estimating coefficients in logistic regression Maximum likelihood to fit the model and learn the β parameters. I.e. We chose those β that maximize the likelihood function of the data: ), ( ) ( ) ( 1 ) ( 1 : 0 : i i y i y i i i x X k Y P x p x p x p L i i

17 Logistic Regression interpretation What do the coefficient represent?

Generalized linear models Yˆ f ( X ) X T ˆ Linear model Yˆ Y Yˆ f ( X ) f ( X ) g( X T ˆ) We will always have an error in trying to approximate the real function Y Generalized linear model g =

18 Generalized linear models Yˆ f ( X ) X T ˆ Linear model Yˆ Y Yˆ f ( X ) f ( X ) g( X T ˆ) We will always have an error in trying to approximate the real function Y Generalized linear model g = activation function In a linear model g = identity funciton In logistic regression g = logistic function The RSS criterium can still be used to find f(x)! Only, f(x) is a more complicated function..

19 Logistic Regression vs Linear Regression Linear Regression Logistic Regression yˆ R yˆ 0,1 T y( x) X 0 T y( x) ( X 0) g = identity function g = sigmoid function

20 Regularized Logistic Regression Classification model can alos use penalty (Ridge, Lasso, etc.) to improve fit E.g. Logistic regression we can maximize a constrainted likelihood function log L( ) p j1 2 j E.g. Ridge-like penalty The glmnet package in R uses a combination of Ridge and Lasso penalty log L( ) p p j j 2 j1 j1 α = mixing proportion that toggles between the pure Lasso penalty (α=1) and pure Ridge (when α=0). α controls the total amount of penalization

21 Regularized Logistic Regression Example: accuracy for different models with different α and λ parameters

22 Can Partial Least Square be extended to Logistic Regression? Yes..It will find new variables that simultaneously reduce dimension and correlate to the response (but Y=0,1) One tuninig parameter: number of components

23 Over-Fitting and Model Tuning

24 The problem of Over-Fitting Tendency to over-emphasize patterns Need to evaluate the model to be confident that it will do well in the future (on new data) Problems in the data: Data quality Limited number of samples

25 The problem of Over-Fitting We want to use existing data to find the best parameters which give not only the best accuracy, but also the most realistic Originally: Split the data into a training set and test set. Modern approaches: Split the data into multiple sets for training, i.e. Parameter tuning Split data into one (or more) distinct set for evaluation purposes

The problem of Over-Fitting When the model, in addition to learning general patterns in the data learns the noise This kind of model will have poor

26 The problem of Over-Fitting When the model, in addition to learning general patterns in the data learns the noise This kind of model will have poor accuracy when predicting a new sample Let s consider the following classification problem Which of these two classifiers is likely to generalize better to new data?

27 The problem of over-fitting

28 Parameter Tuning Several models have at least one tuning parameter We want to find the best set of parameters General strategy for parameter tuning

29 Data Splitting Given a certain amount of data, we have to decide how to spend the data points i.e. which data used for tuning / training and which ones for evaluation Important: Evaluation must be carried out on samples never used in model tuning Many data points -> Test set Few data points -> Re-sampling (Cross-validation) Stratified sampling: random sampling within subgroups when disproportion between classes present

30 Resampling techniques

K-fold Cross Validation Example: predicting cancer patients from gene expression & clinical data Patients split into k sets of roughly same size Model is fit using all patients, except the first

31 K-fold Cross Validation Example: predicting cancer patients from gene expression & clinical data Patients split into k sets of roughly same size Model is fit using all patients, except the first subset (first fold) Held-out patients used for predictions and estimation of performance First subset is returned to the training set and procedure repeated for all k sets k estimates of performance are summarized (usually averaged) Schema of cross-validation process with k = 3

32 Leave One Out Cross-Validation (LOOCV) k-fold cross-validation, k= number of patients (only one patient is held out at time) Final performance computed from the k individual held-out predictions Computationally expensive! k= 10 more attractive but k small reduces the bias between predicted performance and real performance In practise they give similar results

33 Bootstrap Random sample of the patients with replacement I.e. After a data point (patient) is selected for a subset, it can be still selected for the same dataset Some patients represented multiple times in the a set, others not selected at all Not selected samples, out-of bag samples used for prediction and performance estimation Schema of bootstrap procedure

34 Choosing final tuning paramters Pick the parameter setting associated with best accuracy/ minimum error Not always the best choise.. Example: SVM accuracy vs Cost Parameter. 5-fold cross-validation

35 Practical hints to choose the model Test set is a single evaluation: sometimes limited ability Small sample size: We might want to use all points for model building Resampling might be a better solution There is no resampling method better than the others Depends on the situation, e.g. Sample size, computational cost Bootstrap can have lower error rate compared to k-fold CV How to practically choose between models, e.g. SVM or ligistic regression? How can you compare different models?

36 Performance in Classification Models

37 Class Prediction RMSE and R 2 are not appropriate in the context of classification Although classification models mainly return a continuous value (e.g. prob between 0 and 1) -> we need a class prediction (discrete) However, sometimes the probability can be useful to gain confidence

38 Class Prediction - examples message with p=0.51 and another message with p=0.9 would both be classified as spam Imagine a model to classify molecules based on toxicity: molecule1 with class probability 0.52, 0.48 and molecule2 with class probability 0.98, 0.02 will be both classified as non-toxic -> but confidence for molecule 1 is higher

39 Softmax Transformation Prediction for the l-class

40 Evaluating Predicted Classes Confusion matrix: example for a two classes outcome Predicted Event Observed Nonevent Event TP FP Nonevent FN TN Where classes are wrongly predicted Where classes are correctly predicted e.g. Event: healthy, Nonevent: toxic

41 Drawbacks of accuracy measure It does not make a distinction about the type of error being made. E.g. in spam filtering, the cost of deleting an important is higher than allowing a spam pass the filter.. It does not consider the frequency of each class. E.g. in a compound screening model the molecules with biological activity are a minority

42 Other metrics Predicted Observed Event Nonevent Event TP FP Nonevent FN TN Sensitivity (true positive rate) is the rate of correctly predicting the event of interest for all samples having an event Sensitivity = TP TP + FN Specificity is the rate for non-even samples predicted correctly Specificity= False Positive Rate = 1 - Specificity Potential trade-offs between sensitivity and specificity can be made and still keep the same accuracy

43 Other metrics Predicted Observed Event Nonevent Event TP FP Nonevent FN TN Sensitivity and Specificity are conditional measures -> they depend on the event In theory if the event is rare (prevalence w), this should be taken into account.. PPV NPV Sensitivity w Sensitivity w 1 Specificity 1 w Specificity 1 w w1 Sensitivity Specificity 1 w

44 Receiver Operating Characteristic (ROC) Curves Given a collection of continous data points plots sensitivity and false discovery rate at different thresholds What is the effect of altering the threshold? AUC =Area Under the Curve, quantitative assessment of the model ROC curve for a logistic regression model to predict toxicity of a model

45 Precision-Recall curve TP/TP+FN TP/TP+FP FP/FP+TN TP/TP+FN PR curve is much more sensitive to the false positives (e.g. healthy patients that were predicted to have cancer) in cases there the negative class (e.g. Healthy patients) dominates.

46 Class imbalance Imbalance: when one or more classes have very low propostion in the training data Can have significant impact on the effectivness of the model E.g. Pharmaceutical research: High-throughput screening only few molecules show activity: frequency of intersting compounds is low.

47 The effect of class imbalance - example Three models usde to model the high-throughput screening data and evaluated on a test-set The result of class imbalance (most of compounds show no activity) is that models are comparable, have good specificity, but very little sensitivity

48 Class imbalance What can be done? Change the cutoff to increase prediction accuracy of the minority class -> find appropriate balance between sensitivity and specificity min(distance) How do we determine the new cutoff? Find the point on the ROC curve closest to the perfect model

Class Imbalance sampling models Reduce effect of imbalance during training Down-sampling (of the majority class) Bootstrap (such that lcasses are balanced in bootstrap set) Up-sampling (of the

49 Class Imbalance sampling models Reduce effect of imbalance during training Down-sampling (of the majority class) Bootstrap (such that lcasses are balanced in bootstrap set) Up-sampling (of the minority class) Some samples from the minority class appear more than once in the set SMOTE (combination of down-sampling and up-sampling) Class1: healthy Class2: cancer Predictor A: mutation Predictor B: patient age

50 Goals Feature Selection rather than Feature reduction: regularized linear models From Regression to Classification Logistic regression Regularization, partial least square also possible How to improve overfitting How to evaluate a classification model Class imbalance

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)