Lecture 10: Logistic Regression

Size: px

Start display at page:

Download "Lecture 10: Logistic Regression"

Helen Gaines
5 years ago
Views:

1 BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 10: Logistic Regression Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1

2 Outline An Example Logistic Regression with One Predictor Multiple Logistic Regression Sparse Logistic Regression Tips for real applications One vs All for multi-class classification Imbalanced Data Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. 2

3 An Example Predict if a individual will default based on annual income, monthly balance, and status (student?). An subset of the default data set, which contains three predictors with a binary response. Notice that, the response is not continuous, and it takes binary values (Yes or No). This is a classification problem. 3

4 Why Not Linear Regression I For linear regression, the outcome is usually continuous (there is an order in the possible values of the outcome). For classification problem, the outcome takes discrete/categorical values, where order is meaningless. An Example: Suppose you are trying to guess the identity of a superhero based on his/her special talents. He or she can be the Captain America, The Hulk, Black Widow, or Thor. How can we encode the outcome? YY = Captain America The Hulk Black Widow Thor YY = 1 The Hulk 2 Black Widow 3 Thor 4 Captain America Which one is more reasonable? 4

5 Why Not Linear Regression II If we apply linear regression to the classification problems, different encodings of the outcome would produce fundamentally different linear models leading to different predictions for the same test observations. We do not have general principles to convert discrete outcomes with more than two possible levels to continuous ones, which are suitable for linear regression. A good news is that, if the outcome is binary, we can code it simply as YY = 0 1 Captain America Thor 5

6 Logistic Regression with Binary Outcome Consider the example in which the outcome YY is binary: default or not default. As YY is binary, we can code it by YY = 1 1 if default = no If default = yes Notational convenient for later use YY = 0 1 if default = no If default = yes Different from linear regression, logistic regression models the probability of YY = 1 and YY = 1. For example, if balance is given, logistic regression models PP default = yes balance PP default = no balance = 1 PP default = yes balance or 6

Simple Logistic Regression For simplicity, we only consider one predictor: balance. As PP default = yes balance [0,1], we make use of the logistic function to model the probability.

7 Simple Logistic Regression For simplicity, we only consider one predictor: balance. As PP default = yes balance [0,1], we make use of the logistic function to model the probability. a simplified form of PP YY = 1 XX PP XX = eeββ 0+ββ 1 XX 1 + ee ββ 0+ββ 1 XX = ee ββ 0+ββ 1 XX the predictor An illustration of the logistic function. The figure is from 7

8 Odds and Log-odds The so-called odds is defined by the probability that the outcome is Yes the probability that the outcome is No PP XX 1 PP XX = eeββ 0+ββ 1 XX Odds is popular in gambling, e.g., horse-racing. The log-odds is defined by the log of the odds log PP XX 1 PP XX = ββ 0 + ββ 1 XX Log-odds is linear regarding to the predictor, although PP XX is not. 8

9 Estimating the Regression Coefficients We estimate the regression coefficients by maximum likelihood. Intuitively, the idea of maximum likelihood is to find a set of coefficients such that the probability of the observations is maximized (under the assumption that the observations are drawn independently). For our example, maximum likelihood means that we try to find ββ 0 and ββ 1 such that PP XX is close to one for all individuals who default and PP XX is close to zero for all individuals who do not default. 9

10 Maximum Likelihood Estimate I Given a set of observations xx ii, yy ii : ii = 1,, nn, the likelihood function is l ββ 0, ββ 1 = pp xx ii 1 pp xx jj ii:yy ii =1 jj:yy jj = 1 = ii:yy ii =1 = ii:yy ii = ee ββ 0+ββ 1 xx ii jj:yy jj = ee ββ 0+ββ 1 xx ii jj:yy jj = ee ββ 0+ββ 1 xx jj ee ββ 0+ββ 1 xx ii = ee yy ii ββ 0 +ββ 1 xx ii Do NOT confuse with the probability function ii 10

11 Maximum Likelihood Estimate II Maximum likelihood aims to maximize the likelihood function. max l ββ 0, ββ 1 = ββ 0,ββ 1 ii The maximization problem is equivalent to a minimization problem (we first convert the product to summation by Logarithm) ee yy ii ββ 0 +ββ 1 xx ii min log l ββ 0, ββ 1 = ββ 0,ββ 1 ii log 1 + ee yy ii ββ 0 +ββ 1 xx ii 11

12 Gradient Descent Let us denote ββ 0, ββ 1 = argmin ββ 0,ββ 1 log l ββ 0, ββ 1 = There is no closed form solution of ββ 0, ββ 1. One of the simplest method is to iteratively decrease the objective function value LL ββ 0, ββ 1 = log l ββ 0, ββ 1 by gradient descent: ββ 0 kk+1 = ββ 0 kk ηη ββ0 LL ββ 0 kk, ββ 1 kk ββ 1 kk+1 = ββ 1 kk ηη ββ1 LL ββ 0 kk, ββ 1 kk For properly chosen learning rate ηη (e.g., back-line tracking), the sequence LL ββ kk kk 0, ββ 1, kk = 1,2, will converge to LL ββ 0, ββ 1, and the following holds LL ββ 0 kk+1, ββ 1 kk+1 ii LL ββ 0 kk, ββ 1 kk log 1 + ee yy ii ββ 0 +ββ 1 xx ii ηη is the so-called learning rate 12

13 Batch Training We write the updates of ββ 0, ββ 1 explicitly by ββ kk+1 0 = ββ kk yy ii ee yy kk kk ii ββ 0 +ββ1 xxii 0 ηη ii 1 + ee yy kk kk ii ββ 0 +ββ1 xxii ββ kk+1 1 = ββ kk yy ii xx ii ee yy kk kk ii ββ 0 +ββ1 xxii 1 ηη ii 1 + ee yy kk kk ii ββ 0 +ββ1 xxii The above updates need to pass through all of the training data. Thus, it is called batch update. If the training data has millions or billions (very common in real applications) of data samples, the updates can be very time-consuming. 13

14 Online Training Different from batch training (that involves all the data samples in each iteration), online training update the parameters by visiting one data sample in each iteration. Let ff ii ββ 0, ββ 1 = log 1 + ee yy ii ββ 0 +ββ 1 xx ii. Then, ββ 0, ββ 1 = argmin ββ 0,ββ 1 log l ββ 0, ββ 1 = For online training, in each iteration, the updates are ii ff ii ββ 0, ββ 1 Stochastic gradient descent ββ 0 kk+1 = ββ 0 kk ηη ββ0 ff kk ββ 0 kk, ββ 1 kk ββ 1 kk+1 = ββ 1 kk ηη ββ1 ff kk ββ 0 kk, ββ 1 kk = ββ kk 0 ηη yy kkee yy kk kk kk ββ 0 +ββ1 xxkk 1 + ee yy kk kk kk ββ 0 +ββ1 xxkk = ββ kk 1 ηη yy kkxx kk ee yy kk kk kk ββ 0 +ββ1 xxkk 1 + ee yy kk ββ 0 kk +ββ1 kk xxkk 14

15 Batch Training vs Online Training I Batch training (gradient descent). Figure is from 15

16 Batch Training vs Online Training II Online training (stochastic gradient descent). Figure is from 16

17 Batch Training vs Online Training III Online training (improved stochastic gradient descent). Figure is from 17

18 Multiple Logistic Regression When there is more than one predictors, logistic regression model the outcome by PP XX = eeββ 0+ββ 1 XX 1 + ββ pp XX pp 1 + ee ββ 0+ββ 1 XX 1 + ββ pp XX pp = The likelihood function is l ββ 0, ββ 1,, ββ pp The maximum likelihood estimates of the parameters are ee ββ 0+ββ 1 XX 1 + ββ pp XX pp 1 = ii 1 + ee yy 1 pp ii ββ 0 +ββ 1 xx ii + ββpp xx ii ββ 0, ββ 1,, ββ pp = argmin log l ββ 0, ββ 1,, ββ pp = ββ 0,ββ 1,,ββ pp ii log 1 + ee yy ii ββ 0 +ββ 1 xx ii 1 + ββpp xx ii pp 18

Large absolute value of Z-statistic or small value of P-value indicates strong evidence that the corresponding predictor is associated with the outcome.

19 Logistic Regression on the Default Data Std. error is used to compute confidence intervals. That is, there is approximately 95% chance that the interval ββ ii 2SE ββ 1, ββ ii + 2SE ββ 1 contain the true value of ββ ii. Large absolute value of Z-statistic or small value of P-value indicates strong evidence that the corresponding predictor is associated with the outcome. An visualization of the Default data set. The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. 19

20 Motivation of Sparse Models Revisit of the default data set There is no strong association between the predictor income and the outcome default. The single predictor balance may lead to better generalization performance (on testing data). In many real word applications (especially biomedical studies), the number of features/predictors (genes, SNPs, etc.) is much larger than the number of samples (e.g., patients). Many studies indicate that only a very few of the features contribute to the outcome (symptoms of diseases, etc.). 20

21 Sparse Logistic Regression Introduction of sparsity to the probability PP XX = ee ββ 0+ββ 1 XX 1 +ββ 2 XX 2 + ββ pp XX pp Sparsity: many coefficients are zero, i.e., they have no contributions to the probability We can enforce sparsity by the l 1 regularizer. ββ 0, ββ 1,, ββ pp pp = argmin ββ 0,ββ 1,,ββ pp log l ββ 0, ββ 1,, ββ pp + λλ jj=1 ββ jj Usually, we do not penalize the intercept ββ 0. = argmin ββ 0,ββ 1,,ββ pp log 1 + ee yy 1 pp ii ββ 0 +ββ 1 xx ii + ββpp xx ii ii pp + λλ jj=1 ββ jj 21

22 Projected Gradient Descent Another popular formulation of sparse logistic regression is min ββ 0,ββ 1,,ββ pp s.t. Framework of projected gradient descent in the kk th iteration log 1 + ee yy 1 pp ii ββ 0 +ββ 1 xx ii + ββpp xx ii ii pp jj=1 ββ jj rr Update ββ 0 kk, ββ 1 kk,, ββ pp kk ββ 0 kk 1, ββ 1 kk 1,, ββ pp kk 1 Project ββ 1 kk, ββ 2 kk,, ββ pp kk by gradient descent from to the set SS = pp ββ 1, ββ 2,, ββ pp : jj=1 ββ jj rr 22

23 Logistic Regression for Multi-Class Problems One vs All YY = 1 YY = 1 PP 1 XX Class 1 YY = 1 YY = 1 PP 2 XX Given a new instance xx, we set the class label of xx to kk such that PP kk xx PP ii xx for all ii. Class 2 Class 3 YY = 1 YY = 1 PP 3 XX 23

24 Tip for Unbalanced Data: Undersampling In many (probably most of them) real applications, the data are unbalanced. In other words, the number of positive samples is much larger or smaller than the number of negative samples, which may cause serious problem in training the modeling. The common used technique to handle the unbalanced data is undersampling, i.e., we undersample the larger class of data (the one that has more data samples) such that the two classes have roughly the same number of data samples. 24

25 A Synthetic Data Why synthetic data? We know the ground truth. Positive samples: ~NN 5, 2.5. Negative samples: ~NN 5,

26 Imbalanced Data Fitted Models on imbalanced data. 26

27 Undersampling and Ensembling We undersample the data such that the number of the data samples of two classes are roughly the same. Fitted models on the undersampled data. The fifth figure plots the model obtained by averaing the parameters. The last one plots the model by majority voting. 27

28 Resources Books An Introduction to Statistical Learning with applications in R, Section 4.3. Bayesian Reasoning and Machine Learning, Section Software SLEP: Sparse Learning with Efficient Projections. 28

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors