Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Size: px

Start display at page:

Download "Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com"

Georgina Williams
5 years ago
Views:

Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online.

1 Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova 123RF.com

2 Classification Based on Probability Instead of just predicting the class, give the probability of the instance being that class i.e., learn Comparison to perceptron: Perceptron doesn t produce probability estimate Perceptron (and other discriminative classifiers) are only interested in producing a discriminative model Recall that: p(y x) 0 apple p(event) apple 1 p(event) + p( event) = 1 2

3 Logistic Regression Takes a probabilistic approach to learning discriminative functions (i.e., a classifier) h (x) should give p(y =1 x; ) Want 0 apple h (x) apple 1 Can t just use linear regression with a threshold Logistic regression model: h (x) =g ( x) g(z) = 1 1+e z Logistic / Sigmoid Function g(z) h (x) = 1 1+e T x 3

4 Interpretation of Hypothesis Output h (x) = estimated p(y =1 x; ) Example: Cancer diagnosis from tumor size apple apple x0 1 x = = x 1 tumorsize h (x) =0.7 à Tell patient that 70% chance of tumor being malignant Note that: p(y =0 x; )+p(y =1 x; ) =1 Therefore, p(y =0 x; ) =1 p(y =1 x; ) Based on example by Andrew Ng 4

Another Interpretation Equivalently, logistic regression assumes that log p(y =1 x; ) p(y =0 x; ) = 0 + 1 x 1 +.

5 Another Interpretation Equivalently, logistic regression assumes that log p(y =1 x; ) p(y =0 x; ) = x d x d odds of y = 1 Side Note: the odds in favor of an event is the quantity p / (1 p), where p is the probability of the event E.g., If I toss a fair dice, what are the odds that I will have a 6? In other words, logistic regression assumes that the log odds is a linear function of x Based on slide by Xiaoli Fern 5

6 Logistic Regression h (x) =g ( x) g(z) = 1 1+e z g(z) ( x) should be large negative ( x) should be large positive values for negative instances values for positive instances Assume a threshold and... Predict y = 1 if h (x) 0.5 Predict y = 0 if h (x) < 0.5 y = 1 y = 0 Based on slide by Andrew Ng 6

7 Non-Linear Decision Boundary Can apply basis function expansion to features, same as with linear regression x 1 x 2 2 x = x 1 x 2 x 1 5 x 2 1! x 2 x 2 2 x 2 1x x 1 x

8 Logistic Regression Given where n x (1),y (1), x (2),y (2),..., x (i) 2 R d, y (i) 2{0, 1} x (n),y (n) o Model: 2 = 6 4 h (x) =g ( x) 0 1. d g(z) = e z x = 1 x 1... x d 8

9 Logistic Regression Objective Function Can t just use squared loss as in linear regression: J( ) = 1 2n nx h x (i) y (i) 2 Using the logistic regression model h (x) = results in a non-convex optimization 1 1+e T x 9

10 Deriving the Cost Function via Maximum Likelihood Estimation Likelihood of data is given by: l( ) = Can take the log without changing the solution: YnY MLE = arg max log p(y (i) x (i) ; ) ny p(y (i) x (i) ; ) So, looking for the θ that maximizes the likelihood ny MLE = arg max l( ) = arg max p(y (i) x (i) ; ) = arg max XnX log p(y (i) x (i) ; ) 10

11 Deriving the Cost Function via Maximum Likelihood Estimation Expand as follows: MLE = arg max = arg max XnX log p(y (i) x (i) ; ) Xh nx h y (i) log p(y (i) =1 x (i) ; )+ 1 y (i) i log 1 p(y (i) =1 x (i) ; ) Substitute in model, and take negative to yield Logistic regression objective: min J( ) nx h J( ) = y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) 11

12 J( ) = Intuition Behind the Objective nx h y (i) log h (x (i) )+ 1 y (i) log Cost of a single instance: log(h cost (h (x),y)= (x)) if y =1 log(1 h (x)) if y =0 Can re-write objective function as nx J( ) = cost h (x (i) ),y (i) Compare to linear regression: J( ) = 1 2n i 1 h (x (i) ) nx h x (i) y (i) 2 12

13 Intuition Behind the Objective cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 Aside: Recall the plot of log(z) 13

14 Intuition Behind the Objective cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 If y = 1 If y = 1 Cost = 0 if prediction is correct As h (x)! 0, cost!1 cost Captures intuition that larger mistakes should get larger penalties 0 h (x) 1 e.g., predict h (x) =0, but y = 1 Based on example by Andrew Ng 14

15 Intuition Behind the Objective cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 cost If y = 1 If y = 0 If y = 0 Cost = 0 if prediction is correct As (1 h (x))! 0, cost!1 Captures intuition that larger mistakes should get larger penalties 0 h (x) 1 Based on example by Andrew Ng 15

16 Regularized Logistic Regression J( ) = nx h y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) We can regularize logistic regression exactly as before: dx J regularized ( ) =J( )+ 2 j=1 2 j = J( )+ 2 k [1:d] k

17 Gradient Descent for Logistic Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Initialize Repeat until convergence j J( ) simultaneous update for j = 0... d Use the natural logarithm (ln = log e ) to cancel with the exp() in h (x) 17

18 Gradient Descent for Logistic Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Initialize Repeat until convergence nx 0 0 h x (i) y (i) j j " n X h x (i) (simultaneous update for j = 0... d) y (i) x (i) j + j # 18

19 Gradient Descent for Logistic Regression Initialize Repeat until convergence nx 0 0 h x (i) y (i) j j " n X This looks IDENTICAL to linear regression!!! Ignoring the 1/n constant However, the form of the model is very different: h (x) = h x (i) 1 1+e T x (simultaneous update for j = 0... d) y (i) x (i) j + j # 19

20 Stochastic Gradient Descent 20

21 Consider Learning with Numerous Data Logistic regression objective: J( ) = 1 nx [y i log h (x i )+(1 y i ) log (1 h (x i ))] n Fit via gradient descent: j j 1 nx (h (x i ) y i ) x ij n cost (x i,y i ) What is the computational complexity in terms of n? 21

22 Batch Gradient Descent Initialize θ Repeat { } j j 1 n Gradient Descent nx (h (x i ) y i ) x ij for j = 0...d Stochastic Gradient Descent Initialize θ Randomly shuffle dataset Repeat { (Typically 1 10x) For i = 1...n, do j j (h (x i ) y i ) x j J( ) for j j cost (x i,y i ) 22

23 Batch vs Stochastic GD Batch GD Stochastic GD Learning rate α is typically held constant Can slowly decrease α over time to force θ to converge: constant1 e.g., t = iterationnumber + constant2 Based on slide by Andrew Ng 23

24 Adagrad 24

25 New Stochastic Gradient Algorithms So far, we have considered: a constant learning rate α a time-dependent learning rate α t via a pre-set formula AdaGrad adjusts the learning rate based on historical information Frequently occurring features in the gradients get small learning rates and infrequent features get higher ones Key idea: learn slowly from frequent features but pay attention to rare but informative features Define a per-feature learning rate for feature j as: t,j = p tx G t,j = Gt,j where k=1 g j cost (x k,y k ) G t,j is the sum of squares of gradients of feature j through time t 25

t from j j g t,j to j j p Gt,j + g t,j In practice, we add a small constant ζ > 0 to prevent dividing by

26 New Stochastic Gradient Algorithms Adagrad per-feature t,j = p learning rate G t,j = Gt,j where Adagrad converges quickly: With a good choice for α tx k=1 g 2 k,j Adagrad changes the update rule for SGD at time t from j j g t,j to j j p Gt,j + g t,j In practice, we add a small constant ζ > 0 to prevent dividing by zero errors k k2 k k2 With a bad choice for α Time Time 26 Plots from

27 Multi-Class Classification 27

28 Multi-Class Classification Binary classification: Multi-class classification: x 2 x 2 x 1 Disease diagnosis: x 1 healthy / cold / flu / pneumonia Object classification: desk / chair / monitor / bookcase 28

29 Multi-Class Logistic Regression For 2 classes: 1 h (x) = 1+exp( T x) = exp( T x) 1+exp( T x) For C classes {1,..., C}: p(y = c x; 1,..., C )= Called the softmax function weight assigned to y = 0 exp( T c x) P C c=1 exp( T c x) weight assigned to y = 1 29

30 Multi-Class Logistic Regression Split into One vs Rest: x 2 x 1 Train a logistic regression classifier for each class i to predict the probability that y = i with h c (x) = exp( T c x) P C c=1 exp( T c x) 30

31 Implementing Multi-Class Logistic Regression Use h c (x) = exp( T c x) P C c=1 exp( T c x) as the model for class c Gradient descent simultaneously updates all parameters for all models Same derivative as before, just with the above h c (x) Predict class label as the most probable label max c h c (x) 31

Classification Based on Probability

Classification Based on Probability Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely