Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper airibu&on. Please send comments and correc&ons to Eric. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classifica&on Based on Probability Instead of just predic&ng the class, give the probability of the instance being that class i.e., learn Comparison to perceptron: Perceptron doesn t produce probability es&mate Perceptron (and other discrimina&ve classifiers) are only interested in producing a discrimina&ve model Recall that: p(y x) 0 apple p(event) apple 1 p(event) + p( event) = 1 2

Logis&c Regression Takes a probabilis&c approach to learning discrimina&ve func&ons (i.e., a classifier) h (x) should give p(y =1 x; ) Want 0 apple h (x) apple 1 Can t just use linear regression with a threshold Logis&c regression model: h (x) =g ( x) g(z) = 1 1+e z Logis&c / Sigmoid Func&on g(z) h (x) = 1 1+e T x 3

Interpreta&on of Hypothesis Output h (x) = es&mated p(y =1 x; ) Example: Cancer diagnosis from tumor size apple apple x0 1 x = = x 1 tumorsize h (x) =0.7 à Tell pa&ent that 70% chance of tumor being malignant Note that: p(y =0 x; )+p(y =1 x; ) =1 Therefore, p(y =0 x; ) =1 p(y =1 x; ) Based on example by Andrew Ng 4

Another Interpreta&on Equivalently, logis&c regression assumes that log p(y =1 x; ) p(y =0 x; ) = 0 + 1 x 1 +...+ d x d odds of y = 1 Side Note: the odds in favor of an event is the quan&ty p / (1 p), where p is the probability of the event E.g., If I toss a fair dice, what are the odds that I will have a 6? In other words, logis&c regression assumes that the log odds is a linear func&on of x Based on slide by Xiaoli Fern 5

Logis&c Regression h (x) =g ( x) g(z) = 1 1+e z g(z) ( x) should be large nega&ve ( x) should be large posi&ve values for nega&ve instances values for posi&ve instances Assume a threshold and... Predict y = 1 if h (x) 0.5 Predict y = 0 if h (x) < 0.5 y = 1 y = 0 Based on slide by Andrew Ng 6

Non-Linear Decision Boundary Can apply basis func&on expansion to features, same as with linear regression 2 3 1 x 1 x 2 2 x = 4 1 3 x 1 x 2 x 1 5 x 2 1! x 2 x 2 2 x 2 1x 2 6 4 x 1 x 2 7 2 5. 7

Logis&c Regression Given where n x (1),y (1), x (2),y (2),..., x (i) 2 R d, y (i) 2{0, 1} x (n),y (n) o Model: 2 = 6 4 h (x) =g ( x) 0 1. d g(z) = 3 7 5 1 1+e z x = 1 x 1... x d 8

Logis&c Regression Objec&ve Func&on Can t just use squared loss as in linear regression: J( ) = 1 2n nx h x (i) y (i) 2 Using the logis&c regression model h (x) = results in a non-convex op&miza&on 1 1+e T x 9

Deriving the Cost Func&on via Maximum Likelihood Es&ma&on Likelihood of data is given by: l( ) = Can take the log without changing the solu&on: YnY MLE = arg max log p(y (i) x (i) ; ) ny p(y (i) x (i) ; ) So, looking for the θ that maximizes the likelihood ny MLE = arg max l( ) = arg max p(y (i) x (i) ; ) = arg max XnX log p(y (i) x (i) ; ) 10

Deriving the Cost Func&on via Maximum Likelihood Es&ma&on Expand as follows: MLE = arg max = arg max XnX log p(y (i) x (i) ; ) Xh nx h y (i) log p(y (i) =1 x (i) ; )+ 1 y (i) i log 1 p(y (i) =1 x (i) ; ) Subs&tute in model, and take nega&ve to yield Logis,c regression objec,ve: min J( ) nx h J( ) = y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) 11

Intui&on Behind the Objec&ve J( ) = nx h y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) Cost of a single instance: log(h cost (h (x),y)= (x)) if y =1 log(1 h (x)) if y =0 Can re-write objec&ve func&on as nx J( ) = cost h (x (i) ),y (i) Compare to linear regression: J( ) = 1 2n nx h x (i) y (i) 2 12

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 Aside: Recall the plot of log(z) 13

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 If y = 1 If y = 1 Cost = 0 if predic&on is correct As h (x)! 0, cost!1 cost Captures intui&on that larger mistakes should get larger penal&es 0 h (x) 1 e.g., predict h (x) =0, but y = 1 Based on example by Andrew Ng 14

Intui&on Behind the Objec&ve cost (h (x),y)= log(h (x)) if y =1 log(1 h (x)) if y =0 cost If y = 1 If y = 0 If y = 0 Cost = 0 if predic&on is correct As (1 h (x))! 0, cost!1 Captures intui&on that larger mistakes should get larger penal&es 0 h (x) 1 Based on example by Andrew Ng 15

Regularized Logis&c Regression J( ) = nx h y (i) log h (x (i) )+ 1 y (i) log i 1 h (x (i) ) We can regularize logis&c regression exactly as before: dx J regularized ( ) =J( )+ 2 j=1 2 j = J( )+ 2 k [1:d] k 2 2 16

Gradient Descent for Logis&c Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Ini&alize Repeat un&l convergence j j @ J( ) @ j simultaneous update for j = 0... d Use the natural logarithm (ln = log e ) to cancel with the exp() in h (x) 17

Gradient Descent for Logis&c Regression J reg ( ) = Want min nx h y (i) log h (x (i) )+ J( ) 1 y (i) i log 1 h (x (i) ) + 2 k [1:d] k 2 2 Ini&alize Repeat un&l convergence nx 0 0 h x (i) y (i) j j " n X h x (i) (simultaneous update for j = 0... d) y (i) x (i) j + j # 18

Gradient Descent for Logis&c Regression Ini&alize Repeat un&l convergence nx 0 0 h x (i) y (i) j j " n X This looks IDENTICAL to linear regression!!! Ignoring the 1/n constant However, the form of the model is very different: h (x) = h x (i) 1 1+e T x (simultaneous update for j = 0... d) y (i) x (i) j + j # 19

Mul&-Class Classifica&on Binary classifica&on: Mul&-class classifica&on: x 2 x 2 x 1 Disease diagnosis: x 1 healthy / cold / flu / pneumonia Object classifica&on: desk / chair / monitor / bookcase 20

Mul&-Class Logis&c Regression For 2 classes: 1 h (x) = 1+exp( T x) = exp( T x) 1+exp( T x) weight assigned to y = 0 weight assigned to y = 1 For C classes {1,..., C }: p(y = c x; 1,..., C )= exp( T c x) P C c=1 exp( T c x) Called the so3max func&on 21

Mul&-Class Logis&c Regression Split into One vs Rest: x 2 x 1 Train a logis&c regression classifier for each class i to predict the probability that y = i with h c (x) = exp( T c x) P C c=1 exp( T c x) 22

Implemen&ng Mul&-Class Logis&c Regression Use h c (x) = exp( T c x) P C c=1 exp( T c x) as the model for class c Gradient descent simultaneously updates all parameters for all models Same deriva&ve as before, just with the above h c (x) Predict class label as the most probable label max c h c (x) 23