Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 1 of 15
Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β0 + i β (1) ix i P(Y = 1 X) = exp β0 + i β ix i 1 + exp β0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 2 of 15
Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + β i x (j) i j i i (4) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 3 of 15
Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15
Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Objective 0 Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 3 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 2 3 Undiscovered Country Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b α 0 W J W ij (l) W ij (l) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15
Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15
Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15
Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update η: step size, must be greater than zero β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15
Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15
Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15
Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15
Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15
Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15
Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15
Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 8 of 15
Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15
Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15
Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15
Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15
Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations What if we compute an update just from one observation? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15
Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 11 of 15
Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15
Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, Examples in class. β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15
Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j j Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15
Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j Taking the derivative (with respect to example x i ) j β j =(y i p(y i = 1 x i ;β))x j 2µβ j (13) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15
Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 14 of 15
In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 15 of 15