Classification: Logistic Regression from Data

Size: px

Start display at page:

Download "Classification: Logistic Regression from Data"

Barbra Underwood
6 years ago
Views:

1 Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 1 of 15

2 Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β0 + i β (1) ix i P(Y = 1 X) = exp β0 + i β ix i 1 + exp β0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 2 of 15

3 Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + β i x (j) i j i i (4) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 3 of 15

4 Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15

5 Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 4 of 15

6 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Objective 0 Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

7 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

8 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

9 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

10 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b 0 1 Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

11 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

12 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

13 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

14 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b Undiscovered Country Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

15 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables W and b α 0 W J W ij (l) W ij (l) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 5 of 15

16 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

17 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

18 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update η: step size, must be greater than zero β η β ( β) (6) β i β i + η ( β) β i (7) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

19 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (5) Update β η β ( β) (6) β i β i + η ( β) β i (7) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 6 of 15

20 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

21 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

22 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

23 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

24 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 7 of 15

25 Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 8 of 15

26 Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15

27 Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (8) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (9) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 9 of 15

28 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

29 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

30 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (10) Average over all observations What if we compute an update just from one observation? Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 10 of 15

31 Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 11 of 15

32 Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15

33 Stochastic Gradient for Logistic Regression Given a single observation x chosen at random from the dataset, Examples in class. β i β i + η µβ i + x i y p(y = 1 x, β ) (11) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 12 of 15

34 Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j j Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15

35 Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (12) j Taking the derivative (with respect to example x i ) j β j =(y i p(y i = 1 x i ;β))x j 2µβ j (13) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 13 of 15

36 Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 14 of 15

37 In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Jordan Boyd-Graber Boulder Classification: Logistic Regression from Data 15 of 15

Introduction to Machine Learning

Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber UMD Introduction