Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Maximillian Logan
6 years ago
Views:

1 Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 1 / 18

2 Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β 0 + i β (1) ix i P(Y = 1 X) = exp β 0 + i β ix i 1 + exp β 0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 2 / 18

3 Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j i i β i x (j) i (4) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 18

4 Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j Training data (y,x) are fixed. Objective function is a function of β... what values of β give a good value. i i β i x (j) i (4) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 18

5 Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 18

6 Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 18

7 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

8 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

9 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 0 Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

10 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 0 Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

11 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 1 0 Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

12 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 1 0 Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

13 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

14 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

15 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

16 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

17 Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Luckily, (vanilla) logistic regression is convex Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 18

18 Gradient for Logistic Regression To ease notation, let s define π i = expβt x i 1 + expβ T x i (5) Our objective function is logπ i if y i = 1 = logp(y i x i ) = i = log(1 π i ) if y i = 0 i i i (6) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 18

19 Taking the Derivative Apply chain rule: β j = i i ( β) = β j i 1 πi π i β if y j i = π i π i β if y j i = 0 (7) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (8) i β j = (y i π i )x j. (9) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 18

20 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 18

21 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 18

22 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) η: step size, must be greater than zero Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 18

23 Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 18

24 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 18

25 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 18

26 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 18

27 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 18

28 Choosing Step Size Objective Parameter Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 18

29 Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 18

30 Regularized Conditional Log Likelihood Unregularized β = argmax ln p(y (j) x (j),β) (13) β Regularized β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 18

31 Regularized Conditional Log Likelihood Unregularized β = argmax ln p(y (j) x (j),β) (13) β Regularized β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 18

32 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 12 / 18

33 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 12 / 18

34 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations What if we compute an update just from one observation? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 12 / 18

35 Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 18

36 Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 14 / 18

37 Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Examples in class. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 14 / 18

38 Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j j Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 15 / 18

39 Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j Taking the derivative (with respect to example x i ) j β j =(y i π i )x j 2µβ j (18) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 15 / 18

40 Algorithm 1. Initialize a vector B to be all zeros 2. For t = 1,...,T For each example x i,y i and feature j: Compute π i Pr(y i = 1 x i ) Set β[j] = β[j] + λ(y i π i )x i 3. Output the parameters β 1,...,β d. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 16 / 18

41 Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 17 / 18

42 In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 18 / 18

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification: