Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 1 of 18

Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp β 0 + i β (1) ix i P(Y = 1 X) = exp β 0 + i β ix i 1 + exp β 0 + i β (2) ix i Discriminative prediction: p(y x) Classification uses: ad placement, spam detection What we didn t talk about is how to learn β from data Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 2 of 18

Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j i i β i x (j) i (4) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 3 of 18

Logistic Regression: Objective Function lnp(y X,β) = lnp(y (j) x (j),β) (3) j = y β (j) 0 + β i x (j) i ln 1 + exp β 0 + j Training data (y,x) are fixed. Objective function is a function of β... what values of β give a good value. i i β i x (j) i (4) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 3 of 18

Convexity Convex function Doesn t matter where you start, if you walk up objective Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 4 of 18

Convexity Convex function Doesn t matter where you start, if you walk up objective Gradient! Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 4 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 2 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β Objective Undiscovered Country 3 2 1 0 Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient Descent (non-convex) Goal Optimize log likelihood with respect to variables β α 0 β J β ij (l) (l+1) β ij Luckily, (vanilla) logistic regression is convex Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 5 of 18

Gradient for Logistic Regression To ease notation, let s define π i = expβt x i 1 + expβ T x i (5) Our objective function is logπ i if y i = 1 = logp(y i x i ) = i = log(1 π i ) if y i = 0 i i i (6) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 6 of 18

Taking the Derivative Apply chain rule: β j = i i ( β) = β j i 1 πi π i β if y j i = 1 1 1 π i π i β if y j i = 0 (7) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (8) i β j = (y i π i )x j. (9) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 7 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) Why are we adding? What would well do if we wanted to do descent? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update η: step size, must be greater than zero β η β ( β) (11) β i β i + η ( β) β i (12) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Gradient for Logistic Regression Gradient β ( β) = ( β) β 0,..., ( β) β n (10) Update β η β ( β) (11) β i β i + η ( β) β i (12) NB: Conjugate gradient is usually better, but harder to implement Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 8 of 18

Choosing Step Size Objective Parameter Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 9 of 18

Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 10 of 18

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (13) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 11 of 18

Regularized Conditional Log Likelihood Unregularized Regularized β = argmax ln p(y (j) x (j),β) (13) β β = argmax ln p(y (j) x (j),β) µ β i β 2 i (14) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 11 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient (β) x [ (β,x)] (15) Average over all observations What if we compute an update just from one observation? Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 12 of 18

Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 13 of 18

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 14 of 18

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β + η µβ + x j j ij [y i π i ] (16) Examples in class. Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 14 of 18

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j j Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 15 of 18

Stochastic Gradient for Regularized Regression =logp(y x;β) µ β 2 (17) j Taking the derivative (with respect to example x i ) j β j =(y i π i )x j 2µβ j (18) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 15 of 18

Algorithm 1 Initialize a vector B to be all zeros 2 For t = 1,...,T For each example x i,y i and feature j: Compute π i Pr(y i = 1 x i ) Set β[j] = β[j] + λ(y i π i )x i 3 Output the parameters β 1,...,β d. Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 16 of 18

Proofs about Stochastic Gradient Depends on convexity of objective and how close ε you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 17 of 18

In class Your questions! Working through simple example Prepared for logistic regression homework Machine Learning: Alvin Grissom II Boulder Classification: Logistic Regression from Data 18 of 18