Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27

Quiz question For a test instance (x, y) and a naïve Bayes classifier ˆP, which of the following statements is true? (A) c ˆP(y = c x) = 1 (B) c ˆP(x c) = 1 (C) c ˆP(x c)ˆp(c) = 1 Machine Learning: Chenhao Tan Boulder 2 of 27

Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 3 of 27

Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp [β 0 + i β ix i ] P(Y = 1 X) = exp [β 0 + i β ix i ] 1 + exp [β 0 + i β ix i ] (1) (2) Discriminative prediction: P(y x) Classification uses: sentiment analysis, spam detection What we didn t talk about is how to learn β from data Machine Learning: Chenhao Tan Boulder 4 of 27

Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 5 of 27

Objective function Logistic Regression: Objective Function Maximize likelihood Obj log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) log [ 1 + exp ( β 0 + i β i x (j) i )] Machine Learning: Chenhao Tan Boulder 6 of 27

Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Training data {(x, y)} are fixed. Objective function is a function of β... what values of β give a good value? Machine Learning: Chenhao Tan Boulder 7 of 27

Objective function Convexity L (β) is convex for logistic regression. Proof. Logistic loss yv + log(1 + exp(v)) is convex. Composition with linear function maintains convexity. Sum of convex functions is convex. Machine Learning: Chenhao Tan Boulder 8 of 27

Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 9 of 27

Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan Boulder 10 of 27

Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Gradient! Machine Learning: Chenhao Tan Boulder 10 of 27

Gradient Descent Convexity It would have been much harder if this is not convex. Machine Learning: Chenhao Tan Boulder 11 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 2 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 2 3 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan Boulder 12 of 27

Gradient Descent Gradient for Logistic Regression To ease notation, let s define π i = exp βt x i 1 + exp β T x i (3) Our objective function is L = i log p(y i x i ) = i L i = i { log π i if y i = 1 log(1 π i ) if y i = 0 (4) Machine Learning: Chenhao Tan Boulder 13 of 27

Gradient Descent Taking the Derivative Apply chain rule: L β j = i L i ( β) β j = i 1 π i π i ( ) 1 1 π i π i β j β j if y i = 1 if y i = 0 (5) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (6) L i β j = (y i π i )x j. (7) Machine Learning: Chenhao Tan Boulder 14 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Why are we subtracting? What would we do if we wanted to do ascent? Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) η: step size, must be greater than zero β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

Gradient Descent Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Chenhao Tan Boulder 17 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) Machine Learning: Chenhao Tan Boulder 18 of 27

Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan Boulder 18 of 27

Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 19 of 27

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Chenhao Tan Boulder 20 of 27

Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient L (β) E x [ L (β, x)] (13) Average over all observations Machine Learning: Chenhao Tan Boulder 20 of 27

Stochastic Gradient Descent Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Chenhao Tan Boulder 21 of 27

Stochastic Gradient for Regularized Regression L = log p(y x; β) + 1 2 µ j β 2 j (14) Machine Learning: Chenhao Tan Boulder 22 of 27

Stochastic Gradient for Regularized Regression L = log p(y x; β) + 1 2 µ j β 2 j (14) Taking the derivative (with respect to example x i ) L β j = (y i π i )x j + µβ j (15) Machine Learning: Chenhao Tan Boulder 22 of 27

Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β j η ( µβ j x ij [y i π i ] ) (16) Machine Learning: Chenhao Tan Boulder 23 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = β bias = 0, β A = 0, β B = 0, β C = 0, β D = 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 π 1 = Pr(y 1 = 1 x 1 ) = exp βt x i 1+exp β T x i = Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D π 1 = 0.5 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 1 π 1 ) x 1,bias = 0.0 + 1.0 (1.0 0.5) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = 0.0 + 1.0 (1.0 0.5) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = 0.0 + 1.0 (1.0 0.5) 4.0 =2.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = 0.0 + 1.0 (1.0 0.5) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = 0.0 + 1.0 (1.0 0.5) 3.0 =1.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = 0.0 + 1.0 (1.0 0.5) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = 0.0 + 1.0 (1.0 0.5) 1.0 =0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = 0.0 + 1.0 (1.0 0.5) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = 0.0 + 1.0 (1.0 0.5) 0.0 =0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = Pr(y 2 = 1 x 2 ) = exp βt x i 1+exp β T x i = exp{.5+1.5+1.5+0} exp{.5+1.5+1.5+0}+1 = Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = 0.97 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 2 π 2 ) x 2,bias = 0.5 + 1.0 (0.0 0.97) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 2 π 2 ) x 2,A = 2.0 + 1.0 (0.0 0.97) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 2 π 2 ) x 2,B = 1.5 + 1.0 (0.0 0.97) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 2 π 2 ) x 2,C = 0.5 + 1.0 (0.0 0.97) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 2 π 2 ) x 2,D = 0.0 + 1.0 (0.0 0.97) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Machine Learning: Chenhao Tan Boulder 25 of 27

Proofs about Stochastic Gradient Depends on convexity of objective and how close ɛ you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Chenhao Tan Boulder 26 of 27

Running time Number of iterations to get to accuracy: L (β) L (β ) ɛ Gradient descent: If L is strongly convex: O(log(1/ɛ)) iterations Stochastic gradient descent: If L is strongly convex: O(1/ɛ) iterations Machine Learning: Chenhao Tan Boulder 27 of 27