Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Size: px

Start display at page:

Download "Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5"

Grace Penelope Taylor
6 years ago
Views:

1 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27

2 Quiz question For a test instance (x, y) and a naïve Bayes classifier ˆP, which of the following statements is true? (A) c ˆP(y = c x) = 1 (B) c ˆP(x c) = 1 (C) c ˆP(x c)ˆp(c) = 1 Machine Learning: Chenhao Tan Boulder 2 of 27

3 Overview Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 3 of 27

4 Reminder: Logistic Regression 1 P(Y = 0 X) = 1 + exp [β 0 + i β ix i ] P(Y = 1 X) = exp [β 0 + i β ix i ] 1 + exp [β 0 + i β ix i ] (1) (2) Discriminative prediction: P(y x) Classification uses: sentiment analysis, spam detection What we didn t talk about is how to learn β from data Machine Learning: Chenhao Tan Boulder 4 of 27

5 Objective function Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 5 of 27

6 Objective function Logistic Regression: Objective Function Maximize likelihood Obj log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) log [ 1 + exp ( β 0 + i β i x (j) i )] Machine Learning: Chenhao Tan Boulder 6 of 27

7 Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Machine Learning: Chenhao Tan Boulder 7 of 27

8 Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Training data {(x, y)} are fixed. Objective function is a function of β... what values of β give a good value? Machine Learning: Chenhao Tan Boulder 7 of 27

9 Objective function Logistic Regression: Objective Function Minimize negative log likelihood (loss) L log P(Y X, β) = j = j log P(y (j) x (j), β)) y (j) ( β 0 + i β i x (j) i ) + log [ 1 + exp ( β 0 + i β i x (j) i )] Training data {(x, y)} are fixed. Objective function is a function of β... what values of β give a good value? β = arg min L (β) β Machine Learning: Chenhao Tan Boulder 7 of 27

10 Objective function Convexity L (β) is convex for logistic regression. Proof. Logistic loss yv + log(1 + exp(v)) is convex. Composition with linear function maintains convexity. Sum of convex functions is convex. Machine Learning: Chenhao Tan Boulder 8 of 27

11 Gradient Descent Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 9 of 27

12 Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Machine Learning: Chenhao Tan Boulder 10 of 27

13 Gradient Descent Convexity Convex function Doesn t matter where you start, if you go down along the gradient Gradient! Machine Learning: Chenhao Tan Boulder 10 of 27

14 Gradient Descent Convexity It would have been much harder if this is not convex. Machine Learning: Chenhao Tan Boulder 11 of 27

15 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Machine Learning: Chenhao Tan Boulder 12 of 27

16 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

17 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

18 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

19 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

20 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective 0 1 Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

21 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

22 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

23 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β Objective Parameter Undiscovered Country Machine Learning: Chenhao Tan Boulder 12 of 27

24 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Machine Learning: Chenhao Tan Boulder 12 of 27

25 Gradient Descent Gradient Descent (non-convex) Goal Optimize loss function with respect to variables β β l+1 j = β l j η L β j Luckily, (vanilla) logistic regression is convex Machine Learning: Chenhao Tan Boulder 12 of 27

26 Gradient Descent Gradient for Logistic Regression To ease notation, let s define π i = exp βt x i 1 + exp β T x i (3) Our objective function is L = i log p(y i x i ) = i L i = i { log π i if y i = 1 log(1 π i ) if y i = 0 (4) Machine Learning: Chenhao Tan Boulder 13 of 27

27 Gradient Descent Taking the Derivative Apply chain rule: L β j = i L i ( β) β j = i 1 π i π i ( ) 1 1 π i π i β j β j if y i = 1 if y i = 0 (5) If we plug in the derivative, we can merge these two cases π i β j = π i (1 π i )x j, (6) L i β j = (y i π i )x j. (7) Machine Learning: Chenhao Tan Boulder 14 of 27

28 Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

29 Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) β η β L ( β) (9) β i β i η L ( β) β i (10) Why are we subtracting? What would we do if we wanted to do ascent? Machine Learning: Chenhao Tan Boulder 15 of 27

30 Gradient Descent Gradient for Logistic Regression Gradient Update β L ( β) = [ L ( β) ],..., L ( β) β 0 β n (8) η: step size, must be greater than zero β η β L ( β) (9) β i β i η L ( β) β i (10) Machine Learning: Chenhao Tan Boulder 15 of 27

31 Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

32 Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

33 Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

34 Gradient Descent Choosing Step Size Objective Parameter Machine Learning: Chenhao Tan Boulder 16 of 27

35 Gradient Descent Remaining issues When to stop? What if β keeps getting bigger? Machine Learning: Chenhao Tan Boulder 17 of 27

36 Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) Machine Learning: Chenhao Tan Boulder 18 of 27

37 Gradient Descent Regularized Conditional Log Likelihood Unregularized Regularized [ ] β = arg min ln p(y (j) x (j), β) β [ ] β = arg min ln p(y (j) x (j), β) + 1 β 2 µ i (11) β 2 i (12) µ is regularization parameter that trades off between likelihood and having small parameters Machine Learning: Chenhao Tan Boulder 18 of 27

38 Outline Objective function Gradient Descent Stochastic Gradient Descent Machine Learning: Chenhao Tan Boulder 19 of 27

39 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Machine Learning: Chenhao Tan Boulder 20 of 27

40 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient L (β) E x [ L (β, x)] (13) Average over all observations Machine Learning: Chenhao Tan Boulder 20 of 27

41 Approximating the Gradient Our datasets are big (to fit into memory)... or data are changing / streaming Hard to compute true gradient L (β) E x [ L (β, x)] (13) Average over all observations What if we compute an update just from one observation? Machine Learning: Chenhao Tan Boulder 20 of 27

42 Stochastic Gradient Descent Getting to Union Station Pretend it s a pre-smartphone world and you want to get to Union Station Machine Learning: Chenhao Tan Boulder 21 of 27

43 Stochastic Gradient for Regularized Regression L = log p(y x; β) µ j β 2 j (14) Machine Learning: Chenhao Tan Boulder 22 of 27

44 Stochastic Gradient for Regularized Regression L = log p(y x; β) µ j β 2 j (14) Taking the derivative (with respect to example x i ) L β j = (y i π i )x j + µβ j (15) Machine Learning: Chenhao Tan Boulder 22 of 27

45 Stochastic Gradient for Logistic Regression Given a single observation x i chosen at random from the dataset, β j β j η ( µβ j x ij [y i π i ] ) (16) Machine Learning: Chenhao Tan Boulder 23 of 27

46 Example Documents β[j] =β[j] + η(y i π i )x i β = β bias = 0, β A = 0, β B = 0, β C = 0, β D = 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 Machine Learning: Chenhao Tan Boulder 24 of 27

47 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 π 1 = Pr(y 1 = 1 x 1 ) = exp βt x i 1+exp β T x i = Machine Learning: Chenhao Tan Boulder 24 of 27

48 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D You first see the positive example. First, compute π 1 π 1 = Pr(y 1 = 1 x 1 ) = exp βt x i 1+exp β T x i = exp 0 exp 0+1 = 0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

49 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D π 1 = 0.5 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

50 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 1 π 1 ) x 1,bias = ( ) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

51 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 1 π 1 ) x 1,bias = ( ) 1.0 =0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

52 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

53 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = ( ) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

54 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 1 π 1 ) x 1,A = ( ) 4.0 =2.0 Machine Learning: Chenhao Tan Boulder 24 of 27

55 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

56 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = ( ) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

57 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 1 π 1 ) x 1,B = ( ) 3.0 =1.5 Machine Learning: Chenhao Tan Boulder 24 of 27

58 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

59 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = ( ) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

60 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 1 π 1 ) x 1,C = ( ) 1.0 =0.5 Machine Learning: Chenhao Tan Boulder 24 of 27

61 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

62 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = ( ) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

63 Example Documents β[j] =β[j] + η(y i π i )x i β = 0, 0, 0, 0, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 1 π 1 ) x 1,D = ( ) 0.0 =0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

64 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? Machine Learning: Chenhao Tan Boulder 24 of 27

65 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = Pr(y 2 = 1 x 2 ) = exp βt x i 1+exp β T x i = exp{ } exp{ }+1 = Machine Learning: Chenhao Tan Boulder 24 of 27

66 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = Pr(y 2 = 1 x 2 ) = exp βt x i 1+exp β T x i = exp{ } exp{ }+1 = 0.97 Machine Learning: Chenhao Tan Boulder 24 of 27

67 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D Now you see the negative example. What s π 2? π 2 = 0.97 What s the update for β bias? Machine Learning: Chenhao Tan Boulder 24 of 27

68 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 2 π 2 ) x 2,bias = ( ) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

69 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β bias? β bias = β bias + η (y 2 π 2 ) x 2,bias = ( ) 1.0 =-0.47 Machine Learning: Chenhao Tan Boulder 24 of 27

70 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? Machine Learning: Chenhao Tan Boulder 24 of 27

71 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 2 π 2 ) x 2,A = ( ) 0.0 Machine Learning: Chenhao Tan Boulder 24 of 27

72 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β A? β A = β A + η (y 2 π 2 ) x 2,A = ( ) 0.0 =2.0 Machine Learning: Chenhao Tan Boulder 24 of 27

73 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? Machine Learning: Chenhao Tan Boulder 24 of 27

74 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 2 π 2 ) x 2,B = ( ) 1.0 Machine Learning: Chenhao Tan Boulder 24 of 27

75 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β B? β B = β B + η (y 2 π 2 ) x 2,B = ( ) 1.0 =0.53 Machine Learning: Chenhao Tan Boulder 24 of 27

76 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? Machine Learning: Chenhao Tan Boulder 24 of 27

77 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 2 π 2 ) x 2,C = ( ) 3.0 Machine Learning: Chenhao Tan Boulder 24 of 27

78 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β C? β C = β C + η (y 2 π 2 ) x 2,C = ( ) 3.0 =-2.41 Machine Learning: Chenhao Tan Boulder 24 of 27

79 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? Machine Learning: Chenhao Tan Boulder 24 of 27

80 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 2 π 2 ) x 2,D = ( ) 4.0 Machine Learning: Chenhao Tan Boulder 24 of 27

81 Example Documents β[j] =β[j] + η(y i π i )x i β =.5, 2, 1.5, 0.5, 0 y 1 =1 A A A A B B B C (Assume step size η = 1.0.) y 2 =0 B C C C D D D D What s the update for β D? β D = β D + η (y 2 π 2 ) x 2,D = ( ) 4.0 =-3.88 Machine Learning: Chenhao Tan Boulder 24 of 27

82 Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Machine Learning: Chenhao Tan Boulder 25 of 27

83 Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Any issues? Machine Learning: Chenhao Tan Boulder 25 of 27

84 Algorithm 1. Initialize a vector β to be all zeros 2. For t = 1,..., T For each example x i, y i and feature j: Compute π i Pr(y i = 1 x i) Set β j = β j η(µβ j (y i π i)x i) 3. Output the parameters β 1,..., β d. Any issues? Inefficiency due to updates on every j even if x ij = 0. Lazy sparse updates in the readings. Machine Learning: Chenhao Tan Boulder 25 of 27

85 Proofs about Stochastic Gradient Depends on convexity of objective and how close ɛ you want to get to actual answer Best bounds depend on changing η over time and per dimension (not all features created equal) Machine Learning: Chenhao Tan Boulder 26 of 27

86 Running time Number of iterations to get to accuracy: L (β) L (β ) ɛ Gradient descent: If L is strongly convex: O(log(1/ɛ)) iterations Stochastic gradient descent: If L is strongly convex: O(1/ɛ) iterations Machine Learning: Chenhao Tan Boulder 27 of 27

Introduction to Machine Learning

Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber UMD Introduction