Logistic Regression. Stochastic Gradient Descent

Size: px

Start display at page:

Download "Logistic Regression. Stochastic Gradient Descent"

Clare Joseph
5 years ago
Views:

1 Tutorial 8 CPSC 340

2 Logistic Regression Stochastic Gradient Descent

3 Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

4 Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1} The probabilistic model with sigmoid function p(y = 1 x) = σ(w T x) 1 σ(η) = 1 + exp( η)

5 Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1} The probabilistic model with sigmoid function p(y = 1 x) = σ(w T x) 1 σ(η) = 1 + exp( η) what is the probabilities of p(y = 1 x)?

6 Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1} The probabilistic model with sigmoid function p(y = 1 x) = σ(w T x) 1 σ(η) = 1 + exp( η) what is the probabilities of p(y = 1 x)? p(y = 1 x) = 1 p(y = 1 x) = exp(w T x)

7 Learning in Logistic Regression Let X R n d and y { 1, 1} n be training data

8 Learning in Logistic Regression Let X R n d and y { 1, 1} n be training data we can use logistic loss to learn the parameter vector w f (w) = 1 n log(1 + exp( y i w T x i ))

9 Learning in Logistic Regression Let X R n d and y { 1, 1} n be training data we can use logistic loss to learn the parameter vector w f (w) = 1 n log(1 + exp( y i w T x i )) we want to find w = arg min f (w) w R d

10 Learning in Logistic Regression Let X R n d and y { 1, 1} n be training data we can use logistic loss to learn the parameter vector w f (w) = 1 n log(1 + exp( y i w T x i )) we want to find w = arg min f (w) w R d Since f (w) is convex function w.r.t. w we can use GD to find w

11 Learning in Logistic Regression Regularized loss function f (w) = 1 n log(1 + exp( y i w T x i )) + λ 2 w 2

12 Learning in Logistic Regression Regularized loss function f (w) = 1 n log(1 + exp( y i w T x i )) + λ 2 w 2 Exercise: find the gradient of regularized f(w)?

13 Learning in Logistic Regression Regularized loss function f (w) = 1 n log(1 + exp( y i w T x i )) + λ 2 w 2 Exercise: find the gradient of regularized f(w)? Solution f i (w) = log(1 + exp( y i w T x i )) y f i (w) = i x i 1 + exp(y i w T x i ) f (w) = 1 y i x i n 1 + exp(y i w T x i ) + λw

14 Learning in Logistic Regression Coding Exercise: we want to write a function to calculate the gradient and function. The following code is given. Write the code for loglos subfunc.

15 Learning in Logistic Regression Solution

16 Learning Logistic Regression Exercise: Let Z be the transformation of X using some non-linear basis. What would be the new probabilistic model, logistic loss and its gradient?

17 Learning Logistic Regression Exercise: Let Z be the transformation of X using some non-linear basis. What would be the new probabilistic model, logistic loss and its gradient? Slolution: p(y = 1 z) = σ(w T z) f (w) = 1 log(1 + exp( y i w T z i )) + λ n 2 w 2 f (w) = 1 n y i z i 1 + exp(y i w T z i ) + λw

18 Stochastic Gradient Descent

19 Gradient Descent Cost for Big Data Assume our loss function has the following format: e.g. f (w) = 1 n f i (w) f (w) = 1 log(1 + exp( y i w T x i )) n

20 Gradient Descent Cost for Big Data Assume our loss function has the following format: e.g. f (w) = 1 n f (w) = 1 n f i (w) log(1 + exp( y i w T x i )) Using GD to find the best w: w t+1 = w t α t f (w t )

21 Gradient Descent Cost for Big Data Assume our loss function has the following format: e.g. f (w) = 1 n f (w) = 1 n f i (w) log(1 + exp( y i w T x i )) Using GD to find the best w: w t+1 = w t α t f (w t ) But cost of computing f (w) is O(n) because of the sum: f (w) = 1 n f i (w)

22 Gradient Descent Cost for Big Data Assume our loss function has the following format: e.g. f (w) = 1 n f (w) = 1 n f i (w) log(1 + exp( y i w T x i )) Using GD to find the best w: w t+1 = w t α t f (w t ) But cost of computing f (w) is O(n) because of the sum: f (w) = 1 n f i (w) Cost of each iteration in GD could be enormous when n is large!

23 Stochastic Gradient Descent(SGD) for Big Data SGD algorithm: in each iteration pick a f i randomly and use its gradient i unif (1, n) w t+1 = w t α t f i (w t )

24 Stochastic Gradient Descent(SGD) for Big Data SGD algorithm: in each iteration pick a f i randomly and use its gradient i unif (1, n) w t+1 = w t α t f i (w t ) f i is an unbiased approximation of f E [ f i (w)] = p(i) f i (w) = 1 n f i (w) = f (w)

25 Stochastic Gradient Descent(SGD) for Big Data SGD algorithm: in each iteration pick a f i randomly and use its gradient i unif (1, n) w t+1 = w t α t f i (w t ) f i is an unbiased approximation of f E [ f i (w)] = p(i) f i (w) = 1 n f i (w) = f (w) Cost of each iteration in SGD is constant

26 Stochastic Gradient Descent(SGD) for Big Data SGD algorithm: in each iteration pick a f i randomly and use its gradient i unif (1, n) w t+1 = w t α t f i (w t ) f i is an unbiased approximation of f E [ f i (w)] = p(i) f i (w) = 1 n f i (w) = f (w) Cost of each iteration in SGD is constant It does not move toward minimizer in each iteration, so is slower than GD

27 SGD descent: vs GDθ t = θ t 1 γ t g (θ t 1 )=θ t 1 γ t n GD n i i dient descent: θ t = θ t 1 γ t f i(t) (θ t 1) SGD dient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

28 Variance of SGD Variance of SGD in each iteration: Var[ f i (w)] = 1 n f i (w) f (w) 2

29 Variance of SGD Variance of SGD in each iteration: Var[ f i (w)] = 1 n f i (w) f (w) 2 If variance is small, every step jumps in the right direction

30 Variance of SGD Variance of SGD in each iteration: Var[ f i (w)] = 1 n f i (w) f (w) 2 If variance is small, every step jumps in the right direction If variance is large, many steps jump in wrong direction!

31 Variance of SGD Variance of SGD in each iteration: Var[ f i (w)] = 1 n f i (w) f (w) 2 If variance is small, every step jumps in the right direction If variance is large, many steps jump in wrong direction! Variance can be controlled by decreasing step size or by variance reduction technique

32 Variance of SGD To get convergence we need decreasing step sizes

33 Variance of SGD To get convergence we need decreasing step sizes But it cannot shrink too quickly

34 Variance of SGD To get convergence we need decreasing step sizes But it cannot shrink too quickly Two main conditions for decreasing step sizes: α t = we can get everywhere (1) t=1 αt 2 < effect of variance goes to zero (2) t=1

35 Variance of SGD To get convergence we need decreasing step sizes But it cannot shrink too quickly Two main conditions for decreasing step sizes: α t = we can get everywhere (1) t=1 αt 2 < effect of variance goes to zero (2) t=1 Setting α t = O(1/t) satisfies the above conditions but it is too slow

36 Variance of SGD To get convergence we need decreasing step sizes But it cannot shrink too quickly Two main conditions for decreasing step sizes: α t = we can get everywhere (1) t=1 αt 2 < effect of variance goes to zero (2) t=1 Setting α t = O(1/t) satisfies the above conditions but it is too slow In practice: α t = β/(t + γ) α t = O(1/ t) or O(1/t β ) for β (0, 1)

37 SGD for logistic Loss α t = β/(t + γ), f (w) = 1 n log(1 + exp( y i w T x i )) + λ 2 w 2 Coding Exercise: Using the loglos subfunc from previous exercise, complete the SGD code for logistic loss. T : # of iteration w 0 : initial value

38 SGD for logistic Loss Solution:

39 Variance Reduction Technique Using mini-batch

40 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt

41 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t )

42 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size

43 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method

44 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value

45 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t )

46 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t ) Store f i (w t ) in y i

47 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t ) Store f i (w t ) in y i wt+1 = w t α n n y i

48 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t ) Store f i (w t ) in y i wt+1 = w t α n n y i cost of each iteration is constant

49 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t ) Store f i (w t ) in y i wt+1 = w t α n n y i cost of each iteration is constant convergence is fast since we use constant step size

50 Variance Reduction Technique Using mini-batch In each iteration t, we make a random mini-batch Bt wt+1 = w t αt B t f i B t f i (w t ) Variance is inversely proportional to the mini-batch size Using auxiliary memory: SAG method It uses an extra memory y with n cells and each cell yi stores d value In each iteration t, we pick a f i randomly and evaluate the gradient f i (w t ) Store f i (w t ) in y i wt+1 = w t α n n y i cost of each iteration is constant convergence is fast since we use constant step size The memory requirement could be restrictive when n is enormous

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code