Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and KP Murphy (2012). Machine learning: a probabilistic perspective. MIT Press. (Chapter 8) Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press. www.deeplearningbook.org. (Chapter 4) Andrew Ng. Lecture Notes on Machine Learning. Stanford. Hal Daume. A Course on Machine Learning. http://ciml.info/ Nevin L. Zhang (HKUST) Machine Learning 1 / 50
Logistic Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 2 / 50
Logistic Regression Recap of Probabilistic Linear Regression Training set D = {x i, y i } N i=1, where y i R. Probabilistic model: p(y x, θ) = N (y µ(x), σ 2 ) = N (y w x, σ 2 ) Learning: Determining w by minimizing the negative loglikelihood NLL(w, σ) = log P(D w, σ) Point estimation of y: ŷ = µ(x) = w x Nevin L. Zhang (HKUST) Machine Learning 3 / 50
Logistic Regression Logistic Regression (for Classification) Training set D = {x i, y i } N i=1, where y i {0, 1}. Probabilistic model: p(y x, w) = Ber(y σ(w x)) σ(z) is the sigmoid/logistic/logit function. σ(z) = 1 1 + exp( z) = ez e z + 1 It maps the real line R to [0, 1]. Not to be confused with the variance σ 2 in the Gaussian distribution Nevin L. Zhang (HKUST) Machine Learning 4 / 50
Logistic Regression Logistic Regression Rephrase of the probabilistic model, log p 1 p log p(y = 1 x, w) = σ(w x) = p(y = 1 x, w) 1 p(y = 1 x, w) is called the logit of p. = w x 1 1 + exp( w x) Learning: Determining w by minimizing the negative loglikelihood NLL(w) = log P(D w) Nevin L. Zhang (HKUST) Machine Learning 5 / 50
Logistic Regression Logistic Regression: Decision Rule To classify instances, we obtain a point estimation of y: ŷ = arg max y p(y x, w) In other words, the decision/classification rule is: ŷ = 1 iff p(y = 1 x, w) > 0.5 This is called the optimal Bayes classifier: Suppose the same situation is to occur many times. You will always make mistakes no matter what decision rule to use. The probability of mistakes is minimized if you use the above rule. Nevin L. Zhang (HKUST) Machine Learning 6 / 50
Logistic Regression Logistic Regression: Example Solid black dots are the data. Those at the bottom are the SAT scores of applicants rejected by a university, and those at the top are the SAT scores of applicants accepted by a university. The red circles are the predicted probabilities that the applicants would be accepted. Nevin L. Zhang (HKUST) Machine Learning 7 / 50
Logistic Regression Logistic Regression: 2D Example The decision boundary p(y = 1 x, w) = 0.5 is a straight line in the feature space. Nevin L. Zhang (HKUST) Machine Learning 8 / 50
Logistic Regression Logistic Regression is a Linear Classifier In fact, the decision/classification rule in logistic regression is equivalent to: ŷ = 1 iff w x > 0 So,it is a linear classifier with a decision boundary. Example: Whether a students is admitted based on the results of two exams: Nevin L. Zhang (HKUST) Machine Learning 9 / 50
Logistic Regression Some Math Since P(y = 1 x, w) = σ(w x), we have P(y = 0 x, w) = 1 σ(w x). We can combine the two equations into one: P(y x, w) = (σ(w x)) y (1 σ(w x)) 1 y Hence, log P(y x, w) = y log σ(w x) + (1 y) log(1 σ(w x)) Nevin L. Zhang (HKUST) Machine Learning 10 / 50
Logistic Regression Model fitting We would like to find the MLE of w, i.e, the values of w that minimizes the negative loglikelihood: NLL(w) = log P(D w) = = N log P(y i x i, w) i=1 N [y i log σ(w x i ) + (1 y i ) log(1 σ(w x i ))] i=1 Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use optimization algorithms to compute it. Gradient descent Newton s method Nevin L. Zhang (HKUST) Machine Learning 11 / 50
Gradient Descent Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 12 / 50
Gradient Descent Gradient Descent Consider a function y = J(w) of a scalar variable w. The derivative of J(w) is defined as follows: J (w) = When ɛ is small, we have df (w) dw = lim J(w + ɛ) J(w) ɛ 0 ɛ J(w + ɛ) J(w) + ɛj (w) This equation tells use how to reduce J(w) by changing w in small steps: If J (w) > 0, make ɛ negative, i.e., decrease w; If J (w) < 0, make ɛ positive, i.e., increase w. In other words, move in the opposite direction of the derivative (gradient) Nevin L. Zhang (HKUST) Machine Learning 13 / 50
Gradient Descent Gradient Descent To implement the idea, we update w as follows: w w αj (w) The term J (w) means that we move in the opposite direction of the derivative, and α determines how much we move in that direction. It is called the step size in optimization and the learning rate in machine learning Nevin L. Zhang (HKUST) Machine Learning 14 / 50
Gradient Descent Gradient Descent Consider a function y = J(w), where w = (w 0, w 1,..., w D ). The gradient of J at w is defined as w J = ( dj dw 0, dj dw 1,..., dj dw D ) The gradient is the direction along which J increases the fastest. If we want to reduce J as fast as possible, move in the opposite direction of the gradient. Nevin L. Zhang (HKUST) Machine Learning 15 / 50
Gradient Descent Gradient Descent The method of steepest descent/gradient descent for minimizing J(w) 1 Initialize w 2 Repeat until convergence w w α w J(w) The learning rate α usually changes from iteration to iteration. Nevin L. Zhang (HKUST) Machine Learning 16 / 50
Gradient Descent Choice of Learning Rate Constant learning rate is difficult to set: Too small, convergence will be very slow Too large, the method can fail to converge at all. Better to vary the learning rate. Will discuss this more in L07. Nevin L. Zhang (HKUST) Machine Learning 17 / 50
Gradient Descent Gradient Descent Gradient descent can get stuck at local minima or saddle points Nonetheless, it usually works well. Nevin L. Zhang (HKUST) Machine Learning 18 / 50
Gradient Descent Derivative of the Sigmoid Function To apply gradient descent to logistic regression, we need to compute the partial derivative of the NLL w.r.t each weight w j. Before doing that, first consider the derivative of the sigma function: σ (z) = dσ(z) dz 1 1 + e z = d dz 1 d(1 + e z ) = (1 + e z ) 2 dz 1 = (1 + e z ) 2 e z 1 = 1 + e z (1 1 1 + e z ) = σ(z)(1 σ(z)) Nevin L. Zhang (HKUST) Machine Learning 19 / 50
Gradient Descent Derivative of NLL Let z i = w x i. NLL(w) w j = = = = = N i=1 [y i log σ(z i ) w j + (1 y i ) log(1 σ(z i)) w j ] N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ] σ(z i) w j i=1 N 1 [y i σ(z i ) (1 y 1 i) 1 σ(z i ) ]σ(z i)(1 σ(z i )) z i w j i=1 N [y i (1 σ(z i )) (1 y i )σ(z i )]x i,j i=1 N [y i σ(z i )]x i,j i=1 Nevin L. Zhang (HKUST) Machine Learning 20 / 50
Gradient Descent Batch Gradient Descent The Batch Gradient Descent algorithm for logistic regression Repeat until convergence w j w j + α N [y i σ(w x i )]x i,j Interpretation: (Assume x i are positive vectors) If predicted value σ(w x i ) is smaller than the actual value y i, there is reason to increase w j. The increment is proportional to x i,j. If predicted value σ(w x i ) is larger than the actual value y i, there is reason to decrease w j. The decrement is proportional to x i,j. i=1 Nevin L. Zhang (HKUST) Machine Learning 21 / 50
Gradient Descent Stochastic Gradient Descent Batch gradient descent is costly when N is large. In such case, we usually use Stochastic Gradient Descent: Repeat until convergence Randomly choose B {1, 2,..., N} w j w j + α i B[y i σ(w x i )]x i,j The randomly picked subset B is called a minibatch. Its size is called the batch size. Nevin L. Zhang (HKUST) Machine Learning 22 / 50
Gradient Descent L 2 Regularization Just as we prefer ridge regression to linear regression, we prefer to add a regularization term in the objective function of logistic regression: J(w) = NLL(w) + 1 2 λ w 2 2. In this case, we have J(w) w j N = [y i σ(z i )]x i,j + λw j i=1 The weight update formula is: w j w j + α[ λw j + i B(y i σ(w x i ))x i,j ] Nevin L. Zhang (HKUST) Machine Learning 23 / 50
Gradient Descent L 2 Regularization Update rule without regularization: w j w j + α i B[y i σ(w x i )]x i,j Update rule with regularization: w j (1 αλ)w j + α i B(y i σ(w x i ))x i,j Regularization forces the weights to be smaller: as 1 > αλ > 0. (1 αλ)w j < w j Nevin L. Zhang (HKUST) Machine Learning 24 / 50
Newton s Method Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 25 / 50
Newton s Method Newton s Method Consider again a function J(w) of a scalar variable w. We want to minimize J(w). At a minimum point, the derivative J (w) = 0. Let f (w) = J (w). We want to find points where f (w) = 0. Newton s method (aka the Newton-Raphson method) is a commonly used method to solve the problem: Start some initial value w Repeat the following update until convergence w w f (w) f (w) Nevin L. Zhang (HKUST) Machine Learning 26 / 50
Newton s Method Newton s Method Interpretation of Newton s method Approximate the function f via a linear function that is tangent to f at the current guess w i : y = f (w i )w + f (w i ) f (w i )w i Solve for where that linear function equals to zero to get next w. f (w i )w + f (w i ) f (w i )w i = 0 w i+1 = w i f (w i) f (w i ) Nevin L. Zhang (HKUST) Machine Learning 27 / 50
Newton s Method Newton s Method Minimize J(w) using Newton s method: Start some initial value w Repeat the following update until convergence w w J (w) J (w) where f is the second derivative of f. Nevin L. Zhang (HKUST) Machine Learning 28 / 50
Newton s Method Newton s Method New consider minimizing a J(w) of a vector w = (w 0, w 1,..., w D ). The first derivative of J(w) is the gradient vector w J(w). The second derivative of J(w) is the Hessian matrix H = [H ij ] H ij = 2 J(w) w i w j Minimize J(w) using Newton s method Start some initial value w Repeat the following update until convergence w w H 1 w J(w) Nevin L. Zhang (HKUST) Machine Learning 29 / 50
Newton s Method Newton s Method: Another Perspective By Taylor expansion, we have J(w + ɛ) J(w) + J (w)ɛ + 1 2 J (w)ɛ 2 We want dj(w+ɛ) dɛ = 0. So, Solving the equation, we get J (w) + J (w)ɛ = 0 ɛ = J (w) J (w) Nevin L. Zhang (HKUST) Machine Learning 30 / 50
Newton s Method Newton s Method vs Gradient Descent In gradient descent, we have only a first order term in the approximation (gradient). J(w + ɛ) J(w) + ɛj (w) In Newton s method, we also have a second order term in the approximation (curvature) J(w + ɛ) J(w) + J (w)ɛ + 1 2 J (w)ɛ 2 Nevin L. Zhang (HKUST) Machine Learning 31 / 50
Newton s Method Newton s Method vs Gradient Descent Use of curvature allows us to better predict how J(w) changes when we change w 1 With negative curvature, J actually decreases faster than the gradient predicts. 2 With no curvature, the gradient predicts the decrease correctly. 3 With positive curvature, the function decreases more slowly than expected. The use of curvature mitigates with problems with case 1 and 3. Nevin L. Zhang (HKUST) Machine Learning 32 / 50
Newton s Method One iteration of Newtons can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an Hessian matrix. Nevin L. Zhang (HKUST) Machine Learning 33 / 50 Newton s Method vs Gradient Descent Newton s method (red) typically enjoys faster convergence than (batch) gradient descent (green), and requires many fewer iterations to get very close to the minimum.
Softmax Regression Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 34 / 50
Softmax Regression Multi-Class Logistic Regression Training set D = {x i, y i } N i=1, where y i {1, 2,..., C}, and C 2. Multi-Class Logistic Regression (aka softmax regression) uses the following probability model: There is a weight vector w c = (w c,0, w c,1,..., w c,d ) for each class c. W = (w 1,..., w C ) is the weight matrix. The probability of a particular class c given x is: p(y = c x, W) = 1 Z(x, W) exp(w c x) where Z(x, W) = C c =1 exp(w c x) is the normalization constant to ensure the probabilities of all classes sum to 1. It is called the partition function. The classification rule is: ŷ = arg max p(y x, W) y Nevin L. Zhang (HKUST) Machine Learning 35 / 50
Softmax Regression Relationship with Logistic Regression When C = 2, name the two classes as {0, 1} instead of {1, 2}: p(y = 0 x, W) = exp(w0 x) exp(w0 x) + exp(w 1 x), p(y = 1 x, W) = exp(w1 x) exp(w0 x) + exp(w 1 x) Dividing both numerator and denominator by exp(w 1 x), we get p(y = 0 x, W) = exp((w 0 w 1 )x) exp((w 0 w 1 )x) + 1, p(y = 1 x, W) = 1 exp((w 0 w 1 )x) + 1 Let w = w 1 w 0, we get the logistic regression model: p(y = 0 x, w) = exp( w x) 1 + exp( w x), p(y = 1 x, w) = 1 exp(1 + w x) Nevin L. Zhang (HKUST) Machine Learning 36 / 50
Softmax Regression Negative Loglikelihood The negative loglikelihood of the softmax regression model is: NLL(w) = log P(D w) = = = = = N i=1 c=1 N i=1 c=1 N i=1 c=1 N log P(y i x i, w) i=1 C 1(y i = c) log P(y i = c x i, w) C 1(y i = c) log C 1(y i = c)[wc x i log N C [ 1(y i = c)wc x i log i=1 c=1 exp(w c x i ) C c =1 exp(w c x i) C exp(wc x i)] c =1 C exp(wc x i)] c =1 Nevin L. Zhang (HKUST) Machine Learning 37 / 50
Softmax Regression Partial Derivative of NLL We would like to find the MLE of W. To do so, we need to minimize the NLL. There is no closed-form solution. So, we resort to gradient descent. The partial derivative of NLL w.r.t each w c,j is: NLL(w) w c,j = = = N [ i=1 i=1 1 w c,j C c=1 1(y i = c)w c x i 1 w c,j log N exp(wc x i ) [1(y i = c)x i,j C c =1 exp(w c x i) x i,j] N [1(y i = c) p(y i = c x i, W)]x i,j i=1 Compare this to the gradient of logistic regression: NLL(w) N = [y i σ(w x i )]x i,j w j i=1 C exp(wc x i)] Nevin L. Zhang (HKUST) Machine Learning 38 / 50 c =1
Softmax Regression Gradient of NLL The gradient of NLL w.r.t each w c is: wc NLL(w) = ( NLL(w) = w c,0, NLL(w) w c,1,..., NLL(w) ) w c,d N [1(y i = c) p(y i = c x i, W)]x i i=1 Update rule in Gradient Descent: w c w c + α N [1(y i = c) p(y i = c x i, W)]x i i=1 Nevin L. Zhang (HKUST) Machine Learning 39 / 50
Optimization Approach to Classification Outline 1 Logistic Regression 2 Gradient Descent 3 Newton s Method 4 Softmax Regression 5 Optimization Approach to Classification Nevin L. Zhang (HKUST) Machine Learning 40 / 50
Optimization Approach to Classification The Probabilistic Approach to Classification So far, we have been talking about the probabilistic approach to classification: Start with training set D = {x i, y i } N i=1, where y i {0, 1}. Learn a probabilistic model: p(y x, w) = Ber(y σ(w x)) by minimizing the negative loglikelihood Classify new instances using: NLL(w) = log P(D w) ŷ = 1 iff p(y = 1 x, w) > 0.5 Nevin L. Zhang (HKUST) Machine Learning 41 / 50
Optimization Approach to Classification The Optimization Approach Next, we will briefly talk about the optimization approach to classification: Start with training set D = {x i, y i } N i=1, where y i { 1, 1}, Learn a linear classifier ŷ = sign(w x + b) where x = (x 1,..., x D ) and w = (w 1,..., w D ). We drop the convention x 0 = 1 and use b to replace w 0. Nevin L. Zhang (HKUST) Machine Learning 42 / 50
Optimization Approach to Classification The Sign Function signx = 1 if x > 0; signx = 0 if x = 0; signx = 1 if x < 0. Nevin L. Zhang (HKUST) Machine Learning 43 / 50
Optimization Approach to Classification The Optimization Approach We determine w and b by minimizing the training/empirical error L(w, b) = N L 0/1 (y i, ŷ i ) i=1 where L 0/1 (y i, ŷ i ) = 1(y i ŷ i ) is called the zero/one loss function. Let z = w x + b. It is easy to see that L 0/1 (y, ŷ) = 1(y(w x + b) 0) = 1(yz 0) To overload the notation, we also write it as L 0/1 (y, z) So, the loss function of linear classifiers is often written as L(w, b) = N 1(y i (w x i + b) 0) i=1 Nevin L. Zhang (HKUST) Machine Learning 44 / 50
Optimization Approach to Classification A problem the Zero/One Loss Function The zero/one loss function, as a function of w and b, is not convex and hence difficult to optimize X -axis is yz. Nevin L. Zhang (HKUST) Machine Learning 45 / 50
Optimization Approach to Classification Convex Function A real-valued function f is convex if the line segment between any two points on the graph of the function lies above or on the graph. For any two points x 1 < x 2 and any t [0, 1] f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ) f is strictly convex if the the equality holds only when x 1 = x 2. Nevin L. Zhang (HKUST) Machine Learning 46 / 50
Optimization Approach to Classification Surrogate Loss Functions Convex functions are easy to minimize. Intuitively, if you drop a ball anywhere in a convex function, it will eventually get to the minimum. This is not true for non-convex functions. Since the zero/one loss is hard to optimize, we want to approximate it using a convex function and optimize that function. This approximating function will be called a surrogate loss. The surrogate losses need to be upper bounds on the true loss function: this guarantees that if you minimize the surrogate loss, you are also pushing down the real loss. Nevin L. Zhang (HKUST) Machine Learning 47 / 50
Optimization Approach to Classification Logistic Loss One common surrogate loss function is the logistic loss: 1 L log (y, z) = log(1 + exp( yz)) log 2 = { 1 log 2 log(1 + exp( (w x + b))) if y = 1 1 log 2 log(1 + exp(w x + b)) if y = 1 On the other hand, each term in the NLL of logistic regression has the following form log P(y x, w) = {y log σ(w x) + (1 y) log(1 σ(w x))} { log 1 if y = 1 1+exp( w = x) 1 log(1 1+exp( w x) ) if y = 0 { log(1 + exp( w = x)) if y = 1 log(1 + exp(w x)) if y = 0 So, logistic regression is the same as linear classifier with logistic surrogate loss function. Nevin L. Zhang (HKUST) Machine Learning 48 / 50
Optimization Approach to Classification Other Surrogate Loss Functions Zero/one loss: L 0/1 (y, z) = 1(yz 0). Logistic loss: L log (y, z) = 1 log 2 log(1 + exp( yz)) Hinge loss: L hin (y, z) = max{0, 1 yz} Exponential loss: L exp (y, z) = exp( yz) Squared loss: L sqr (y, z) = (y z) 2 Nevin L. Zhang (HKUST) Machine Learning 49 / 50
Optimization Approach to Classification Loss and Error When a surrogate loss function is used, There are two ways to measure how a classifier performs on the training set: training loss and training error. There are also two ways to measure how a classifier performs on the set set: test loss and test error. As in the case of regression, test loss/error can be improved using regularization. Nevin L. Zhang (HKUST) Machine Learning 50 / 50