Logistic Regression. Seungjin Choi

Size: px

Start display at page:

Download "Logistic Regression. Seungjin Choi"

Morgan Grant McDonald
6 years ago
Views:

1 Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin 1 / 30

2 Binary Classification 2 / 30

3 Binary Classification Consider a binary classification problem, i.e., C = {C 1, C 2 }. Given a particular vector x, we wish to assign it to one of the two classes. To this end, we use Bayes rule and calculate the relevant posterior probability: P(C 1 x) = P(x C 1)P(C 1 ) P(x) P(x C 1 )P(C 1 ) = P(x C 1 )P(C 1 ) + P(x C 2 )P(C 2 ) 1 = 1 + P(x C2) P(C 2) = P(x C 1) P(C 1) { 1 + exp log 1 [ P(x C1) P(x C 2) ] log [ ]} P(C1) P(C 2) 3 / 30

4 It can be written in the form of the logistic function: where P(C 1 x) = e ξ, [ ] [ ] P(x C1 ) P(C1 ) ξ = log + log. P(x C 2 ) P(C 2 ) }{{}}{{} likelihood ratio prior ratio Choose a particular form for the class-conditional density function P(x C i ). 4 / 30

5 Multivariate Gaussian Model Assume that the class-conditional densities are multivariate Gaussian with identical covariance matrix Σ: { 1 P(x C i ) = exp 1 } (2π) D 2 Σ (x µ i) Σ 1 (x µ i ). After a simple calculation, we have where P(C 1 x) = e (w x+b), w = Σ 1 (µ 1 µ 2 ), b = 1 2 (µ 2 + µ 1 ) Σ 1 (µ 2 µ 1 ) + log [ ] P(C1 ). P(C 2 ) 5 / 30

6 Properties of Logistic Function σ(ξ) 0 as ξ. σ(ξ) 1 as ξ. σ( ξ) = 1 σ(ξ). d dξ [σ(ξ)] = σ(ξ)σ( ξ) / 30

7 Maximum Likelihood Formulation 7 / 30

8 Logistic Regression Predict a binary output y n {0, 1} from an input x n. The logistic regression models the input-output by a conditional Bernoulli distribution where E [y n x n ] = P(y n = 1 x n ) = σ(w x n ), σ(ξ) = 1 eξ = 1 + e ξ 1 + e ξ. Discriminative model: Directly model p(y x). 8 / 30

9 Logistic Regression: MLE Given {(x n, y n ) n = 1,..., N}, the likelihood is given by p(y X, w) = = N p(y n = 1 x n ) yn (1 p(y n = 1 x n )) 1 yn N σ(w x n ) ( yn 1 σ(w x n ) ) 1 y n. Then log-likelihood function is given by L = N log p(y n x n ) = where σ n = σ(w x n ). N { } y n log σ n + (1 y n ) log(1 σ n ), This is a nonlinear function of w whose maximum cannot be computed in a closed form. Iterative re-weighted least squares (IRLS) is a popular algorithm, derived from Newton s method. 9 / 30

10 Newton s Method 10 / 30

11 Mathematical Preliminaries Gradient Hessian matrix Gradient descent/ascent Newton s method 11 / 30

12 Gradient Consider a real-valued function f (x), that takes a real-valued vector x R D as an input, f (x) : R D R. The gradient of f (x) is defined by f (x) = = f (x) x [ f,, x 1 ] f. x D 12 / 30

13 Hessian Matrix If f (x) belongs to the class C 2, the Hessian matrix H is defined as the symmetric matrix with the (i, j)-element 2 f (x) x i x j, H = 2 f (x) [ 2 ] f (x) = x i x j = 2 f x f x D x 1 [ f (x) 2 f x 1 x 2 2 f x 1 x D. 2 f x D x 2 2 f xd 2 ]. = x x = [ f (x)] x 13 / 30

14 Gradient Descent/Ascent The gradient descent/ascent learning is a simple (first-order) iterative method for minimization/maximization. Gradient descent: iterative minimization ( ) J w (k+1) = w (k) η. w Gradient ascent: iterative maximization ( ) J w (k+1) = w (k) + η. w Learning rate: η > 0 14 / 30

15 Newton s Method The basic idea of Newton s method is to optimize the quadratic approximation of the objective function J (w) around the current point w (k). The second-order Taylor series expansion of J (w) at the current point w (k) gives [ ] J 2(w) = J (w (k) ) + J (w (k) ) (w w (k)) + 1 ( w w (k)) ( 2 J (w (k) ) w w (k)), 2 where 2 J (w (k) ) is the Hessian of J (w) evaluated at w = w (k). Differentiate this w.r.t. w and set it equal 0, which leads to J (w (k) ) + 2 J (w (k) )w 2 J (w (k) )w (k) = 0, 15 / 30

16 Thus we have [ 1 w = w (k) 2 J (w )] (k) J (w (k) ). Hence the Newton s method is of the form [ 1 w (k+1) = w (k) 2 J (w )] (k) J (w (k) ). Remark: The Hessian 2 J (w (k) ) should be positive definite for all t. 16 / 30

17 Illustration of Newton s Method f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k (a) convex (b) non-convex [Figure source: Murphy s book] 17 / 30

18 Logistic Regression: Algorithms 18 / 30

19 Logistic Regression: Gradient Ascent The gradient ascent learning has the form w new = w old ( ) L + η. w Calculate the gradient L w = t { yn σ(w x n ) } x n. Hence, the gradient ascent update rule for w is w new = w old + η { ( ( y n σ w old) )} xn x n. t 19 / 30

20 Detailed Calculation of Gradient Recall the log-likelihood L = N [ ] y n log σ n + (1 y n) log(1 σ n). Calculate N L w = [ ] σ n y n x n + (1 y n) σ n x n σ n 1 σ n = = = N [ ] σ n(1 σ n) σn(1 σn) y n (1 y n) x n σ n 1 σ n N [y n(1 σ n) (1 y n)σ n] x n N (y n σ n) x n. 20 / 30

21 Detailed Calculation of Hessian Calculate the Hessian: 2 L = = = w [ L] [ N ] (y n σ n) x n w N σ n(1 σ n)x nx n. 21 / 30

22 We set the objective function J (w) as the negative log-likelihood: J (w) = L(w) = N Thus, the gradient and the Hessian are: J (w) = 2 J (w) = [ ] y n log σ n + (1 y n ) log(1 σ n ). N (y n σ n ) x n, N σ n (1 σ n )x n x n. 22 / 30

23 Logistic Regression: IRLS Newton s update has the form [ N w = σ n (1 σ n ) x n x n ] 1 }{{} inverse of Hessian, [ 2 J (w)] 1 [ ] N (y n σ n ) x n, } {{ } gradient, J (w) Newton s update reduces to iterative re-weighted least squares (IRLS): where S = b = w = ( XSX ) 1 XSb, σ 1 (1 σ 1 ) σ N (1 σ N ) y 1 σ 1 σ 1(1 σ 1). y N σ N σ N (1 σ N )., 23 / 30

24 Alternatively we write w k+1 = w k + ( XS k X ) 1 XSk b k, where S k and b k are computed using w k. That is, w k+1 = w k + ( XS k X ) 1 XSk b k = ( XS k X ) 1 [( XSk X ) ] w k + XS k b k = ( XS k X ) 1 XSk [ X w k + b k ]. 24 / 30

25 Algorithm Outline Algorithm 1 IRLS Input: {(x n, y n ) n = 1,..., N} 1: Initialize w = 0 and w 0 = log (y/(1 y)) 2: repeat 3: for n = 1,..., N do 4: Compute η n = w x n + w 0 5: Compute σ n = σ(η n ) 6: Compute s n = σ n (1 σ n ) 7: Compute z n = η n + yn σn s n 8: end for 9: Construct S = diag(s 1:N ) 10: Update w = ( XSX ) 1 XSz 11: until convergence 12: return w 25 / 30

26 Generalized Linear Models 26 / 30

27 Generalized Linear Models The GLM is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM consists of three elements: 1. A probability distribution from the exponential family. 2. A linear predictor η = X θ 3. A link function g( ) such that E [y X] = µ = g 1 (η). 27 / 30

28 Logistic Regression: Generalized Linear Model Bernoulli distribution for a univariate binary random variable y {0, 1} with mean µ, p(y µ) = µ y (1 µ) 1 y. ( ) Logit transform (log-odds), θ = w x = log µ 1 µ, leads to [0, 1] R. The logistic function, which is the inverse-logit, gives µ = 1 1+e θ = σ(θ). Thus, we have p(y θ) = σ(θ) y (1 σ(θ)) 1 y = σ(θ) y σ( θ) 1 y 28 / 30

29 Multiclass Extension 29 / 30

30 Multiclass Logistic Regression (Multinomial Logistic Regression) Model input-output by a softmax transformation of logits θ k = w k x n: p(y n = k x n) = softmax(θ k ) = exp{w k x n} K j=1 exp{w j x n} Given Y R K N (each column y n R K follows the 1-of-K coding) and X R D N, the likelihood is given by p(y X, w 1,..., w K ) = The log-likelihood is given by L = N k=1 N k=1 K p(y n = k x n) Y k,n. K Y k,n log [p(y n = k x n)]. One can apply Newton s update to derive IRLS, as in logistic regression. 30 / 30

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,