Logistic Regression. Seungjin Choi

Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 30

Binary Classification 2 / 30

Binary Classification Consider a binary classification problem, i.e., C = {C 1, C 2 }. Given a particular vector x, we wish to assign it to one of the two classes. To this end, we use Bayes rule and calculate the relevant posterior probability: P(C 1 x) = P(x C 1)P(C 1 ) P(x) P(x C 1 )P(C 1 ) = P(x C 1 )P(C 1 ) + P(x C 2 )P(C 2 ) 1 = 1 + P(x C2) P(C 2) = P(x C 1) P(C 1) { 1 + exp log 1 [ P(x C1) P(x C 2) ] log [ ]} P(C1) P(C 2) 3 / 30

It can be written in the form of the logistic function: where P(C 1 x) = 1 1 + e ξ, [ ] [ ] P(x C1 ) P(C1 ) ξ = log + log. P(x C 2 ) P(C 2 ) }{{}}{{} likelihood ratio prior ratio Choose a particular form for the class-conditional density function P(x C i ). 4 / 30

Multivariate Gaussian Model Assume that the class-conditional densities are multivariate Gaussian with identical covariance matrix Σ: { 1 P(x C i ) = exp 1 } (2π) D 2 Σ 1 2 2 (x µ i) Σ 1 (x µ i ). After a simple calculation, we have where P(C 1 x) = 1 1 + e (w x+b), w = Σ 1 (µ 1 µ 2 ), b = 1 2 (µ 2 + µ 1 ) Σ 1 (µ 2 µ 1 ) + log [ ] P(C1 ). P(C 2 ) 5 / 30

Properties of Logistic Function 1 0.9 0.8 0.7 0.6 0.5 σ(ξ) 0 as ξ. σ(ξ) 1 as ξ. σ( ξ) = 1 σ(ξ). d dξ [σ(ξ)] = σ(ξ)σ( ξ). 0.4 0.3 0.2 0.1 0 5 0 5 6 / 30

Maximum Likelihood Formulation 7 / 30

Logistic Regression Predict a binary output y n {0, 1} from an input x n. The logistic regression models the input-output by a conditional Bernoulli distribution where E [y n x n ] = P(y n = 1 x n ) = σ(w x n ), σ(ξ) = 1 eξ = 1 + e ξ 1 + e ξ. Discriminative model: Directly model p(y x). 8 / 30

Logistic Regression: MLE Given {(x n, y n ) n = 1,..., N}, the likelihood is given by p(y X, w) = = N p(y n = 1 x n ) yn (1 p(y n = 1 x n )) 1 yn N σ(w x n ) ( yn 1 σ(w x n ) ) 1 y n. Then log-likelihood function is given by L = N log p(y n x n ) = where σ n = σ(w x n ). N { } y n log σ n + (1 y n ) log(1 σ n ), This is a nonlinear function of w whose maximum cannot be computed in a closed form. Iterative re-weighted least squares (IRLS) is a popular algorithm, derived from Newton s method. 9 / 30

Newton s Method 10 / 30

Mathematical Preliminaries Gradient Hessian matrix Gradient descent/ascent Newton s method 11 / 30

Gradient Consider a real-valued function f (x), that takes a real-valued vector x R D as an input, f (x) : R D R. The gradient of f (x) is defined by f (x) = = f (x) x [ f,, x 1 ] f. x D 12 / 30

Hessian Matrix If f (x) belongs to the class C 2, the Hessian matrix H is defined as the symmetric matrix with the (i, j)-element 2 f (x) x i x j, H = 2 f (x) [ 2 ] f (x) = x i x j = 2 f x 2 1. 2 f x D x 1 [ f (x) 2 f x 1 x 2 2 f x 1 x D. 2 f x D x 2 2 f xd 2 ]. = x x = [ f (x)] x 13 / 30

Gradient Descent/Ascent The gradient descent/ascent learning is a simple (first-order) iterative method for minimization/maximization. Gradient descent: iterative minimization ( ) J w (k+1) = w (k) η. w Gradient ascent: iterative maximization ( ) J w (k+1) = w (k) + η. w Learning rate: η > 0 14 / 30

Newton s Method The basic idea of Newton s method is to optimize the quadratic approximation of the objective function J (w) around the current point w (k). The second-order Taylor series expansion of J (w) at the current point w (k) gives [ ] J 2(w) = J (w (k) ) + J (w (k) ) (w w (k)) + 1 ( w w (k)) ( 2 J (w (k) ) w w (k)), 2 where 2 J (w (k) ) is the Hessian of J (w) evaluated at w = w (k). Differentiate this w.r.t. w and set it equal 0, which leads to J (w (k) ) + 2 J (w (k) )w 2 J (w (k) )w (k) = 0, 15 / 30

Thus we have [ 1 w = w (k) 2 J (w )] (k) J (w (k) ). Hence the Newton s method is of the form [ 1 w (k+1) = w (k) 2 J (w )] (k) J (w (k) ). Remark: The Hessian 2 J (w (k) ) should be positive definite for all t. 16 / 30

Illustration of Newton s Method f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k (a) convex (b) non-convex [Figure source: Murphy s book] 17 / 30

Logistic Regression: Algorithms 18 / 30

Logistic Regression: Gradient Ascent The gradient ascent learning has the form w new = w old ( ) L + η. w Calculate the gradient L w = t { yn σ(w x n ) } x n. Hence, the gradient ascent update rule for w is w new = w old + η { ( ( y n σ w old) )} xn x n. t 19 / 30

Detailed Calculation of Gradient Recall the log-likelihood L = N [ ] y n log σ n + (1 y n) log(1 σ n). Calculate N L w = [ ] σ n y n x n + (1 y n) σ n x n σ n 1 σ n = = = N [ ] σ n(1 σ n) σn(1 σn) y n (1 y n) x n σ n 1 σ n N [y n(1 σ n) (1 y n)σ n] x n N (y n σ n) x n. 20 / 30

Detailed Calculation of Hessian Calculate the Hessian: 2 L = = = w [ L] [ N ] (y n σ n) x n w N σ n(1 σ n)x nx n. 21 / 30

We set the objective function J (w) as the negative log-likelihood: J (w) = L(w) = N Thus, the gradient and the Hessian are: J (w) = 2 J (w) = [ ] y n log σ n + (1 y n ) log(1 σ n ). N (y n σ n ) x n, N σ n (1 σ n )x n x n. 22 / 30

Logistic Regression: IRLS Newton s update has the form [ N w = σ n (1 σ n ) x n x n ] 1 }{{} inverse of Hessian, [ 2 J (w)] 1 [ ] N (y n σ n ) x n, } {{ } gradient, J (w) Newton s update reduces to iterative re-weighted least squares (IRLS): where S = b = w = ( XSX ) 1 XSb, σ 1 (1 σ 1 ) 0..... 0 σ N (1 σ N ) y 1 σ 1 σ 1(1 σ 1). y N σ N σ N (1 σ N )., 23 / 30

Alternatively we write w k+1 = w k + ( XS k X ) 1 XSk b k, where S k and b k are computed using w k. That is, w k+1 = w k + ( XS k X ) 1 XSk b k = ( XS k X ) 1 [( XSk X ) ] w k + XS k b k = ( XS k X ) 1 XSk [ X w k + b k ]. 24 / 30

Algorithm Outline Algorithm 1 IRLS Input: {(x n, y n ) n = 1,..., N} 1: Initialize w = 0 and w 0 = log (y/(1 y)) 2: repeat 3: for n = 1,..., N do 4: Compute η n = w x n + w 0 5: Compute σ n = σ(η n ) 6: Compute s n = σ n (1 σ n ) 7: Compute z n = η n + yn σn s n 8: end for 9: Construct S = diag(s 1:N ) 10: Update w = ( XSX ) 1 XSz 11: until convergence 12: return w 25 / 30

Generalized Linear Models 26 / 30

Generalized Linear Models The GLM is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM consists of three elements: 1. A probability distribution from the exponential family. 2. A linear predictor η = X θ 3. A link function g( ) such that E [y X] = µ = g 1 (η). 27 / 30

Logistic Regression: Generalized Linear Model Bernoulli distribution for a univariate binary random variable y {0, 1} with mean µ, p(y µ) = µ y (1 µ) 1 y. ( ) Logit transform (log-odds), θ = w x = log µ 1 µ, leads to [0, 1] R. The logistic function, which is the inverse-logit, gives µ = 1 1+e θ = σ(θ). Thus, we have p(y θ) = σ(θ) y (1 σ(θ)) 1 y = σ(θ) y σ( θ) 1 y 28 / 30

Multiclass Extension 29 / 30

Multiclass Logistic Regression (Multinomial Logistic Regression) Model input-output by a softmax transformation of logits θ k = w k x n: p(y n = k x n) = softmax(θ k ) = exp{w k x n} K j=1 exp{w j x n} Given Y R K N (each column y n R K follows the 1-of-K coding) and X R D N, the likelihood is given by p(y X, w 1,..., w K ) = The log-likelihood is given by L = N k=1 N k=1 K p(y n = k x n) Y k,n. K Y k,n log [p(y n = k x n)]. One can apply Newton s update to derive IRLS, as in logistic regression. 30 / 30