Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression Sargur N. University at Buffalo, State University of New York USA

Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis functions 2. Logistic Regression (two-class) 3. Iterative Reweighted Least Squares (IRLS) 4. Multiclass Logistic Regression 5. Probit Regression 6. Canonical Link Functions 2

Topics in Logistic Regression Logistic Sigmoid and Logit Functions Parameters in discriminative approach Determining logistic regression parameters Error function Gradient of error function Simple sequential algorithm An example Generative vs Discriminative Training Naiive Bayes vs Logistic Regression 3

Logistic Sigmoid and Logit Functions In two-class case, posterior of class C 1 can be written as as a logistic sigmoid of feature vector ϕ=[ϕ 1,..ϕ M ] T p(c 1 ϕ) = y(ϕ) = σ (w T ϕ) with p(c 2 ϕ) = 1- p(c 1 ϕ) Here σ (.) is the logistic sigmoid function w Known as logistic regression in statistics Although a model for classification rather than for regression Logit function: It is the log of the odds ratio It links the probability to the predictor variables Logistic Sigmoid σ(a) a Properties: A. Symmetry σ (-a)=1-σ (a) B. Inverse a=ln(σ /1-σ) known as logit. Also known as log odds since it is the ratio ln[p(c 1 ϕ)/p(c 2 ϕ)] C. Derivative dσ/da = σ (1-σ)

Fewer Parameters in Linear Discriminative Model Discriminative approach (Logistic Regression) For M -dim feature space ϕ: M adjustable parameters Generative based on Gaussians (Bayes/NB) 2M parameters for mean M(M+1)/2 parameters for shared covariance matrix Two class priors Total of M(M+5)/2 + 1 parameters Grows quadratically with M If features assumed independent (naïve Bayes) still needs M+3 parameters 5

Determining Logistic Regression parameters Maximum Likelihood Approach for Two classes For a data set (ϕ n,t n ) where t n ε {0,1} and ϕ n =ϕ (x n ), n =1,..,N Likelihood function can be written as p(t w) = N t y n n n=1 { 1 y } 1 t n n where t =(t 1,..,t N ) T and y n = p(c 1 ϕ n ) y n is the probability that t n =1 6

Error Fn for Logistic Regression Likelihood function is p(t w) = N n=1 y n t n { 1 y } 1 t n n By taking negative logarithm we get the Cross-entropy Error Function N { } E(w) = ln p(t w) = t n lny n + (1 t n )ln(1 y n ) n=1 where y n = σ (a n ) and a n =w T ϕ n We need to minimize E(w) At its minimum, derivative of E(w) is zero So we need to solve for w in the equation E(w) = 0 7

Gradient of Error Function Error function E(w) = ln p(t w) = t n lny n + (1 t n )ln(1 y n ) n=1 where y n = σ(w T ϕ n ) Using Derivative of logistic sigmoid dσ/da=σ(1-σ) Gradient of the error function E(w) = N n=1 ( y n t n ) φ n N Error x Feature Vector Contribution to gradient by data point n is error between target t n and prediction y n = σ (w T φ n ) times basis φ n { } Proof of gradient expression Let z = z 1 + z 2 where z 1 = t lnσ(wφ) and z 2 = (1 t)ln[1 σ(wφ)] dz 1 dw and dz2 dw = tσ(wφ)[1 σ(wφ)]φ σ(wφ) = (1 t)σ(wφ)[1 σ(wφ)]( φ) [1 σ(wφ)] Therefore dz dw = (σ(wφ) t)φ Using dσ da = σ(1 σ) d dx (lnax) = a x 8

Simple Sequential Algorithm Given Gradient of error function N E(w) = ( y n t n ) φ where y n n = σ(w T ϕ n ) n=1 Solve using an iterative approach w τ+1 = w τ η E n where E n = (y n t n )φ n Takes precisely same form as Gradient of Sum-of-squares error for linear regression Error x Feature Vector Samples are presented one at a time in which each each of the weight vectors is updated 9

ML solution can over-fit Severe over-fitting for linearly separable data Because ML solution occurs at!4!2 0 2 4 6 8 σ = 0.5 With σ>0.5 and σ< 0.5 for the two classes Solution equivalent to a=w T ϕ = 0 Logistic sigmoid becomes infinitely steep A Heavyside step function w goes to infinity Penalizing wts can avoid this 4 2 0!2!4!6!8 σ(a) a 10

11 An Example of 2-class Logistic Regression Input Data ϕ 0 (x)=1, dummy feature

12 Initial Weight Vector, Gradient and Hessian (2-class) Weight vector Gradient Hessian

13 Final Weight Vector, Gradient and Hessian (2-class) Weight Vector Gradient Hessian Number of iterations : 10 Error (Initial and Final): 15.0642, 1.0000e-009

Generative vs Discriminative Training Variables x ={x 1,..x M } and classifier target y 1. Generative: estimate parameters of variables independently Naïve Bayes y x 1 x 2 x M For classification: Determine joint: M p(y,x) = p(y) p(x i y) i=1 From joint get required conditional p(y x) Simple estimation independently estimate M sets of parameters But independence is usually false We can estimate M(M+1)/2 covariance matrix 2. Discriminative: estimate joint parameters w i Potential Functions (log-linear) ϕ i (x i,y)=exp{w i x i I{y=1}}, ϕ 0 (y)=exp{w 0 I{y=1}} Naïve Markov y I has value 1 when y=1, else 0 x 1 x 2 x M For classification: Unnormalized Normalized!P(y = 1 x) = exp w 0 + w i x i Jointly optimize M parameters More complex estimation but correlations accounted for Can use much richer features: Edges, image patches sharing same pixels M i=1 P(y = 1 x) = sigmoid w 0 + Logistic Regression M i=1 w i x i!p(y = 0 x) = exp{ 0} = 1 where sigmoid(z) = e p(y i φ) = y i (φ) = z 1+ e z multiclass exp(a i ) exp(a j ) j where a j =w jt ϕ

Logistic Regression is a special architecture of a neural network 15