Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22
Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis Linear and logistic regression 2/22
Linear regression Linear and logistic regression 4/22
Design matrix Consider a finite collection of vectors x i R d pour i = 1... n. Design Matrix X = x 1.... x n We will assume that the vectors are centered, i.e. that n x i = 0, and normalized, i.e. that 1 n n x2 i = 1. If x i are not centered the design matrix of centered data can be constructed with the rows x i x with x = 1 n n x i. Normalization usually consists in dividing each row by its empirical standard deviation. Linear and logistic regression 5/22
Generative models vs conditional models X is the input variable Y is the output variable A generative model is a model of the joint distribution p(x, y). A conditional model is a model of the conditional distribution p(y x). Conditional models vs Generative models CM make fewer assumptions about the data distribution CM require fewer parameters CM are typically computationally harder to learn CM can typically not handle missing data or latent variables Linear and logistic regression 6/22
Probabilistic version of linear regression Modeling the conditional distribution of Y given X by Y X N (w X + b, σ 2 ) or equivalently Y = w X + b + ɛ with ɛ N (0, σ 2 ). The offset can be ignored up to a reparameterization. ( ) Y = w x + ɛ. 1 Likelihood for one pair p(y i x i ) = 1 ( 1 exp (y i w x i ) 2 ) 2πσ 2 2 σ 2 Negative log-likelihood l(w, σ 2 ) = log p(y i x i ) = n 2 log(2πσ2 ) + 1 2 (y i w x i ) 2 σ 2. Linear and logistic regression 7/22
Probabilistic version of linear regression n min σ 2,w 2 log(2πσ2 ) + 1 2 The minimization problem in w min w (y i w x i ) 2 1 2σ 2 y Xw 2 2 that we recognize as the usual linear regression, with y = (y 1,..., y n ) and σ 2 X the design matrix with rows equal to x i. Optimizing over σ 2, we find: σ 2 MLE = 1 n (y i ŵmlex i ) 2 Linear and logistic regression 8/22
Solving linear regression To solve min w R p Rn (f w ), we consider that R n (f w ) = 1 ( w X Xw 2 w X y + y 2) 2n is a differentiable convex function whose minima are thus characterized by the Normal equations X Xw X y = 0 If X X is invertible, then ŵ MLE is given by: ŵ MLE = (X X) 1 X y. Problem: X X is never invertible for p > n and thus the solution is not unique (and any solution is overfit). Linear and logistic regression 9/22
Logistic regression Linear and logistic regression 11/22
Logistic regression Classification setting: X = R p, Y {0, 1}. Key assumption: log P(Y = 1 X = x) P(Y = 0 X = x) = w x Implies that P(Y = 1 X = x) = σ(w x) for σ : z 1 1 + e z, the logistic function. 1 0.8 0.6 0.4 0.2 0 10 5 0 5 10 The logistic function is part of the family of sigmoid functions. Often called the sigmoid function. Properties: z R, σ( z) = 1 σ(z), z R, σ (z) = σ(z)(1 σ(z)) = σ(z)σ( z). Linear and logistic regression 12/22
Likelihood for logistic regression Let η := σ(w x + b). W.l.o.g. we assume b = 0. By assumption: Y X = x Ber(η). Likelihood p(y = y X = x) = η y (1 η) 1 y = σ(w x) y σ( w x) 1 y. Log-likelihood l(w) = y log σ(w x) + (1 y) log σ( w x) = y log η + (1 y) log(1 η) η = y log + log(1 η) 1 η = y w x + log σ( w x) Linear and logistic regression 13/22
Maximizing the log-likelihood Log-likelihood of a sample Given an i.i.d. training set D = {(x 1, y 1 ),, (x n, y n )} l(w) = y i w x i + log σ( w x i ). The log-likelihood is differentiable and concave. Its global maxima are its stationary points. Gradient of l l(w) = = σ( w x i )σ(w x i ) y i x i x i σ( w x i ) (y i η i )x i with η i = σ(w x i ). Thus, l(w) = 0 n x i(y i σ(θ x i )) = 0. No closed form solution! Linear and logistic regression 14/22
Second order Taylor expansion Need an iterative method to solve x i (y i σ(θ x i )) = 0. Gradient descent (aka steepest descent) Newton s method Hessian of l Hl(w) = = x i (0 σ (w x i )σ ( w x i )x i ) η i (1 η i )x i x i = X Diag(η i (1 η i ))X where X is the design matrix. Note that Hl is p.s.d. l is concave. Linear and logistic regression 15/22
Newton s method Use the Taylor expansion l(w t ) + (w w t ) l(w t ) + 1 2 (w wt ) Hl(w t )(w w t ). and minimize w.r.t. w. Setting h = w w t, we get max h h w l(w) + 1 2 h Hl(w)h. I.e., for logistic regression, writing D η = Diag ( ) (η i (1 η i )) i min h h X (y η) 1 2 h X D η Xh Modified normal equations X D η X h X ỹ with ỹ = y η. Linear and logistic regression 16/22
Iterative Reweighted Least Squares (IRLS) Assuming X D η X is invertible, the algorithm takes the form w (t+1) w (t) + (X D η (t)x) 1 X (y η (t) ). This is called iterative reweighted least squares because each step is equivalent to solving the reweighted least squares problem: 1 2 τ 2 i 1 (x i h ˇy i ) 2 with τ 2 i = 1 η (t) i (1 η (t) i ) and ˇy i = τ 2 i (y i η (t) i ). Linear and logistic regression 17/22
Alternate formulation of logistic regression If y { 1, 1}, then P(Y = y X = x) = σ(y w x) Log-likelihood l(w) = log σ(yw x) = log ( 1 + exp( yw x) ) Log-likelihood for a training set l(w) = log ( 1 + exp( y i w x i ) ) Linear and logistic regression 18/22
Fisher discriminant analysis Linear and logistic regression 20/22
Generative classification X R p and Y {0, 1}. Instead of modeling directly p(y x) model p(y) and p(x y) and deduce p(y x) using Bayes rule. In classification P(Y = 1 X = x) = P(X = x Y = 1) P(Y = 1) P(X = x Y = 1) P(Y = 1) + P(X = x Y = 0) P(Y = 0) For example one can assume P(Y = 1) = π P(X = x Y = 1) N (x; µ 1, Σ 1 ) P(X = x Y = 0) N (x; µ 0, Σ 0 ). Linear and logistic regression 21/22
Fisher s discriminant aka Linear Discriminant Analysis (LDA) Previous model with the constraint Σ 1 = Σ 0 = Σ. Given a training set, the different model parameters can be estimated using the maximum likelihood principle, which leads to ( π, µ 1, µ 0, Σ 1, Σ 0 ). Linear and logistic regression 22/22