Machine Learning. 7. Logistic and Linear Regression

Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi, Valsamis Ntouskos Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Probabilistic Generative Models Consider first the case of two classes. Find the conditional probability: with: p(x C )P(C ) P(C x) = p(x C )P(C ) + p(x C 2 )P(C 2 ) = + exp( α) = σ(α). α = ln p(x C )P(C ) p(x C 2 )P(C 2 ) and σ(α) = +exp( α) the sigmoid function. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 2 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Probabilistic Generative Models Assume p(x C i ) = N (x µ i, Σ) - same covariance matrix we get: with: P(C x) = σ(w T x + w ), w = Σ (µ µ 2 ), w = 2 µt Σ µ + 2 µt 2 Σ µ 2 + ln P(C ) P(C 2 ). note that: P(C 2 x) = P(C x) µ P(C {}}{{}}{ ) {}}{ Nr. of parameters: M(M + 5)/2 + = 2 M + M(M + )/2 +. Σ Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 3 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Probabilistic Generative Models Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 4 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Logistic regression - Model P(C x) = σ(w T x + w ),.9.8.7.6.5.4.3.2. 5 5.9.8.7.6.5.4.3.2. 46 48 5 52 54 56 58 6 62 64 Decision rule: c = C P(c = C x) >.5 Nr. of parameters: M. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 5 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Varying the parameters w in 2D Logistic regression - Model 5 4 3 2.5.5 W = ( 2, 3 ) x x 2 W = ( 2, ) x x 2 w 2.5 W = (, 4 ).5 x x 2 W = (, 2 ) W = ( 2, 2 ).5 x x 2 x x 2 W = (, ) W = ( 3, ).5.5 x x 2 x x 2 W = ( 2, 2 ).5.5.5 W = ( 5, 4 ) x x 2 W = ( 5, ) x x 2 w 2 x x 2 3 3 2 2 3 4 5 6 Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 6 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Logistic regression - Algorithms How parameters w are computed given a dataset D? We will see this in a while... Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 7 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression Goal: Estimate the value t of a continuous function at x based on a dataset D composed of N observations {x n }, where n =,..., N, together with the corresponding target values {t n }. Ideally: t = y(x, w) Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 8 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Model Linear Basis Function Models Simplest case: y(x, w) = w + w x +... + w D x D = w T x w with x =. and w =. x D w D Linear both in model parameters w and variables x. Too limiting! Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 9 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Example - Line fitting 5 y = w x + w 4 prediction truth 3 2 2 3 4 3 2 2 3 4 Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Model Linear Basis Function Models Using nonlinear functions of input variables: y(x, w) = M j= w j φ j (x) = w T φ, with φ (x) = and φ = φ. φ M Still linear in the parameters w! Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Example - Polynomial curve fitting y = w + w x + w 2 x 2 +... + w M x M = M w j x j j= M = M = t t x x M = 3 M = 9 t t x x Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 2 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Model Examples of basis functions.5.5.75.5.25.75.5.25 Polynomial Radial Sigmoid / Tanh Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 3 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Maximum likelihood and least squares Target value t is given by y(x, w) affected by additive noise t = y(x, w) + ɛ Assume Gaussian noise p(ɛ β) = N (ɛ, β ), with β the precision (inverse variance). We have: p(t x, w, β) = N (t y(x, w), β ) Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 4 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Assume observations independent and identically distributed (i.i.d.) We seek the maximum of the likelihood function: p({t,..., t N } x, w, β) = or equivalently minimize: ln p({t,..., t N } w, β) = N N (t n w T φ(x n ), β ). n= N ln N (t n w T φ(x n ), β ) n= = β N [t n w T φ(x n )] 2 + N 2 2 ln(2πβ ). n= }{{} E D (w) Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 5 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Note: with t = t. Linear Regression - Algorithms t N E D = 2 (t Φw)T (t Φw), and Φ = φ (x ) φ (x ) φ M (x ) φ (x 2 ) φ (x 2 ) φ M (x 2 )....... φ (x N ) φ (x N ) φ M (x N ) Optimality condition: E D = Φ T Φw = Φ T t. Hence: w ML = (Φ T Φ) Φ T t. }{{} Φ : pseudo-inverse Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 6 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Sequential Learning Stochastic gradient descent algorithm: w (τ+) = w (τ) η E n, with τ the iteration number and η the learning rate parameter. Therefore: w (τ+) = w (τ) + η(t n (w (τ) ) T φ(x n ))φ(x n ). Algorithm converges for suitable values of η. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 7 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Regularization Why regularize? To control over-fitting. Minimize: E D (w) + λe W (w). A common choice: Other choices: E W (w) = 2 wt w. M E W (w) = w j q. j= Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 8 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Regularization w 2 w 2 w w w w q = 2 q = Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 9 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Regularization E W (w) = 2 wt w. ln λ = 8 ln λ = t t x x Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 2 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Linear Regression - Algorithms Multiple outputs y(x, W) = W T φ(x). Target variable is given by: T = y(x, W) + ɛ, with p(ɛ β) = N (ɛ, β I). Similarly with before we obtain: W ML = (Φ T Φ) Φ T T. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 2 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Logistic regression - Algorithms Time to see how parameters of the logistic regression model are estimated. Remember that logistic regression is used for classification! In the following we will consider the logistic regression problem: P(C φ) = y(φ) = σ(w T φ), with σ( ) the logistic sigmoid function. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 22 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Logistic regression - Algorithms Given data set {φ n, t n }, t n {, } and φ n = φ(x n ), n =,..., N Likelihood function: P(t w) = N n= with y n = P(C φ n ) = σ(w T φ n ). y tn n ( y n ) tn ( ) (*) Hint: P(c = φ n ) = y n and P(c = φ n ) = y n. Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 23 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Logistic regression - Algorithms Cross-entropy error function N E(w) = ln P(t w) = [t n ln y n + ( t n ) ln( y n )]. n= Gradient of the error with respect to w using dσ dα = σ( σ) N E(w) = (y n t n )φ n, n= Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 24 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Iterative reweighted least squares Apply Newton-Raphson iterative optimization technique for minimizing E(w). w w H E(w) H = E(w) is the Hessian matrix of E(w). Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 25 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Iterative reweighted least squares E(w) = Φ T (y t) H = Φ T RΦ with t = (t,..., t n ) T, y = (y,..., y n ) T R diagonal matrix with R n,n = y n ( y n ) φ T Φ =... φ T N Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 26 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Iterative reweighted least squares Iterative method w w (Φ T RΦ) Φ T (y t) Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 27 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Multiclass logistic regression with a k = w T k φ. P(C k φ) = y k (φ) = exp(a k) j exp(a j) }{{} softmax Discriminative model P(T w,... w K ) = N n= k= with y nk = y k (φ n ) and T n,k = t nk. K P(C k φ n ) t nk = N K n= k= y t nk nk Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 28 / 29

Sapienza University of Rome, Italy - Machine Learning (27/28) Multiclass logistic regression Cross-entropy error function N K E(w,... w K ) = ln P(T w,... w K ) = t nk ln y nk Gradient n= k= Hessian matrix N wj E(w,... w K ) = (y nj t nj )φ n n= N wk wj E(w,... w K ) = y nk (I kj y nj )φ n φ T n n= Luca Iocchi, Valsamis Ntouskos 7. Logistic and Linear Regression 29 / 29