Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, PDF Free Download

Lecture 5 Gaussian Models - Part 1 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza November 29, 2016 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 1 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 2 / 42

Univariate Gaussian (Normal) Distribution X is a continuous RV with values x R X N (µ, σ 2 ), i.e. X has a Gaussian distribution or normal distribution mean E[X ] = µ mode µ N (x µ, σ 2 ) variance var[x ] = σ 2 1 2πσ 2 e 1 2σ 2 (x µ)2 precision λ = 1 σ 2 (µ 2σ, µ + 2σ) is the approx 95% interval (µ 3σ, µ + 3σ) is the approx. 99.7% interval (= P X (X = x)) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 4 / 42

Multivariate Gaussian (Normal) Distribution X is a continuous RV with values x R D X N (µ, Σ), i.e. X has a Multivariate Normal distribution (MVN) or multivariate Gaussian [ 1 N (x µ, Σ) exp 1 ] (2π) D/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) mean: E[x] = µ mode: µ covariance matrix: cov[x] = Σ R D D where Σ = Σ T and Σ 0 precision matrix: Λ Σ 1 spherical isotropic covariance with Σ = σ 2 I D Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 5 / 42

MLE for an MVN Theorem Theorem 1 If we have N iid samples x i N (µ, Σ), then the MLE for the parameters is given by 1 µ MLE = 1 N N x i x i=1 2 Σ MLE = 1 N (x i x)(x i x) T = 1 ( N x i x T i N N i=1 i=1 ) x x T this theorem states the MLE parameter estimates for an MVN are just the empirical mean and the empirical covariance in the univariate case, one has ˆµ = 1 N x i x N ˆσ 2 = 1 N i=1 N (x i x)(x i x) T = 1 ( N N i=1 i=1 x i x T i ) x 2 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 7 / 42

MLE for an MVN Theorem proof sketch in order to find the MLE one should maximize the log-likelihood of the dataset given that x i N (µ, Σ) p(d µ, Σ) = i N (x i µ, Σ) the log-likelihood (dropping additive constants) is l(µ, Σ) = log p(d µ, Σ) = N 2 log Λ 1 (x i µ)λ(x i µ) T + const 2 the MLE estimates can be obtained by maximizing l(µ, Σ) w.r.t. µ and Σ i homework: continue the proof for the univariate case Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 8 / 42

Generative Classifiers probabilistic classifier we are given a dataset D = {(x i, y i )} N i=1 the goal is to compute the class posterior p(y = c x) which models the mapping y = f (x) generative classifiers p(y = c x) is computed starting from the class-conditional density p(x y = c, θ) and the class prior p(y = c θ) given that p(y = c x, θ) p(x y = c, θ)p(y = c θ) (= p(y = c, x θ)) this is called a generative classifier since it specifies how to generate the feature vector x for each class y = c (by using p(x y = c, θ)) the model is usually fit by maximizing the joint log-likelihood, i.e. one computes θ = arg max θ i log p(y i, x i θ) discriminative classifiers the model p(y = c x) is directly fit to the data the model is usually fit by maximizing the conditional log-likelihood, i.e. one computes θ = arg max θ i log p(y i x i, θ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 10 / 42

Gaussian Discriminant Analysis GDA we can use the MVN for defining the class conditional densities in a generative classifier p(x y = c, θ) = N (x µ c, Σ c) for c {1,..., C} this means the samples of each class c are characterized by a normal distribution this model is called Gaussian Discriminative Analysis (GDA) but it is a generative classifier (not discriminative) in the case Σ c is diagonal for each c, this model is equivalent to a Naive Bayes Classifier (NBC) since p(x y = c, θ) = D N (x j µ jc, σjc) 2 for c {1,..., C} j=1 once the model is fit to the data, we can classify a feature vector by using the decision rule [ ] ŷ(x) = argmax log p(y = c x, θ) = argmax log p(y = c π) + log p(x y = c, θ c) c c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 12 / 42

Gaussian Discriminant Analysis GDA decision rule ŷ(x) = argmax c [ ] log p(y = c π) + log p(x y = c, θ c) given that y Cat(π) and x (y = c) N (µ c, Σ c) the decision rule becomes (dropping additive constants) [ ŷ(x) = argmin log π c + 1 c 2 log Σc + 1 ] 2 (x µc)t Σ 1 c (x µ c) which can be thought as a nearest centroid classifier in fact, with an uniform prior and Σ c = Σ ŷ(x) = argmin c (x µ c) T Σ 1 (x µ c) = argmin c x µ c 2 Σ in this case, we select the class c whose center µ c is closest to x (using the Mahalanobis distance x µ c Σ ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 13 / 42

Mahalanobis Distance the covariance matrix Σ can be diagonalized since it is a symmetric real matrix Σ = UDU T = D λ i u i u T i where U = [u 1,..., u D ] is an orthonormal matrix of eigenvectors (i.e. U T U = I) and λ i are the corresponding eigenvalues (λ i 0 since Σ 0) one has immediately Σ 1 = UD 1 U T = D 1 i=1 u i u T i λ i ( ) 1/2 the Mahalanobis distance is defined as x µ Σ (x µ) T Σ 1 (x µ) one can rewrite i=1 (x µ) T Σ 1 (x µ) = (x µ) T ( D = D i=1 i=1 1 λ i (x µ) T u i u T i (x µ) = where y i u T i (x µ) (or equivalently y U T (x µ)) ) 1 u i u T i (x µ) = λ i Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 14 / 42 D i=1 y 2 i λ i

Mahalanobis Distance Σ = UDU T = D i=1 λ iu i u T i (x µ) T Σ 1 (x µ) = D yi 2 i=1 (where y U T (x µ)) λ i (1) center w.r.t. µ (2) rotate by U T (3) get a norm weighted by the 1 λ i Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 15 / 42

Gaussian Discriminant Analysis GDA left: height/weight data for the two classes male/female right: visualization of 2D Gaussian fit to each class we can see that the features are correlated (tall people tend to weigh more ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 16 / 42

Quadratic Discriminant Analysis QDA the complete class posterior with Gaussian densities is π c 2πΣ c 1/2 exp[ 1 2 p(y = c x, θ) = (x µc)t Σ 1 c (x µ c)] c π c 2πΣ c 1/2 exp[ 1 (x µ 2 c )T Σ 1 c (x µ c ) the quadratic decision boundaries can be found by imposing or equivalently p(y = c x, θ) = p(y = c x, θ) log p(y = c x, θ) = log p(y = c x, θ) for each pair of adjacent classes (c, c ), which results in the quadratic equation 1 2 (x µ c )T Σ 1 c (x µ c ) = 1 2 (x µ c )T Σ 1 c (x µ c ) + constant Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 18 / 42

Quadratic Discriminant Analysis QDA left: dataset with 2 classes right: dataset with 3 classes Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 19 / 42

Linear Discriminant Analysis LDA we now consider the GDA in the special case Σ c = Σ for c {1,..., C} in this case we have p(y = c x, θ) π c exp [ 1 ] 2 (x µc)t Σ 1 (x µ c) = [ = exp 1 ] 2 xt Σ 1 x exp [µ Tc Σ 1 x 12 ] µtc Σ 1 µ c + log π c note that the quadratic term 1 2 xt Σ 1 x is independent of c and it will cancel out in the numerator and denominator of the complete class posterior equation we define γ c 1 2 µt c Σ 1 µ c + log π c β c Σ 1 µ c we can rewrite p(y = c x, θ) = e βt c x+γc c e βt c x+γ c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 21 / 42

Linear Discriminant Analysis LDA we have p(y = c x, θ) = e βt c x+γc c e βt c x+γ c S(η) c where η [β1 T x + γ 1,..., βc T x + γ C ] T R C and the function S(η) is the softmax function defined as [ e η 1 η S(η) c e η,..., e C ] T c c e η c and S(η) c R is just its c-th component the softmax function S(η) is so-called since it acts a bit like the max function. To see this, divide each component η c by a temperature T, then 1 if c = argmax η c S(η/T ) c = c as T 0 0 otherwise in other words, at low temperature S(η/T ) c returns the most probable state, whereas at high temperatures S(η/T ) c returns one of the states with a uniform probability (cfr. Bolzmann distribution in physics) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 22 / 42

Linear Discriminant Analysis Softmax softmax distribution S(η/T ), where η = [3, 0, 1] T, at different temperatures T when the temperature is high (left), the distribution is uniform, whereas when the temperature is low (right), the distribution is spiky, with all its mass on the largest element Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 23 / 42

Linear Discriminant Analysis LDA in order to find the decision boundaries we impose p(y = c x, θ) = p(y = c x, θ) which entails e βt c x+γc = e βt c x+γ c in this case, taking the logs returns βc T x + γ c = β T c x + γ c which in turn corresponds to a linear decision boundary 1 (β c β c ) T x = (γ c γ c ) 1 in D dimensions this corresponds to an hyperplane, in 3D to a plane, in 2D to a straight line Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 24 / 42

Linear Discriminant Analysis LDA left: dataset with 2 classes right: dataset with 3 classes Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 25 / 42

Linear Discriminant Analysis two-class LDA let us consider an LDA with just two classes (i.e. y {0, 1}) in this case e βt 1 x+γ 1 p(y = 1 x, θ) = e βt 1 x+γ 1 + e βt 0 x+γ 0 1 = 1 + e (β 0 β 1 ) T x+(γ 0 γ 1 ) that is p(y = 1 x, θ) = sigm((β 0 β 1) T x + (γ 0 γ 1)) where sigm(η) 1 is the sigmoid function (aka logistic function) 1+exp( η) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 26 / 42

Linear Discriminant Analysis two-class LDA the linear decision boundary is if we define (β 0 β 1) T x + (γ 0 γ 1) = 0 w β 1 β 0 = Σ 1 (µ 1 µ 0) x 0 1 2 (µ1 + µ0) (µ1 µ0) log(π 1/π 0) (µ 1 µ 0) T Σ 1 (µ 1 µ 0) we obtain w T x 0 = (γ 1 γ 0) the linear decision boundary can be rewritten as w T (x x 0) = 0 in fact we have p(y = 1 x, θ) = sigm(w T (x x 0)) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 27 / 42

Linear Discriminant Analysis two-class LDA we have w β 1 β 0 = Σ 1 (µ 1 µ 0) x 0 1 2 (µ1 + µ0) (µ1 µ0) log(π 1/π 0) (µ 1 µ 0) T Σ 1 (µ 1 µ 0) the linear decision boundary is w T (x x 0) = 0 in the case Σ 1 = Σ 2 = I and π 1 = π 0, one has w = µ 1 µ 0 and x 0 = 1 (µ1 + µ0) 2 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 28 / 42

MLE for GDA how to fit the GDA model? the simplest way is to use MLE let s assume iid samples, then it is p(d θ) = N i=1 p(x i, y i θ) one has p(x i, y i θ) = p(x i y i, θ)p(y i π) p(x i y i, θ) = c N (x i µ c, Σ c) I(y i =c) p(y i π) = c π I(y i =c) c where θ is a compound parameter vector containing the parameters π, µ c and Σ c the log-likelihood function is [ N C ] C [ log p(d θ) = I(y i = c) log π c + i=1 c=1 c=1 i:y i =c ] log N (x i µ c, Σ c) which is the sum of C + 1 distinct terms: the first depending on π and the other C terms depending both on µ c and Σ c we can estimate each parameter by optimizing the log-likelihood separately w.r.t. it Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 30 / 42

MLE for GDA the log-likelihood function is [ N C ] C [ log p(d θ) = I(y i = c) log π c + i=1 c=1 c=1 for the class prior, as with the NBC model, we have ˆπ c = Nc N i:y i =c ] log N (x i µ c, Σ c) for the class conditional densities, we partition the data based on its class label, and compute the MLE for each Gaussian term ˆΣ c = 1 N c ˆµ c = 1 N c N c N c i:y i =c i:y i =c x i (x i ˆµ c)(x i ˆµ c) T Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 31 / 42

Posterior Predictive for GDA once the model is fit and the parameters are estimated we can make predictions by using a plug-in approximation p(y = c x, ˆθ) ˆπ c 2π ˆΣ c 1/2 exp[ 1 2 (x ˆµc)T ˆΣ 1 c (x ˆµ c)] Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 32 / 42

Overfitting for GDA the MLE is fast and simple, however it can badly overfit in high dimensions in particular, ˆΣ c = 1 N c N c i:y i =c (x i ˆµ c)(x i ˆµ c) T R D D is singular for N c < D even when N c > D, the MLE can be ill-conditioned (close to singular) possible simple strategies to solve this issue (they reduce the number of parameters) use NBC model/assumption (i.e. Σ c are diagonal) use LDA (i.e. Σ c = Σ) use diagonal LDA (i.e. Σ c = Σ = diag(σ1, 2..., σd)) 2 (following subsection) use Bayesian approach: estimate full covariance by imposing a prior and then integrating out (following subsection) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 33 / 42

Diagonal LDA the diagonal LDA assumes Σ c = Σ = diag(σ 2 1,..., σ 2 D) for c {1,..., C} one has p(x i, y i = c θ) = p(x i y i = c, θ c)p(y i = c π) = N (x i µ c, Σ)π c = and taking the logs log p(x i, y i = c θ) = D (x ij µ cj ) 2 + log π c j=1 typically the estimates of the parameters are ˆµ cj = 1 N c i:y i =c ˆσ 2 j = 1 N C x ij 2σ 2 j D N (x ij µ cj, σj 2 ) j=1 C (x ij ˆµ cj ) 2 (pooled empirical variance) c=1 i:y i =c in high-dimensional settings, this model can work much better than LDA and RDA Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 35 / 42

Bayesian Procedure we now follow the full Bayesian procedure to fit the GDA model let s restart from the expression of the posterior predictive PDF p(y = c x, D) = p(y = c, x D) p(x D) since we are interested in computing = p(x y = c, D)p(y = c D) p(x D) c = argmax c p(y = c x, D) we can neglect the constant p(x D) and use the following simpler expression p(y = c x, D) p(x y = c, D)p(y = c D) note that we didn t use the model parameters in the previous equation now we use the Bayesian procedure in which we integrate out the unknown parameters for simplicity we now consider a vector parameter π for the PMF p(y = c D) and a vector parameter θ c for the PDF p(x y = c, D) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 37 / 42

Bayesian Procedure as for the PMF p(y = c D) we can integrate out π as follows p(y = c D) = p(y = c, π D)dπ we know that y Cat(π) i.e. p(y π) = c πi(y=c) c we can decompose p(y = c, π D) as follows p(y = c, π D) = p(y = c π, D)p(π D) = p(y = c π)p(π D) = π cp(π D) where p(π D) is the posterior w.r.t. π using the previous equation in integral above we have p(y = c D) = p(y = c, π D)dπ = π cp(π D)dπ = E[π c D] = Nc + αc N + α 0 which is the posterior mean computed for the Dirichlet-multinomial model (cfr lecture 4 slides) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 38 / 42

Bayesian Procedure as for the PDF p(x y = c, D) we can integrate out θ c as follows p(x y = c, D) = p(x, θ c y = c, D)dθ c = p(x, θ c D c)dθ c where for simplicity we introduce D c {(x i, y i ) D y i = c} we know that p(x θ c) = N (x µ c, Σ c) where θ c = (µ c, Σ c) we can use the following decomposition p(x, θ c D c) = p(x θ c, D c)p(θ c D c) = p(x θ c)p(θ c D c) where p(θ c D c) is the posterior w.r.t. θ c hence one has p(x y = c, D) = p(x, θ c D c)dθ c = p(x θ c)p(θ c D c)dθ c = = N (x µ c, Σ c)p(µ c, Σ c D c)dµ cdσ c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 39 / 42

Bayesian Procedure one has p(x y = c, D) = N (x µ c, Σ c)p(µ c, Σ c D c)dµ cdσ c the posterior is (see sect. 4.6.3.3 of the book) p(µ c, Σ c D c) = NIW(m c, Σ c m c N, κ c N, ν c N, S c N) then (see sect. 4.6.3.6) p(x y = c, D) = N (x µ c, Σ c)niw(µ c, Σ c m c N, κ c N, ν c N, S c N)dµ cdσ c = p(x y = c, D) = T (x m c N, κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 40 / 42

Bayesian Procedure let s summarize what we obtained by applying the Bayesian procedure we first found and then p(y = c D) = E[π c D] = p(x y = c, D) = T (x m c N, Nc + αc N + α 0 κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) then combining everything in the starting posterior predictive we have p(y = c x, D) p(x y = c, D)p(y = c D) = = E[π c D]T (x m c N, κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 41 / 42

Credits Kevin Murphy s book Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 42 / 42

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016