CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17

Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are independent and identically distributed. A standard pdf used to model multivariate real data is the multivariate Gaussian or normal p (x µ, Σ = N (x; µ, Σ = 1 (π D / Σ 1/ exp( 1 (x µt Σ 1 (x µ. }{{} Mahalanobis distance It can be shown that µ is the mean and Σ is the covariance of N (x; µ, Σ ; i.e. E (X = µ and cov (X = Σ. It will be used extensively in our discussion on unsupervised learning and can also used for generative classifiers (i.e. discriminant analysis. AD ( March 11 / 17

Special Cases When D = 1, we are back to p ( x µ, σ = N ( x; µ, σ ( 1 = exp 1 (x µ πσ σ When D = and writing ( σ Σ = 1 ρσ 1 σ ρσ 1 σ σ where ρ = corr (X 1, X [ 1, 1] we have 1 p (x µ, Σ = ( exp 1 (1 ρ πσ 1 σ 1 ρ { (x 1 µ 1 σ 1 + (x µ σ } (x 1 µ 1 (x µ σ 1 σ AD ( March 11 3 / 17

Graphical Ilustrations ntrobody.tex full diagonal spherical....15.15.15.1.1.1.5.5.5 1 5 5 1 1 5 5 1 1 5 5 1 5 5 5 5 5 5 Illustration(a of D Gaussian pdfs for different (b covariance matrices (c (left: full, (middle: diagonal, (right: spherical. full 6 1 diagonal 5 spherical 8 AD ( 6 3 March 11 / 17

Why are the contours of a multivariate Gaussian elliptical If we plot the values of x s.t. p (x µ, Σ is equal to a constant, i.e. s.t. (x µ T Σ 1 (x µ = c > where c is given, then we obtain an ellipse. Σ is a positive definite matrix so we have Σ = UΛU T where U is an orthonormal matrix of eigenvectors, i.e. U T U = I, and Λ = diag (λ 1,..., λ D with λ k is the diagonal matrix of eigenvalues. Hence, we have so Σ 1 = ( U T 1 Λ 1 U 1 = UΛ 1 U T = (x µ T Σ 1 (x µ = D k=1 1 λ k u k u T k, D yk where y k = u T k (x µ. λ k=1 k AD ( March 11 5 / 17

5 5 1 1 Graphical Ilustrations 5 1 5 5 5 (a (b (c 6 full 1 diagonal 5 spherical 8 6 3 1 1 6 8 3 6 6 6 1 5 3 1 1 3 5 5 6 Illustration (d of D Gaussian pdfs level(e sets for different covariance (f matrices (left: full, (middle: diagonal, (right: spherical. ure 1.8: Top: plot of pdf s for d Gaussian distributions. Bottom: corresponding level sets, i.e., we plot the points x where p(x = c f ferent values of c. Left: AD ( a full covariance matrix has elliptical contours. Middle: a diagonal covariance matrix March is an11 axis aligned 6 / ellips 17

Properties of Multivariate Gaussians Marginalization is straightforward. Conditioning is easy; e.g. if X = (X 1 X with p (x = p (x 1, x = N (x; µ, Σ where µ = ( µ1 µ ( Σ11 Σ, Σ = 1 Σ 1 Σ then with p (x 1 x = p (x 1, x p (x = N (x 1 ; µ 1, Σ 1 µ 1 = µ 1 + Σ 1 Σ 1 (x µ, Σ 1 = Σ 11 Σ 1 Σ 1 Σ 1. AD ( March 11 7 / 17

Independence and Correlation for Gaussian Variables It is well-known that independence implies uncorrelations; i.e. if the components (X 1,..., X D of a vector X are independent then they are uncorrelated. However, uncorrelated does not imply independence in the general case. If the components (X 1,..., X D of a vector X distributed according to a multivariate Gaussian are uncorrelated then they are independent. Proof. If (X 1,..., X D are uncorrelated then Σ = diag ( σ 1,..., σ D D and Σ = σ k so k=1 p (x µ, Σ = D p ( x k µ k, σ D k = N ( x k ; µ k, σ k k=1 k=1 AD ( March 11 8 / 17

ML Parameter Learning for Multivariate Gaussian Consider data { x i } N i=1 where xi R D and assume they are independent and identically distributed from N (x; µ, Σ. The ML parameter estimates of (µ, Σ maximize by definition N log N ( x i ; µ, Σ i=1 = ND log (π N log Σ 1 N ( x i µ T Σ 1 ( x i µ. i=1 We obtain after painful calculations the fairly intuitive results ( µ = N i=1 x i N, Σ = N i=1 x i µ ( x i µ T. N AD ( March 11 9 / 17

Application to Supervised Learning using Bayes Classifier Assume you are given some training data { x i, y i } N i=1 where xi R D and y i {1,,..., C } can take C different values. Given an input test data x, you want to predict/estimate the output y associated to x. Previously we have followed a probabilistic approach p (y = c x = p (y = c p (x y = c C c =1 p (y = c p (x y = c. This requires modelling and learning the parameters of the class conditional density of features p (x y = c. AD ( March 11 1 / 17

Height Weight Data 6 introbody 8 red = female, blue=male 8 red = female, blue=male 6 6 weight 18 weight 18 16 16 1 1 1 1 1 1 8 55 6 65 7 75 8 height 8 55 6 65 7 75 8 height (a (b (left Height/Weight data for female/male (right d Gaussians fit to Figure each 1.9: class. (a Height/weight 95% of data. the (b Visualization proba mass of d Gaussians is inside fit to each theclass. ellipse 95% of the probability mass is inside the elli Figure generated by gaussheightweight. AD ( March 11 11 / 17

Supervised Learning using Bayes Classifier Assume we pick and p (y = c = π c then p (x y = c = N (x; µ c, Σ c p (y = c x π c Σ c 1/ exp( 1 (x µ c T Σc 1 (x µ c = exp(µ T c Σ 1 c x 1 µt c Σ 1 c µ c + log π c exp ( 1 xt Σc 1 x For models where Σ c = Σ then this is known as linear discriminant analysis ( exp β T c p (y = c x = x + γ c C c =1 (β exp T c x + γ c where β c = Σ 1 µ c, γ c = 1 µt c Σ 1 µ c + log π c and the model is very similar to logistic regression. AD ( March 11 1 / 17

Decision Boundaries obody.tex Linear Boundary All Linear Boundaries 6 (a 6 (b Parabolic Boundary Some Linear, Some Quadratic 8 6 (c 6 (d Figure 1.3: Decision boundaries in D for the and 3 class case. Figure generated by discrimanalysisdboundariesdemo Decision boundaries in D for and 3 class case. we can write T p(y = c x, θ = P eβ c c x+γc T eβ c x+γc (1 ch is the softmax function (see Section 1..1. This is equivalent in form to multinomial logistic regression. However all method is not equivalent to logistic regression, as we explain in Section 1..6. The decision boundary c x, 11 θ = p(y13= x AD ( between two classes say c and c, is the set of points x for which p(y = March /c 17

Binary Classification Consider the case where C = then one can check that p (y = 1 x = g ((β 1 β T x + γ 1 γ We have γ 1 γ = 1 (µ 1 µ T Σ 1 (µ 1 + µ + log (π 1 /π, x : = 1 (µ 1 + µ (µ 1 µ log (π 1/π (µ 1 µ T Σ 1 (µ 1 µ w : = β 1 β = Σ 1 (µ 1 µ then p (y = 1 x = g ( w T (x x x is shifted by x and then projected onto the line w. AD ( March 11 1 / 17

Binary Classification re 1.31: Geometry of LDA in the class case where Σ 1 = Σ Example where Σ = σ I so w is in the direction of µ 1 µ. γ = 1 µ T Σ 1 µ + 1 AD ( µ T Σ 1 µ + Marchlog(π 11 15 1 //π 17

Generative or Discriminative Classifiers When the model for class conditional densities is correct, resp. incorrect, generative classifiers will typically outperform, resp. underperform, discriminative classifiers for large enough datasets. Generative classifiers can be diffi cult to learn whereas Discriminative classifiers try to learn directly the posterior probability of interest. Generative classifiers can handle missing data easily, discriminative methods cannot. Discriminative can be more flexible; e.g. substitute x to Φ (x. AD ( March 11 16 / 17

Application Bankruptcy Data using Gaussian (black = train, blue=correct, red=wrong, nerr = 3 Ba 1 1 3 6 Bankrupt Solvent 8 5 3 1 1 Discriminant analysis on the bankruptcy data set. Gaussian class (a.33: Discriminant analysis on the bankruptcy data set. Left: Gaus s. conditional Points that densities. belongestimated to class 1labels, are shown based onasthe triangles, posterior Points proba ofthat b osterior belonging probability to each class, ofare belonging computed. toifeach incorrect, class, theare point computed. is colored If th raining data is in black. Figure generated by robustdiscriman read, otherwise in blue (Training data are black. AD ( March 11 17 / 17 5 6 7