Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Similar documents
Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning - MT Classification: Generative Models

Introduction to Machine Learning

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Bayesian Decision and Bayesian Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Introduction to Machine Learning

Machine Learning (CS 567) Lecture 5

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Gaussian Models

Statistical Data Mining and Machine Learning Hilary Term 2016

CS340 Machine learning Gaussian classifiers

Introduction to Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

7 Gaussian Discriminant Analysis (including QDA and LDA)

Gaussian Models (9/9/13)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Classification. Sandro Cumani. Politecnico di Torino

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

CSC 411: Lecture 09: Naive Bayes

Linear Methods for Prediction

Lecture 9: Classification, LDA

The Bayes classifier

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Introduction to Machine Learning Spring 2018 Note 18

Lecture 4: Probabilistic Learning

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Naive Bayes and Gaussian Bayes Classifier

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

LDA, QDA, Naive Bayes

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

MSA220 Statistical Learning for Big Data

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Parametric Techniques

Inf2b Learning and Data

Statistical Machine Learning Hilary Term 2018

Gaussian discriminant analysis Naive Bayes

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

The generative approach to classification. A classification problem. Generative models CSE 250B

Naive Bayes and Gaussian Bayes Classifier

Parametric Techniques Lecture 3

Machine Learning for Signal Processing Bayes Classification and Regression

Classification 2: Linear discriminant analysis (continued); logistic regression

Naive Bayes and Gaussian Bayes Classifier

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

5. Discriminant analysis

Machine learning - HT Maximum Likelihood

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Linear Methods for Prediction

Naïve Bayes classification

COM336: Neural Computing

Gaussian and Linear Discriminant Analysis; Multiclass Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Introduction to Machine Learning

Lecture : Probabilistic Machine Learning

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Exam 2. Jeremy Morris. March 23, 2006

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Bayesian Decision Theory

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Learning Bayesian network : Given structure and completely observed data

Lecture 2: Priors and Conjugacy

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning

CS 195-5: Machine Learning Problem Set 1

Regularized Discriminant Analysis and Reduced-Rank LDA

CSC411 Fall 2018 Homework 5

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Chapter 17: Undirected Graphical Models

Naive Bayes & Introduction to Gaussians

Linear Regression and Discrimination

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Logistic Regression. Machine Learning Fall 2018

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Generative Learning algorithms

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Lecture 5: Classification

STA 4273H: Statistical Machine Learning

Classification. Chapter Introduction. 6.2 The Bayes classifier

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

PMR Learning as Inference

STAT 730 Chapter 4: Estimation

Machine Learning, Fall 2012 Homework 2

CMSC858P Supervised Learning Methods

Transcription:

Lecture 5 Gaussian Models - Part 1 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza November 29, 2016 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 1 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 2 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 3 / 42

Univariate Gaussian (Normal) Distribution X is a continuous RV with values x R X N (µ, σ 2 ), i.e. X has a Gaussian distribution or normal distribution mean E[X ] = µ mode µ N (x µ, σ 2 ) variance var[x ] = σ 2 1 2πσ 2 e 1 2σ 2 (x µ)2 precision λ = 1 σ 2 (µ 2σ, µ + 2σ) is the approx 95% interval (µ 3σ, µ + 3σ) is the approx. 99.7% interval (= P X (X = x)) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 4 / 42

Multivariate Gaussian (Normal) Distribution X is a continuous RV with values x R D X N (µ, Σ), i.e. X has a Multivariate Normal distribution (MVN) or multivariate Gaussian [ 1 N (x µ, Σ) exp 1 ] (2π) D/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) mean: E[x] = µ mode: µ covariance matrix: cov[x] = Σ R D D where Σ = Σ T and Σ 0 precision matrix: Λ Σ 1 spherical isotropic covariance with Σ = σ 2 I D Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 5 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 6 / 42

MLE for an MVN Theorem Theorem 1 If we have N iid samples x i N (µ, Σ), then the MLE for the parameters is given by 1 µ MLE = 1 N N x i x i=1 2 Σ MLE = 1 N (x i x)(x i x) T = 1 ( N x i x T i N N i=1 i=1 ) x x T this theorem states the MLE parameter estimates for an MVN are just the empirical mean and the empirical covariance in the univariate case, one has ˆµ = 1 N x i x N ˆσ 2 = 1 N i=1 N (x i x)(x i x) T = 1 ( N N i=1 i=1 x i x T i ) x 2 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 7 / 42

MLE for an MVN Theorem proof sketch in order to find the MLE one should maximize the log-likelihood of the dataset given that x i N (µ, Σ) p(d µ, Σ) = i N (x i µ, Σ) the log-likelihood (dropping additive constants) is l(µ, Σ) = log p(d µ, Σ) = N 2 log Λ 1 (x i µ)λ(x i µ) T + const 2 the MLE estimates can be obtained by maximizing l(µ, Σ) w.r.t. µ and Σ i homework: continue the proof for the univariate case Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 8 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 9 / 42

Generative Classifiers probabilistic classifier we are given a dataset D = {(x i, y i )} N i=1 the goal is to compute the class posterior p(y = c x) which models the mapping y = f (x) generative classifiers p(y = c x) is computed starting from the class-conditional density p(x y = c, θ) and the class prior p(y = c θ) given that p(y = c x, θ) p(x y = c, θ)p(y = c θ) (= p(y = c, x θ)) this is called a generative classifier since it specifies how to generate the feature vector x for each class y = c (by using p(x y = c, θ)) the model is usually fit by maximizing the joint log-likelihood, i.e. one computes θ = arg max θ i log p(y i, x i θ) discriminative classifiers the model p(y = c x) is directly fit to the data the model is usually fit by maximizing the conditional log-likelihood, i.e. one computes θ = arg max θ i log p(y i x i, θ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 10 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 11 / 42

Gaussian Discriminant Analysis GDA we can use the MVN for defining the class conditional densities in a generative classifier p(x y = c, θ) = N (x µ c, Σ c) for c {1,..., C} this means the samples of each class c are characterized by a normal distribution this model is called Gaussian Discriminative Analysis (GDA) but it is a generative classifier (not discriminative) in the case Σ c is diagonal for each c, this model is equivalent to a Naive Bayes Classifier (NBC) since p(x y = c, θ) = D N (x j µ jc, σjc) 2 for c {1,..., C} j=1 once the model is fit to the data, we can classify a feature vector by using the decision rule [ ] ŷ(x) = argmax log p(y = c x, θ) = argmax log p(y = c π) + log p(x y = c, θ c) c c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 12 / 42

Gaussian Discriminant Analysis GDA decision rule ŷ(x) = argmax c [ ] log p(y = c π) + log p(x y = c, θ c) given that y Cat(π) and x (y = c) N (µ c, Σ c) the decision rule becomes (dropping additive constants) [ ŷ(x) = argmin log π c + 1 c 2 log Σc + 1 ] 2 (x µc)t Σ 1 c (x µ c) which can be thought as a nearest centroid classifier in fact, with an uniform prior and Σ c = Σ ŷ(x) = argmin c (x µ c) T Σ 1 (x µ c) = argmin c x µ c 2 Σ in this case, we select the class c whose center µ c is closest to x (using the Mahalanobis distance x µ c Σ ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 13 / 42

Mahalanobis Distance the covariance matrix Σ can be diagonalized since it is a symmetric real matrix Σ = UDU T = D λ i u i u T i where U = [u 1,..., u D ] is an orthonormal matrix of eigenvectors (i.e. U T U = I) and λ i are the corresponding eigenvalues (λ i 0 since Σ 0) one has immediately Σ 1 = UD 1 U T = D 1 i=1 u i u T i λ i ( ) 1/2 the Mahalanobis distance is defined as x µ Σ (x µ) T Σ 1 (x µ) one can rewrite i=1 (x µ) T Σ 1 (x µ) = (x µ) T ( D = D i=1 i=1 1 λ i (x µ) T u i u T i (x µ) = where y i u T i (x µ) (or equivalently y U T (x µ)) ) 1 u i u T i (x µ) = λ i Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 14 / 42 D i=1 y 2 i λ i

Mahalanobis Distance Σ = UDU T = D i=1 λ iu i u T i (x µ) T Σ 1 (x µ) = D yi 2 i=1 (where y U T (x µ)) λ i (1) center w.r.t. µ (2) rotate by U T (3) get a norm weighted by the 1 λ i Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 15 / 42

Gaussian Discriminant Analysis GDA left: height/weight data for the two classes male/female right: visualization of 2D Gaussian fit to each class we can see that the features are correlated (tall people tend to weigh more ) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 16 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 17 / 42

Quadratic Discriminant Analysis QDA the complete class posterior with Gaussian densities is π c 2πΣ c 1/2 exp[ 1 2 p(y = c x, θ) = (x µc)t Σ 1 c (x µ c)] c π c 2πΣ c 1/2 exp[ 1 (x µ 2 c )T Σ 1 c (x µ c ) the quadratic decision boundaries can be found by imposing or equivalently p(y = c x, θ) = p(y = c x, θ) log p(y = c x, θ) = log p(y = c x, θ) for each pair of adjacent classes (c, c ), which results in the quadratic equation 1 2 (x µ c )T Σ 1 c (x µ c ) = 1 2 (x µ c )T Σ 1 c (x µ c ) + constant Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 18 / 42

Quadratic Discriminant Analysis QDA left: dataset with 2 classes right: dataset with 3 classes Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 19 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 20 / 42

Linear Discriminant Analysis LDA we now consider the GDA in the special case Σ c = Σ for c {1,..., C} in this case we have p(y = c x, θ) π c exp [ 1 ] 2 (x µc)t Σ 1 (x µ c) = [ = exp 1 ] 2 xt Σ 1 x exp [µ Tc Σ 1 x 12 ] µtc Σ 1 µ c + log π c note that the quadratic term 1 2 xt Σ 1 x is independent of c and it will cancel out in the numerator and denominator of the complete class posterior equation we define γ c 1 2 µt c Σ 1 µ c + log π c β c Σ 1 µ c we can rewrite p(y = c x, θ) = e βt c x+γc c e βt c x+γ c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 21 / 42

Linear Discriminant Analysis LDA we have p(y = c x, θ) = e βt c x+γc c e βt c x+γ c S(η) c where η [β1 T x + γ 1,..., βc T x + γ C ] T R C and the function S(η) is the softmax function defined as [ e η 1 η S(η) c e η,..., e C ] T c c e η c and S(η) c R is just its c-th component the softmax function S(η) is so-called since it acts a bit like the max function. To see this, divide each component η c by a temperature T, then 1 if c = argmax η c S(η/T ) c = c as T 0 0 otherwise in other words, at low temperature S(η/T ) c returns the most probable state, whereas at high temperatures S(η/T ) c returns one of the states with a uniform probability (cfr. Bolzmann distribution in physics) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 22 / 42

Linear Discriminant Analysis Softmax softmax distribution S(η/T ), where η = [3, 0, 1] T, at different temperatures T when the temperature is high (left), the distribution is uniform, whereas when the temperature is low (right), the distribution is spiky, with all its mass on the largest element Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 23 / 42

Linear Discriminant Analysis LDA in order to find the decision boundaries we impose p(y = c x, θ) = p(y = c x, θ) which entails e βt c x+γc = e βt c x+γ c in this case, taking the logs returns βc T x + γ c = β T c x + γ c which in turn corresponds to a linear decision boundary 1 (β c β c ) T x = (γ c γ c ) 1 in D dimensions this corresponds to an hyperplane, in 3D to a plane, in 2D to a straight line Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 24 / 42

Linear Discriminant Analysis LDA left: dataset with 2 classes right: dataset with 3 classes Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 25 / 42

Linear Discriminant Analysis two-class LDA let us consider an LDA with just two classes (i.e. y {0, 1}) in this case e βt 1 x+γ 1 p(y = 1 x, θ) = e βt 1 x+γ 1 + e βt 0 x+γ 0 1 = 1 + e (β 0 β 1 ) T x+(γ 0 γ 1 ) that is p(y = 1 x, θ) = sigm((β 0 β 1) T x + (γ 0 γ 1)) where sigm(η) 1 is the sigmoid function (aka logistic function) 1+exp( η) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 26 / 42

Linear Discriminant Analysis two-class LDA the linear decision boundary is if we define (β 0 β 1) T x + (γ 0 γ 1) = 0 w β 1 β 0 = Σ 1 (µ 1 µ 0) x 0 1 2 (µ1 + µ0) (µ1 µ0) log(π 1/π 0) (µ 1 µ 0) T Σ 1 (µ 1 µ 0) we obtain w T x 0 = (γ 1 γ 0) the linear decision boundary can be rewritten as w T (x x 0) = 0 in fact we have p(y = 1 x, θ) = sigm(w T (x x 0)) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 27 / 42

Linear Discriminant Analysis two-class LDA we have w β 1 β 0 = Σ 1 (µ 1 µ 0) x 0 1 2 (µ1 + µ0) (µ1 µ0) log(π 1/π 0) (µ 1 µ 0) T Σ 1 (µ 1 µ 0) the linear decision boundary is w T (x x 0) = 0 in the case Σ 1 = Σ 2 = I and π 1 = π 0, one has w = µ 1 µ 0 and x 0 = 1 (µ1 + µ0) 2 Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 28 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 29 / 42

MLE for GDA how to fit the GDA model? the simplest way is to use MLE let s assume iid samples, then it is p(d θ) = N i=1 p(x i, y i θ) one has p(x i, y i θ) = p(x i y i, θ)p(y i π) p(x i y i, θ) = c N (x i µ c, Σ c) I(y i =c) p(y i π) = c π I(y i =c) c where θ is a compound parameter vector containing the parameters π, µ c and Σ c the log-likelihood function is [ N C ] C [ log p(d θ) = I(y i = c) log π c + i=1 c=1 c=1 i:y i =c ] log N (x i µ c, Σ c) which is the sum of C + 1 distinct terms: the first depending on π and the other C terms depending both on µ c and Σ c we can estimate each parameter by optimizing the log-likelihood separately w.r.t. it Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 30 / 42

MLE for GDA the log-likelihood function is [ N C ] C [ log p(d θ) = I(y i = c) log π c + i=1 c=1 c=1 for the class prior, as with the NBC model, we have ˆπ c = Nc N i:y i =c ] log N (x i µ c, Σ c) for the class conditional densities, we partition the data based on its class label, and compute the MLE for each Gaussian term ˆΣ c = 1 N c ˆµ c = 1 N c N c N c i:y i =c i:y i =c x i (x i ˆµ c)(x i ˆµ c) T Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 31 / 42

Posterior Predictive for GDA once the model is fit and the parameters are estimated we can make predictions by using a plug-in approximation p(y = c x, ˆθ) ˆπ c 2π ˆΣ c 1/2 exp[ 1 2 (x ˆµc)T ˆΣ 1 c (x ˆµ c)] Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 32 / 42

Overfitting for GDA the MLE is fast and simple, however it can badly overfit in high dimensions in particular, ˆΣ c = 1 N c N c i:y i =c (x i ˆµ c)(x i ˆµ c) T R D D is singular for N c < D even when N c > D, the MLE can be ill-conditioned (close to singular) possible simple strategies to solve this issue (they reduce the number of parameters) use NBC model/assumption (i.e. Σ c are diagonal) use LDA (i.e. Σ c = Σ) use diagonal LDA (i.e. Σ c = Σ = diag(σ1, 2..., σd)) 2 (following subsection) use Bayesian approach: estimate full covariance by imposing a prior and then integrating out (following subsection) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 33 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 34 / 42

Diagonal LDA the diagonal LDA assumes Σ c = Σ = diag(σ 2 1,..., σ 2 D) for c {1,..., C} one has p(x i, y i = c θ) = p(x i y i = c, θ c)p(y i = c π) = N (x i µ c, Σ)π c = and taking the logs log p(x i, y i = c θ) = D (x ij µ cj ) 2 + log π c j=1 typically the estimates of the parameters are ˆµ cj = 1 N c i:y i =c ˆσ 2 j = 1 N C x ij 2σ 2 j D N (x ij µ cj, σj 2 ) j=1 C (x ij ˆµ cj ) 2 (pooled empirical variance) c=1 i:y i =c in high-dimensional settings, this model can work much better than LDA and RDA Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 35 / 42

Outline 1 Basics Multivariate Gaussian 2 MLE for an MVN Theorem 3 Gaussian Discriminant Analysis Generative Classifiers Gaussian Discriminant Analysis (GDA) Quadratic Discriminant Analysis (QDA) Linear Discriminant Analysis (LDA) MLE for Gaussian Discriminant Analysis Diagonal LDA Bayesian Procedure Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 36 / 42

Bayesian Procedure we now follow the full Bayesian procedure to fit the GDA model let s restart from the expression of the posterior predictive PDF p(y = c x, D) = p(y = c, x D) p(x D) since we are interested in computing = p(x y = c, D)p(y = c D) p(x D) c = argmax c p(y = c x, D) we can neglect the constant p(x D) and use the following simpler expression p(y = c x, D) p(x y = c, D)p(y = c D) note that we didn t use the model parameters in the previous equation now we use the Bayesian procedure in which we integrate out the unknown parameters for simplicity we now consider a vector parameter π for the PMF p(y = c D) and a vector parameter θ c for the PDF p(x y = c, D) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 37 / 42

Bayesian Procedure as for the PMF p(y = c D) we can integrate out π as follows p(y = c D) = p(y = c, π D)dπ we know that y Cat(π) i.e. p(y π) = c πi(y=c) c we can decompose p(y = c, π D) as follows p(y = c, π D) = p(y = c π, D)p(π D) = p(y = c π)p(π D) = π cp(π D) where p(π D) is the posterior w.r.t. π using the previous equation in integral above we have p(y = c D) = p(y = c, π D)dπ = π cp(π D)dπ = E[π c D] = Nc + αc N + α 0 which is the posterior mean computed for the Dirichlet-multinomial model (cfr lecture 4 slides) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 38 / 42

Bayesian Procedure as for the PDF p(x y = c, D) we can integrate out θ c as follows p(x y = c, D) = p(x, θ c y = c, D)dθ c = p(x, θ c D c)dθ c where for simplicity we introduce D c {(x i, y i ) D y i = c} we know that p(x θ c) = N (x µ c, Σ c) where θ c = (µ c, Σ c) we can use the following decomposition p(x, θ c D c) = p(x θ c, D c)p(θ c D c) = p(x θ c)p(θ c D c) where p(θ c D c) is the posterior w.r.t. θ c hence one has p(x y = c, D) = p(x, θ c D c)dθ c = p(x θ c)p(θ c D c)dθ c = = N (x µ c, Σ c)p(µ c, Σ c D c)dµ cdσ c Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 39 / 42

Bayesian Procedure one has p(x y = c, D) = N (x µ c, Σ c)p(µ c, Σ c D c)dµ cdσ c the posterior is (see sect. 4.6.3.3 of the book) p(µ c, Σ c D c) = NIW(m c, Σ c m c N, κ c N, ν c N, S c N) then (see sect. 4.6.3.6) p(x y = c, D) = N (x µ c, Σ c)niw(µ c, Σ c m c N, κ c N, ν c N, S c N)dµ cdσ c = p(x y = c, D) = T (x m c N, κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 40 / 42

Bayesian Procedure let s summarize what we obtained by applying the Bayesian procedure we first found and then p(y = c D) = E[π c D] = p(x y = c, D) = T (x m c N, Nc + αc N + α 0 κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) then combining everything in the starting posterior predictive we have p(y = c x, D) p(x y = c, D)p(y = c D) = = E[π c D]T (x m c N, κ c N + 1 κ c N (νc N D + 1) Sc N, ν c N D + 1) Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 41 / 42

Credits Kevin Murphy s book Luigi Freda ( La Sapienza University) Lecture 5 November 29, 2016 42 / 42