CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Similar documents
Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning

Introduction to Machine Learning

The generative approach to classification. A classification problem. Generative models CSE 250B

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Motivating the Covariance Matrix

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Bayesian Decision Theory

Machine Learning for Signal Processing Bayes Classification and Regression

CS340 Machine learning Gaussian classifiers

STA 4273H: Statistical Machine Learning

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Table of Contents. Multivariate methods. Introduction II. Introduction I

CS 195-5: Machine Learning Problem Set 1

Support Vector Machines

Generative Learning algorithms

CSC 411: Lecture 09: Naive Bayes

Machine Learning (CS 567) Lecture 5

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

CS281 Section 4: Factor Analysis and PCA

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Classification. Sandro Cumani. Politecnico di Torino

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning Spring 2018 Note 18

L5: Quadratic classifiers

Machine Learning - MT Classification: Generative Models

Naive Bayes and Gaussian Bayes Classifier

Bayesian Decision and Bayesian Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Naive Bayes & Introduction to Gaussians

Naive Bayes and Gaussian Bayes Classifier

Linear Classification: Probabilistic Generative Models

COM336: Neural Computing

Metric-based classifiers. Nuno Vasconcelos UCSD

The Bayes classifier

Ch 4. Linear Models for Classification

Naïve Bayes classification

CS 340 Lec. 16: Logistic Regression

Linear Models for Classification

CPSC 540: Machine Learning

CMU-Q Lecture 24:

Logistic Regression. Seungjin Choi

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

ECE521 week 3: 23/26 January 2017

Gaussian Models

Notes on Discriminant Functions and Optimal Classification

4 Gaussian Mixture Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayes Decision Theory

Data Preprocessing. Cluster Similarity

Gaussian discriminant analysis Naive Bayes

Machine Learning for Signal Processing Bayes Classification

Lecture 3: Pattern Classification

Naive Bayes and Gaussian Bayes Classifier

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CMSC858P Supervised Learning Methods

Bayesian Learning (II)

Classification: Linear Discriminant Analysis

5. Discriminant analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 4: Probabilistic Learning

L11: Pattern recognition principles

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Discriminant analysis and supervised classification

Machine Learning 2nd Edition

Minimum Error Rate Classification

Bayes Decision Theory

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning 2017

MACHINE LEARNING ADVANCED MACHINE LEARNING

6.867 Machine Learning

Pattern Recognition. Parameter Estimation of Probability Density Functions

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Statistical Data Mining and Machine Learning Hilary Term 2016

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear Methods for Prediction

Linear Regression and Discrimination

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Inf2b Learning and Data

Statistical Machine Learning Hilary Term 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

LDA, QDA, Naive Bayes

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Transcription:

CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17

Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are independent and identically distributed. A standard pdf used to model multivariate real data is the multivariate Gaussian or normal p (x µ, Σ = N (x; µ, Σ = 1 (π D / Σ 1/ exp( 1 (x µt Σ 1 (x µ. }{{} Mahalanobis distance It can be shown that µ is the mean and Σ is the covariance of N (x; µ, Σ ; i.e. E (X = µ and cov (X = Σ. It will be used extensively in our discussion on unsupervised learning and can also used for generative classifiers (i.e. discriminant analysis. AD ( March 11 / 17

Special Cases When D = 1, we are back to p ( x µ, σ = N ( x; µ, σ ( 1 = exp 1 (x µ πσ σ When D = and writing ( σ Σ = 1 ρσ 1 σ ρσ 1 σ σ where ρ = corr (X 1, X [ 1, 1] we have 1 p (x µ, Σ = ( exp 1 (1 ρ πσ 1 σ 1 ρ { (x 1 µ 1 σ 1 + (x µ σ } (x 1 µ 1 (x µ σ 1 σ AD ( March 11 3 / 17

Graphical Ilustrations ntrobody.tex full diagonal spherical....15.15.15.1.1.1.5.5.5 1 5 5 1 1 5 5 1 1 5 5 1 5 5 5 5 5 5 Illustration(a of D Gaussian pdfs for different (b covariance matrices (c (left: full, (middle: diagonal, (right: spherical. full 6 1 diagonal 5 spherical 8 AD ( 6 3 March 11 / 17

Why are the contours of a multivariate Gaussian elliptical If we plot the values of x s.t. p (x µ, Σ is equal to a constant, i.e. s.t. (x µ T Σ 1 (x µ = c > where c is given, then we obtain an ellipse. Σ is a positive definite matrix so we have Σ = UΛU T where U is an orthonormal matrix of eigenvectors, i.e. U T U = I, and Λ = diag (λ 1,..., λ D with λ k is the diagonal matrix of eigenvalues. Hence, we have so Σ 1 = ( U T 1 Λ 1 U 1 = UΛ 1 U T = (x µ T Σ 1 (x µ = D k=1 1 λ k u k u T k, D yk where y k = u T k (x µ. λ k=1 k AD ( March 11 5 / 17

5 5 1 1 Graphical Ilustrations 5 1 5 5 5 (a (b (c 6 full 1 diagonal 5 spherical 8 6 3 1 1 6 8 3 6 6 6 1 5 3 1 1 3 5 5 6 Illustration (d of D Gaussian pdfs level(e sets for different covariance (f matrices (left: full, (middle: diagonal, (right: spherical. ure 1.8: Top: plot of pdf s for d Gaussian distributions. Bottom: corresponding level sets, i.e., we plot the points x where p(x = c f ferent values of c. Left: AD ( a full covariance matrix has elliptical contours. Middle: a diagonal covariance matrix March is an11 axis aligned 6 / ellips 17

Properties of Multivariate Gaussians Marginalization is straightforward. Conditioning is easy; e.g. if X = (X 1 X with p (x = p (x 1, x = N (x; µ, Σ where µ = ( µ1 µ ( Σ11 Σ, Σ = 1 Σ 1 Σ then with p (x 1 x = p (x 1, x p (x = N (x 1 ; µ 1, Σ 1 µ 1 = µ 1 + Σ 1 Σ 1 (x µ, Σ 1 = Σ 11 Σ 1 Σ 1 Σ 1. AD ( March 11 7 / 17

Independence and Correlation for Gaussian Variables It is well-known that independence implies uncorrelations; i.e. if the components (X 1,..., X D of a vector X are independent then they are uncorrelated. However, uncorrelated does not imply independence in the general case. If the components (X 1,..., X D of a vector X distributed according to a multivariate Gaussian are uncorrelated then they are independent. Proof. If (X 1,..., X D are uncorrelated then Σ = diag ( σ 1,..., σ D D and Σ = σ k so k=1 p (x µ, Σ = D p ( x k µ k, σ D k = N ( x k ; µ k, σ k k=1 k=1 AD ( March 11 8 / 17

ML Parameter Learning for Multivariate Gaussian Consider data { x i } N i=1 where xi R D and assume they are independent and identically distributed from N (x; µ, Σ. The ML parameter estimates of (µ, Σ maximize by definition N log N ( x i ; µ, Σ i=1 = ND log (π N log Σ 1 N ( x i µ T Σ 1 ( x i µ. i=1 We obtain after painful calculations the fairly intuitive results ( µ = N i=1 x i N, Σ = N i=1 x i µ ( x i µ T. N AD ( March 11 9 / 17

Application to Supervised Learning using Bayes Classifier Assume you are given some training data { x i, y i } N i=1 where xi R D and y i {1,,..., C } can take C different values. Given an input test data x, you want to predict/estimate the output y associated to x. Previously we have followed a probabilistic approach p (y = c x = p (y = c p (x y = c C c =1 p (y = c p (x y = c. This requires modelling and learning the parameters of the class conditional density of features p (x y = c. AD ( March 11 1 / 17

Height Weight Data 6 introbody 8 red = female, blue=male 8 red = female, blue=male 6 6 weight 18 weight 18 16 16 1 1 1 1 1 1 8 55 6 65 7 75 8 height 8 55 6 65 7 75 8 height (a (b (left Height/Weight data for female/male (right d Gaussians fit to Figure each 1.9: class. (a Height/weight 95% of data. the (b Visualization proba mass of d Gaussians is inside fit to each theclass. ellipse 95% of the probability mass is inside the elli Figure generated by gaussheightweight. AD ( March 11 11 / 17

Supervised Learning using Bayes Classifier Assume we pick and p (y = c = π c then p (x y = c = N (x; µ c, Σ c p (y = c x π c Σ c 1/ exp( 1 (x µ c T Σc 1 (x µ c = exp(µ T c Σ 1 c x 1 µt c Σ 1 c µ c + log π c exp ( 1 xt Σc 1 x For models where Σ c = Σ then this is known as linear discriminant analysis ( exp β T c p (y = c x = x + γ c C c =1 (β exp T c x + γ c where β c = Σ 1 µ c, γ c = 1 µt c Σ 1 µ c + log π c and the model is very similar to logistic regression. AD ( March 11 1 / 17

Decision Boundaries obody.tex Linear Boundary All Linear Boundaries 6 (a 6 (b Parabolic Boundary Some Linear, Some Quadratic 8 6 (c 6 (d Figure 1.3: Decision boundaries in D for the and 3 class case. Figure generated by discrimanalysisdboundariesdemo Decision boundaries in D for and 3 class case. we can write T p(y = c x, θ = P eβ c c x+γc T eβ c x+γc (1 ch is the softmax function (see Section 1..1. This is equivalent in form to multinomial logistic regression. However all method is not equivalent to logistic regression, as we explain in Section 1..6. The decision boundary c x, 11 θ = p(y13= x AD ( between two classes say c and c, is the set of points x for which p(y = March /c 17

Binary Classification Consider the case where C = then one can check that p (y = 1 x = g ((β 1 β T x + γ 1 γ We have γ 1 γ = 1 (µ 1 µ T Σ 1 (µ 1 + µ + log (π 1 /π, x : = 1 (µ 1 + µ (µ 1 µ log (π 1/π (µ 1 µ T Σ 1 (µ 1 µ w : = β 1 β = Σ 1 (µ 1 µ then p (y = 1 x = g ( w T (x x x is shifted by x and then projected onto the line w. AD ( March 11 1 / 17

Binary Classification re 1.31: Geometry of LDA in the class case where Σ 1 = Σ Example where Σ = σ I so w is in the direction of µ 1 µ. γ = 1 µ T Σ 1 µ + 1 AD ( µ T Σ 1 µ + Marchlog(π 11 15 1 //π 17

Generative or Discriminative Classifiers When the model for class conditional densities is correct, resp. incorrect, generative classifiers will typically outperform, resp. underperform, discriminative classifiers for large enough datasets. Generative classifiers can be diffi cult to learn whereas Discriminative classifiers try to learn directly the posterior probability of interest. Generative classifiers can handle missing data easily, discriminative methods cannot. Discriminative can be more flexible; e.g. substitute x to Φ (x. AD ( March 11 16 / 17

Application Bankruptcy Data using Gaussian (black = train, blue=correct, red=wrong, nerr = 3 Ba 1 1 3 6 Bankrupt Solvent 8 5 3 1 1 Discriminant analysis on the bankruptcy data set. Gaussian class (a.33: Discriminant analysis on the bankruptcy data set. Left: Gaus s. conditional Points that densities. belongestimated to class 1labels, are shown based onasthe triangles, posterior Points proba ofthat b osterior belonging probability to each class, ofare belonging computed. toifeach incorrect, class, theare point computed. is colored If th raining data is in black. Figure generated by robustdiscriman read, otherwise in blue (Training data are black. AD ( March 11 17 / 17 5 6 7