CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Similar documents
DIMENSION REDUCTION AND CLUSTER ANALYSIS

INTRODUCTION TO WELL LOGS And BAYES THEOREM

Discriminant analysis and supervised classification

Gaussian Models

Table of Contents. Multivariate methods. Introduction II. Introduction I

Feature selection and extraction Spectral domain quality estimation Alternatives

The Bayes classifier

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Introduction to Machine Learning

CMSC858P Supervised Learning Methods

L11: Pattern recognition principles

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Introduction to Machine Learning Spring 2018 Note 18

Machine Learning (CS 567) Lecture 5

Machine Learning Linear Classification. Prof. Matteo Matteucci

6-1. Canonical Correlation Analysis

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Introduction to Machine Learning

Lecture Note 1: Probability Theory and Statistics

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Chapter 5 continued. Chapter 5 sections

MSA220 Statistical Learning for Big Data

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

STT 843 Key to Homework 1 Spring 2018

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Motivating the Covariance Matrix

Classification: Linear Discriminant Analysis

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

L5: Quadratic classifiers

Correlation and Regression Theory 1) Multivariate Statistics

The F distribution and its relationship to the chi squared and t distributions

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

An Introduction to Multivariate Statistical Analysis

STAT 461/561- Assignments, Year 2015

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Classification with Gaussians

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Linear Methods for Prediction

The following postestimation commands are of special interest after discrim qda: The following standard postestimation commands are also available:

Lecture 9: Classification, LDA

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

p(x ω i 0.4 ω 2 ω

Lecture 3: Pattern Classification

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Hypothesis Testing for Var-Cov Components

Lecture 9: Classification, LDA

Minimum Error Rate Classification

Lecture 6: Methods for high-dimensional problems

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Dimension Reduction. David M. Blei. April 23, 2012

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

A Program for Data Transformations and Kernel Density Estimation

Lecture 4 Discriminant Analysis, k-nearest Neighbors

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

CHAPTER 5. Outlier Detection in Multivariate Data

Linear Methods for Prediction

Hotelling s One- Sample T2

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian Decision Theory

Statistics Toolbox 6. Apply statistical algorithms and probability models

Principal component analysis

Classification 2: Linear discriminant analysis (continued); logistic regression

Rejection regions for the bivariate case

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Machine Learning 2017

Lecture 8: Classification

Lecture 9: Classification, LDA

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Naive Bayes & Introduction to Gaussians

Confidence Intervals, Testing and ANOVA Summary

MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS. Richard Brereton

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Unconstrained Ordination

Outline. Simulation of a Single-Server Queueing System. EEC 686/785 Modeling & Performance Evaluation of Computer Systems.

Multivariate Statistical Analysis

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Review (Probability & Linear Algebra)

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Pattern Recognition and Machine Learning

CS 195-5: Machine Learning Problem Set 1

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

A Probability Review

STA 437: Applied Multivariate Statistics

LEC 4: Discriminant Analysis for Classification

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Transcription:

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833

Example Data For the next two lectures, we will loo at predicting facies from logs for a lower Cretaceous (~45 to 00 million years old) section in north central Kansas: Facies assignments from core are available from the Jones well, along with a suite of logs including neutron and density porosity, photoelectric factor, and thorium, uranium and potassium components of the spectral gamma ray log. We will recast the density porosity as apparent matrix density (Rhomaa) and the photoelectric factor as apparent matrix volumetric photoelectric absorption (Umaa), so that the six logs employed for discrimination are: Th, U, K, Rhomaa, Umaa, and φ N. The six facies piced from core are marine shale, paralic (coastal), floodplain, channel sandstone, splay sandstone, and paleosol.

So, we will train on this data from the Jones well and loo at predictions both in the Jones well and the Kenyon well. 3

For the sae of illustration, we will also loo at a two-dimensional, two-group sub-example, trying to discriminate marine and paralic facies: There are 8 core samples designated marine and 56 designated paralic. Classical discriminant analysis assumes that the data from each group follow a multivariate normal distribution, so we will tae a bit of time to loo at some properties of this distribution. 4

The Normal (Gaussian) Density Function The probability density function for a single normally distributed variable, X, with a mean of µ and a standard deviation of σ is given by or where f π [ σ ] ( x) = exp ( x µ ) f σ ( x) = exp[ 0.5z ] σ z π = x µ σ represents a standardized version of x. The standardized random variable, Z, follows a normal distribution with a mean of zero and standard deviation of, or Z ~ N(0,). The appearance of z, the negative of the squared scaled distance to the mean, in the exponential is very important. This means that squared scaled distances (scaled Euclidean distances) are the natural distance metric for normally distributed variables. This leads, for example, to the close connection between the normal distribution and leastsquares regression. 5

For the multivariate normal distribution, we now consider a vector of random variables, X, with a vector mean of µ and a covariance matrix S. That is each individual variable, X i, follows a normal distribution with a mean of µ i and a variance of σ i = Σii, the appropriate diagonal element of the covariance matrix. The covariance between any pair of the variables, X i and X j, is given by Σ ij and the corresponding correlation is given by Σ ij ρ ij =. ΣiiΣ jj 6

The multivariate normal density function for X is given by f ( x) = ( π ) p exp S ( ) ( ) x µ S x µ where p is the number of variables or components of X and S is the determinant of the covariance matrix. The quadratic form z = ( x µ ) S ( x µ ) now represents a squared distance to the vector mean scaled according to the variances and covariances specified in S. This is the squared Mahalanobis distance to the vector mean. Using z we can express the multivariate normal density function in the same form as the univariate version, apart from appropriate differences in the normalizing factor out front: f ( x) = ( π ) p S exp p / [ 0.5z ] = ( π ) S / exp[ 0.5z ] The second form shown will be handy for later development. 7

If each component variable, X i, is scaled to zero mean and unit standard deviation in advance of the analysis, the resulting vector mean is µ = 0 and the covariance matrix is the same as the correlation matrix of the original variables, with s on the diagonal and correlations, ρ ij, in the offdiagonal locations. In general, statistical analyses (regression, classification, etc.) using these standardized yield equivalent results to those based on the original variables. In particular, Mahalanobis distances between points in the standardized space are the same as those between corresponding points in the original space, so that the fundamental configuration of the data is unchanged by the translation (to zero mean) and scaling (to unit standard deviation). If all the variables are mutually uncorrelated, then the correlation matrix (which is also the covariance matrix in standardized space) is the identity matrix and Mahalanobis distances reduce to Euclidean distances in the standardized space. This is all to say that Mahalanobis distances are essentially Euclidean distances scaled according to the individual variances (or standard deviations) and adjusted to account for correlations among the variables. The latter adjustment is basically a coordinate rotation and re-scaling in accordance with the principal axes of the correlation matrix. The following plots show contours of the bivariate normal density function and Mahalanobis distances (M.D.) to the points (,) and (,-) when the correlation, ρ, between the two variables is 0 and when it is 0.9. 8

9

Fitting the Normal Distribution to Data The normal distribution is determined by two parameters, mean and variance (or standard deviation). In a multivariate context, this means a mean vector and a covariance matrix. Fitting the normal distribution to a set of N data values is a simple matter of computing the average for each variable, X i : X = N N i X i, n n= and the sample covariance between each pair of variables, X i and X j : Cov ( X, X ) i j = ( N ) N n= ( X X ) ( X X ) i, n i j, n j These serve as the estimates of the population means and covariances (variances when i = j). The division by (N-), rather than N, for the covariance gives the usual unbiased estimator. Although it is easy to fit the normal distribution in the sense of computing the sample means and covariances, there is absolutely no guarantee that the resulting normal distribution will actually fit the data well. 0

Assessing Multivariate Normality The goodness-of-fit of the normal distribution to the observed data should be assessed prior to applying normalbased procedures, including classical discriminant analysis. Methods for assessing the goodness of fit to a normal distribution include graphical displays such as quantilequantile plots and numerical tests such as the Kolmogorov- Smirnov test. The Matlab Statistics toolbox contains various functions for testing normality of univariate data (stest, jbtest, lillietest). You can also assess fits using the distribution fitting tool (dfittool). In the multivariate case, each variable must be normally distributed for the entire set to follow a multivariate normal distribution, but normality of the individual variables does not guarantee multivariate normality. However, if the data do follow a multivariate normal distribution, then the squared Mahalanobis distances from the data points to the centroid (mean) should follow a chi-squared distribution with p degrees of freedom. As an example, we can compute squared M.D. s for the marine Umaa-Rhomaa data points, in the 8x (Nxp) data matrix URMar, using N = size(urmar,); % number of data points (rows) mu = mean(urmar); % x p vector of column means sigma = cov(urmar); % p x p covariance matrix xmd = zeros(n,); % initialize vector of squared md's for i = :N z = (URMar(i,:)- mu)'; % difference from mean xmd(i) = z'*(sigma\z); % squared MD to mean end

Then we can use a Kolmogorov-Smirnov test to compare the computed squared M.D. s to the theoretical chi-squared cumulative density function with p = degrees of freedom: >> stest(xmd,[sort(xmd) chicdf(sort(xmd),)]) ans = 0 The K-S test compares the empirical cumulative density function for the data to the given theoretical CDF (in this case, the chi-squared with degrees of freedom), comparing the maximum difference between the two to a certain test statistic whose distribution is nown under the null hypothesis that the data follow the specified distribution. The answer of 0 indicates that, at a 5% significance level, we cannot reject the hypothesis that the observed values follow the chi-squared distribution. We can also construct a quantile-quanitle plot of the observed squared M.D. s versus chi-squared quantiles using code lie: prb = ( (:N) - 0.5 )/N; % probability values % degrees of freedom = number of variables: df = size(urmar,); % plot sorted xmd s against chi-squared quantiles: plot(chiinv(prb,df),sort(xmd),'.'); hold on plot([0 4],[0 4],'r-'); % to line 3

The resulting plot is shown below. This could also be accomplished with Matlab s qqplot function. The plot shows noticeable deviation from the -to- line, but the deviations are not strong enough for the K-S test to reject the possibility that the squared M.D. s follow a chi-squared distribution. 4

Discriminant Analysis Classical discriminant analysis results from assuming that each data point arises with prior probability q from one of K different groups or classes, each characterized by its own group mean vector, µ, and covariance matrix, S. Plugging the multivariate normal density function into Bayes theorem yields the following posterior probability for group given a vector of observed data values, x: with P = ( x) q K q l= = Pr[ G q f = x] = K q f l= ( x) ( x) p / / ( π ) S [ ] exp 0.5z p / / ( π ) S exp[ 0.5z ] l z = µ l ( x µ ) S ( x ) representing the squared Mahalanobis distance from the data vector x to the th p group mean. The factor ( π ) is the same for all groups, so that the posterior probability for each group at x is an exponential transform of the negative squared Mahalanobis distance to the group centroid adjusted by the prior probability and the determinant of the group covariance matrix. l l l 5

The training process for classical discriminant analysis simply consists of estimating the mean and covariance matrix for each group or class, based on a training dataset with nown classes for each data point. The j th component of the mean vector for group is simply the mean for variable j over the N data points in group. x ; j = xn; N n j where n indicates the set of data points in group. The group covariance matrices are commonly estimated one of two ways: Either a distinct estimate, S, is developed for each group s covariance matrix, with entries given by S ; i, j = n; i ; i n; j ; ( N ) n ( x x )( x x ) or the covariance matrices are assumed to be equal and estimated by a single pooled estimate, S, with entries: j S i, j = ( N K ) N n= ( x x ) ( x x ) n; i ( n); i n; j ( n); j where by x ( n); i I mean the i th component of the mean vector for whichever group data point n belongs to, (n). 6

If the group covariance matrices are assumed to be equal and estimated by the pooled data covariance matrix, S, then the squared Mahalanobis distance from a data vector x to the mean of group is given by z = ( x x ) S ( x x ) and the covariance matrix determinants are all equal, so that Bayes formula reduces to P q exp ( ) [ 0.5z ] = K q exp[ 0.5z ] x. l= l Computing the Mahalanobis distances using a common covariance estimate and then assigning each data vector, x, to the group with the highest posterior probability results in an allocation rule that draws linear boundaries between regions of space allocated to different groups. Thus, this approach is called linear discriminant analysis. Linear discriminant analysis is implemented by the classify function in Matlab s statistics toolbox, setting the type option to linear (which is the default). l 7

With the Umaa-Rhomaa data in the data matrix URData and the facies numbers ( for Marine, for Paralic) in the vector URFacies, we can perform linear discriminant analysis, predicting bac on the training data, using >> [class,err,prob] = classify(urdata,urdata,urfacies,'linear'); >> % the estimated error (misclassification) rate >> err err = 0.0587 >> % compare original and predicted facies >> crosstab(urfacies,class) ans = 6 45 We can get a contour plot of the posterior probability for the marine facies by predicting over a grid of values specified in URGrid (a 00 x matrix containing all the combinations of 0 regularly-spaced Umaa values and 0 regularly-spaced Rhomaa values): [class,err,prob]=classify(urgrid,urdata,urfacies,'linear'); and then contouring the first column of the probability matrix against the grid coordinates (Ug, Rg): >> [cs,h] = contour(ug,rg,reshape(prob(:,),0,0)',(0.:0.:0.9)); >> clabel(cs,h,[0. 0.5 0.9]) >> hold on >> plot(urmar(:,),urmar(:,),'r+') >> plot(urpar(:,),urpar(:,),'bo') 8

(I ve also reversed the Rhomaa axis.) Because there are only two classes, the posterior probability for the paralic facies is just one minus that for the marine facies. The Prob = 0.5 contour is the dividing line between the regions in Umaa-Rhomaa space allocated to the two classes by Bayes rule. Misallocations are inevitable unless there is no overlap between the original classes. Bayes rule yields the minimum error or misclassification rate when the density estimates going into it are accurate. Adjusting the prior probabilities shifts the placement of the probability contours but does not change their orientation. 9

If the covariance matrices are not assumed to be equal and each is estimated separately by S, then the Mahalanobis to each group is given by z = ( x x ) S ( x x ) and, in general, the covariance matrix determinants all differ, so that P ( ) [ 0.5z ] [ 0.5z ] q S exp x =. K q S exp l= l l Using Mahalanobis distances computed from the groupspecific covariance matrices leads to an allocation rule that draws quadratic boundaries between groups in the variable space, so this is called quadratic discriminant analysis. It can be implemented using the classify function with type set to quadratic. The Matlab documentation refers to the individual-group covariance estimates as stratified by group. l 0

Applying quadratic discriminant analysis to the Umaa- Rhomaa data yields: >> [class,err,prob] = classify(urdata,urdata,urfacies,'quadratic'); >> err % a slightly lower error rate err = 0.0544 >> crosstab(urfacies,class) ans = 9 9 6 50

Note that if the prior probabilities for all groups are assumed to be equal (e.g, you have no basis for assigning unequal priors) and the covariance matrix determinants are assumed to be equal (or, differences among them are ignored), then Bayes formula simply reduces to P exp ( ) [ 0.5z ] x = K exp[ 0.5z ] l= so that allocating x to the nearest group, in terms of Mahalanobis distance, is equivalent allocating it to the group with the highest posterior probability. In other words, if you are only interested in allocation, and not probabilities, you can stop after computing Mahalanobis distances and assign each data point to the group with minimum Mahalanobis distance. This can be done with classify function by setting type to mahalanobis. Note that this option uses stratified covariance matrix estimates, not a pooled covariance matrix estimate. Since the stratified estimates, S, will in general have differing determinants, running classify with type = mahalanobis will produce different allocations than running it using type = quadratic, even if you specify equal priors. The Mahalanobis distance allocations will also differ from those produced using type = linear, since the latter computes distances based on the pooled covariance matrix. l

Applying linear discriminant analysis to the full Jones dataset, with six facies (specified in the vector JonesFac) and six logs (in the 88 x 6 data matrix JonesVar), and predicting bac on the Jones data itself yields... >> [class,err,prob] = classify(jonesvar,jonesvar,jonesfac,'linear'); >> err err = 0.368 >> crosstab(jonesfac,class) 07 3 0 3 4 0 7 3 4 0 5 8 6 3 5 38 0 7 0 4 3 0 5 9 0 4 0 0 9 8 46 Quadratic discriminant analysis yields: >> [class,err,prob] = classify(jonesvar,jonesvar,jonesfac,'quadratic'); >> err err = 0.44 >> crosstab(jonesfac,class) 0 0 9 3 9 4 8 3 6 40 49 0 6 39 5 0 0 0 35 0 0 6 So, Q.D.A. produces a lower misallocation rate than L.D.A. when the training data are resubstituted into the allocation rule. This is not necessarily a good thing, since reproducing the training data too accurately (overtraining) can lead to poor generalization. More on that next time. 3

For the sae of illustration, we will loo at the Q.D.A. allocations versus depth: Note: The equality of the number of categories (facies) and number of predictor variables (logs) is a purely coincidental aspect of this example. For all the millions of things I haven t said, see McLachlan, G.J., 99, Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, Inc., New Yor, 56 pp. 4