CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833
Example Data For the next two lectures, we will loo at predicting facies from logs for a lower Cretaceous (~45 to 00 million years old) section in north central Kansas: Facies assignments from core are available from the Jones well, along with a suite of logs including neutron and density porosity, photoelectric factor, and thorium, uranium and potassium components of the spectral gamma ray log. We will recast the density porosity as apparent matrix density (Rhomaa) and the photoelectric factor as apparent matrix volumetric photoelectric absorption (Umaa), so that the six logs employed for discrimination are: Th, U, K, Rhomaa, Umaa, and φ N. The six facies piced from core are marine shale, paralic (coastal), floodplain, channel sandstone, splay sandstone, and paleosol.
So, we will train on this data from the Jones well and loo at predictions both in the Jones well and the Kenyon well. 3
For the sae of illustration, we will also loo at a two-dimensional, two-group sub-example, trying to discriminate marine and paralic facies: There are 8 core samples designated marine and 56 designated paralic. Classical discriminant analysis assumes that the data from each group follow a multivariate normal distribution, so we will tae a bit of time to loo at some properties of this distribution. 4
The Normal (Gaussian) Density Function The probability density function for a single normally distributed variable, X, with a mean of µ and a standard deviation of σ is given by or where f π [ σ ] ( x) = exp ( x µ ) f σ ( x) = exp[ 0.5z ] σ z π = x µ σ represents a standardized version of x. The standardized random variable, Z, follows a normal distribution with a mean of zero and standard deviation of, or Z ~ N(0,). The appearance of z, the negative of the squared scaled distance to the mean, in the exponential is very important. This means that squared scaled distances (scaled Euclidean distances) are the natural distance metric for normally distributed variables. This leads, for example, to the close connection between the normal distribution and leastsquares regression. 5
For the multivariate normal distribution, we now consider a vector of random variables, X, with a vector mean of µ and a covariance matrix S. That is each individual variable, X i, follows a normal distribution with a mean of µ i and a variance of σ i = Σii, the appropriate diagonal element of the covariance matrix. The covariance between any pair of the variables, X i and X j, is given by Σ ij and the corresponding correlation is given by Σ ij ρ ij =. ΣiiΣ jj 6
The multivariate normal density function for X is given by f ( x) = ( π ) p exp S ( ) ( ) x µ S x µ where p is the number of variables or components of X and S is the determinant of the covariance matrix. The quadratic form z = ( x µ ) S ( x µ ) now represents a squared distance to the vector mean scaled according to the variances and covariances specified in S. This is the squared Mahalanobis distance to the vector mean. Using z we can express the multivariate normal density function in the same form as the univariate version, apart from appropriate differences in the normalizing factor out front: f ( x) = ( π ) p S exp p / [ 0.5z ] = ( π ) S / exp[ 0.5z ] The second form shown will be handy for later development. 7
If each component variable, X i, is scaled to zero mean and unit standard deviation in advance of the analysis, the resulting vector mean is µ = 0 and the covariance matrix is the same as the correlation matrix of the original variables, with s on the diagonal and correlations, ρ ij, in the offdiagonal locations. In general, statistical analyses (regression, classification, etc.) using these standardized yield equivalent results to those based on the original variables. In particular, Mahalanobis distances between points in the standardized space are the same as those between corresponding points in the original space, so that the fundamental configuration of the data is unchanged by the translation (to zero mean) and scaling (to unit standard deviation). If all the variables are mutually uncorrelated, then the correlation matrix (which is also the covariance matrix in standardized space) is the identity matrix and Mahalanobis distances reduce to Euclidean distances in the standardized space. This is all to say that Mahalanobis distances are essentially Euclidean distances scaled according to the individual variances (or standard deviations) and adjusted to account for correlations among the variables. The latter adjustment is basically a coordinate rotation and re-scaling in accordance with the principal axes of the correlation matrix. The following plots show contours of the bivariate normal density function and Mahalanobis distances (M.D.) to the points (,) and (,-) when the correlation, ρ, between the two variables is 0 and when it is 0.9. 8
9
Fitting the Normal Distribution to Data The normal distribution is determined by two parameters, mean and variance (or standard deviation). In a multivariate context, this means a mean vector and a covariance matrix. Fitting the normal distribution to a set of N data values is a simple matter of computing the average for each variable, X i : X = N N i X i, n n= and the sample covariance between each pair of variables, X i and X j : Cov ( X, X ) i j = ( N ) N n= ( X X ) ( X X ) i, n i j, n j These serve as the estimates of the population means and covariances (variances when i = j). The division by (N-), rather than N, for the covariance gives the usual unbiased estimator. Although it is easy to fit the normal distribution in the sense of computing the sample means and covariances, there is absolutely no guarantee that the resulting normal distribution will actually fit the data well. 0
Assessing Multivariate Normality The goodness-of-fit of the normal distribution to the observed data should be assessed prior to applying normalbased procedures, including classical discriminant analysis. Methods for assessing the goodness of fit to a normal distribution include graphical displays such as quantilequantile plots and numerical tests such as the Kolmogorov- Smirnov test. The Matlab Statistics toolbox contains various functions for testing normality of univariate data (stest, jbtest, lillietest). You can also assess fits using the distribution fitting tool (dfittool). In the multivariate case, each variable must be normally distributed for the entire set to follow a multivariate normal distribution, but normality of the individual variables does not guarantee multivariate normality. However, if the data do follow a multivariate normal distribution, then the squared Mahalanobis distances from the data points to the centroid (mean) should follow a chi-squared distribution with p degrees of freedom. As an example, we can compute squared M.D. s for the marine Umaa-Rhomaa data points, in the 8x (Nxp) data matrix URMar, using N = size(urmar,); % number of data points (rows) mu = mean(urmar); % x p vector of column means sigma = cov(urmar); % p x p covariance matrix xmd = zeros(n,); % initialize vector of squared md's for i = :N z = (URMar(i,:)- mu)'; % difference from mean xmd(i) = z'*(sigma\z); % squared MD to mean end
Then we can use a Kolmogorov-Smirnov test to compare the computed squared M.D. s to the theoretical chi-squared cumulative density function with p = degrees of freedom: >> stest(xmd,[sort(xmd) chicdf(sort(xmd),)]) ans = 0 The K-S test compares the empirical cumulative density function for the data to the given theoretical CDF (in this case, the chi-squared with degrees of freedom), comparing the maximum difference between the two to a certain test statistic whose distribution is nown under the null hypothesis that the data follow the specified distribution. The answer of 0 indicates that, at a 5% significance level, we cannot reject the hypothesis that the observed values follow the chi-squared distribution. We can also construct a quantile-quanitle plot of the observed squared M.D. s versus chi-squared quantiles using code lie: prb = ( (:N) - 0.5 )/N; % probability values % degrees of freedom = number of variables: df = size(urmar,); % plot sorted xmd s against chi-squared quantiles: plot(chiinv(prb,df),sort(xmd),'.'); hold on plot([0 4],[0 4],'r-'); % to line 3
The resulting plot is shown below. This could also be accomplished with Matlab s qqplot function. The plot shows noticeable deviation from the -to- line, but the deviations are not strong enough for the K-S test to reject the possibility that the squared M.D. s follow a chi-squared distribution. 4
Discriminant Analysis Classical discriminant analysis results from assuming that each data point arises with prior probability q from one of K different groups or classes, each characterized by its own group mean vector, µ, and covariance matrix, S. Plugging the multivariate normal density function into Bayes theorem yields the following posterior probability for group given a vector of observed data values, x: with P = ( x) q K q l= = Pr[ G q f = x] = K q f l= ( x) ( x) p / / ( π ) S [ ] exp 0.5z p / / ( π ) S exp[ 0.5z ] l z = µ l ( x µ ) S ( x ) representing the squared Mahalanobis distance from the data vector x to the th p group mean. The factor ( π ) is the same for all groups, so that the posterior probability for each group at x is an exponential transform of the negative squared Mahalanobis distance to the group centroid adjusted by the prior probability and the determinant of the group covariance matrix. l l l 5
The training process for classical discriminant analysis simply consists of estimating the mean and covariance matrix for each group or class, based on a training dataset with nown classes for each data point. The j th component of the mean vector for group is simply the mean for variable j over the N data points in group. x ; j = xn; N n j where n indicates the set of data points in group. The group covariance matrices are commonly estimated one of two ways: Either a distinct estimate, S, is developed for each group s covariance matrix, with entries given by S ; i, j = n; i ; i n; j ; ( N ) n ( x x )( x x ) or the covariance matrices are assumed to be equal and estimated by a single pooled estimate, S, with entries: j S i, j = ( N K ) N n= ( x x ) ( x x ) n; i ( n); i n; j ( n); j where by x ( n); i I mean the i th component of the mean vector for whichever group data point n belongs to, (n). 6
If the group covariance matrices are assumed to be equal and estimated by the pooled data covariance matrix, S, then the squared Mahalanobis distance from a data vector x to the mean of group is given by z = ( x x ) S ( x x ) and the covariance matrix determinants are all equal, so that Bayes formula reduces to P q exp ( ) [ 0.5z ] = K q exp[ 0.5z ] x. l= l Computing the Mahalanobis distances using a common covariance estimate and then assigning each data vector, x, to the group with the highest posterior probability results in an allocation rule that draws linear boundaries between regions of space allocated to different groups. Thus, this approach is called linear discriminant analysis. Linear discriminant analysis is implemented by the classify function in Matlab s statistics toolbox, setting the type option to linear (which is the default). l 7
With the Umaa-Rhomaa data in the data matrix URData and the facies numbers ( for Marine, for Paralic) in the vector URFacies, we can perform linear discriminant analysis, predicting bac on the training data, using >> [class,err,prob] = classify(urdata,urdata,urfacies,'linear'); >> % the estimated error (misclassification) rate >> err err = 0.0587 >> % compare original and predicted facies >> crosstab(urfacies,class) ans = 6 45 We can get a contour plot of the posterior probability for the marine facies by predicting over a grid of values specified in URGrid (a 00 x matrix containing all the combinations of 0 regularly-spaced Umaa values and 0 regularly-spaced Rhomaa values): [class,err,prob]=classify(urgrid,urdata,urfacies,'linear'); and then contouring the first column of the probability matrix against the grid coordinates (Ug, Rg): >> [cs,h] = contour(ug,rg,reshape(prob(:,),0,0)',(0.:0.:0.9)); >> clabel(cs,h,[0. 0.5 0.9]) >> hold on >> plot(urmar(:,),urmar(:,),'r+') >> plot(urpar(:,),urpar(:,),'bo') 8
(I ve also reversed the Rhomaa axis.) Because there are only two classes, the posterior probability for the paralic facies is just one minus that for the marine facies. The Prob = 0.5 contour is the dividing line between the regions in Umaa-Rhomaa space allocated to the two classes by Bayes rule. Misallocations are inevitable unless there is no overlap between the original classes. Bayes rule yields the minimum error or misclassification rate when the density estimates going into it are accurate. Adjusting the prior probabilities shifts the placement of the probability contours but does not change their orientation. 9
If the covariance matrices are not assumed to be equal and each is estimated separately by S, then the Mahalanobis to each group is given by z = ( x x ) S ( x x ) and, in general, the covariance matrix determinants all differ, so that P ( ) [ 0.5z ] [ 0.5z ] q S exp x =. K q S exp l= l l Using Mahalanobis distances computed from the groupspecific covariance matrices leads to an allocation rule that draws quadratic boundaries between groups in the variable space, so this is called quadratic discriminant analysis. It can be implemented using the classify function with type set to quadratic. The Matlab documentation refers to the individual-group covariance estimates as stratified by group. l 0
Applying quadratic discriminant analysis to the Umaa- Rhomaa data yields: >> [class,err,prob] = classify(urdata,urdata,urfacies,'quadratic'); >> err % a slightly lower error rate err = 0.0544 >> crosstab(urfacies,class) ans = 9 9 6 50
Note that if the prior probabilities for all groups are assumed to be equal (e.g, you have no basis for assigning unequal priors) and the covariance matrix determinants are assumed to be equal (or, differences among them are ignored), then Bayes formula simply reduces to P exp ( ) [ 0.5z ] x = K exp[ 0.5z ] l= so that allocating x to the nearest group, in terms of Mahalanobis distance, is equivalent allocating it to the group with the highest posterior probability. In other words, if you are only interested in allocation, and not probabilities, you can stop after computing Mahalanobis distances and assign each data point to the group with minimum Mahalanobis distance. This can be done with classify function by setting type to mahalanobis. Note that this option uses stratified covariance matrix estimates, not a pooled covariance matrix estimate. Since the stratified estimates, S, will in general have differing determinants, running classify with type = mahalanobis will produce different allocations than running it using type = quadratic, even if you specify equal priors. The Mahalanobis distance allocations will also differ from those produced using type = linear, since the latter computes distances based on the pooled covariance matrix. l
Applying linear discriminant analysis to the full Jones dataset, with six facies (specified in the vector JonesFac) and six logs (in the 88 x 6 data matrix JonesVar), and predicting bac on the Jones data itself yields... >> [class,err,prob] = classify(jonesvar,jonesvar,jonesfac,'linear'); >> err err = 0.368 >> crosstab(jonesfac,class) 07 3 0 3 4 0 7 3 4 0 5 8 6 3 5 38 0 7 0 4 3 0 5 9 0 4 0 0 9 8 46 Quadratic discriminant analysis yields: >> [class,err,prob] = classify(jonesvar,jonesvar,jonesfac,'quadratic'); >> err err = 0.44 >> crosstab(jonesfac,class) 0 0 9 3 9 4 8 3 6 40 49 0 6 39 5 0 0 0 35 0 0 6 So, Q.D.A. produces a lower misallocation rate than L.D.A. when the training data are resubstituted into the allocation rule. This is not necessarily a good thing, since reproducing the training data too accurately (overtraining) can lead to poor generalization. More on that next time. 3
For the sae of illustration, we will loo at the Q.D.A. allocations versus depth: Note: The equality of the number of categories (facies) and number of predictor variables (logs) is a purely coincidental aspect of this example. For all the millions of things I haven t said, see McLachlan, G.J., 99, Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, Inc., New Yor, 56 pp. 4