Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical variate analysis is a widely used method aimed at finding linear combinations of observed features which best characterize or separate two or more classes of objects or events. The resulting combinations are commonly used for dimensionality reduction, discrimination, before later classification. LDA is also closely related to principal component analysis (PCA) in that they both look for linear combinations of variables which best explain the data. However LDA explicitly attempts to model the difference between the classes of data while PCA does not take into account any difference in class. Here are a few examples of possible applications of the method: Identification To identify the type of customers that is likely to buy a certain product in a store. Using simple questionnaire surveys, we can get the features of customers. Discriminant analysis will help us to select which features can describe the group membership of buy or not buy the product. Decision Making Doctor diagnosing illness may be seen as assessing which disease the patient has. However, we can transform this problem into a classification problem by assigning the patient to a number of possible groups of diseases based on the observation of the symptoms. Prediction Question Will it rain today? can be thought of as prediction. Prediction problem can be thought of as assigning today to one of the two possible groups of rain and dry. 1

Pattern recognition To distinguish pedestrians from dogs and cars on captured image sequence of traffic data is a classification problem. Learning Scientists attempt to teach a robot to learn to talk can be seen as classification problem. It assigns frequency, pitch, tune, and many other measurements of sound into many groups of words. The method was first introduced by R.A. Fisher in 1936. Discrimination Let s assume we have G groups of units (each composed of n g units, for g = 1,..., G, such that G n g = n) on which a vector random variable x (corresponding to p observed numeric variables) has been observed. We also assume that, for the G populations the G groups come from, the homoscedasticity condition holds, i.e. Σ 1 = Σ 2 = = Σ g = = Σ G = Σ. Fisher suggested to look for the linear combination y of the variables in x, y = a T x, which best separates the groups. This amounts to look for the vector a such that, when projected along it, the groups are as separated as possible and as homogeneous as possible at the same time. In this framework the function which must be optimized with respect to a is the ratio of the between group to the within group variance of the linear combination y. In the x space the overall average is x, while each group has average vector x g and covariance matrix S g ; because of the properties of the arithmetic mean it will be x = 1 n G x g n g (1) The variable y will therefore have overall average ȳ = a T x and, for each group, average value ȳ g = a T x g and variance V ar(y g ) = a T S g a. V ar(y) within = 1 n G { = a T G 1 n G (n g 1)V ar(y g ) = 1 n G G (n g 1)a T S g a } G (n g 1)S g a = a T Wa 2

where W = 1 n G G (n g 1)S g is the within group covariance matrix (also known as within group scatter matrix) in the observed variable space (it is meaningful because of the omoscedasticity assumption). The n G degrees of freedom derive from the fact that the external sum involves G terms, each of which has n g 1 degrees of freedom. V ar(y) between = 1 G 1 G (ȳ g ȳ) 2 n g = 1 G (a T x g a T x) 2 n g G 1 { G } n g ( x g x)( x g x) T a = a T Ba = 1 G 1 at where B = 1 G 1 G n g( x g x)( x g x) T is the between group covariance matrix (also known as between group scatter matrix) in the observed variable space. Its rank is at most G 1. In the simple two group case the variance between has one degree of freedom only, and has the simple expression B = 2 n g( x g x)( x g x) T. After writing x as in equation (1) and after little algebra it becomes B = n 1n 2 n 1 + n 2 ( x 1 x 2 )( x 1 x 2 ) T This clearly shows that in the two group case the between group covariance matrix has rank equal to 1. In general, the function we need to optimize with respect to a is therefore: φ = V ar(y) between V ar(y) within = at Ba a T Wa (2) (notice that it coincides with the F statistic for the Analysis of Variance). In order to find the vector a for which φ is maximum we derive it with respect to a and we set the derivatives to 0: { } φ Ba(a T a = 2 Wa) Wa(a T Ba) (a T Wa) { 2 } Ba(a T Wa) = 2 Wa(aT Ba) = 0 (a T Wa) 2 (a T Wa) 2 3

Remembering equation (2), it becomes and then or equivalently Ba a T Wa Waφ a T Wa = 0 Ba Waφ = 0 (3) (B φw)a = 0 After pre-multiplying both sides by W 1 (under the assumption that it is non singular) we obtain (W 1 B φi)a = 0 This is a linear homogeneous equation system which admits non trivial solution if and only if det(w 1 B φi) = 0; this means that φ is an eigenvalue of W 1 B and a is the corresponding eigenvector. As φ is the function we want to maximize we choose the largest eigenvalue and the corresponding eigenvector as the best discriminant direction. We can find up to G 1 different discriminant directions, as the rank of B (and hence of W 1 B) is at most G 1. Each of them will have a decreasing discrimination power. They define a vector subspace containing the variability between features. These vectors are primarily used in feature reduction and can be interpreted in the same way as principal components. As W 1 B is non symmetric its eigenvectors are not necessarily orthogonal, but it can be proved that the linear combinations y obtained through them (also known as canonical variates) are uncorrelated. Let s go back to equation (3) and let s consider two generic couples eigenvalueeigenvector, say {φ i, a i } and {φ j, a j }. For them the following equalities hold: Ba i = Wa i φ i Ba j = Wa j φ j After pre-multiplying the first one by a T j and the second one by a T i we obtain a T j Ba i = a T j Wa i φ i a T i Ba j = a T i Wa j φ j 4

But a T j Ba i = a T i Ba j because they are both scalars and hence also a T j Wa i φ i = a T i Wa j φ j Again a T j Wa i = a T i Wa j as both are scalars; therefore, unless φ i = φ j (a rare event in practice) the above equality only holds if a T j Wa i = a T i Wa j = 0 i.e. if the discriminant variables are uncorrelated. In matrix form: A T WA = diag Many softwares (R included, see tutorial) scale A so that A T WA = I: the discriminant variables are uncorrelated and have unit within group variance. They are often said to be sphered. As, from equation (3), BA = WAΦ, where Φ is the diagonal matrix of the eigenvalues, after pre-multiplying it by A T, in the within group sphered data case, we obtain A T BA = A T WAΦ = Φ i.e. the discriminant variables are uncorrelated between groups too, and have decreasing between variance. Being uncorrelated both within and between groups, the discriminant variables are also uncorrelated with respect to the whole set of units. Fisher proved that in the two group case, the only discriminant direction a can be obtained as a = W 1 ( x 1 x 2 ). The corresponding liner combination will therefore be y = a T x = ( x 1 x 2 ) T W 1 x. Classification Even if it has been derived for discrimination purposes, Fisher s linear function can also be used to address classification issues, i.e. to define a rule for assigning a unit, whose group membership is unknown, to one out of the G known groups. The method is general, but for teaching purposes we will limit our attention to the two group case only. Let s denote by ȳ 1 the projection of group 1 average on a, ȳ 1 = a T x 1 = ( x 1 x 2 ) T W 1 x 1 and by ȳ 2 = ( x 1 x 2 ) T W 1 x 2 the projection of group 2 average on a. Let s also assume, without loss of generality, that ȳ 1 > ȳ 2. Being x 0 the new unit we want to classify and y 0 = a T x 0 = ( x 1 x 2 ) T W 1 x 0 its projection on a, a natural allocation rule will consist in assigning x 0 to the group whose average it is closest to along a: i.e. assign x 0 to group 1 if y 0 ȳ 1 < y 0 ȳ 2 and to group 2 if the opposite inequality holds. This amounts to assign x 0 to group 1 if y 0 > ȳ1+ȳ 2 2. 5

Due to the results previously obtained this defines the following rule: assign x 0 to group 1 if: ( x 1 x 2 ) T W 1 x 0 > 1 2 {( x 1 x 2 ) T W 1 x 1 + ( x 1 x 2 ) T W 1 x 2 } or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) that is known as linear classification rule as it is a linear function of the observed vector variable x. This allocation rule is very popular as it can be obtained addressing the classification problem according to a variety of different perspectives. We will see a couple of them in the following. 2 Classification based on Mahalanobis distance One common way to measure the distance between points in a p dimensional space is provided by the Euclidean distance. However the Euclidean distance attaches equal weight to all the axes of the representation; if differences are present in the variances of the observed variables, and if the variables are correlated this might not be a desirable feature. In Fig.1 a bivariate point cloud is represented. The variables have different variances and are correlated. The points A and B have the same Euclidean distance from the center of the point cloud (i.e. from the average vector) but while B lies in the core of the distribution, A is in a low frequency region. Figure 1: Euclidean distance 6

Mahalanobis distance provides a way to take into account variances and correlations when computing distances, i.e. to take into account the shape of the point cloud. Given a point whose coordinate vector is x, its Mahalonobis distance from the center of the point cloud x is defined as d M (x, x) = {(x x) T S 1 (x x)} 1/2 where, as usual, S is the sample covariance matrix. Mahalanobis distance is thus a weighted Euclidean distance. In Fig. 2 the same point cloud and two points having the same Mahalanobis distance are represented Figure 2: Mahalanobis distance Mahalanobis distance turns out to be really useful also for classification purposes. Assuming homoscedasticity, and estimating the unknown common covariance matrix by the within group covariance matrix W, the squared Mahalonobis distance between a unit x 0 and group 1 center can be defined as d 2 M(x 0, x 1 ) = (x 0 x 1 ) T W 1 (x 0 x 1 ); the same for group 2. If W is the identity matrix, i.e. if, within groups, the variables are uncorrelatated and have unit variance, then Mahalanobis distance becomes the ordinary Euclidian one. The allocation rule, consisting in assigning a unit to the group whose average it is closest to, will assign unit x 0 to group 1 if (x 0 x 1 ) T W 1 (x 0 x 1 ) < (x 0 x 2 ) T W 1 (x 0 x 2 ) and to group 2 if the opposite inequality holds. After a few passages the classification rule: assign x 0 to group 1 if ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) 7

can be easily obtained. This is exactly Fisher s linear classification rule. The method can be extended to the classification of units into more than 2 groups. In the G group case we allocate an object to the group for which (a) the Mahalanobis distance between the object and the class mean is smallest using the original variables or (b) the Euclidean distance between the object and the class mean is smallest in terms of the Canonical Variates. Figure 3: Three group case 3 Classification based on probability models So far we have addressed classification issues in a purely distribution free context. Now we will assume that we know the shape of the probability density functions (pdf) that have generated our data in the G groups. We will consider the G = 2 case only. Let x be the p-dimensional vector of the observed variables and x 0 a new unit whose group membership is unknown. Let Π 1 and Π 2 denoted the two parent populations. The key assumption is that x has a different pdf in Π 1 and Π 2. Let s denote the pdf of x in Π 1 as f 1 (x) and the pdf of x in Π 2 as f 2 (x). Let s denote by R the set of all possible values x can assume. As f 1 (x) and f 2 (x) usually overlap, each point of R can belong both to Π 1 and Π 2, but with a different probability degree. The goal is to partition R into two exhaustive, non overlapping regions R 1 and R 2 (R 1 R 2 = R and R 1 R 2 = ) such that the probability of a wrong classification is minimum, when a unit belonging to R 1 is allocated to Π 1 and a unit belonging to R 2 is allocated to Π 2. 8

Given x 0, a very intuitive rule consists in allocating it to Π 1 if the probability that it comes from Π 1 is larger than the probability that it comes from Π 2 or to allocate it to Π 2 if the opposite holds. According to this criterion: R 1 is the set of the x values such that f 1 (x) > f 2 (x) and R 2 is the set of the x values such that f 1 (x) < f 2 (x). The ensuing classification rule is therefore; allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > 1 allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < 1 allocate x 0 randomly to one of the two populations if equality holds. This classification rule is known as likelihood ratio rule. However intuitively reasonable, this rule neglects possibly different prior probabilities of class membership and possibly different misclassification costs. Let s denote by π 1 the prior probability that x 0 belongs to Π 1 and by π 2 the prior probability that x 0 belongs to Π 2 (π 1 +π 2 =1). Based on likelihoods only, the probability that a unit, belonging to Π 1 is wrongly classified to Π 2 (this happens when it falls in R 2 ) is R 2 f 1 (x)dx If we consider prior probabilities too, the probability of a wrong classification of a unit to Π 2 when it effectively comes from Π 1 is p(2 1) = π 1 R 2 f 1 (x)dx and the probability of a wrong classification of a unit to Π 1 when it effectively comes from Π 2 is p(1 2) = π 2 R 1 f 2 (x)dx The total probability of a wrong classification is therefore prob = p(2 1) + p(1 2) 9

R 1 and R 2 should be therefore chosen in such a way that prob is minimum. prob can be written as prob = π 1 f 1 (x)dx + π 2 f 2 (x)dx (4) R 2 R 1 Since R is the complete space and pdfs are known to integrate to 1 over their domain, it is f 1 (x)dx = f 2 (x)dx = 1 R and as R 1 R 2 = R and R 1 R 2 = we have f 1 (x)dx = R f 1 (x)dx + R 1 f 1 (x)dx = 1 R 2 and hence f 1 (x)dx = 1 f 1 (x)dx R 2 R 1 After replacing it in equation (4) we obtain [ prob = π 1 1 f 1 (x)dx R ] + π 2 f 2 (x)dx 1 R 1 = π 1 π 1 f 1 (x)dx + π 2 f 2 (x)dx R 1 R 1 = π 1 R 1 [π 2 f 2 (x) π 1 f 1 (x)]dx As π 1 is a constant, prob will be minimum when the integral is minimum i.e. when the integrand is negative. This means that R 1 should be chosen so that, for the points belonging to it that is R π 2 f 2 (x) π 1 f 1 (x) < 0 π 1 f 1 (x) > π 2 f 2 (x) The ensuing allocation rule will then be: allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > π 2 π 1 10

allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < π 2 π 1 allocate x 0 randomly to one of the two populations if equality holds. In the equal prior case (π 1 = π 2 = 1/2) the likelihood ratio rule is obtained. It is worth adding that the classification rule obtained by minimizing the total probability of a wrong classification is equivalent to the one that would be obtained by maximizing the posterior probability of population membership. That s the reason why it is often called optimal Bayesian rule. Following the same steps we have gone through before, in case of unequal misclassification costs, the rule minimizing the total misclassification cost can be obtained. Gaussian populations Let s assume that both f 1 (x) and f 2 (x) are multivariate normal densities: f 1 (x) = (2π) p/2 Σ 1 1/2 exp { 1 2 (x µ 1) T Σ 1 1 (x µ 1 ) } f 2 (x) = (2π) p/2 Σ 2 1/2 exp { 1 2 (x µ 2) T Σ 1 2 (x µ 2 ) } The likelihood ratio is therefore f 1 (x) f 2 (x) = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [(x µ 1) T Σ 1 1 (x µ 1 ) (x µ 2 ) T Σ 1 2 (x µ 2 )] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt Σ 1 1 x + µ T 1 Σ 1 1 µ 1 2x T Σ 1 1 µ 1 x T Σ 1 2 x µ T 2 Σ 1 2 µ 2 + 2x T Σ 1 2 µ 2 ] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) ]} + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ 2 The expression can be simplified by considering the logarithm of the likelihood ratio. In this way the so called Quadratic discriminant function is obtained: Q(x) = ln f 1(x) f 2 (x) = 1 2 ln Σ 2 Σ 1 1 [ ] x T (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ 2 2 11

The expression clearly shows that it is a quadratic function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if Q(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. When Σ 1 = Σ 2 = Σ then the likelihood ratio becomes f 1 (x)/f 2 (x) = exp { (µ 1 µ 2 ) T Σ 1 x 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 + µ 2 ) } = exp { (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ]} After taking the logarithms we obtain L(x) = (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ] This is known as linear discriminant rule as it is a linear function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if L(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. In empirical applications, maximum likelihood estimates of the model parameters are plugged into the classification rules. µ 1 and µ 2 are estimated by x 1 and x 2. Furthermore, in the heteroscedastic case Σ 1 and Σ 2 are replaced by S 1 and S 2 respectively, while in the homoscedastic case, the common Σ is replaced by the within group covariance matrix W. The empirical linear discriminant rule, in the equal priori case, becomes ( x 1 x 2 ) T W 1[ x 0 1 2 ( x 1 + x 2 ) ] > 0 or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 + x 2 ) In it the allocation rule obtained according to Fisher s approach can be easily recognized. This means that, for gaussian populations and equal priors, besides optimizing group separation, Fisher s rule also minimizes the probability of a wrong classification. 12

4 Assessment of classification rules After a classification rule has been devised, a measure of its performance must be obtained. In other words an answer to the question: How good is the classification rule in correctly predicting class membership? must be provided. The most obvious measure of performance is an estimation of the proportion of individuals that will be misallocated by the rule. And the simplest way to do this is by applying the given allocation rule to the sample units that have been used to derive it. The proportion of misallocated individuals will provide an estimate of the error rate. The method is called resubstitution method (because the sample units are used to derive the rule and then resubstituted into it in order to evaluate its performance) and the corresponding error rate is known as apparent error rate. Although a very simple procedure, the resubstiution method produces biased error estimates. It gives an optimistic measure of the performance of the rule; the rule is expected to perform worse on units that have not contributed to its derivation. A safer strategy suggests to split the original sample into two subgroups: the training sample (containing 70-75% of the units), on which the classification rule is constructed, and the test sample on which the classification rule is evaluated. The error rate is estimated by the proportion of misallocated units in the test set. In case the sample is so small that splitting it in two parts might produce rules based on unstable model parameters, resampling techniques have been suggested. The most common one is k-fold cross-validation. It suggests to split the sample into k equal sized sub-samples, build the rule using the data of k 1 groups as the training set and the group not used before as the test set. Repeat the procedure excluding one subset at a time and obtain the error rate estimate by averaging the error rates obtained in the various steps. A famous variant is the so called leave-one-out (or jackknife) estimator obtained by excluding one observation at a time from the training set and using it as a test set. 13