DISCRIMINANT ANALYSIS. 1. Introduction

Size: px

Start display at page:

Download "DISCRIMINANT ANALYSIS. 1. Introduction"

Hector Hawkins
6 years ago
Views:

1 DISCRIMINANT ANALYSIS. Introduction Discrimination and classification are concerned with separating objects from different populations into different groups and with allocating new observations to one of these groups. The goals thus are To describe (graphically or algebraically) the difference between objects from several known populations. We construct discriminants that have numerical values which separate the different collections as much as possible. To assign objects into several labeled classes. We derive a classification rule that can be used to assign (new) objects to one of the labeled classes. Examples. Based on historical bank data, separate the good from poor credit risks (based on income, age, family size, etc). Classify new credit applications into one of these two classes to decide to allow or reject a loan.. Make a distinction between readers and non-readers of a magazine or newspaper based on e.g. education level, age, income, profession, etc... such that the publishers knows which category of people are potential new readers.

2 . Introduction A good procedure should result in as few misclassifications as possible. It should take into account the likelihood of objects to belong to each of the classes (=prior probability of occurrence). One often also takes into account the costs of misclassification. For example the cost of not operating a person needing surgery is much higher than unnecessarily operating a person, so the first type a misclassification has to be avoided as much as possible.

3 . Discrimination and Classification of Two Populations 3. Discrimination and Classification of Two Populations We now focus on separating objects from two classes and assigning new objects to one of these two classes. The classes will be labeled π and π. Each object consists of measurements for p random variables X,..., X p such that the observed values differ to some extend from one class to the other. The distributions associated with both populations will be described by their density functions f and f respectively. Now consider an observed value x = (x,..., x p ) τ X = (X,..., X p ) τ. Then f (x) is the density in x if x belongs to population π f (x) is the density in x if x belongs to population π of the random variable The object x must be assigned to either population π or π. Denote Ω the sample space (= collection of all possible outcomes of X) and partition the sample space as Ω = R R where R is the subspace of outcomes which we classify as belonging to population π and R = Ω R the subspace of outcomes classified as belonging to π. It follows that the (conditional) probability of classifying an object as belonging to π when it is really from π equals P ( ) = P (X R X π ) = f (x)d x R and the (conditional) probability of assigning an object to π when it in fact is from π equals P ( ) = P (X R X π ) = f (x)d x R

4 . Discrimination and Classification of Two Populations 4 y f (x) f (x) P( ) P( ) R R x Similarly, we define the conditional probabilities P ( ) and P ( ). To obtain the probabilities of correctly and incorrectly classifying objects we also have to take the prior class probabilities into account. Denote p = P (X π ) = prior probability of π p = P (X π ) = prior probability of π where p + p =. It follows that the overall probabilities of correctly and incorrectly classifying objects are given by

5 . Discrimination and Classification of Two Populations 5 P (object is correctly classified as π ) = P (X π and X R ) = P (X R X π )P (X π ) = P ( )p P (object is misclassified as π ) = P (X π and X R ) = P (X R X π )P (X π ) = P ( )p P (object is correctly classified as π ) = P (X π and X R ) = P (X R X π )P (X π ) = P ( )p P (object is misclassified as π ) = P (X π and X R ) = P (X R X π )P (X π ) = P ( )p To consider the cost of misclassification, denote c( ) = the cost of classifying an object from π as π c( ) = the cost of classifying an object from π as π A classification rule is obtained by minimizing the expected cost of misclassification: ECM := c( )P ( )p + c( )P ( )p

6 . Discrimination and Classification of Two Populations 6 Result. The regions R and R that minimize the ECM are given by { R = x Ω; f ( ) ( )} (x) c( ) f (x) p c( ) p { R = x Ω; f ( ) ( )} (x) c( ) f (x) < p c( ) p Proof. Using that P ( ) + P ( ) = (since R R = Ω) we obtain ECM = c( )P ( )p + c( )P ( )p = c( )( P ( ))p + c( )P ( )p = c( )p + [c( )f (x)p c( )f (x)p ]d x R Since probabilities and densities, as well as misclassification costs (there is no gain by misclassifying objects) are nonnegative, the ECM is minimal if the integrand [c( )f (x)p c( )f (x)p ] 0 for all x R which yields the regions above. Note that these regions only depend on ratios: f (x) f (x) c( ) c( ) = density ratio = cost ratio p p = prior probability ratio These ratios are often much easier to determine than the exact values of the components.

7 . Discrimination and Classification of Two Populations 7 Special Cases:. Equal (or unknown) prior probabilities: Compare density ratio with cost ratio R : f (x) f (x) c( ) c( ) R : f (x) f (x) < c( ) c( ). Equal (or undetermined) misclassification cost: Compare density ratio with prior probability ratio: R : f (x) f (x) p p R : f (x) f (x) < p p 3. Equal prior probabilities and equal misclassification cost (or p p = ( c( ) c( ) ) ) R : f (x) f (x) R : f (x) f (x) < Example. If we set the cost ratio equal to and we know that 0% of all objects belong to π, then given that f (x 0 ) = 0.3 and f (x 0 ) = 0.4, do we classify x 0 as belonging to π or π? We have that p = 0. so p = 0.8, and p /p = 0.5. Therefore, we obtain For x 0 we have R : f (x) f (x) (0.5) = 0.5 and R : f (x) f (x) f (x 0 ) f (x 0 ) = = 0.75 > 0.5 so we find x 0 R and classify x 0 as belonging to π. < (0.5) = 0.5

8 3. Classification with Two Multivariate Normal Populations 8 3. Classification with Two Multivariate Normal Populations We now assume that f and f are multivariate normal densities with respectively mean vectors µ and µ and covariance matrices Σ and Σ. 3.. Σ = Σ = Σ The density of population π i (i =, ) is now given by f i (x) = (π) p/ det(σ) /e (x µ i) τ Σ (x µ i ). Result. If the populations π and π both have multivariate normal densities with equal covariance matrices, then the classification rule corresponding to minimizing ECM becomes: Classify x 0 as π if (µ µ ) τ Σ x 0 [( ) ( )] c( ) (µ µ ) τ Σ p (µ + µ ) ln c( ) and classify x 0 as π otherwise. p Proof. We assign x 0 to π if f (x 0 ) f (x 0 ) ( c( ) c( ) ) ( ) p which can be rewritten as e (x 0 µ ) τ Σ (x 0 µ ) + (x ( ) ( ) 0 µ ) τ Σ (x 0 µ ) c( ) p c( ) p p

9 3. Classification with Two Multivariate Normal Populations 9 By taking logarithms on both sides and using the equality (x 0 µ ) τ Σ (x 0 µ ) + (x 0 µ ) τ Σ (x 0 µ ) = we obtain the classification rule. (µ µ ) τ Σ x 0 (µ µ ) τ Σ (µ + µ ) In practice, the population parameters µ, µ and Σ are unknown and have to be estimated from the data. Suppose we have n objects belonging to π (denoted as x (),..., x() n ) and n objects from π (denoted as x (),..., x() n ) with n + n = n the total sample size. The sample mean vectors and covariance matrices of both groups are given by x = n n i= x = n n i= x () j S = n x () j S = n n i= n i= (x () j x )(x () j (x () j x )(x () j x ) τ x ) τ Since both populations have the same covariance matrix Σ we combine the two sample covariance matrices S and S to obtain a more precise estimate of Σ given by ( S pooled = n (n ) + (n ) ) ( S + n (n ) + (n ) ) S By replacing µ, µ and Σ with x, x and S pooled in Result we obtain the sample classification rule: Classify x 0 as π if ( x x ) τ S pooled x 0 ( x x ) τ S pooled ( x + x ) ln and classify x 0 as π otherwise. [( c( ) c( ) ) ( )] p p

10 3. Classification with Two Multivariate Normal Populations 0 Special Case: Equal prior probabilities and equal misclassification cost: [( ) ( )] c( ) p ln = ln() = 0 c( ) such that we assign x 0 to π if p ( x x ) τ S pooled x 0 ( x x ) τ S pooled ( x + x ) Denote a = S pooled ( x x ) IR p, then this can be rewritten as a τ x 0 (aτ x + a τ x ) That is, we have to compare the scalar ŷ 0 = a τ x 0 with the midpoint ˆm = (ȳ + ȳ ) = (aτ x + a τ x ). We thus have created to univariate populations (determined by the y-values) by projecting the original data on the direction determined by a. This direction is the (estimated) direction in which the two populations are best separated. Remark. By replacing the unknown parameters with their estimates, there is no guarantee anymore that the resulting classification rule minimizes the expected cost of misclassification. However, we expect that we obtain a good estimate of the optimal rule. Example. To develop a test for potential hemophilia carriers, blood samples were taken from two groups of patients. The two variables measured are AHF activity and AHF-like antigen where AHF means AntiHemophilic Factor. For both variables we take the logarithm (base 0). The first group of n = 30 patients did not carry the hemophilia gene. The second group consisted of known hemophilia carriers. From these samples the following statistics have been derived x = , x = , and S pooled =

11 3. Classification with Two Multivariate Normal Populations Assuming equal costs and equal priors, we compute a = S pooled ( x x ) = 37.6 and 8.9 ȳ = a τ x = 0.88, ȳ = a τ x = 0.0. The corresponding midpoint thus becomes ˆm = (ȳ + ȳ ) = 4.6. A new object x = (x, x ) τ is classified as non-carrier if ŷ = 37.6x 8.9x ˆm = 4.6 and is a carrier otherwise. A potential hemophilia carrier has values x = 0.0 and x = Should this patient be classified as carrier? We obtain ŷ = 6.6 < 4.6 so we indeed assign this patient to the population of carriers. Suppose now that it is known that the prior probability of being a hemophilia carrier is 5%, then a new patient is classified as non-carrier if ŷ ˆm ln ( p p ). We find ŷ ˆm = =.0 and ( ) ln p p = ln ( ( ) = ln 3) =.0 so we still classify this patient as carrier.

12 3. Classification with Two Multivariate Normal Populations 3.. Σ Σ The density of population π i (i =, ) is now given by f i (x) = (π) p/ det(σ i ) /e (x µ i) τ Σ i (x µ i ). Result 3. If the populations π and π both have multivariate normal densities with mean vectors and covariance matrices µ, Σ and µ, Σ respectively, then the classification rule corresponding to minimizing ECM becomes: Classify x 0 as π if xτ 0(Σ Σ )x 0 + (µ τ Σ µ τ Σ )x 0 k ln and classify x 0 as π otherwise. The constant k is given by k = ( ) ln det(σ ) det(σ ) [( c( ) c( ) + (µτ Σ µ µ τ Σ µ ) ) ( )] p p Proof. We assign x 0 to π if ( f (x 0 ) ln f (x 0 ) ln ( ) f (x 0 ) = ( f (x 0 ) ln det(σ ) det(σ ) ) ln ( c( ) c( ) ) ( ) p p and ) + (x 0 µ ) τ Σ (x 0 µ ) (x 0 µ ) τ Σ (x 0 µ ) = xτ 0(Σ Σ )x 0 + (µ τ Σ µ τ Σ )x 0 k

13 3. Classification with Two Multivariate Normal Populations 3 In practice, the parameters µ, µ, Σ and Σ are unknown and replaced by the estimates x, x, S and S which yields the following sample classification rule: Classify x 0 as π if xτ 0(S S )x 0 + ( x τ S x τ S )x 0 k ln and classify x 0 as π otherwise. The constant k is given by k = ( ) ln det(s ) det(s ) [( c( ) c( ) + ( xτ S x x τ S x ) ) ( )] p p

14 4. Evaluating Classification Rules 4. Evaluating Classification Rules 4 To judge the performance of a sample classification procedure, we want to calculate its misclassification probability or error rate. A measure of performance that can be calculated for any classification procedure is the apparent error rate (APER) which is defined as the fraction of observations in the sample that are misclassified by the classification procedure. Denote n M and n M the number of objects misclassified as π respectively π objects, then APER = n M + n M n + n The APER is intuitively appealing and easy to calculate. Unfortunately, it tends to underestimate the actual error rate (AER) when classifying new objects. This underestimation occurs because we used the sample to build the classification rule (therefore we call this the training sample ) as well as to evaluate it. To obtain a reliable estimate of the AER we ideally consider an independent test sample of new objects from which we know the true class label. This means that we split the original sample in a training sample and test sample. The AER is then estimated by the proportion of misclassified objects in the test sample while the training sample was used to construct the classification rule. However, there are two drawbacks with this approach It requires large samples. The classification rule is less precise because we do not use the information from the test sample to build the classifier.

15 4. Evaluating Classification Rules 5 An alternative is the (leave-one-out) cross-validation or jackknife procedure which works as follows.. Leave one object out of the sample and construct a classification rule based on the remaining n objects in the sample.. Classify the left-out observation using the classification rule constructed in step. 3. Repeat the two previous steps for each of the objects in the sample. Denote n CV M and ncv M in class and respectively. the number of left-out observations misclassified Then a good estimate of the actual error rate is given by AER ˆ = ncv n + n M + ncv M Example 3. We consider a sample of size n = 98 containing the response to visual stimuli for both eyes measured for patients suffering from multiplesclerosis and for controls (healthy patients). Based on these measured responses and age we want to develop a rule that will allow to classify potential patients. Estimate the actual error rate as well. The assumption of equal covariances is acceptable. Prior probabilities and cost of misclassification are undetermined and thus considered to be equal.

16 Analyzing the data in S-Plus we find the group means 4. Evaluating Classification Rules x =.5639 and x =.7586 and the pooled covariance matrix S pooled = The classification rule becomes: Classify patient as suffering from multiplesclerosis if ŷ ˆm = 0.0x x x and otherwise the patient is healthy. Based on the training sample we obtain the following misclassifications: n M = 4 and n M = 3 which yields the apparent error rate APER = = 7.3%. On the other hand, by using cross-validation we obtain the misclassifications n CV M = 5 and ncv M ˆ AER = = 0.4% which is 3% higher! = 5 which yields the estimated actual error rate Note that three times as many persons are misclassified as MS patients than as healthy even while misclassification cost was assumed equal.

17 5. Classification with Several Populations 7 5. Classification with Several Populations We now consider the more general situation of separating objects from g (g ) classes and assigning new objects to one of these g classes. For i =,..., g denote f i the density associated with population π i p i the prior probability of population π i R i the subspace of outcomes assigned to π i c(j i) the cost of misclassifying an object to π j when it is from π i P (j i) the conditional probability of assigning an object of π i to π j. The (conditional) expected cost of misclassifying an object of population π is given by ECM() = P ( )c( ) + + P (g )c(g ) g = P (i )c(i ) i= and similarly we can determine the expected cost of misclassifying objects of population π,..., π g. It follows that the overall ECM equals ECM = g p j ECM(j) = j= g j= p j g P (i j)c(i j) i= i j

18 5. Classification with Several Populations 8 Result 4. The classification rule that minimizes the ECM assigns each object x to the population π i for which g p j f j (x)c(i j) j= j i is smallest. If the minimum is not unique then x can be assigned to any of the populations for which the minimum is attained. (without proof) Special Case: If all misclassification costs are equal (or unknown) we assign x to the population π i for which g j= p j f j (x) is smallest, or equivalently for j i which p i f i (x) is largest. We thus obtain Classify x as π i if p i f i (x) > p j f j (x) j i

19 5. Classification with Several Populations Classification with Normal Populations The density of population π i (i =,..., g) is now given by f i (x) = (π) p/ det(σ i ) /e (x µ i) τ Σ i (x µ i ). Result 5. If all misclassification costs are equal (or unknown) we assign x to the population π i if the (quadratic) score d i (x) = scores are given by g max j= d j(x) where the d j (x) = ln(det(σ j)) (x µ j) τ Σ j (x µ j ) + ln(p j ) j =,..., g Proof. We assign x to π i if ln(p i f i (x)) = max ln(p j f j (x)) and ln(p j f j (x)) = ln(p j ) ( ) p ln(π) ln(det(σ j)) (x µ j) τ Σ j (x µ j ). Dropping the second term which is constant yields the result. In practice, the parameters µ j and Σ j are unknown and will be replaced by the sample means x j and covariances S j which yields the sample classification rule Classify x as π i if the (quadratic) score ˆd i (x) = are given by g max j= ˆd j (x) where the scores ˆd j (x) = ln(det(s j)) (x x j) τ S j (x x j ) + ln(p j ) j =,..., g

20 5. Classification with Several Populations 0 If all covariance matrices are equal: Σ j quadratic scores d j (x) become = Σ for j =,..., g, then the d j (x) = ln(det(σ)) xτ Σ x + µ τ j Σ x µτ j Σ µ j + ln(p j ). The first two terms are the same for all d j (x) so they can be left out, which yields the (linear) scores d j (x) = µ τ j Σ x µτ j Σ µ j + ln(p j ). To estimate these scores in practice we use the sample means x j and the pooled estimate of Σ given by S pooled = (n ) + + (n g ) [(n )S + + (n g )S g ] which yields the sample classification rule Classify x as π i if the (linear) score ˆd i (x) = given by g max j= ˆd j (x) where the scores are ˆd j (x) = x τ j S pooled x xτ j S pooled x j + ln(p j ) j =,..., g Remark. In the case of equal covariance matrices, the scores d j (x) can also be reduced to d j (x) = (x µ j) τ Σ (x µ j ) + ln(p j ) = d Σ(x, µ j ) + ln(p j ) Which can be estimated by ˆd j (x) = d S pooled (x, x j ) + ln(p j ). If the prior probabilities are all equal (or unknown) we thus assign an object x to the closest population.

21 5. Classification with Several Populations Example 4. The admission board of a business school uses two measures to decide on admittance of applicants: GPA= undergraduate grade point average GMAT=graduate management aptitude test score Based on these measures applicants are categorized as: admit (π ), do not admit (π ), and borderline (π 3 ). The training set is shown below. GMAT GPA Based on the training sample with group sizes n = 3, n = 8, and n 3 = 6 we calculate the group means x = 3.40, x =.48, and x 3 =.99, and their pooled covariance matrix S pooled =

22 5. Classification with Several Populations With equal prior probabilities we assign a new applicant x = (x, x ) τ the closest class, so we compute its (quadratic) distance to each of the three classes: d S pooled (x, x ) = (x x ) τ S pooled (x x ) d S pooled (x, x ) = (x x ) τ S pooled (x x ) d S pooled (x, x 3 ) = (x x 3 ) τ S pooled (x x 3 ) Suppose a new applicant has test scores x τ = (3., 497) then we obtain ( ) d S pooled (x, x ) = =.58 ( ) d S pooled (x, x ) = = 7.0 ( ) d S pooled (x, x 3 ) = =.47 The distance from x to π 3 is thus smallest such that this applicant is a borderline case. to

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another