THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

Size: px

Start display at page:

Download "THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay"

Clyde Osborne
6 years ago
Views:

1 THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 9: Discrimination and Classification 1 Basic concept Discrimination is concerned with separating distinct sets of observations whereas classification is to allocate new objects to previously defined groups The goals of these two multivariate techniques are 1 Discrimination: To describe the differential features of objects (observations) from several known collections (populations) To find discriminants whose numerical values are such that the collections are separated as much as possible 2 Classification: To sort objects (observations) into two or more labeled classes To derive a rule that can be used to optimally assign new objects to the labeled classes 2 The case of two populations Denote the two populations by π 1 and π 2, respectively Denote by X = (X 1,, X p ), the p-dimensional random vector of measurements from the object of the populations Let f i (x) be the probability density function of X under the ith population An object with associated measurements x must be assigned to either π 1 or π 2 Let Ω be the sample space, ie the collection of all possible values of x Let R 1 be the set of x values for which we classify objects as π 1 and R 2 = Ω R 1 be the remaining x values for which we classify objects as π 2 Since every object must be assigned to one and only one of the two populations, the sets R 1 and R 2 are mutually exclusive and exhaustive The condition probability, P (2 1), of classifying an object as π 2 when, in fact, it is from π 1 is P (2 1) = P (X R 2 π 1 ) = f 1 (x)dx R 2 Similarly, the conditional probability, P (1 2), of classifying an object as π 1 when is is really from π 2 is P (1 2) = P (X R 1 π 2 ) = f 2 (x)dx R 1 Suppose that the prior distribution for the populations is P (π i ) = p i, i = 1 and 2, where p 1 + p 2 = 1 Then, we have

2 P(obs correctly classified as π 1 ) = P(obs comes from π 1 and is correctly classified as π 1 ) = P (X R 1 π 1 )P (π 1 ) = P (1 1)p 1 P(obs incorrectly classified as π 1 ) = P(obs comes from π 2 and is incorrectly classified as π 1 ) = P (X R 1 π 2 )P (π 2 ) = P (1 2)p 2 P(obs correctly classified as π 2 ) = P(obs comes from π 2 and is correctly classified as π 2 ) = P (X R 2 π 2 )P (π 2 ) = P (2 2)p 2 P(obs incorrectly classified as π 2 ) = P(obs comes from π 1 and is incorrectly classified as π 2 ) = P (X R 2 π 1 )P (π 1 ) = P (2 1)p 1 The cost of misclassification is defined by a cost matrix: Classify as: π 1 π 2 True population: π 1 0 c(2 1) π 2 c(1 2) 0 Thus, the costs are (1) zero for correct classification, (2) c(1 2) when an observation from π 2 is incorrectly classified as π 1, and (3) c(2 1) when a π 1 observation is incorrectly classified as π 2 The expected cost of misclassification is ECM = c(2 1)P (2 1)p 1 + c(1 2)P (1 2)p 2 A reasonable classification rule should have an ECM as small as possible Result 111 The regions R 1 and R 2 that minimize the ECM are defined by the values x for which the following inequalities hold: R 1 : f 1(x) f 2 (x) c(1 2) c(2 1) p 2 p 1 Thus, R 1 consists of x such that ( density ratio R 2 : f 1(x) f 2 (x) < c(1 2) c(2 1) p 2 p 1 ) ( cost ratio ) ( prior prob ratio ) Proof: See the hint at Exercise 113 Another criterion: Total probability of misclassification (TPM) T P M = p 1 R 2 f 1 (x)dx + p 2 2 R 1 f 2 (x)dx

3 This criterion is the same as the ECM if the costs of misclassification are the same Finally, a new observation x o can be allocated to the population with the largest posterior probability P (π i x o ) Specifically, by Bayes s rule, the posterior probabilities are P (π 1 x o ) = P (π 1 occurs and we observe x o ) P (we observe x o ) P (we observe x o π 1 )P (π 1 ) = P (we observe x o π 1 )P (π 1 ) + P (we observe x o π 2 )P (π 2 ) p 1 f 1 (x o ) = p 1 f 1 (x o ) + p 2 f 2 (x o ) P (π 2 x o ) = p 2 f 2 (x o ) p 1 f 1 (x o ) + p 2 f 2 (x o ) Classifying an observation x o as π 1 if P (π 1 x o ) > P (π 2 x o ) 3 Classification with two multivariate normal populations Case 1: equal covariance matrices Result 112 Let the populations π i (i = 1 and 2) be described by the density functions N p (µ i, Σ) Then, the allocation rule that minimizes the ECM is as follows: Allocate x o to π 1 if (µ 1 µ 2 ) Σ 1 x o 1 [( ) ( )] c(1 2) 2 (µ 1 µ 2 ) Σ 1 p2 (µ 1 + µ 2 ) ln c(2 1) p 1 Allocate x o to π 2 otherwise Proof: Use the identity 1 2 (x µ 1) Σ 1 (x µ 1 ) (x µ 2) Σ 1 (x µ 2 ) = (µ 1 µ 2 ) Σ 1 x Also, the minimum ECM regions are 1 2 (µ 1 µ 2 ) Σ 1 (µ 1 + µ 2 ) [ R 1 : exp 1 2 (x µ 1) Σ 1 (x µ 1 ) + 1 ] 2 (x µ 2) Σ 1 (x µ 2 ) c(1 2)p 2 [ R 2 : exp 1 2 (x µ 1) Σ 1 (x µ 1 ) + 1 ] 2 (x µ 2) Σ 1 (x µ 2 ) < c(1 2)p 2 3

4 In practice, Σ is estimated by the pooled covariance matrix S pool and µ i by sample mean of the population π 1 Consequently, the sample classification rule is as follows: The estimated minimum ECM rule for two normal populations: Allocate x o to π 1 if ( x 1 x 2 ) S 1 poolx o 1 [( ) ( )] c(1 2) 2 ( x 1 x 2 ) S 1 p2 pool( x 1 + x 2 ) ln c(2 1) p 1 Allocate x o to π 2 otherwise Fisher s approach to classification with two populations: Fisher arrived at prior minimum ECM rule using an alternative argument He considered linear combinations of the random vector so as to transform multivariate distributions into univariate ones The transformation is chosen to achieve maximum separation of the transformed sample means Let y 1i = a x 1i and y 2i = a x 2i be the transformed samples Also,let ȳ j be the sample mean of {y ji }, j = 1 and 2, and s 2 y = n1 i=1(y 1i ȳ 1 ) 2 + n 2 i=1(y 2i ȳ 2 ) 2 n 1 + n 2 2 The separation is measured by (ȳ 1 ȳ 2 ) 2 /s 2 y Result 113 The linear transformation ŷ = â x = ( x 1 x 2 ) S 1 poolx maximizes the ratio (ȳ 1 ȳ 2 ) 2 s 2 y = [â ( x 1 x 2 )] 2 â S pool â = (â d) 2 â S pool â over all possible coefficient vectors â, where d = x 1 x 2 The maximum of the above ratio is D 2 = ( x 1 x 2 ) S 1 pool( x 1 x 2 ) Proof: Use the inequality in Eq (2-50) of the textbook An allocation rule based on Fisher s discriminant function: Allocate x o to π 1 if ŷ o = ( x 1 x 2 ) S 1 poolx o ˆm where ˆm = 1 2 ( x 1 x 2 ) S 1 pool( x 1 + x 2 ) Allocate x o to π 2 otherwise This rule is known as the Fisher linear discriminant function Case 2: Unequal covariance matrices: Quadratic classification rule The Result 111 leads to the following classification regions when the covariance matrices are not equal R 1 : 1 [ ] 2 x (Σ 1 1 Σ 1 2 )x + (µ 1Σ 1 1 µ 2Σ 1 c(1 2)p2 2 )x k ln, 4

5 where R 2 : 1 [ ] 2 x (Σ 1 1 Σ 1 2 )x + (µ 1Σ 1 1 µ 2Σ 1 c(1 2)p2 2 )x k < ln, k = 1 ( ) 2 ln Σ1 + 1 Σ 2 2 (µ 1Σ 1 1 µ 1 µ 2Σ 1 2 µ 2 ) For a new observation x o, we may use sample statistics to obtain the following quadratic classification rule: Allocate x o to π 1 if [ ] 1 2 x o(s 1 1 S 1 2 )x o + ( x 1S 1 1 x 2S 1 c(1 2)p2 2 )x o k ln Allocate x o to π 2 otherwise 4 Evaluating classification functions If the population distributions are known, the total probability of misclassification is T P M = p 1 f 1 (x)dx + p 2 f 2 (x)dx R 2 R 1 The choices of R 1 and R 2 that minimizes TPM give rise to the optimum error rate (OER) When R 1 and R 2 are chosen based on Result 111 with c(1 2) = c(2 1), we have OER Thus, OER is the error rate determined by the minimum TPM classification rule In practice, the population densities are unknown, the actual error rate (AER) is AER = p 1 f 1 (x)dx + p 2 f 2 (x)dx, ˆR 2 ˆR 1 where ˆR i are the classification regions determined by the sample statistics A measure that does not depend on the form of populations is called apparent error rate (APER), which is defined as the fraction of observations in the training sample that are misclassified by the sample classification function Predicted membership π 1 π 2 Actual π 1 n 1c n 1m = n 1 n 1c membership π 2 n 2m = n 2 n 2c n 2c where n 1c = number of π 1 items correctly classified as π 1 items n 1m = number of π 1 items incorrectly classified as π 2 items 5

6 n 2c = number of π 2 items correctly classified as π 2 items n 2m = number of π 2 items incorrectly classified as π 1 items The apparent error rate is then AP ER = n 1m + n 2m n 1 + n 2 Example Example 118 about classification of Alaskan and Canadian salmon Page 603 of the textbook 5 More than two populations 51 The minimum expected cost of misclassification method Let f i (x) be the density function of the population π i, i = 1,, g Define p i = prior probability of population π i c(k i) = the cost of allocating an item to π k when, in fact, it belongs to π i, for k, i = 1,, g (with c(i i) = 0) R i = the set of x s classified as π i P (k i) = P(classifying item an π k π i ) = R k f i (x)dx Note that {R i } are mutually exclusive and exhaustive The cost of misclassifying an item x from π i is g ECM(i) = P (1 i)c(1 i) + P (2 i)c(2 i) + + P (g i)c(g i) = P (j i)c(j i), j=1 where it is understood that c(i i) = 0 and P (i i) = 1 j i P (j i) The overall ECM is ECM = p 1 ECM(1) + p 2 ECM(2) + + p g ECM(g) g g = P (j i)c(j i) p i i=1 j=1,j i It turns out that we have the following result Result 115 The classification regions that minimize the ECM are defined by allocating x to that population π k for which g p i f i (x)c(k i) i=1,i k 6

7 is smallest If a tie occurs, x can be assigned to any of the tied populations If the misclassification costs are the same, then the classification rule is: Allocate x to π k if p k f k (x) > p i f i (x), for all i k or, equivalently, Allocate x to π k if ln(p k f k (x)) > ln(p i f i (x)), for all i k This is equivalent to the one that maximizes the posterior probability P (π k x), which is P (π k x) = p kf k (x) g i=1 p i f i (x) Normal populations: Assume that f i (x) N p (µ i, Σ i ) and the costs of misclassification are the same Then, we have ln(p i f i (x)) = ln(p i ) p 2 ln(2π) 1 2 ln( Σ i ) 1 2 (x µ i) Σ 1 i (x µ i ) This leads to the definition of quadratic discrimination score as d Q i (x) = 1 2 ln( Σ i ) 1 2 (x µ i) Σ 1 i (x µ i ) + ln(p i ) for i = 1,, g In practice, replace the theoretical values by their sample estimates, we have the estimated minimum total probability misclassification rule for several normal populations as Allocate x to π i if ˆd Q i (x) = max{ ˆd Q 1 (x),, ˆd Q g (x)}, where ˆd Q j (x) is the sample estimate of d Q j (x) When Σ i are equal, then the function d Q i (x) reduces to the linear discriminant score defined as d i (x) = µ Σ 1 x 1 2 µ iσ 1 µ i + ln(p i ) for i = 1,, g Example: See Examples 1111 and 1112 Logistic regression and classification: Use salmon data to demonstrate the analysis 7

8 Bayesian Classification And Regression Tree (Bayesian CART) Some references Classification and Regression Trees by Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone (1984) Wadsworth Statistics/Probability Series Wadsworth International Group Bayesian cart model search by Hugh A Chipman, Edward I George and Robert E McCulloch (1998) Journal of the American Statistical Association, 93(443) A recent lecture note by Teresa Jacobson, Department of Applied Mathematics and Statistics, UC Santa Cruz March 11, 2010 (Available on Web) 8

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another