Lecture 8: Classification

Size: px

Start display at page:

Download "Lecture 8: Classification"

Jeffery Hodge
6 years ago
Views:

1 1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University Multivariate Methods 19/5 2010

2 Classification: introductory examples Goal: Classify an observation x as belonging to one of several predefined categories, π 1, π 2,..., π g. Examples: Classify insects into one of several sub-species using measurements on external features. Use measurements on blood proteins and family history to classify women as carriers or non-carriers of a genetic disorder. Classify the quality of a new mobile phone battery as good or bad based on a few preliminary measurements. Use information on background, family support, psychological test scores etc. to screen applicants for parole from prison. Use information on sex, age, income, education level, marital status, debts etc. to classify a potential borrower as eligible or ineligible for a bank loan. Detect spam messages based upon the message header and content. 2/26

3 3/26 Classification: discrimination rule We d like to find a discrimination rule for classification that In general classifies observations correctly. Minimizes the probability of misclassification. Minimizes the expected cost of misclassification. Ideally should be a simple rule. If the distributions of the populations are known, we can use this knowledge to derive rules. Otherwise, we must use training data to find good rules.

4 4/26 ML approach: assumptions We ve seem many times before that the likelihood approach yields good tests and estimators. It seems reasonable to try to use it for classification. In the maximum likelihood approach to discrimination, the distributions of the g populations are assumed to be known. Simplest to analyse theoretically, although the least realistic in practice - Mardia, Kent & Bibby (1979). The ML discriminant rule for allocating an observation x to one of the populations π 1,..., π g is to allocate x to the population which gives the largest likelihood to x. See blackboard!

5 ML approach: univariate case Consider the univariate case with two populations: Population 1: X N(µ 1, σ 2 ) Population 2: X N(µ 2, σ 2 ) /26

6 6/26 ML approach: likelihood ratio For these populations, the likelihood ratio in the point x is: Thus: λ = likelihood of x for Pop. 1 likelihood of x for Pop. 2 = f 1(x) f 2 (x) = 1 σ 2π e (x µ 1) 2 /2σ 2 1 σ 2π e (x µ 2) 2 /2σ 2 { λ = exp 1 ( (x µ1 ) 2 2 σ 2 (x µ 2) 2 )} σ 2 Rule: classify x into Pop. 1 if λ > 1, into Pop. 2 if λ < 1 and flip a coin if λ = 1.

7 7/26 ML approach: further elaboration The rule tells us to classify into Pop. 1 if the standardized distance of x from µ 1 is less than the standardized distance of x from µ 2. We can rewrite the rule in a simpler form by taking logarithms: 2 ln λ = (x µ 1) 2 σ 2 (x µ 2) 2 σ 2 = 2 (µ 1 µ 2 ) σ 2 x + µ2 1 µ2 2 σ 2 = βx + α Rule: classify into Pop. 1 if βx + α < 0. This is a linear rule.

8 8/26 ML approach: multivariate case Now, consider the more general setting where p traits are measured, with Population 1: X N p (µ 1, Σ 1 ), Population 2: X N p (µ 2, Σ 2 ). Consider the natural logarithm of the likelihood ratio for an observed x for some individual: ( 1 (2π) p/2 (det(σ 1)) exp { (x µ 1/2 1 ) Σ 1 2 ln ( ) f1 (x) = 2 ln f 2 (x) =ln(det Σ 1 ) + (x µ 1 ) Σ 1 1 (x µ 1) 1 (x µ 1)/2 } ) 2 (x µ 2)/2 } ( ) ln(det Σ 2 ) + (x µ 2 ) Σ 1 2 (x µ 2) 1 (2π) p/2 (det(σ 2)) exp { (x µ 1/2 2 ) Σ 1 Classify into Pop 1 if this quantity is less than zero, otherwise classify into Pop 2. This is a quadratic rule, but if the covariance matrices are equal it reduces to a linear rule. See blackboard!

9 9/26 ML approach: general g Theorem. If π i is the N p (µ i, Σ) population, i = 1,..., g and Σ > 0, then the ML discriminant rule allocates x to π j, where j {1,..., g} is that value of i which minimizes the square of the Mahalanobis distance (x µ i ) Σ 1 (x µ i ). When g = 2, the rule allocates x to π 1 if α (x µ) > 0 where α = Σ 1 (µ 1 µ 2 ) and µ = 1 2 (µ 1 + µ 2 ), and to π 2 otherwise.

10 10/26 ML approach: an example See blackboard!

11 11/26 Decision theory approach: idea In many cases, we have some more information or would like to have other criterions for the classification. What if we want to minimize the probability of misclassification? Does that change anything? What if one kind of misclassification costs more than another? How can we take this into account? What if we know beforehand that, say, 80 % of the observations belong to π 1 and 20 % belong to π 2? How can we use this knowledge? Decision theory is the theory concerned with finding optimal decisions given certain information. Estimation and testing can both be viewed in a decision theoretical context, as can classification.

12 12/26 Decision theory approach: misclassification Suppose that we have some prior probabilities of the observations belonging to the different populations. Prior probability that an individual comes from Pop. 1: p 1 = P(π 1 ) Prior probability that an individual comes from Pop. 2: Now, let p 2 = P(π 2 ) = 1 p 1. P(2 1) = Conditional probability of misclassifying an observation into Pop. 2 if the observation actually belongs to Pop. 1. P(1 2) = Conditional probability of misclassifying an observation into Pop. 1 if the observation actually belongs to Pop. 2.

13 13/26 Decision theory approach: TPM Partition the sample space into R 1 and R 2 = R 1 such that If x R 1, classify to π 1 If x R 2, classify to π 2 The TPM Total Probability of Misclassification is defined as P(Misclassific.) = P(x is in π 2 but is classified as π 1 ) +P(x is in π 1 but is classified as π 2 ) = P(Classify x in π 1 π 2 )P(π 2 ) + P(Classify x in π 2 π 1 )P(π 1 ) = P(1 2)p 2 + P(2 1)p 1 See blackboard!

14 14/26 Decision theory approach: misclassification costs What if the costs of the different misclassifications differ? For instance, the cost of classifying a patient with a deadly disease as healthy is higher than the cost of classifying a healthy patient as having a deadly disease. Define the costs c(1 2), c(2 1) Costs table True population: Classify as: π 1 π 2 π 1 0 c(2 1) π 2 c(1 2) 0 and study the ECM Expected Cost of Misclassification: ECM = E(cost of decision) = c(1 2)P(1 2)p 2 + c(2 1)P(2 1)p 1 If c(1 2) = c(2 1), minimizing ECM is mathematically equivalent to minimizing TPM.

15 15/26 Decision theory approach: minimization of ECM Result 11.1: The regions R 1 and R 2 that minimize the ECM are R 1 : { f 1 (x) x; f 2 (x) c(1 2) } p 2 c(2 1) p 1 R 2 : { f 1 (x) x; f 2 (x) < c(1 2) } p 2 c(2 1) x is classified into π 1 if x R 1 and into π 2 if x R 2. The maximum likelihood approach can be viewed as the special case where p 1 = p 2 and c(1 2) = c(2 1), or where p 2 /p 1 = c(1 2)/c(2 1): Classify x 0 to π 1 if f 1 (x 0 ) f 2 (x 0 ) > 1 p 1

16 Comparing P(π 1 x) and P(π 2 x) is equivalent to taking c(1 2) = c(2 1) in the decision theory approach. 16/26 Decision theory approach: Bayesian approach Recall Bayes theorem: P(A B) = P(B A). P(B) Bayesian approach: allocate x to the population with largest posterior probability P(π i x). We find that P(π 1 x) = P(π 1, x) P(x π 1 )P(π 1 ) = P(x) P(x π 1 )P(π 1 ) + P(x π 2 )P(π 2 ) p 1 f 1 (x) = p 1 f 1 (x) + p 2 f 2 (x) Similarly, we get P(π 2 x) = p 2 f 2 (x) p 1 f 1 (x) + p 2 f 2 (x)

17 17/26 Decision theory approach: normal data If π 1 is N p (µ 1, Σ) and π 2 is N p (µ 2, Σ) then the decision theory rule becomes: allocate x to π 1 if (µ 1 µ 2 ) Σ 1 x 1 [( ) ( )] c(1 2) 2 (µ 1 µ 2 ) Σ 1 p2 (µ 1 +µ 2 ) ln c(2 1) and to π 2 otherwise. Now assume that we have two normal populations with unknown means and equal but unknown covariance matrices. n 1 observations are available fromn π 1 and n 2 observations are available from π 2. The estimated minimum ECM rule is: Allocate a new observation x 0 to π 1 if ( x 1 x 2 ) S 1 pool x 0 1 [( ) ( )] c(1 2) 2 ( x 1 x 2 ) S 1 pool ( x p2 1+ x 2 ) ln c(2 1) Allocate x 0 to π 2 otherwise. p 1 p 1

18 18/26 Decision theory approach: quadratic classification rule Now suppose that we have two normal populations with unequal covariance matrices Σ 1 and Σ 2. In this case, the classification rule becomes more complicated: Allocate x 0 to π 1 if 1 [( ) ( )] c(1 2) 2 x 0(Σ 1 1 Σ 1 2 )x 0+(µ 1Σ 1 1 µ 2Σ 1 2 )x p2 0 k ln c(2 1) p 1 where k = 1 ( ) 2 ln det(σ1 ) + 1 ( µ det(σ 2 ) 2 1 Σ 1 1 µ 1 µ 2Σ 1 2 µ 2). Allocate x 0 to π 2 otherwise. Replacing µ i with x i and Σ i with S i we obtain the estimated minimum ECM rule.

19 19/26 Fisher s approach: introduction Suppose that observations from g populations are given and that we wish to use these to classify a new observation. It is easier to tell the populations apart if the variation between the groups is larger than the variation within the groups. Let B = W = g n l ( x l x)( x l x) l=1 g n l (x lj x l )(x lj x l ) l=1 j=1 Fisher s idea: Look for the linear function a x which maximizes the ratio of between-group sum of squares to within-group of sum of squares.

20 20/26 Fisher s approach: linear discriminant function For the vector a maximizing a Ba a Wa the linear function a x is called Fisher s linear discriminant function. Theorem. The vector a in Fisher s linear discriminant function is the eigenvector of W 1 B corresponding to the largest eigenvalue.

21 21/26 Fisher s approach: allocation rule Rule: an observation x is allocated to that population whose mean score is closest to a x. That is, allocate x to π j if a x a x j < a x a x i for all i j For g = 2 groups, this becomes: allocate x to π 1 if d W (x 1 1 ) 2 ( x 1 + x 2 ) > 0 where d = x 1 x 2 ; allocate to π 2 otherwise. This coincides with the classification rule obtained by the ML approach for two multivariate normal populations with equal covariance matrices. However, no assumption of normality was made here.

22 22/26 Discrimination and MANOVA Of course, before applying our classification methods, we should ask ourselves if it really is meaningful to use a certain dataset for classification. Consider g multinormal populations, assumed to have the same covariance matrix, Σ 1 =... = Σ g. To check whether or not a discriminant analysis is worthwhile, test the hypothesis µ 1 =... = µ g This is the MANOVA problem!

23 23/26 Evaluating classification functions OER (Optimum Error Rate): The smallest value of TPM. AER (Actual Error Rate): Based on performance of sample classification functions. APER (Apparent Error Rate): The fraction of observations in a training sample that are misclassified by the sample classification function. Lachenbruch s cross-validation procedure: method for estimating AER. See blackboard!

24 24/26 The Museum Gustavianum sword data Museum Gustavianum in downtown Uppsala has a large collection of antique swords. In a current research project, the researchers are measuring lengths and various other properties of the swords. The goal is to use this to classify swords as coming from different epochs. They need help from someone with knowledge of classification methods, so they asked the mathematics department for help. This could be a fun project for a bachelor or master thesis, or perhaps just a nice project to work with along with your studies. Talk to Jesper Rydén, jesper@math.uu.se, if you are interested or if you would like to know more!

25 Classification: a second look at the introductory examples Can the methods presented today be used in our introductory examples? Examples: Classify insects into one of several sub-species using measurements on external features. Use measurements on blood proteins and family history to classify women as carriers or non-carriers of a genetic disorder. Classify the quality of a new mobile phone battery as good or bad based on a few preliminary measurements. Use information on background, family support, psychological test scores etc. to screen applicants for parole from prison. Use information on sex, age, income, education level, marital status, debts etc. to classify a potential borrower as eligible or ineligible for a bank loan. Detect spam messages based upon the message header and content. 25/26

26 26/26 Classification: next lecture In the next lecture, we will talk about decision trees (for classification) and other algorithmic methods. These methods are popular within, for instance, the field of data mining and do not require assumptions about distributions. We will also compare ordinary probabilistic methods with algorithmic methods and discuss when the different methods should be used.

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

SF2935: MODERN METHODS OF STATISTICAL LEARNING LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS Tatjana Pavlenko 5 November 2015 SUPERVISED LEARNING (REP.) Starting point: we have an outcome