SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Size: px

Start display at page:

Download "SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko."

Monica Clark
6 years ago
Views:

1 SF2935: MODERN METHODS OF STATISTICAL LEARNING LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS Tatjana Pavlenko 5 November 2015

2 SUPERVISED LEARNING (REP.) Starting point: we have an outcome measurement Y, quantitative (such as a stock price, blood pressure ) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (also called inputs, regressors, covariates, features, independent variables), X = (X 1,..., X p ) (such as diet, clinical measurements). we have a training set of data, {(x 1, y 1 ),..., (x n, y n )} for a set of n objects (such as patients). On the basis of the training data we would like to build a prediction model/rule, or learner, which will enable us to predict the outcome for new unseen objects, understand which input variables affect the outcome, and how, assess the accuracy of our predictions and inferences. The scenario above is called supervised because of the presence of the outcome variable to guide the learning process.

3 SUPERVISED VS UNSUPERVISED LEARNING (CONT.) In the unsupervised learning problem: there is no outcome variable, just a set of predictors (features) measured on a set of samples objective is more fuzzy find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation difficult to know how well we are performing different from supervised learning, but can be useful as a pre-processing step for supervised learning.

4 CLASSIFICATION PROBLEMS EXAMPLES Use information on sex, age, income, education level, merital status, debts, etc to classify a potential borrower as eligible or ineligible for a bank loan. Use measurements on blood proteins and family history to classify women as a carriers or non-carriers of a genetic disorder Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient Classify a tissue sample into one of several cancer classes, based on a gene expression profile (see fragment of the data on figure on the next slide).

5 Tatjana Pavlenko SF2935: Modern methods of statistical learning, l. 3

6 GENERAL CLASSIFICATION SET-UP, NOTATIONS AND OBJECTIVES Given is observed feature vector of variables X = (X 1,..., X p ) and a qualitative (response) variable Y (outcome measurment, target) that takes valuse in an unordered set, C of several predefined categories (classes, populations). Classification task is to build the a function C (X) that takes as input the feature vector X and predicts its value for Y, i.e C (X) C. More often, we focus on estimating the probabilities that X belongs to each category in the set C. To construct C (X), we have training data (x 1, y 1 ),..., (x n, y n ) where each observed x i is accompanied by its known class memebrship (supervised framework).

7 DISCRIMINANT ANALYSIS/CLASSIFICATION The goal is to find a discrimination rule for classification which in general classifies observations correctly minimizes probability of misclassification minimizes the expected cost of misclassification. If the population distributions are know we can use this knowledge to obtain the discriminant functon. Otherwise we have to use training data to derive an accurate discrimination rule.

8 CLASSIFICATION: TWO POPULATIONS Let Π i, i = 1, 2 denote two populations and π i be the prior probability that a randomly chosen observation comes from the ith population. π 1 = Pr(Π 1 ), π 2 = Pr(Π 2 ) = 1 π 1. Given is the observed X = (X 1,..., X p ). We model the distribution of X in each of Π i s separately. Let f i (x) denote the probability density function (pdf) of X corresponding to Π i. Partition the total sample space so that Ω = R 1 R 2 and R 1 R 2 =. R i = {x : x Π i } represents the region of observed X corresponding to Π i. Define the misclassifcation probability Similarly p(1 2). p(2 1) = Pr(assing X Π 2 actually X Π 1 ).

9 DISCRIMINATION RULES

10 CLASSIFICATION: TWO POPULATIONS (CONT.) The figure above shows an exemple of classification problem for two 2-dimensional populations (with two random variables, p = 2); the training data (with class labels) are shown in the scatter plots. The red-dotted lines show linear (left) or quadratic (right) decision boundaries that are used to define the decision regions R 1 and R 2. New observations will be assigned the population Π 1 or Π 2 depending on in which decision region they will fall into. We can already assume that our discrimination rule of a new (unseen observation) will not be perfect, i.e. and some percentage samples will likely be misclassified. With random X, we can, at best, reduce probabilities of misclassification.

11 TWO POPULATIONS: MISCLASSIFICATION PROB. Theoretically, misclassification probabilities can be computed as p(2 1) = Pr(X is in Π 1 but is assigned to Π 2 ) = f 1 (x)dx, R 2 =Ω R 1 and analogious p(1 2) = Pr(X R 1 Π 2 ) = f 2 (x)dx. R 1 Notations: R i = f (x 1,..., x p )dx 1,... dx p p-tuple integral For p = 1, see figure on the board! The more separated f 1 (x) and f 2 (x) the smaller misclassifications p(2 1) and p(1 2). Goal: We need to fix an optimality criteria to construct the discrimination rule!

12 TWO POPULATIONS: OPTIMALITY CRITERA Decision theory strategy. Desision theory is concerned with finding optimal decisions given certain information. Both estimation and hypothesis testing can be viewed as techniques of decision theory, as well as classification. A) What if we know in advance (a priori) that, say 80% of the observations are coming from Π 1 and the remaining 20% are from Π 2? How can we use this information to optimize the classifier? B) What is one kind of misclassification costs more than the other one? How can we use this information when designing the classifier? C) What if we wish to mininmize the total probability of misclassification? How does this effect the structure of the classifier?

13 TWO POPULATIONS: OPTIMALITY CRITERA (CONT) A) Maximize the posterior probability that the observation X belongs to the ith population. Bayesian approach (see Sec in ISL) which result in the so-called Bayes classifier: assign a test observation x 0 to the population with the largest posterior probability, given the feature values for that observation. Let p(π i x) the posterior probab that an observation X = x belongs to Π i. We compute the probabilities of the populations Π 1 and Π 2 after observing x, (hence the name posterior probabilities), prior probabiliteis are π 1 + π 2 = 1. By Bayes Theorem (details on the board) p(π 1 x) = π 1 f 1 (x) π 1 f 1 (x) + π 2 f 2 (x), p(π 2 x) = 1 p(π 1 x). Discriminant rule that maximizes the class posterior probability is: Assign x to Π 1 if p(π 1 x) > p(π 2 x), otherwise to Π 2.

14 OPTIMALITY CRITERA (CONT) What if the costs of the different misclassifications differ? For example, the cost of classifying a patient with a dengerous or deadly disease as healthy is higher than the cost of classifying a healthy patient as having a dengerous disease. B) Minimize the expected cost of misclassification (ECM): Let c(1 2) and c(2 1) be the costs associated with p(1 2) and p(2 1). c(1 1) = c(2 2) no cost of correct decisions. The expected cost of misclassification ECM is ECM = c(1 2)p(1 2)π 2 + c(2 1)p(2 1)π 1. Discriminant rule that minimizes the ECM is (see proof on the board): Assign x to Π 1 if f 1(x) f 2 (x) > π 2c(1 2) π 1 c(2 1), else to Π 2.

15 OPTIMALITY CRITERA (CONT) C) Minimize the total probability of misclassification (TPM): TPM = p(2 1)π 1 + p(1 2)π 2 = π 1 R 2 f 1 (x)dx + π 2 R 1 f 2 (x)dx. This leads (proof is similar to ECM case) to the discriminant rule: Assign x to Π 1 if f 1(x) f 2 (x) > π 2 π 1, else to Π 2. Special cases of ECM rule: if π 1 = π 2 then f 1(x) f 2 (x) > c(1 2) c(2 1). If c(1 2) = c(2 1) then f 1(x) f 2 (x) > π 2 π 1. Same as TPM and Bayes classifier. If c(1 2) = c(2 1) and π 1 = π 2 then f 1(x) > 1. Likelihood ratio f 2 (x) classification rule.

16 TWO MULTIVARIATE NORMAL POPULATIONS: LDA AND QDA When we use normal (Gaussian) distributions for each population Π i, this leads to linear or quadratic dicriminant analysis, LDA or QDA. However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions. Assume that f i (x) is N p (µ i, Σ i ) corresponding to Π i, i = 1, 2. µ i is a class-specific mean vector, Σ i is the class covariance matrix. X N p (µ, Σ). Here E(X) = µ, Cov(X) = Σ and the density is ( 1 f (x) = (2π) p/2 exp 1 ) Σ 1/2 2 (x µ) Σ 1 (x µ).

17 TWO BIVARIATE NORMAL DENSITY FUNCTIONS. FIGUR : Two two-dimensional (p = 2) density functions. Left: X 1 and X 2 are uncorrelated. Right: Corr(X 1, X 2 ) = 0.7.

18 TWO MULTIVARIATE NORMAL POPULATIONS: LDA Assume that f i (x) is N p (µ i, Σ) corresponding to Π i. Σ is the covariance matrix which is common for both populations. Then ( ) f1 (x) log = (µ f 2 (x) 1 µ 2 ) Σ 1 x 1 2 (µ 1 µ 2 ) Σ 1 (µ 1 + µ 2 ) = D(x) (say). The discriminant rule that minimizes EPM is: Assign x to Π 1 if D(x) > log π 2c(1 2) π 1 c(2 1), else to Π 2. Proof on the board. For Σ 1 = Σ 2, D(x) is linear in x which is the reason for the name Linear discriminant function or Linear discriminant analysis (LDA). Σ 1 = Σ 2 results in a quadratic discriminant rule or QDA, will be discussed more later.

19 LINEAR DISCRIMINANT FUNCTIONS.

20 TWO MULTIVARIATE NORMAL POPULATIONS: LDA In practice, population parameters µ i and Σ are unknown. We estmate D(x) from the data! Estimation technique: plug-in estimated µ i and Σ into D(x). This gives a sampe discriminant rule. Given the data X i : n i p from Π i, calculate x i, S i (unbiased). Since Σ 1 = Σ 2 use S pooled = (n 1 1)S 1 +(n 2 1)S 2 n 1 +n 2 2 and obtain D(x) = ( x 1 x 2 ) S 1 pooled x 1 2 ( x 1 x 2 ) S 1 pooled ( x 1 + x 2 ). The sample EPM rule is: Assign x to Π 1 if D(x) > log π 2c(1 2) π 1 c(2 1), else to Π 2.

21 SOME REMARKS ON LDA Estimation of π i s: Usually, π i = Bayes prior, guess,... n i n 1 +n 2 is assumed. Otherwise, use With D(x) there is no assurance that the resulting rule will minimize the ECM in a particular application. This is because the optimal rule was derived assuming population densities f i (x) were known completely. But D(x) is expected to perform well if the sample size n i is large. Denote by d = ( x 1 x 2 ) Spooled 1 and let ŷ(x) = d x, ȳ i = ˆd x i, i = 1, 2. When π 2c(1 2) = 1 the discriminant rule becomes π 1 c(2 1) ŷ(x) > 1 2 (ȳ 1 + ȳ 2 ). As ŷ(x) and ȳ i s are linear combinations, the multivariate expressions convert to univariate ones.

22 LDA. EXAMPLE Example is adapted from AMSA book and is concerned with the detection of hemophilia A carriers. The goal was to construct a procedure for classifying patients as hemophilia A carriers or not. Measurments on two variables were conducted for n 1 = 30 women from non-carriers population, Π 1 and n 2 = 22 from carriers population, Π 2. X 1 = log 10 (AHF activity), X 2 = log 10 (AHF-like antigen). Data: X approximately bivariate normal on log-transformed scale. Assume Σ 1 = Σ 2. To construct a sample-based LD function the following is provided x 1 = ( 0.065, ), x 2 = ( 0.248, ). ( ) Spooled =

23 LDA. EXAMPLE (CONT.) For the sample-based LD function we have ŷ(x) = ( x 1 x 2 ) Spooled 1 x = 37.61x x 2. }{{} d ȳ 1 = ( x 1 x 2 ) Spooled 1 x 1 = 0.88, ȳ 2 = ( x 1 x 2 }{{} ) 1 Spooled x 2 = }{{} d d The midpoint between these two means is 1 2 (ȳ 1 + ȳ 2 ) = Recall that the discriminant rule, when π 2c(1 2) π 1 c(2 1) = 1 is ŷ(x) > 1 2 (ȳ 1 + ȳ 2 ). Hence assign x to Π 1 (normal) if ŷ(x) > 1 2 (ȳ 1 + ȳ 2 ) = 4.61, else to Π 2 (carrier).

24 LDA. EXAMPLE (CONT.) Discriminant rule is ŷ(x) > 1 2 (ȳ 1 + ȳ 2 ) Assign x to normal groupp, Π 1 if ŷ(x) > 1 2 (ȳ 1 + ȳ 2 ) = 4.61, else to carrier group, Π 2. Q: Measurments of AHF activity and AHF-like antigen ona women who might be hemofilia A carrier give x 1 = 0.210, x 2 = Should this women be classified as Π 1 (normal) or Π 2 (carrier)? Given a new observation on a women x 0 = ( 0.210, 0 044) we obtain ŷ(x 0 ) = 6.62 < 1 2 (ȳ 1 + ȳ 2 ) = Hence we assign her to carrier group, i.e assign the new observation x 0 to Π 2. This classifier assumes equal costs and equal priors! R+ estimates prior probabilities by default if nor specified.

25 SOME REMARKS ON LDA (CONT.) Consider the rule (with π 2c(1 2) = 1) again. We have π 1 c(2 1) ( ) f1 (x) D(x) = log = (µ f 2 (x) 1 µ 2 ) Σ 1 x 1 2 (µ 1 µ 2 ) Σ 1 (µ 1 + µ 2 ). Rule: Assing x to Π 1 is D(x) > 0 otherwise to Π 2 The rule is called linear dicriminant function (LDF) Bayes decision boundary {x D(x) = 0} is a hyperplane (of size p 1) dviding the two classes. See ISL, p. 144: Bayes decision boundary represents the set of values x for which δ 1 (x) = δ 2 (x) where δ i (x) = x Σ 1 µ i 1 2 µ i Σ 1 µ i, i = 1, 2. (the term log(π i ) desappears since it is the same for Π 1 and Π 2.

26 EVALUATION OF PERFORMANCE ACCURACY Optimality criteria are based on probailities of misclassification. The smaller they are the more optimal is the classifier performance. Recall TMP = π 1 R 2 f 1 (x)dx + π 2 R 1 f 2 (x)dx, where min (TMP) = optimum misclassification rate, OMR. For completely known Π i s distributions OMR can be computed exact. For example, for LDA with Π i s defined by N p (µ i, Σ) we have ( OMR = min (TMP) = Φ ) 2 where 2 = (µ 1 µ 2 ) Σ 1 (µ 1 µ 2 ) is the Mahalanobis distance between Π 1 and Π 2, Φ( ) is the cdf of N(0, 1).

The dashed lines are Bayesian decision boundaries.

27 MULTI-SAMPLE LDA FIGUR : Three Gaussian classes with p = 2. Left: Ellipces represent regions containig 95% of the probability for each of three classes. The dashed lines are Bayesian decision boundaries. Right: 20 observations were generated from each class and the corresponding LDA decision boundaries are presented as solid lines along with the Bayesian boundaries (dashed lines). Overall, the sample-based LDA is close to Bayesian decision boundaries.

Lecture 8: Classification

1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation