THE LINEAR DISCRIMINATION PROBLEM

Size: px

Start display at page:

Download "THE LINEAR DISCRIMINATION PROBLEM"

Priscilla Doyle
6 years ago
Views:

1 What exactly is the linear discrimination story? In the logistic regression problem we have 0/ dependent variable, and we set up a model that predict this from independent variables. Specifically we use logit P[ Y i = ] = β 0 + β X i + β 2 X i2 + + β k X ik This has assumed k independent variables. As an alternative, we might try to ask what linear combination of the X s is most useful for distinguishing Y i = from Y i = 0. The story line is roughly similar. In logistic regression, we are fitting the probability P[ Y i = ]. We are estimating the slopes, determining which predictors are useful, and noting the sensitivity of the response to each predictor. We maintain the illusion that we might be able to influence the outcomes through manipulation of the independent variables. In linear discrimination, we do not control the X s. The objective is just being able to predict Y with good probability. The data come to us as sample from two populations. There are n values from the population, in which the distribution of X is normal with mean μ and with variance Σ. There are n 2 values from the 2 population, the distribution of X is normal with mean μ 2 and with variance Σ. The variance matrix is assumed the same. We re thinking of the populations as and 2, rather than 0 and. No big deal. The probability density for population j (j =, 2) is the multivariate normal density, which is f j (x) = ( ) factor with Σ exp{ ( x μ j) Σ ( x μ j) 2 }. We can also impose prior probabilities on the problem π and π 2 = π. These might or might not be related to the sample sizes n and n 2.

2 Let s say you get a value x. What population is it from? P[population data x] = [ x ] P data population P[ data x ] = = [ ] [ x ] P population P data population P[ data x] P[ popn ] P[ data x popn ] [ ] [ x ] + [ ] [ x ] P popn P data popn P popn 2 P data popn 2 In a similar style, get P[population 2 data x ]. You will then be able to get the ratio [ x] [ x] P population data P population 2 data = π P[ data x population ] ( π ) P[ data x population 2 ] The substitution of the multivariate normal density will lead to a condition of the form Classify as population if Classify as population 2 if a x c a x< c The vector a is the linear discriminator. This can go beyond two populations. We have training sets of values from M populations. The data are all vectors of the same form; that is, every vector is K-by- and the meanings of all the coordinates are the same. In a medical investigation on human subjects, the first coordinate might be age, the second might be height, and so on. From population, with mean vector μ, we have n values. From population 2, with mean vector μ 2, we have n 2 values... From population M, with mean vector μ M, we have n M values. It is assumed that the population variance matrices are Σ (all the same). The populations might also be given prior probabilities π, π 2,, π K. (If these are not given, some people use π j = n j n +.) 2

3 Then, given a new random vector X, the task is to identify which population it came from. The solution will find M vectors a, a 2,, a K. We will classify this as population j if a j X is the biggest value. If M = 2, we will classify population if and only if a X > a2 X. This is of course equivalent to ( a a2) X > 0. The two-population discrimination problem is commonly described in terms of a single vector. Let s illustrate the linear discrimination function with the file on the baby weights, LOWBWT.MTP. The columns of this sheet are ID, LOW, AGE, LWT, RACE, SMOKE, PTL, HT, UI, FTV, BWT The variable RACE is categorical with three levels, so we ll use Calc Make Indicator Variables to break into separate indicators. Use Stat Multivariate Discriminant Analysis. The grouping variable will be LOW, which was coded as = Low birth weight and 0 = not low birth weight. If we take the defaults, here is what happens: Discriminant Analysis: LOW versus AGE, LWT,... Linear Method for Response: LOW Predictors: AGE, LWT, SMOKE, PTL, HT, UI, FTV, RACE, RACE3 Group Count Summary of classification True Group Put into Group Total N N correct Proportion N = 89 N Correct = 28 Proportion Correct = Squared Distance Between Groups

4 Linear Discriminant Function for Groups Constant AGE LWT SMOKE PTL HT UI FTV RACE RACE We can start by noting that % are in the group LOW = 0. Thus a naive 89 method, always guess LOW = 0, would be right 68.78% of the time. Observe that the self-classification gets 67.7% correct (which is terrible). You have two discriminant functions here, and the method of classification is to choose the group that gets the higher value. You can check off the cross-validation box. If you do, the classification for data row j is based on the discriminant function obtained from the other n points. In this case, it does slightly worse. You can see from various displays that these two groups are very badly overlapped. Here s one: 45 Boxplot of AGE AGE LOW 4

5 The Options box can help. Set this up as This gets much better results: Discriminant Analysis: LOW versus AGE, LWT,... Linear Method for Response: LOW Predictors: AGE, LWT, SMOKE, PTL, HT, UI, FTV, Race, Race2 Group Count Prior Summary of classification True Group Put into Group Total N N correct 5 2 Proportion N = 89 N Correct = 36 Proportion Correct = Squared Distance Between Groups

6 Linear Discriminant Function for Groups Constant AGE LWT SMOKE PTL HT UI FTV Race Race You could apply logistic regression to the same set of data. Make the predictions based on the p ˆ j values (which Minitab will compute for you). Use the cutoff 0.50 to make the groups. This will make 49 errors out of 89; the probability of correct prediction is %, so it did slightly better! 89 Let s try this on the Easton data set (EASTON.mtp). The variables were these: MONTH, PRICE, SIZE, BEDROOM, AGE, SUBD, AGENCY, Avon, Bellewood, Chelsea Let s see what discriminates (aside from price) those homes sold by agents (AGENCY = ) from those that were sold by the builder. Use Stat Multivariate Discriminant Analysis. The grouping variable will be AGENCY. Use SIZE, BEDROOM, AGE. The results: Discriminant Analysis: AGENCY versus SIZE, BEDROOM, AGE Linear Method for Response: AGENCY Predictors: SIZE, BEDROOM, AGE Group Count Summary of classification True Group Put into Group Total N N correct Proportion N = 58 N Correct = 344 Proportion Correct =

7 Squared Distance Between Groups Linear Discriminant Function for Groups Constant SIZE BEDROOM AGE In this problem, the homes not sold by agents were guessing should get this right at least that often % of the data. Naive If we set the prior probabilities to match this, we have Discriminant Analysis: AGENCY versus SIZE, BEDROOM, AGE Linear Method for Response: AGENCY Predictors: SIZE, BEDROOM, AGE Group Count Prior Summary of classification True Group Put into Group Total N N correct Proportion N = 58 N Correct = 469 Proportion Correct = Squared Distance Between Groups It got this by placing all predictions in group 0. 7

8 It might be interesting to try three groups. Let s see what distinguishes the three subdivisions. Tally for Discrete Variables: SUBD SUBD Count N= 58 We had named these three subdivisions as Avon, Bellewood, and Chelsea. Observe that the proportions are as prior probabilities , , and We can use these Let s make the discrimination on the basis of Price, Size, Bedroom, Age. The values of Bedroom and Age are small integers, while the values of Price and Size are large numbers. Let s begin by standardizing. We can set this up with Calc Standardize. (There should be an equivalent operation within Calc Calculator, but there is not.) Here s the result. Discriminant Analysis: SUBD versus ZPrice, ZSize, ZBedroom, ZAge Linear Method for Response: SUBD Predictors: ZPrice, ZSize, ZBedroom, ZAge Group 2 3 Count Prior Summary of classification True Group Put into Group Total N N correct Proportion N = 58 N Correct = 408 Proportion Correct = Squared Distance Between Groups

9 Linear Discriminant Function for Groups 2 3 Constant ZPrice ZSize ZBedroom ZAge The discrimination is very strong on Price (favoring AVON), very strong on Size (favoring Chelsea). The discrimination on the other variables is borderline. You can see from this summary list: Descriptive Statistics: PRICE, SIZE, BEDROOM, AGE Variable SUBD N Mean StDev PRICE SIZE BEDROOM AGE

10 You can see even better from this graph: Scatterplot of PRICE vs SIZE SUBD PRICE SIZE

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using