5. Discriminant analysis

Size: px

Start display at page:

Download "5. Discriminant analysis"

Luke Wilkerson
5 years ago
Views:

1 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density function) and a priori class probability both in numerator and denominator. See Fig

2 P(x c i ) P(c i ) Bayes s rule P(c i x) case x Fig 5.1 Converting a priori class probability to a posteriori probability via vector x. 158

3 On the basis of Section 3 p , we know that minimizing the risk or error probability is equivalent to partitioning the variable space into M regions, classes for classification. If regions R i and R j happen to be contiguous, they are separated by decision boundary or decision surface in the multidimensional variable space. For the minimum error probability, this is described by the following equation. 159

4 From the one side of the boundary this difference is positive and from the other it is negative. Sometimes, instead working directly with probabilities (or risk functions), it may be more convenient to work with their equivalent function, for example: (5.2) 160

5 Now f( ) is a monotonically increasing function, and g i (x) isknownasadiscriminant function. The decision test (5.2) is stated classify x in c i if g i (x)> g j (x) for all j i (5.3) and decision boundaries separating contiguous regions are defined in the following way. 161

6 g ij (x) =g i (x)-g j (x) = 0, i,j=1,2,,c, j i Next we focus on a particular family of decision boundaries associated with Bayesian classification and also Gaussian density (normal distribution). We employ the Gaussian or normal density function from Section 3 p. 117 (5.4) where is p p covariance matrix and its determinant. 162

7 5.1 Quadratic discriminant analysis Let us define a discriminant function for the ith class from (5.1): Given a vector x of p variables, classification according to (5.1) and (5.3) is based on detecting the greatest discriminant function value. Denominator p(x) is neglected, because it can be reduced as a positive constant for any class c i for a fixed x. (We could even simplify assuming equal a priori probabilities, this would mean choosing the class for which p(x c i ) is greatest.) 163

8 As written above, any monotonically increasing function g i (x) is also a valid discriminant function. The logarithmic function meets this requirement and gives an alternative discriminant function: We now simplify the arrangement to illustrate the solution approach. 164

9 We attain the following form (5.5) where case x follows a multidimensional normal distribution in every class. From (5.4) we derive (5.6) where mean vector i of the ith class and covariance matrix and its determinant are used as usual. 165

10 By substituting (5.6) into (5.5) we obtain the following form. (5.7) The first term is dropped out, because it is the same constant for every class. 166

11 In discriminant analysis a test case x is classified into class c i that gives the greatest d i (x). Let us consider a simple two-dimensional example and assume that the covariance matrix is as follows. 167

12 The (5.7) becomes and obviously the associated decision curves d i (x)-d j (x)=0 are quadrics, that is, ellipsoids, parabolas, hyperbolas, or pair of lines. 168

13 Fig. 5.2 show the decision curves corresponding to P(c 1 )=P(c 2 )=1/2, 1 =(0 0) T and 2 =(1 0) T. The covariance matrices are for Fig. 5.2(a) and for Fig. 5.2(b). 169

14 Fig. 5.2 Quadric decision curves, (a) ellipsoid (b) hyperbola ( 1 and 2 are classes). 170

15 5.2 Linear discriminant analysis The only quadratic contribution in (5.7) comes from the term x T i -1 x. If we now assume that the covariance matrix is the same in all classes, that is, i, the quadratic term will be the same in all discriminant functions. Hence, it does not enter into the comparison for computing the maximum and it cancels out in the decision surface equations. Thus, it can be omitted and we redefine 171

16 (5.8) where Hence d i (x) isalinear discriminant function of x and the respective decision surfaces are hyperplanes. 172

17 For diagonal covariance matrix with equal elements the is = 2 I, where I is the p-dimensional identity matrix and 2 variance. Then (5.8) becomes Thus, the corresponding decision hyperplanes can be written as 173

18 where and where (5.9) 174

19 is the Euclidean norm of a vector. Now the decision surface is hyperplane passing through point x 0. If P(c i )=P(c j ), then x 0 =( i + j )/2, and the hyperplane passes through the mean of mean vectors i and j. The geometry is illustrated in Fig. 5.3 for the twodimensional case. The decision hyperplane (line here) is orthogonal to i - j. For any point x lying on the decision hyperplane, the vector x-x 0 also lies on the hyperplane. 175

20 hyperplane Fig. 5.3 Decision line for two classes and normally distributed vectors with = 2 I. 176

21 If P(c i )<P(c j ) (or P(c i )>P(c j )), the hyperplane is closer to i (or j ). If 2 is small with respect to the Euclidean norm of i - j, the location of the hyperplane is rather insensitive to the values of P(c i ) and P(c j ). Small variance indicates that random vectors are clustered within a small radius around the means. Fig. 5.4 illustrates this. For each class, the circles show regions in which cases have a high probability, say 98%, of being found. 177

22 hyperplane hyperplane Fig. 5.4 The decision lines for (a) small variance and (b) for large variance. The location of the hyperplane in the latter is much more critical than in the former. 178

23 For a nondiagonal covariance matrix we end up with hyperplanes given by where and then the norm from (5.9) is replaced with Mahalanobis distance (norm) 179

24 This is like in the case of the diagonal covariance matrix, with one exception. The decision hyperplane is no longer orthogonal in the vector i - j, but to its linear transformation -1 ( i - j ). These are illustrated in Fig

25 Fig. 5.5 Curves of (a) equal Euclidean distance and (b) equal Mahalanobis distance from the mean points of each class. ( s and v s are eigenvalues and eigenvectors and c is a constant.) 181

26 Fisher s linear discriminant Let us return to formula (5.8): The decision regions are separated by hyperplanes as seen. Conceptually, a p-dimensional vector x is reduced to a single dimension (scalar) that is then used for classification. In the following we consider C equal to 2, which can be extended by considering C>2 by considering classes pairwise (one-versus-one). Fig. 5.6 summarizes the idea of Fisher s linear discriminant that involves projection of the p-dimensional data onto an appropriate line, thereby reducing the variable data to a single value on a line. 182

27 Fig. 5.6 Classes are 1 and 2. Probability density functions (pdf) outline them. Every data case is projected onto a line attempting to best separate the classes. w 183

28 The projection line is oriented to maximize class separation. Considerable utility is gained when the size of the data vector is large, e.g., p equal to 50, 100 or greater. Let a training or learning set be H={x 1, x 2,, x N }={H 1, H 2 } partitioned into two subsets corresponding to the two classes that include N 1, and N 2 training cases. We get the following formula for data vector projections. (5.10) 184

29 When we further constrain the norm of x to be equal to 1, each y k is the projection onto the line in the direction w. This line always goes through the origin in the p- dimensional real space. The objective is to choose the direction so that the classes are separated optimally. There are some measures of projected data class separation. The simplest one is called the difference of the means of the projections. For example, is such a measure, where we compute expected values according to (5.10) as sample means through w: 185

30 Thus, the projection mean for each class is a scalar, given by where m i is a sample mean of vectors in H i and Y i a class-based subset of projections. The difference of the projection means is therefore: 186

31 The difference between the means of the projected data alone is often insuffcient for a good classifier. Although we need well-separated (class) projections, they should not be intermingled. To achieve this, we need to consider variances of y i in Y i relative to means. A better class separation measure is the ratio (difference of means) / (variance of within-class data) in the situation of C=2 where s 2 is a sample variance: 187

32 5.3 Logistic discrimination The basic assumption is that the difference between the logarithms of the class-conditional density functions is linear in the variables of x: The assumption is satisfied by many families of distributions and thus applicable to a wide range of real data sets depart from normal distribution. 188

33 Since the sum of the previous class-conditional probabilities has to be equal to 1, we obtain (5.11) where 189

34 Discrimination between two classes depends on the ratio p(c 1 x)/p(c 2 x): Assign to c 1 if p(c 1 x)/p(c 2 x)>1, otherwise to c 2. Using (5.11) we see that the decision about discrimination is determined solely by the linear function as follows: Assign to c 1 if, otherwise to c

35 Multiclass logistic discrimination In the multiclass discrimination (regression) problem, the basic assumption is that, for C classes That log-likelihood ratio is linear for any pair of likelihoods. Again, we may show that the posterior probabilities are of the form 191

36 where Also, the decision rule about discrimination depends solely on the linear functions: Assign x to c m if x to c j., otherwise assign 192

37 Example 1: Demographic vs. crime variables Table 5.1 Classification accuracies scaling or without it. (%) of discriminant analysis after Method Not scaled Scaled into [0,1] Standardized Linear Logistic Returning to the example 3 from p. 122 and p. 153, we obtained result that were either worse or better than those given by naive Bayes, and either worse or equal than those given by the nearest neighbor searching. 3 X. Li, H. Joutsijoki, J. Laurikkala and M. Juhola: Crime vs. demographic factors: application of data mining methods, Webology, Article 132, 12(1), 1-19,

38 Example 2: Vertigo data Let us view Vertigo data 4 from Section 3, p , and Section 4, p For linear discriminant analysis we obtained high average accuracy of 95.5% weighted by the class sizes. This was slightly better than those weighted averages in Section 4. Applying Mahalanobis or quadratic discriminant function was not successful because of not positive-definitive matrices obtained in the computation which prevented their use. 4 M. Juhola: Data Classification, Encyclopedia of Computer Science and Engineering, ed. B. Wah, John Wiley & Sons, 2008 (print version, 2009, Hoboken, NJ),

Bayesian Decision Theory

Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities