Cross-validation of N earest-neighbour Discriminant Analysis

Size: px

Start display at page:

Download "Cross-validation of N earest-neighbour Discriminant Analysis"

Adrian May
5 years ago
Views:

1 r--;"'-'p'-;lo':y;q":;lpaof!i"joq PS56:t;;-'5""V""""::;':::'''-''i;''''':h'''''<>--:''':-c'-::.'" !.1 ---;-- '.' -" ',". ----_. -- Cross-validation of N earest-neighbour Discriminant Analysis A.P. White! Computer Centre Elms Road University of Birmingham Edgbaston Birmingham B15 2TT United Kingdom Abstract The SAS statistical package contains a general-purpose discriminant procedure, DISCRIM. Among the options available for this procedure are ones for performing nearest-neighbour discriminant analysis and cross-validation. Each of these works well enough when used separately but, when the two options are used together, an optimistic bias in cross-validated performance emerges. For certain parameter values, this bias can be dramatically large. The cause of the problem is analyzed mathematically for the two-class case with uniformly distributed data and demonstrated by simulation for normal data. The corresponding misbehaviour for multiple classes is also demonstrated by Monte Carlo simulation. A modification to the procedure, which would remove the bias, is proposed. Key Words: SAS; optimistic bias. Cross-validation; nearest neighbour discriminant analysis; 1 A.P. White is also an Associate Member of the School of Mathematics and Statistics at the University of Birmingham 829

2 1 Introd uction The general discriminant problem is one of deriving a model for classifying observations of unknown class membership by making use of a set of observations from the same population, whose class membership is known. To be : < :.i., -':' :". t' ',' ' ( f: Ii L!, : I'.t. " ti :r l',- f. K [r, " f r 11 I:t,, I, more specific, let S be a set of n observations, each of the form (w, x), where W denotes membership of one of c classes. Each observation has measurements on m variables, giving the vector x. Let the prior probability for membership of class i be p(wi). Also, let the unconditional and class-specific probability densities at x be I(x) and Ji(x), respectively. Bayes' theorem then gives the posterior probability that an observation at x belongs to class i: (.1 ) - p(wi)!i(x) P W t x - I(x) (1) Because the classes are mutually exclusive and jointly exhaustive, this can be re-written as: Classification of a new observation at x is then carried out from the posterior probabilities. p(wjlx) = maxi(p(wilx )). (2) Thus x is predicted as belonging to class Wj if Now, different approaches to discriminant analysis employ different methods of estimating the class-specific probability densities, Ji(x). In the parametric case, Fisher's linear discriminant analysis derives these estimates byassuming a multivariate normal distribution for the data, based on class-specific sample means and a pooled covariance matrix. (The quadratic version is similar but uses separate covariance matrices for each class). The k-nearestneighbour (k-nn) method, on the other hand, is nonparametric and makes no such distributional assumptions. In this approach, a kernel is formed in the measurement space. This kernel has the shape of a hypersphere and is centred on x. The volume, V, of the kernel is such that it is just large enough to contain k observations. Let ki of these observations belong to class Wi and let there be ni observations in S, belonging to class Wi. Thus, summing over all c classes gives E ki = k and E ni = n. \ t \ "."- '.".'... - ;'_-rr : X.-.; - '-;;_'.-'.--'

3 r""9:.q.:=;sk"'5'.....;;;;... Y"&-:::<:S'"'---<..;?-G-;;;.-""-... -"'-"-'.,- --;: -" I } Hand (1981) shows how posterior probabilities can be estimated in such a situation. The essence of his argument is as follows. probability density for class Wi at x is estimated by: A k hex) =-' niv The class-specific Similarly, the unconditional probability density at x is estimated by: k nv A I(x) = If the sample sizes, ni, are proportional to the prior probabilities, p(wi), then the priors can also be estimated by: Substituting from Equations 3, 4 and 5 into 1 gives estimates for the posterior probabilities: p(wilx ) = ki (6) k On the other hand, if the sample sizes are not proportional to the priors, then an adjustment is required. In this case, let: where the various Ti are adjustment factors for the lack of proportionality. The estimates for the posterior probabilities now become: (3) (4) (5) (7) (8) 2 Cross-validation Anomaly in SAS The statistical package SAS contains a multi-purpose discriminant procedure, called DISCRIM. This procedure has options for k-nn discriminant analysis and for cross-validation. These options may be used in combination. However, the way in which this is implemented in SAS is responsible for a rather strange difficulty which arises under cross-validation, in the form of a parameter-dependent bias in the cross-validated error rate estimate. In certain 831

4 . ::. ; " f t";. :r, >.. 2 "J: ;- '.3" l #, : )-; t f' f; f l'l," '" "- {; '" i If 1? t k i. circumstances, this bias can be dramatically large. This anomaly is, perhaps, best introduced by means of an example, leaving a more general treatment of the problem until later in this paper. Suppose that cross-validation is being performed on a data set in which the measurement space consists of a single uniformly distributed variable, x, and that observations belong to one of two equiprobable classes. Suppose that x contains no information at all about class membership and that the sample sizes nl and n2 are equal. Now, consider the behaviour of an algorithm, operating as previously specified, with a parameter setting of k = 2. The distribution of kernel membership over the cross-validation procedure will be very nearly binomial (k,p), where p = nt/no In this case, p = 1/2. (The distribution is not exactly binomial because p changes very slightly, according to the actual class membership of he small detail is unimportant here). observation being classified but this The focus of interest is the consequences which follow from tied classmembership in the kernels. In this example, approximately half the kernels would be expected to have one neighbour belonging to each class. Without loss of generality, consider what happens when a member of class 1 is subjected to cross-validatory classification in this situation, in order to estimate eev (the cross-validated error rate). From Equation (3), it can be seen that: and also that: A 1 hex) = V(nl - 1) A 1 h(x)= Vn2 As the prior probabilities are equal, it follows that p(wllx) > p(w2ix) and the case will be classified correctly. Kernels with pure class membership will obviously produce classification in the expected direction. This leads to the expected value for the cross-validated error rate, E(ecv ), being only 1/4, rather than 1/2 as expected under a random assignment of observations to predicted classes. With these parameter values, it is easy to see that when k is changed, the parity of k has a marked effect on the estimated error rate under crossvalidation, because of the effect of ties in the kernel membership when k is (9) (10) \ \. \. :.: I 832

5 even but not when it is odd. Thus, for odd values of k, E(e cv ) = 1/2 but for even values of k, an optimistic bias in E(ecv ) is clearly evident: (for k even) (11) Another disturbing feature of this approach is the relationship that emerges between ers (the resubstitution error estimate) and e cv Under resubstitution, for the same parameter values, the effect of ties in kernel membership is different. In these circumstances, the class-specific probability densities will be exactly equal, leading to a tie in the posterior probabilities. In SAS these ties are evaluated conservatively (i.e. as classification errors). Consequently,for k > 1, the following relationship holds: e(k+l) = e(k) rs cv (12) (Of course, for k = 1, ers is zero, because each observation is its own nearest neighbour). In fact, this relationship is quite general and can be shown to hold for any number of equiprobable classes. Moving from cross-validation with k nearest neighbours, to resubstitution with k + 1, increases the number of neighbours of the same class as the test case by one. Because of the different consequences of having ties for the majority in kernel membership under resubstitution and cross-validation, this means that the judgement of majority membership will not differ between the two schemes. Now, all these strange properties follow from the fact that, for tied kernel membership, the density estimates for the two classes are not equal (as might be naively expected) but are biased in favour of the class to which the deleted observation belongs. In the parametric situation, by contrast, this does not happen. Consider a particular observation being classified using Fisher's linear discriminant analysis. Suppose that, under resubstitution, the observation lies at a point exactly mid-way between the two group means (and hence the group-specific densities are equal). Under cross-validation, the group mean of the class to which the deleted case belongs will have moved slightly farther away from the observation itself (because this observation no longer makes a contribution to the computation of the mean) and hence the classpecific 833

6 density estimate will be somewhat lower for the true class, than for the other class. When the sample sizes are not equal, the effect of sample size on the estimates of group-specific density and hence on the estimated posterior probabilities is easily calculated, for ties in kernel membership. The results are summarised in Table 1. It can easily be seen that, for Inl - n21 1, there is an optimistic bias in the classification behaviour under cross-validation. Outside these limits, the mean error rate has the appropriate theoretical value but the performance is markedly different for observations from the two different classes. Actual Class nl- n2 1 2 >1 wrong correct 1 tie correct 0 correct correct -1 correct tie < -1 correct wrong Table 1: Classification behaviour under cross-validation, for ties in kernel membership, as a function of differences in sample size. See text for further explanation. 3 Differences in Class Location For uniformly distributed data, it is a simple matter to generalise the argument just presented to the situation where there are differences in location between two equiprobable classes. Let one class have uniformly distributed data lying in the range (0, 1) and the other have uniform data in the range (s, 1 + s). Thus s is the separation distance between the class means and the classes overlap for a distance 1 - s on the data line. A Bayes' decision rule will give errors only where the classes overlap in the data space. Thus, the Bayes' error rate, eb is given simply by: 1 eb = -(1 - s) 2 (13) 834

7 r-qk-s...ag44i'm""",,,-",::x :('::':.'-.>"--'----" i! In this situation, the k-nn classification rule should approximate to the Bayes' rate, provided that k is small compared with the sample sizes. In SAS, under cross-validation, this is the case only for odd values of k. For even values of k, the effect of kernel ties in the overlap region produces an optimistic bias in E(eev}. Just as before, this is clearly evident: (k)} ( k ) E(ecv - eb(1 k/2 2k} (for k even) (14) 4 Unequal Class Probabilities Any attempt to extend the argument just presented to the situation involving unequal class probabilities immediately runs into a complicating factor in the performance of the nearest-neighbour algorithm. This arises from the fact that the k-nn classification rule is optimal only for the special case of equal class probabilities. Cover & Hart (1967) proved that, for any number of classes, the single nearest-neighbour decision rule produces an error rate, e), which is bounded below by the Bayes' error rate and above by twice the Bayes' rate: (15) This is easily illustrated with the sort of discrimination problem under consideration in this paper. For a two-class problem, with the sample sizes proportional to the prior probabilities, then provided that the data are smoothly distributed, the following error rate analysis is applicable. Within the region of overlap, for an observation in class 1, the probability of a classification error is simply the probability that its nearest neighbour in the data space is in class 2, i.e. p(w2}. Likewise, the probability of mis-classifying a case in class 2 is p(wl). Without loss of generality, let p(wl) be the smaller of the two class probabilities, i.e. p(wl) p(w2). Taking class separation into account, weighting the classes by their prior probabilities and writing p(w2) as 1 - p(wl) gives: (16) By contrast, the Bayes' error rate is given by the simple decision rule of classifying according to the most frequent class. Hence: (17) 835

8 ""'7""'''''''''"='''t'-:.f.-;..::---"'--'-:-i:"-;'<.: -"., Clearly, E(e!/) = eb when p(wl) = 1/2 and tends to the value 2eb as p(wl) approaches zero. Thus, it is easier to deal with the general values of k without reference to the Bayes' error rate. For the general k-nn rule, within the region of overlap, the expected error rates for odd values and even values of k can be obtained from binomial expansions, as shown below. To simplify the notation, let the prior probability of one of the classes be p. Then, by the same sort of argument used earlier, odd values of k give: (k-l)/2 (k).. i=k (k)... eo=p L i p'(i_p)k-t+(i_p) L i p'(i_p)k-1 i=o (k+1)/2 (18) For even values of k, two quantities can be defined: k/2-l (k).. i=k (k).. e = p?:i pt(1 - p)k-1 + (1- p) L i pt(1 - p)k-, t=o k/2+l (19) and ee = e + (k2)pk/2(1- p)k/2 (20) Thus the true expected cross-validated error rates are: E(eC'l)) = (1 - s)eo (for k odd) (21) and (for k even) (22) However, because of the way that kernel ties are evaluated in SAS: E(ecv ) = (1 - s)e (for k even, in SAS) (23) Thus, for even values of k, an optimistic bias in the cross-validated error rate is noticeable as: \ \. 836

9 5 The Normal Error Distribution So far, the analysis in this paper of cross-validation of the k-nn algorithm in SAS has examined its behaviour in dealing with the uniform error distribution only. This distribution was chosen because of the simplicity that it lends to the theoretical analysis. This simplicity arises from the fact that the range boundaries allow the data space to be partitioned into regions which differ strongly in class-specific density. Also, within each. region, the class-specific densities are necessarily constant. If we turn to examining behaviour with other distributions, such as the normal, then theoretical analysis becomes more difficult because of the fact that the class-specific densities change continuously with the value of x. Furthermore, the normal curve cannot be integrated analytically, which adds to the problem. For this reason, it was thought preferable to examine the behaviour of the algorithm with normally-distributed data by means of Monte Carlo simulation. Twelve thousand observations were drawn from the normal (0,1) distribution and two binary class membership indicators were simulated independently. One had exactly equal numbers of cases in each class, while the other was arranged to have exactly two-thirds of the cases in the most frequent class. Two situations were examined. In one, x was uncorrelated with class. In the other, a new variable, xl, was derived from x, simply by adding the binary class membership indicator, thereby introducing a class separation distance of unity between the respective class means. Thus four conditions were arranged, as follows: EQRAND: equiprobable classes and no class separation; EQDIST: equiprobable classes and unit class separation; NEQRAND: class membership odds of 2 : 1 and no class separation; NEQDIST: class membership odds of 2 : 1 and unit class separation; For each condition, the classification performance of the k-nn discriminant algorithm in SAS was estimated with the cross-validation option. Prior probabilities were estimated from the data. Each condition was tested using values 837

10 == of k from 1 to 12. For comparison purposes, Fisher's linear discriminant analysis was also applied to the same data. Experimental Condition k EQRAND EQDIST NEQRAND NEQDIST paramo Bayes ; Table 2: Cross-validation estimates of error rates as a function of k and experimental condition, obtained in Monte Carlo simulation. The parametric crossvalidation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation. The resulting error rates are shown in Table 2. For each experimental condition, there is the same type of optimistic bias in E(ecv ), for even values of k, as was deduced analytically for the uniform error distribution. It is obvious that these estimates are impossibly optimistic because they are smaller than the Bayes' rate, most noticeably for k = 2. These results are hardly surprising, because the essence of the problem lies in the high frequency of kernel ties and the way that they are evaluated in SAS and is not dependent on the particular within-class error distribution. \ \. 838

11 6 Extension to Multiple Classes by Monte Carlo Simulation If more than two classes are considered, the position becomes more complex because many different possibilities for ties emerge for majority kernel membership. For example, if c = 3 and k = 6, a three-way tie is possible, as well as three possible two-way ties. Also, ties emerge even when k is not a multiple of c. For example, with k = 6 and four classes, a kernel might have two observations from each of two classes and a single observation from each of the other two classes. Thus, an exact analytic approach to examining the effect of ties for more than two classes is extremely tedious and not worth the trouble. For this reason, a simple Monte Carlo simulation was performed, as follows. Twelve thousand observations were drawn from a uniform distribution and class membership indicator variables for 2, 3, 4, 5 and 6 classes were simulated (independently of the continuous variable), so as to have exactly equal numbers of observations in each class. Thus five data sets were generated, all sharing the same independent variable, which conveyed no information about class membership. The classification performance of k-nn discriminant analysis was estimated using the DISCRIM procedure in SAS, with the cross-validation option. Each data set was tested using values of k from 1 to 12. The resulting error rates are shown in Table 3. The following points should be noted. 1. The error rates for c = 2 are as expected from Equation In all cases where k = 1, the error rates approximate closely to the theoretical expected values. 3. In all situations where c > 2 and k > 1, the results show clearly that the estimated error rates are substantially lower than the corresponding theoretical expected values. 4. Resubstitution estimates of the error rates were also recorded. For k > 1, they confirmed exactly the relationship with the cross-validation estimates stated in Equation

12 N umber of Classes k paramo Bayes Table 3: Cross-validation estimates of error rates as a function of k and number of classes obtained in Monte Carlo simulation. The parametric cross-validation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation. 7 A Possible Remedy The basis of the problem lies in peculiarities of the density estimation procedure in the k-nn algorithm under cross-validation, compounded by the high frequency of kernel ties. However, it is possible to compensate for this by making adjustments to the estimates of the prior probabilities. Hence, one solution to the problem is to estimate the prior probabilities from the data after case deletion, rather than fix them from the outset as is done conventionally. 2 If this course of action is taken then, under cross-validation, if the deleted case belongs to class i, the prior probability for membership of class i is then estimated by: A( ) ni - 1 (25) PWi =-- n-l 2Note that this adaptation is proposed for the nonparametric algorithm only. 840

13 However, the prior probability for membership of any of the other classes, j, is given by: n p(wj) = n 1 (26) The corresponding class-specific densities at x are estimated as: and ki fi(x) = V(ni - 1) (27) j.(x) = l5.l J Vn. and the unconditional density is estimated by: J (28) k f(x) = (n - l)v (29) Thus the corresponding estimated posterior probabilities become: ( I) ki P Wi x =- k (30) and k p(wjlx) = -.2 (31) k In these circumstances, if there is a tie for majority kernel membership involving the class to which the deleted case belongs, then there will also be a tie in the estimated posterior probabilities. It is proposed that a random classification choice is made between the tied classes in these cases. If this is done, then the resulting error rate will be essentially unbiased. Of course, the approach just described is appropriate only when the samples have been drawn so as to be representative of the populations that they are intended to represent. If the priors are non-proportional, then this approach needs to be modified. In this situation, the priors must be specified initially by the user. To begin with, this additional information is ignored and the computation is performed as previously specified, up to the point where the posterior probabilities are estimated. However, the posterior probabilities then need to be adjusted for the lack of proportionality before the assignment to classes is made. Thus, if 7ri is the user-specified prior for class Wi, then the adjustment factor required is: (32) 841

14 The appropriately adjusted estimate of the posterior probability is then given by Equation 8. 8 Conclusion This is an interesting example of a problem occurring in statistical software which is caused, not by a computing error, but by a mathematical one. The misbehaviour of the k-nn algorithm under cross-validation is entirely deducible from the mathematics given in the SAS manual, (SAS Institute, 1989). In this respect, the nature of the problem is similar to the one reported by White & Liu (1993), in which a stepwise discriminant algorithm is improperly crossvalidated. In both cases, the respective problems arose because of lack of consideration of the effects of combining techniques. In the case of SAS, the k-nn algorithm works well enough when considered in isolation and so does the cross-validation technique. The difficulty arises when the two techniques are used in combination. Apart from SAS, nonparametric discriminant techniques are not available in the commonly used statistical software with which the author is familiar. Hence, problems encountered by combining these two techniques do not seem to have been encountered elsewhere. The solution offered in this paper keeps as close as possible to the original philosophies of both cross-validation and k-nn discriminant analysis. It involves estimating the prior probabilities from the data after the case deletion which forms part of the cross-validation procedure. The only possibly contentious aspect is the proposed use of random choice between predicted classes in the case of ties in posterior probabilities. One feature of this approach is that the procedure is non-repeatable. However, there are precedents for this type of procedure. Tocher (1950) proposed a modification to Fisher's exact probability test which utilised random choice in order to achieve specified a values for significance testing purposes. Also, the use of approximate randomization techniques for conducting significance tests has been described by Edgington (1980), Still & White (1981) and White & Still (1984, 1987). The proposal here is to make use of random choice to achieve an unbiased estimate of classification performance when kernel ties are encountered. \ \, 842 -_.._.. _

15 Acknowledgements The author would like to thank Prof. J.B. Copas (from the Department of Statistics at the University of Warwick) and Prof. A.J. Lawrance and Dr. P. Davies (both from the School of Mathematics and Statistics, at the University of Birmingham) for their helpful comments. References Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, Edgington, E.S. (1980). Randomization Tests, New York: Marcel Dekker. Hand, D.J. (1981). Discrimination and Classification, New York: John Wiley & Sons Ltd. SAS Institute Inc. (1989). SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC, USA: SAS Institute Inc. Still, A.W. and White, A.P. (1981). The approximate randomization test as an alternative to the F test in analysis of variance. British Journal of Mathematical and Statistical Psychology, 34, Tocher, K.D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika, 37, White, A.P. and Liu, W.Z. (1993). The jackknife with a stepwise discriminant algorithm - a warning to BMDP users. Journal of Applied Statistics, 20, (1), White, A.P. and Still, A.W. (1984). Monte Carlo analysis of variance. In Proceedings of the Sixth Symposium in Computational Statistics (Prague). Vienna: Physica-Verlag. White, A.P. and Still, A.W. (1987). Monte Carlo randomization tests: A reply to Bradbury. British Journal of Mathematical and Statistical Psychology, 40,

Naive Bayesian classifiers for multinomial features: a theoretical analysis

Naive Bayesian classifiers for multinomial features: a theoretical analysis Ewald van Dyk 1, Etienne Barnard 2 1,2 School of Electrical, Electronic and Computer Engineering, University of North-West, South