Weighted K-Nearest Neighbor Revisited

Weighted -Nearest Neighbor Revisited M. Biego University of Verona Verona, Italy Email: manuele.biego@univr.it M. Loog Delft University of Tehnology Delft, The Netherlands Email: m.loog@tudelft.nl Abstrat In this paper we show that weighted -Nearest Neighbor, a variation of the lassi -Nearest Neighbor, an be reinterpreted from a lassifier ombining perspetive, speifially as a fixed ombiner rule, the sum rule. Subsequently, we experimentally demonstrate that it an be rather benefiial to onsider other ombining shemes as well. In partiular, we fous on trained ombiners and illustrate the positive effet these an have on lassifiation performane. lass(x) ---> x 0.9 x 0.5 lass(x) ---> I. INTRODUCTION The -nearest neighbor () rule is a widely used and easy to implement lassifiation rule. It assigns a point x to the lass most present among the points in the training set nearest to x 1, 2, 3, 4. Deiding whih points are nearest is done aording to some prespeified distane. In this proedure, all points within the neighborhood ontribute equally to the final deision for x. It seems obvious, therefore, to allow for weighted voting (of some sort) in order to improve performane. Royall was probably the first to seriously onsider this option 5: he demonstrated that improvements are indeed possible in the regression setting under squared loss. In the lassifiation setting, Dudani 6 was the first to introdue a speifi distane-weighted rule and provided empirial evidene of its admissibility. He disussed some alternatives to define the weights, all with weights dropping in terms of the distane to x with a weight of 1 for the first nearest neighbor and a weight of 0 for the th. Given the weights, eah neighbor of x ontributes to the final deision with its own weight: in partiular, the Weighted -Nearest- Neighbor (W) rule assigns x to that lass for whih the weights of the representatives among the nearest neighbors sum to the greatest value 6 (see. Fig. (1)). The weighting sheme introdued by Dudani 6, even when weights are leverly hosen, is not neessarily helpful as, for instane, demonstrated in 7. The paper showed that, asymptotially, unweighted is to be preferred over any weighted version in ase we fix. However, when dealing with the realisti setting of finite samples, improvements are possible (see 8 for instane). Clearly, whether weighting an help also depends on what we onsider as improvement 5, 8, 9. Though weighted rules are used in various appliations, little oneptual, theoretial, or methodologial advanes have been made in the past deades. Two reent additions to this literature inlude 10 and 11. In 10, a soalled dual distane funtion is onsidered, whih turns out to be less sensitive to the hoie of and supposedly avoids (a) Fig. 1: Example of (a) -Nearest Neighbor and (b) Weighted -Nearest Neighbor ( = 3). With, every neighbor ounts in the same way for the final deision: in the ase shown in figure, the ross is assigned to the irle lass, the most frequent lass in the neighborhood. On the ontrary, with Weighted every neighbor has assoiated a weight; in the final deision, eah neighbor ounts with its own weight: in the example, sine the sum of the weights of the neighbors from the square lass is larger than that of the neighbors of the irle lass, the ross is assigned to the square lass. degradation of the lassifiation auray in the small sample ase and when dealing with outliers. In 11, the authors derive an asymptotially optimal way of defining nonnegative weights to be used within the W sheme. In this work, we reinterpret the Weighted (and the ) from a lassifier ombining perspetive 12: we show that an be seen as a plain majority voting sheme and, generally, the weighted as a fixed ombiner rule (the sum rule). This view opens the door to the use of other lassifier ombiners and we show that it an indeed be quite benefiial to onsider alternative and more advaned shemes. In partiular, here we fous on trained ombining shemes 13, 14, for whih with our experiments demonstrate potentially signifiant improvements in lassifiation performanes over the original weighting sheme by Dudani 6. A. Outline Setion II introdues the neessary bakground on, its weighted variant, and lassifier ombiners, while fixing notation. Setion III offers our interpretation of as a ombining sheme and skethes how various ombiners ould be integrated using the terminology of mathing sores. The next setion, Setion IV, desribes the experiments that were arried out with our revisited using a trained ombiner. It (b)

also reports on the results and disusses them. Finally, Setion V onludes. II. PRELIMINARIES AND ADDITIONAL BACGROUND In this setion we introdue the neessary bakground on, the W, and the theory of lassifier ombiners, while fixing notation. A. -Nearest Neighbor Let us start with some definitions: x: the pattern to be lassified; {x i } (with1 i N): the set of N points in the training set; eah training pattern is equipped with a label {y i } (with 1 i N). The label y i an be one of the possible values 1...C, where C is the number of lasses of the problem at hand ne (x) = {n 1,...,n }: the points in the training set whih are nearest to x aording to a ertain distane d(, ); y n1,...y n are the orresponding labels; please note that we onsider {n 1,...,n } as ordered aording to the distane from x n 1 is the nearest neighbor, n is the farthest of the nearest neighbors. Given these definitions, the standard rule assigns x to the lass ĉ more frequent in the set ne (x), i.e. x argmax {n i : y ni = } (1) where X denotes the ardinality of the set X. Rule (1) an be rewritten as x argmax I (n i ) (2) i=1 where I (z) is the indiator funtion for lass { 1 if z belongs to lass I (z) = The summation in (2), for a given, simply ounts the number of points in the neighborhood ne (x) belonging to lass. B. Weighted -Nearest Neighbor Within the Weighted -Nearest Neighbor rule 6, eah neighbor n i ne (x) is equipped with a weight w ni, whih an be omputed using for example the methods presented in 6, 11. Note that in the general setting, we may have a different set of weights for every point to be lassified: when hanging the point x to be lassified, also the neighborhood ne (x) hanges and therefore the orresponding weights, as they typially depend diretly on the relation between the neighbors and the point x. This is lear, for instane, when onsidering the definition of weights introdued in Equation (2) of 6: 1 w ni = (4) d(x,n i ) With this definition, the weight of a given training example is different when hanging the point x to be lassified it depends on the distane from suh x: the more distant to the (3) neighbor, the lower its weight/importane in the lassifiation of x. This definition of weights takes inspiration from ideas typial of the Parzen Windows estimator 15. Given neighbors and weights, the Weighted -nearest neighbor rule assigns x to the lass ĉ for whih the weights of its representatives in the neighborhood ne(x) sum to the greatest value. Following the notation of Equation (2), x argmax I (n i )w ni (5) i=1 Clearly, the and the Weighted rules are equivalent when =1. C. Classifier Combining The main idea behind the Classifier Combining theory 16, 12 is that it is possible to improve the lassifiation auray by exploiting the diversity present in different pattern reognition systems. Suh diversity an derive from the employment of different sensors, different features, different training sets, different lassifiers or others 12. In partiular, here we fous on the following senario: we have a set of M different lassifiers (experts) E 1,...E M. Given a lassifiation problem involving C lasses, and a pattern x to be lassified, every lassifier E l returns a set of values E l (x): E l (x) = e l1 (x),e l2 (x),,e lc (x) where e l (x) an be a posterior of the lass i.e e l (x) = P( x) or simply a mathing sore, i.e. a number indiating how likely is that the lass of x is (alled onfidenes in 12). A given lassifier (expert) E l takes a deision on x with the following rule x argmaxe l (x) (6) Given a pool ofm lassifiers, the goal is to ombine the values present in the following matrix e 11 (x) e 12 (x) e 1C (x) e 21 (x) e 22 (x) e 2C (x) E(x) =........ e M1 (x) e M2 (x) e MC (x) to reah a lassifiation that is potentially better than those of the single lassifiers. Many methods have been proposed in the past to address this problem (16, 12, 17, 18, just to ite a few), whih are based on different ideas, intuitions, or hypotheses. Here we summarize the following three lasses of approahes, whih will beome useful in the remainder of this work. 1) Combination of deisions: In this ase, eah expert E l takes its own deision; the final lassifiation is then obtained by ombining suh deisions. One relevant example is the majority voting rule, where the final deision is taken by looking at the lass whih reeived the majority of votes. More formally, x argmax M l (x) (7)

where l (x) = { 1 if el (x) = maxe lj (x) j In other words l (x) is 1 only if the lassifier E l assigns x to the lass. 2) Fixed ombination of mathing sores: In this ase, for a given lass, the mathing sores e l (x) of the different lassifiers (with 1 l M) are ombined together, in order to return an unique mathing sore for the onsidered lass. The ombination of these sores follows fixed rules, suh as the sum or the produt of them, the max or the min among them, the linear ombination of them, and similar 12. The final deision is finally taken by looking at these aggregated mathing sores. For example, with the Sum Rule, a pattern x is lassified with the following rule: M x argmax e l (x) (9) whereas with the Prod Rule we have M x argmax e l (x) (8) (10) 3) Trained Combiners: This represents a more advaned sheme 13, 14, in whih the idea is to diretly use the sores derived in the matrix E(x) as new features for the pattern x: in this way a lassifier is learned on the ouputs of other lassifiers, following what is sometimes referred to as staked ombination 19. In more detail, a pattern x is desribed with ve(e(x)), where the so-alled ve( ) operator (vetorization) takes a matrix argument and returns a vetor with the matrix elements staked olumn by olumn. In the training phase, the vetorized E(x i ) matrix is omputed for all objets x i of the training set, resulting in a novel training set, whih is used to train a lassifier f. In the testing phase, the testing objet x is firstly enoded with ve(e(x)) and then lassified using the lassifier f. III. THE WEIGHTED RULE REVISITED In this setion we propose an interpretation of the W rule (and the rule) from a ombining lassifier perspetive. The main idea behind our interpretation is the following: in the (W) the final deision on x is obtained by ombining information provided by the nearest neighbors ne (x) = n 1, n of x. Therefore it seems reasonable to onsider these points as different experts/lassifiers, whih provide information to be ombined for reahing the final deision. Let us larify our vision by firstly onsidering the : we will show how to build the E(x) matrix, and whih ombination rule should be used to get exatly the rule. As said before, we have experts/lassifiers, whih we indiate as E n1, E n, eah one related to one speifi neighbor n l. In the ase, the elements of the matrix E (x) are defined, l {1..} and {1..C}, as: { a if ynl = e nl (x) = (11) withaafixed positive number 1 (it an also be 1). For example, if = 3, C = 4, and y n1 = 1, y n2 = 1, and y n3 = 2, the matrix E (x) is E (x) = a 0 0 0 a 0 0 0 0 a 0 0 Given this formulation, if we apply the majority voting rule defined in Equation (7) we have to perform two steps: i) to take a deision for eah lassifier (eah row), and this is done by taking the maximum over the row; ii) then to assign x to the lass whih reeived the majority of votes. Atually in this way we obtain exatly the -nearest neighbor lassifier: given the definition in Equation (11), every expert (neighbor) votes for the lass orresponding to its label, and the final lass is deided by looking at the most voted lass, whih is exatly the most frequent lass in the neighborhood 2. In the ase of Weighted, we define the elements of the matrix E W (x), l {1..} and {1..C}, as: { wnl if y e nl (x) = nl = (12) For example, for the problem introdued before ( = 3, C = 4, y n1 = 1, y n2 = 1, and y n3 = 2) w n1 0 0 0 E W (x) = w n2 0 0 0 0 w n3 0 0 Given this definition, if we apply Sum Rule desribed in Equation (9) to E W (x), we have to perform two steps: i) aggregate the sores for every lass, and this is done by summing the values ontained in eah olumn; ii) assign x to the lass for whih this aggregated sore is maximum. It is straightforward to note that this is exatly the deision rule proposed by the Weighted rule desribed in Equation (5). A. Normalization of E(x) In many ombination rules, before applying a ombination sheme, the mathing sores (onfidenes) E l (x) should be normalized, in order to get values whih are omparable among the different lassifiers (see 12, hap. 5) this is espeially true when using trained ombiners 13, as those desribed in the previous setion. In the following we provide some intuitions on what happens when using a ommon and established normalization sheme, the so-alled Soft-Max 1 Please note that e nl (x) an be defined in a more ompat way using the Indiator Funtion I (z) used in Equation (2). However, for larity, here he presented this more verbose formulation. 2 Please note that in this ase we also get the rule by applying the sum rule (sine a is a ostant).

normalization 15. After this normalization the mathing sores are in the range 0 1; moreover, for every lassifier, they sums to 1, so that they an be interpreted with somehow an abuse of interpretation as posterior probabilities this is espeially useful when trying to derive theoretial properties as in 16. When applied to our ase, eah e l (x) of E(x) is transformed in ê l (x) via the following formula: ê l (x) = ee l(x) C j=1 e e lj(x) (13) With this normalization, E (x) is transformed in Ê (x), where { e ê nl (x) = a /R if y nl = (14) 1/R otherwise where R = (C 1)+e a (15) is the normalization fator present in the denominator of Equation (13). It is straightforward to observe that, given this normalized Ê (x), the rule is still obtained by applying the majority voting rule to Ê (x). On the ontrary, after this normalization, the Weighted rule beomes equivalent to another fixed rule, namely the prod rule. Atually, Ê W (x) is defined as { e w nl /R ê nl (x) = l if y nl = (16) 1/R l otherwise where R l is again the normalization fator of Equation (13), whih in this ase is different for different neighbors n l, and is defined as R l = (C 1)+e wn l (17) If we onsider the prod rule in Equation (10) applied to Ê W (x), we have M x argmax ê l (x) (18) Taking the log does not affet the argument of the max, therefore an equivalent rule is: M x argmax log ê l (x) (19) whih beomes x argmax = arg max = arg max = arg max logê l (x) e l (x) logr l M e l (x) logr l e l (x) (20) (21) (22) (23) where we dropped the last term beause is equal among all lasses. The resulting rule is equivalent to the Weighted rule of Equation (5). Summarizing, here we provided a revisitation of the and the Weighted rules from the Classifier Combining perspetive: this opens the door to the possibility of using different (even omplex) ombination strategies. We will provide some evidene, in the experimental setion, that using a trained ombiner permits to improve the performanes of both the and the W rule. IV. EXPERIMENTAL RESULTS In this setion, we provide some empirial evidene that the perspetive introdued in this paper permits to exploit advaned ombination tehniques, suh as those represented by trained ombiners 13, 14. In partiular, in our empirial evaluation we ompare three tehniques: 1) : this is the lassi -Nearest Neighbor rule. As we have shown in Setion III, this orresponds to the majority vote rule applied to the E matrix defined in Equation (11) as well as to the ÊW defined in Equation (16). 2) : this is the Original Weighted - Nearest Neighbor rule desribed in 6 and presented in Setion II-B. This orresponds to the sum rule applied to the E W matrix defined in Equation (12) or to the prod rule applied to the ÊW defined in Equation (16). 3) W (TrainedComb): in this ase we applied a trained ombiner sheme: as explained in Setion II-C, with this sheme every pattern is desribed with the vetorization of its orresponding matrix of sores, whih is used as feature vetor to represent it. In other words, all the objets of the problem are mapped in a novel feature spae, where another lassifier is used. Here we adopt the deision template sheme proposed in 12, whih represents one of the first and most basi trained ombiner. More in details, for every pattern x i of the training set we ompute the matrix ÊW (x i ), as defined in Equation (16); we used the normalized sores, as suggested in 13. For every training point x i, the orresponding neighborhood ne (x i ) is determined without onsidering x i (this an partially prevent the overtraining situation whih may our with trained ombiners for a disussion on these aspets see 13). Given this novel feature spae, the Nearest Mean Classifier 15 is used as lassifier. In partiular, for every lass, we ompute the mean of the vetorized sores of the x i belonging to lass : this averaged vetorized sore then represents the template t of suh lass: t = mean x is.t.y ni = ve(ê(x i)) (24) Finally, the testing objet x is lassified by looking at the similarity between its ve(ê(x)) and the different

lasses templates t, assigning it to the nearest template: x argmin d(ve(ê(x)),t ) (25) where d(, ) is a distane between vetors. For more details interested readers an refer to Subsetion 5.3.1 of 12. The three tehniques have been tested using 6 different lassi datasets (from the UCI-ML repository), whih harateristis are summarized in table I. All datasets have been normalized so that every feature has zero mean and unit variane. As distane to ompute neighbors we used the lassial Dataset Objets Classes Features Sonar 208 2 60 Soybean2 136 4 35 Ionosphere 351 2 34 Wine 178 3 13 Breast 699 2 9 Bananas 100 2 2 TABLE I: Desription of the datasets Eulidean distane. Weighted weights are omputed using Equation (1) of Dudani s paper 6: d(x,n ) d(x,n i ) if d(x,n ) d(x,n 1 ) d(x,n w ni = ) d(x,n 1 ) 1 otherwise In this way weights are normalized between 0 and 1 (1 the weight of the nearest neighbor, 0 the weight of the farthest neighbor). We let vary from 5 to 45 (step 2). Classifiation errors have been omputed using the Averaged Holdout Cross validation protool: the dataset is randomly split into two parts, one used for training and the other used for testing; the proedure has been repeated 30 times. Results are shown in Figure 2. From the plots it an be observed that the Trained Combiner rule permits to improve the auray of both and W. This is more evident when the problem lives in a high dimensional spae. For moderately dimensional spae we annot observe suh a drasti improvement. One interesting observation derives by looking at the behavior for large. Apparently, the Trained Combiner sheme does not suffer too muh from a bad hoie of ; this may be due to the fat that adding neighbors to the analysis simply orresponds to a different normalization of the feature spae indued by ve( E(x)). ˆ More in details, adding neighbors hanges d(x,n ), whih results in a shift (the numerator) and in a resaling (the denominator) of the weight defined in Equation (26). Sine we onsider suh weights as features in the novel spae, adding neighbors simply result in a different saling, whih seems to not affet too muh the final lassifiation. a fixed ombiner rule, whereas a majority voting rule. Then we provided some evidene that lassifiation improvements are possible when using other lassifier ombining tehniques, suh as trained ombiners. ACNOWLEDGEMENTS This work was partially supported by the University of Verona through the CooperInt Program 2014 Edition. The authors are extremely grateful for all the guidane and inspiration that O. Ai Preti offered. REFERENCES 1 E. Fix and J. L. Hodges Jr, Disriminatory analysis-nonparametri disrimination: onsisteny properties, DTIC Doument, Teh. Rep., 1951. 2, Disriminatory analysis-nonparametri disrimination: Small sample performane, DTIC Doument, Teh. Rep., 1952. 3 T. Cover and P. Hart, The nearest neighbor deision rule, IEEE Trans. Inform. Theory, vol. IT-13, pp. 21 27, 1967. 4 L. Devroye, L. Györfi, and G. Lugosi, A probabilisti theory of pattern reognition. Springer Siene & Business Media, 2013, vol. 31. 5 R. M. Royall, A lass of non-parametri estimates of a smooth regression funtion, Ph.D. dissertation, Dept. of Statistis, Stanford University., 1966. 6 S. Dudani, The distane-weighted k-nearest-neighbor rule, IEEE Trans. on Systems, Man, and Cybernetis, vol. SMC-6, no. 4, pp. 325 327, 1976. 7 T. Bailey and A. Jain, A note on distane-weighted k-nearest neighbor rules, IEEE Transations on Systems, Man, and Cybernetis, no. 4, pp. 311 313, 1978. 8 J. E. MaLeod, A. Luk, and D. M. Titterington, A re-examination of the distane-weighted k-nearest neighbor lassifiation rule, Systems, Man and Cybernetis, IEEE Transations on, vol. 17, no. 4, pp. 689 696, 1987. 9 J. F. Banzhaf III, Weighted voting doesn t work: A mathematial analysis, Rutgers L. Rev., vol. 19, p. 317, 1964. 10 J. Gou, L. Du, Y. Zhang, and T. Xiong, A new distane-weighted k- nearest neighbor lassifier, Journal of Information & Computational Siene, vol. 9, no. 6, pp. 1429 1436, 2012. 11 R. Samworth, Optimal weighted nearest neighbor lassifiers, The Annals of Statistis, vol. 40, no. 5, pp. 2733 2763, 2012. 12 L. unheva, Combining Pattern Classifiers: Methods and Algorithms. Wiley, 2004. 13 R. Duin, The ombining lassifier: To train or not to train? in Pro. Int. Conf. on Pattern Reognition, 2002, pp. 765 770. 14 L. unheva, J. Bezdek, and R. Duin, Deision templates for multiple lassifier fusion: an experimental omparison, Pattern Reognition, vol. 34, no. 2, p. 299314, 2001. 15 R. Duda, P. Hart, and D. Stork, Pattern Classifiation, 2nd ed. John Wiley & Sons, 2001. 16 J. ittler, M. Hatef, R. Duin, and J. Matas, On ombining lassifiers, IEEE Trans. Pattern Anal. Mah. Intell., vol. 20, no. 3, pp. 226 239, 1998. 17 A. Ross,. Nandakumar, and A. Jain, Handbook of Multibiometris. Springer, 2006. 18 G. Fumera and F. Roli, A theoretial and experimental analysis of linear ombiners for multiple lassifier systems, IEEE Trans. Pattern Anal. Mah. Intell., vol. 27, no. 6, pp. 942 956, 2005. 19 D. Wolpert, Staked generalization, Neural Networks, vol. 5, no. 2, pp. 241 259, 1992. V. CONCLUSIONS In this paper we revisited the Weighted -Nearest Neighbor (and the -Nearest Neighbor) sheme under a lassifier ombining perspetive. Assuming this view, W implements

4 2 8 6 4 2 8 6 W (TrainedComb) sonar (a) 0.6 0.55 0.5 0.45 0.4 5 5 5 W (TrainedComb) soybean2 (b) 5 ionosphere W (TrainedComb) 0.06 0.055 0.05 wine W (TrainedComb) 5 0.045 0.04 0.035 5 0.03 0.025 () 0.02 (d) 0.4 breast 0.5 bananas 5 W (TrainedComb) 0.45 0.4 W (TrainedComb) 5 5 5 5 0.05 5 0 (e) (f) Fig. 2: Cross validation errors of the tested tehniques for different datasets: (a) Sonar; (b) Soybean2; () Ionosphere; (d) Wine; (e) Breast; (f) Bananas.