Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods

Size: px

Start display at page:

Download "Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods"

Austin Norman
5 years ago
Views:

1 Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods Th. Villmann 1 and F. Schleif 2 and B. Hammer 3 1 University Leipzig, Clinic for Psychotherapy 2 Bruker Daltonik GmbH Leipzig and University Leipzig, Dept. of Computer Science 3 Clausthal University of Technology, Department of Mathematics and Computer Scien February 9, 2005 Abstract The paper deals with the concept of relevance learning in learning vector quantization and classification. Recent machine learning approaches with the ability of metric adaptation but based on different concepts are considered in comparison to variants of relevance learning vector quantization. We compare these methods with respect to their theoretical motivation and we demonstrate the differences of their behavior for several real world data sets. Keywords: learning vector quantization, relevance learning, metric adaptation, classification 1 Introduction Data in interesting domains such as language processing, logic,chemistry, and bioinformatics often possess an inherent structure. Typical difficulties arise for machine learning within these domains: a very high or varying data dimensionality, correlations of the data elements, or a sparsely covered data space, to name just a few. Because of this fact, standard vector processing by means of euclidean vectors faces severe problems in these cases and several approaches to deal in a more adequate way with these data structures have been developed. A very successful and interesting possibility for recursive data structures is given by the dynamics of recurrent and recursive neural networks [1, 2]. Recently, this idea has been extended to more general graph structures in several ways [3, 4, 5, 6]; however, the data structures which can be tackled in this way are still restricted. Another very general alternative to deal with structured data is offered within similarity based machine learning approaches. Here, fairly general data structures can be dealt with as soon as a similarity measure for these structures has been defined or data are embedded into a metric space [7]. A popular application of this idea can be found in connection to support vector machines (SVM and other kernel methods, where a variety of different kernels such as string kernels, graph kernels, or kernels derived from a probabilistic model have been defined [8, 9, 10]. The use of specific kernels is not restricted to the SVM but it can readily be transferred to general metric based approaches such as the median selforganizing map or nearest neighbor classification [11]. Naturally, the similarity measure plays corresponding author, University Leipzig, Clinic for Psychotherapy, Karl-Tauchnitz-Str. 25, Leipzig, Germany, villmann@informatik.uni-leipzig.de

2 a crucial role in these approaches and an appropriate choice of the distance might face severe difficulties. The design of general similarity measures which can be used for any learning tasks, i.e. which guarantee a universal approximation ability and distinguishability of arbitrary structures, is one possible line of research; however, the resulting representation of data is often too complex for the concrete task and, moreover, a universal design might not be efficient for complex data structures due to principled problems [12]. Therefore, similarity measures which are constructed for the concrete problem based on the given data are particularly interesting since they offer an automated design of problem specific representations. A prime example of this idea is the Fisher kernel which derives a similarity measure based on a statistical model of the data [9]. Still, the resulting kernel is fairly general since it mirrors general statistical properties of the given data set. In case of supervised learning tasks such as classification, only those properties are relevant which are related to the class labels whereas statistical information which is independent of the class distribution can be abandoned. The focus of this article is a presentation and comparison of approaches which adapt a similarity measure based on given class information for supervised learning tasks. Pattern classification plays an important role in data processing. It is used for discrimination of patterns according to certain criteria, which may be of statistical type, structural differences, feature discrimination, etc. Thereby the representation of the objects significantly influences the ability for discrimination. An improper representation of the object may lead to vanishing differences whereas a suitable representation offers a clear separation of object classes. In this sense, classical statistical discriminant analysis techniques like Fisher-discriminant analysis project data onto a one-dimensional representation which should deliver the best separation of classes. Obviously, the optimal representation depends on the classification task. It is closely related to the definition of the similarity or metric between the objects used for classification. Hence, the similarity should be chosen adequately to the given classification task. An appropriate choice of the metric can substitute adaptations of the representation and vice versa. Also quite common, the standard Euclidean metric may not be the best choice. One family of intuitive metric based classification algorithms is learning vector quantization (LVQ. This family comprises prototype based algorithms which try to represent the objects by typical representatives (prototypes, weight vectors. The discriminant property is realized by labeling of the prototypes such that they adapt specifically according to the given classes. As it is described in more detail later, non-standard metrics as well as metric adaptation can easily be included in advanced modifications of LVQ. A further famous approach for classification are support vector machines (SVMs. Here, the key idea is to map the data into a possibly high-dimensional representation space, which allows a linear separation of the classes. The choice of the mapping (kernel mapping is crucial and the incorporation of metric adaptation is thus quite interesting. In the present paper we compare several machine learning approaches for classification in the light of metric adaptation and usage of non-standard metrics according to the given classification task. In particular, we compare the recent developments of relevance learning in LVQ, distance metric learning in SVM and relevance learning in information theoretic based LVQ. These approaches represent different paradigms as prototype based classification, kernel regression classification, kernel mapping based classification and mutual information maximization based classsification, respectively. We demonstrate the consequences in several real life experiments. Since the approaches have been designed for standard vectorial data, we also present results for vectorial data sets to allow a fair comparison of the methods. However, it turns out that even for this comparably simple vector representation of data a problem adapted metric which integrates some structure into the problem in form of relevance information is superior to a simple euclidean metric. To demonstrate the principled applicability of the methods to more 2

3 complex data, we also integrate an example from bioinformatics with more complex input signals, the classification of spectra. Since these data are obtained as the function values of a spectrum at different wave lengths, typical characteristica of structures can be observed: a very high dimensionality, a close correlation of subsequent entries, and an only sparsely covered data space. Here, the design of metrics which also take the generalization ability of the classifier into account and which go beyond the standard euclidean metric is of crucial importance. The paper is structured as follows: in sect. 2 the investigated method are described including the respective extensions to non-standard metrics and metric adaptation. They are compared for several classification tasks in sect. 3 followed by a summary. 2 Methods for classification We concentrate on four methods for classification in machine learning according to the above outlined four directions and consider them in the context of non-standard metrics and metric adaptation. These chosen methods are popular representants of their methodologies behind: prototype based classification: Generalized Learning Vector Quantization (GLVQ as an extension of the basic LVQ-algorithms introduced by S &Y [13] combined with relevance learning and neighborhood cooperativeness (Supervised Relevance neural Gas - SRNG [14] mutual information maximization: Information Theoretic LVQ according to information theoretic measures proposed by T [15],[16] kernel regression classification: Regression Parametric Distance Metric Learning (RPDML invented by Z. [17] kernel based classification: Kernel Based Distance Learning (KBDL by T &K [18],[17]. In the following we briefly explain the basic ideas of the algorithms. numerical considerations. It is followed by 2.1 Supervised Neural Gas for Generalized Learning Vector Quantization Supervised Neural Gas is considered as a representative for prototype based classification approaches. It can be combined with the demanded feature of relevance learning. Moreover, it is a stochastic gradient descent algorithm, which is a margin optimizer with known bounds of generalization ability [19] Basic Model Standard LVQ2.1 as proposed by K is a heuristic approach to reduce the classification error in supervised learning [20]. However, the adaptation dynamic does not minimize any continuous cost function and shows instabilities. In addition, the result of LVQ2.1 crucially depends on the initialization of the prototypes, commonly initialized with the centerpoints of the classes or random representatives from the training set. The algorithms use iterative local learning rules which can easily stuck in local optima. GLVQ avoids the numerical instabilities of LVQ2.1 due to a stochastic gradient descent on a cost function [13] which optimizes the 3

4 margin [21]. However, it crucially dependents on the initialization. To overcome this drawback a combination of GLVQ with neural gas (NG has been proposed by the authors in such a way that a cost function is minimized trough the learning rule [22]. This cost function leads to training similar to NG or simple GLVQ, respectively, depending on the choice of a parameter of the cost function. During training, this parameter is varied so that neighborhood cooperation assures a distribution of the prototypes among the data set at the beginning, and a good separation of the classes is accounted for at the end of training. We now shortly present the respective formal descriptions. Let us first clarify some notations: Let cv Lbe the label of input v, L a set of labels (classes with #L = NL. Let V RD V be a finite set of inputs v. LVQ uses a fixed number of prototypes (weight vectors, codebook vectors for each class. Let W = {w r } be the set of all codebook vectors and c r be the class label of w r. Furthermore, let W c = {w r c r = c} be the subset of prototypes assigned to class c L. The task of vector quantization is realized by the map Ψ as a winner-take-all rule, i.e. a stimulus vector v V is mapped onto that neuron s A the pointer w s of which is closest to the presented stimulus vector v, Ψ λ V A : v s (v = argmin r A dλ (v,w r (2.1 with d λ (v,w being an arbitrary differentiable similarity 1 measure which may depend on a parameter vector λ. For the moment we take λ as fixed. The neuron s (v is called winner or best matching unit. The subset of the input space Ω λ r = {v V : r =Ψ V A (v} (2.2 which is mapped to a particular neuron r according to (2.1, forms the (masked receptive field of that neuron forming a Voronoi tesselation. If the class information of the weight vector is used, the boundaries Ω λ r generate the decision boundaries for classes. A training algorithm should adapt the prototypes such that for each class c L, the corresponding codebook vectors W c represent the class as accurately as possible. This means that the set of points in any given class V c = {v V cv = c}, and the union U c = r wr W c Ω r of receptive fields of the corresponding prototypes should differ as little as possible. We now consider the Generalized Learning Vector Quantization approach - GLVQ. The main idea is to introduce a cost function such that the learning rule gives a gradient descent on it. At the same time it should assess the number of misclassifications of the prototype based classification. Let f (x = (1 + exp ( x 1 be the logistic function. GLVQ minimizes the cost function Cost GLV Q = v f(µ λ(v (2.3 µ λ(v= dλ r+ dλ r d λ r+ + dλ r (2.4 via stochastic gradient descent, where d λ r+ is the squared distance of the input vector v to the nearest codebook vector labeled with c r + = cv, sayw r+, andd λ r is the squared distance to 1 A similarity measure is a non-negative real-valued function of two variables, which, in contrast to a distance measure does not necessarily fulfill the triangle inequality and has not to be symmetric. Naturally, each distance measure is a similarity measure. 4

5 the best matching prototype but labeled with c r c sayw v, r. As it was shown in [23], the usage of the function µ λ (v yields robust behavior whereas LVQ2.1 does not. The learning rule of GLVQ is obtained taking the derivatives of the above cost function. Using µ λ w (v = ξ+ λ d r+ w and µ λ (v r+ w = ξ dλ r r w r with ξ + = one obtains for the weight updates [13]: 2 d λ r (d λ r+ + dλ r 2 and ξ = 2 d λ r+ (d λ r+ + dλ r 2 (2.5 w r+ = ɛ + f µ λ (v ξ+ d λ r+ w r+ (2.6 w r = ɛ f µ λ (v ξ λ d r (2.7 w r + ɛ,ɛ are learning rates. As shown in [24], the above learning rules are also valid in case of a continuous data distribution. The original (unsupervised Neural Gas (NG adapts unlabeled prototypes w r W according to a given data set is minimized with local costs such that the cost function e r (γ= Cost NG (γ= and neighborhood function known from NG: 1 C (γ,k r e r (γ (2.8 P (v h γ (r,v, W (v w r 2 dv (2.9 h γ (r, v,w =exp ( k r (v,w γ. (2.10 Thereby k r (v, W yields the number of prototypes w r for which the relation d λ (v,w r d λ (v,w r is valid, i.e. k r (v,w is the winner rank [25]. C (γ,k is a normalization constant depending on the neighborhood range γ and the cardinality K of W. The learning rule reads as w r = ɛ h γ (r, v, W (v w i (2.11 minimizing the cost function. The initialization of the prototypes is not longer crucial in NG because of the involved neighborhood cooperation. As mentioned above, Supervised Neural Gas (SNG constitutes a combination of GLVQ and NG. Again, let W c= {w r c r = c} be the subset of prototypes assigned to class c Land K c its cardinality. Further we assume to have m data vectors v i. As pointed out in [22], the neighborhood learning for a given input v i with label c is applied to the subset W c. The respective cost function is Cost SNG (γ= m h γ (r, v i, W ci f (µ λ (r, v i=1 r w r W ci C (γ,k ci (2.12 with f (x = (1 + exp ( x 1 and µ λ (r, v= dλ r d λ r d λ r +d λ r whereby dλ r isdefinedasinglvqabove and d λ r = d λ (v,w r. The neighborhood cooperativeness makes sure that prototypes are spread 5

6 faithfully among data of their respective classes. Note that lim γ 0 Cost SNG (γ = Cost GLV Q holds [22]. Hence, for vanishing neighborhood the SNG also becomes optimal in the sense of margin analysis, as detailed below. However, if the neighborhood range γ is large, typically at the beginning of the training, the prototypes of one class share their responsibilities for a given input. Hence, neighborhood cooperation is involved such that initialization of the prototypes is not longer crucial. Given a training example (v i, c i all prototypes w r W ci and the closest wrong prototype w r are adapted. Taking now we get the update rules ξ + r = 2 dλ r (d λ r + d λ r 2 and ξ r = w r = ɛ + ξ + r f µ λ (r,v h γ (r, v i,w ci C (γ,k ci 2 d λ r (d λ r + d λ r 2 (2.13 f µ λ (r,v h γ (r, v i,w ci d λ r w r (2.14 w r = ɛ d λ ξ r r (2.15 C (γ,k r w r W ci c w r We remark that only for correct prototypes the neighborhood cooperativeness is applied. Yet, one could also include neighborhood cooperation for wrong prototypes. However, as shown in [22] this yields instabilities of learning. Margin analysis of an algorithm is important to asses the level of confidence of a classifier with respect to its decision. For example, the sample margin is defined as the distance between the input and the decision boundary. The natural choice of this margin to be maximized in learning vector quantization causes numerical instabilities [21]. An alternative definition gives the hypothesis margin: this margin is the distance that the classifier can move without changing the way it labels any sample data. Thereby, the sample-margin majorizes the hypothesis margin. The hypothesis margin of a prototype based classifier is given by ( 1 λ d 2 ( r+ d λ 1 2 r (2.16 In fact, the GLVQ (and, hence, SNG for vanishing neighborhood maximizes a cost function closely related to this hypothesis margin and, hence, can be taken as maximum margin algorithm provided the similarity measure can be interpreted as a kernelized version of the Euclidean metric, i.e. the similarity measure is symmetric and its negative is conditionally positive definite [19]. Further, any given margin provides an upper bound for the generalization error of the classifier such that the higher the margin the lower the generalization error [21]. For a more detailed analysis we refer to [21] Non-Standard Metrics, Relevance Learning and Metric Adaptation in SNG We now consider relevance learning in the above introduced SNG, i.e. we study the influence of the parameter vector λ = (λ1,...,λ m in the distance measure d λ (v,w defined in (2.1. With other words, we are interested in adaptation of the distance measure in dependence on the λ k to minimize the cost function. We are looking for the relevance of the parameters [26]. Then, an adaptation step for the parameters λ k has to be added to the usual weight vector adaptation. Thereby, we assume that λ k 0 and λ k k = 1. Relevance learning for SNG is referred as SRNG and we get λ k = 2ɛλ r w r W c f µ λ (r,v h γ (r, v i, W ci C (γ,k ci ( ξ + r d λ r λk ξ d λ r r λk (2.17 6

7 using µ λ (v λ k = ξ + λ d r+ λ k ξ dλ r λ k. (2.18 followed by a renormalization. ɛ λ > 0 is the learning rate. For vanishing neighborhood cooperativeness γ 0 the SRNG turns to Generalized Relevance LVQ (GRLVQ as relevance learning in GLVQ [27] with λ k = ɛλ f µ λ (r,v µ λ (r,v λk (2.19 Again as above, the update rule is also valid for the continuous case [24]. The learning rate ɛ should be in both approaches at least a magnitude smaller than the learning rates for the weight vectors. Then the weight vector adaptation takes place in a quasi stationary environment with respect to the (slowly changing metric. Hence, the margin optimization takes place for each level of the parameter λk and we get an overall optimization of margin including a relevance weighting of the parameters. As pointed out before, the similarity measure d λ (v,w is only required to be differentiable with respect to λ and w. The triangle inequality has not to be fulfilled necessarily. This leads to a great freedom in the choice of suitable measures and allows the usage of non-standard metrics in a natural way. In particular, kernel based similarity measures are allowed [19]. In this way SNG/SRNG are comparable to SVMs [28]. Interestingly, the margin analysis of this LVQ-version also holds for adaptive relevance metrics and kernelizations thereof as shown in [28]. In case that D d λ V (v,w = λ i (v i w 2 i (2.20 i=1 is the squared, scaled Euclidean distance, whereby again λ i 0 and λ i i = 1, we immediately get in (2.6 and (2.7 for GLVQ update: w r+ = ɛ + 2 f µ λ (v ξ+ Λ (v w r+ (2.21 w r = ɛ 2 f µ λ (v ξ Λ (v w r, (2.22 respectively, with Λ being the diagonal matrix with entries λ 1,...,λ DV and the relevance parameter update (2.19 in GRLVQ reads as λ k = ɛ λ f µλ (v ( ξ + (v w r + 2 k ξ (v w r 2 k (2.23 with k = 1...D V. In case of SNG and SRNG we get for the updates in case of the scaled Euclidean distance w r = 2ɛ + ξ + r w r = 2ɛ f µ λ (r,v h γ (r, v i, W ci Λ (v w r (2.24 C (γ,k ci ξ r C (γ,k r w r W ci c with Λ being again the diagonal matrix with entries λ 1,...,λD V and λk= ɛλ r w r W c f f µ λ (r,v h γ (r, v i, W ci C (γ,k ci µ λ (r,v h γ (r, v i, W ci Λ (v w r, ( ( ξ + (v w r r 2 k ξ (v w r r 2 k (2.26

8 Another popular similarity measure is the (parametrized Mahalanobis distance d λ (v, w=(v w T C 1 λ (v w (2.27 with C λ being the parametrized covariance matrix defined as follows: Let C = P T DP be the diagonal representation of the usual covariance matrix with diagonal matrix D. Then the parametrized diagonal matrix D λ is obtained as D λ = Λ D with Λ is the diagonal matrix and the diagonal elements Λ ii = λ i generate the parameter vector λ. Finally we define C λ = P T D λ P (2.28 The application of this measure allows a relevance ranking of the principal components of the data distribution for the given classification task. 2.2 Learning Classification According to Information Theoretic Measures The information theoretic classification is based on maximization of the mutual information between the class information and the data to approximate the Bayes error Basic Model for Classification Learning We now describe the information theory based learning vector quantization (IT-LVQ proposed in [29],[30],[15]. The basic idea is the usage of the mutual information between the class labels of input vectors taken as a random variable and the network output. This approach is based on the following considerations: The ultimate criterion for classification tasks is the Bayes error. Yet, mutual information can be taken as a proper approximation: Generally, the mutual information I (X, Y measures the information transfer between the random variables X and Y I (Y, X = H (Y +H (X H (Y, X (2.29 = H (Y +H (Y X = H (X +H (X Y with H (Y usually being the Shannon entropy and H (Y X its conditional counterpart. As pointed out in [16], the Bayes error B (Y has an upper bound [31] given by B (Y 1 2 H (X Y = 1 (H (X I (Y, X ( and a lower bound given by Fano s inequality ( P Y Ŷ H (Y X 1 log (ℵ = H (Y I (Y, X 1 log (ℵ (2.31 with ℵ is the number off possible instances of Y and Ŷ is a certain fixed instance of Y [32]. Hence, maximizing I (Y, X is equivalent to minimizing both bounds. Therefore, the mutual information can be taken as an estimate for the Bayes error. Now we can formulate the problem of classification in the following way: Let y i = g (v i, w (2.32 8

9 be a transfer function for a given input v i V with weights w. Then the formal stochastic gradient approach according to the mutual information I can be written for the weights w as I w = ε w with = w I i I y i y i w (2.33 Hence, the task is to determine I y i. For this purpose we consider I (Y,X. We specify X = L, again as L being the set of labels (classes with #L = NL, i.e. we interested in maximizing the mutual information between the random sequence of labels cv according to the input sequence of v and the variable y by adaptation of the weights w. For high-dimensional large data sets the computational costs become intractable. Therefore, a more convenient approximation is necessary. Such an approach is provided in [16] based on the fundamental work [29]. The key idea is to replace the Shannon entropy in the definition of the mutual information for tractable computation. In fact, according to K [33] the third axiom of additivity/recursiveness of the Shannon entropy is not necessary, if one is only interested to maximize or minimize the entropy of a system [16]. Therefore, other entropy measures can be used in this case [34]. For easier numerical computation Renyi s entropy is considered [35] H Renyi (α = 1 (p l α ( α ln ( l L instead of the Shannon entropy for use in the mutual information. For α =2 it is called quadratic entropy. Then, the mutual information (2.29 can be written as: I (Y,L = l L +( l L 2 p 2 (l, y dy (2.35 p 2 l p 2 (y dy p (l p (l, y p (y dy l L with p l the a priori probabilities of class l, as shown in [16]. One can estimate the probability density using Parzen window approximation of spherical Gaussian kernels G as N p (y= 1 ( G y yi,σ 2 1 (2.36 N i=1 with the Gaussian kernel in d-dimensional space defined as G (y,σ= 1 (2π d 2 Σ 1 exp 2 Thus, the mutual information I can be estimated as I (Y,L 1 NL N l G ( y N 2 li i,j=1 l=1 + 1 N 2 ( NL p 2 l l=1 2 1 NL N 2 p l l=1 ( 1 2y T Σ 1 y y lj, 2σ 2 1 (2.37 N G( yi y j, 2σ 2 1 (2.38 N l i=1 9 N i,j=1 j=1 G ( y l i y j, 2σ 2 1 (2.39

10 using the convolution properties of Gaussian kernels. This formula can be used to carry out the derivative I y. However, the computation is still expensive. Therefore in [15] a stochastic approximation based on the above Parzen estimate is suggested: It is assumed for the moment that the database V only contains two inputs v 1 and v 2 with two class possibilities. Then two cases have to be handled separately: a both, the transformed data y 1 and y 2 come from the same class or from different classes. In the first case it follows that I (Y,L 0 holds. In the second case b one gets I (Y, L= 1 4 which leads to the derivatives ( G ( 0,2σ 2 1 G ( y 1 y 2, 2σ 2 1 (2.40 I = 1 ( y 1 8σ 2 G y1 y 2, 2σ 2 1 (y 2 y 1 = I (2.41 y 2 In this way we can rewrite the general update rule (2.33 for this case and obtain ( w 1 ( = ( ε 8σ 2 G y1 y 2, 2σ 2 y2 1 (y 2 y 1 w y 1. (2.42 w It should be mentioned here, that the same update rule for the weights can be obtained using G s information energy O (X, Y instead of the mutual information [36, 30]. information energy is defined as The O (X, Y =E [Y X] E [Y ] (2.43 whereby E is the expectation value. The information energy has the following properties O (X, Y O (Y, X O (X, Y 0, O (X, Y =0iff X and Y are statistically independent O (X, Y 1 E [X] and O (X, Y =1 E [X] iff X is completely dependent on Y. O (X, Y measures the unilateral dependence of X relative to Y Non-Standard Metrics, Relevance Learning and Metric Adaptation The definition of the transfer function (2.32 for the model requires the differentiability with respect to the weights w. The specific choice is not pre-determined and, again, allows the usage of non-standard metrics in a natural manner. Further, any transfer function can depend on an additional parameter vector λ g (v i,w=g (v i,w, λ (2.44 According to the idea of relevance learning, we suggest to extend the weight update (2.33, respectively (2.42, by a parameter adaptation, as introduced by A in case of information energy [30]. I λ = ε λ with I λ = From this we immediately derive the explicit formula ( 1 λ = ε 8σ 2G( y 1 y 2, 2σ 2 1 (y 2 y 1 i I y i y i λ ( y 2 λ y 1 λ (2.45 (2.46 in complete analogy to the formula for the weight update (2.33. We remark that we obtained the same relevance parameter update rule as known for the information energy (2.43 except a constant factor [30]. 10

11 2.3 Regression Parametric Distance Metric Learning (RPDML The Regression Parametric Distance Metric Learning (RPDML as a regression classification approach is based on implicite kernel mapping estimation and subsequent usual regression. Regression parametric distance metric learning has been developed by Z. [17]. In the RPDML the label information of the vectors is directly included into a special data dependent similarity measure: ( exp v i v j cv i = cv j s ij = 1 ( v exp i v j β β cv i cv j (2.47 where denotes the Euclidean norm and β>0 is the free width parameter. The corresponding dissimilarity measure is d ij = s ii + s jj 2s ij (2.48 It has been shown that the matrix D = [d ij ] is metric, i.e. d ij 0, d ii = 0, d ij = d ji and the triangle inequality d ik + d jk d ij hold for all i, j, k [17]. The regression model RPDML introduces a mapping f (v,w=(f 1 (v, W,...,f l (v, W from the original input space V R D V to the embedded Euclidean space Rl as f = W T φ. Each f i is a linear combination of p linear or non-linear basis functions φ j (v: f i (v,w= p j=1 w ji φ j (v (2.49 where W =[w ji ] is an adaptive weight matrix and φ = ( T φ p l 1,...,φ. p Let D (V be given for an arbitrary fixed data set. Then the RPDML minimizes the cost function e 2 (W= i j (d ij q ij (W 2 (2.50 W T (φ (v i φ (v j. In this way one obtains a classifier from the regression model f = W T φ, whereby the weight matrix W contains the information of the metric D derived from the data. with respect to W by the iterative majorization algorithm [17, 37], where q ij (W = 2.4 Kernel Based Distance Learning (KBDL In kernel based classification models the data are mapped by kernel mapping into a possible high-dimensional space with high separation abilities. The Kernel Based Distance Learning classifier is developed in the context of classification with metric adaptation. Kernel based distance learning (KBDL has been recently introduced by T &K for distance metric learning for better classification [18]. The approach is derived from classic kernel methods. In fact, the problem description becomes very similar to that for support vector machines (SVMs. However, the focus of this approach is to automatically achieve an optimum metric for a given proximity setting. It is assumed that similarity/dissimilarity information between the data vectors v V R DV is available: all pairs (v i, v j of similar vectors form the set S of size N S, all other pairs are collected in the dissimilarity set D of size N D. In sense of a classification task the set S consists of all vector pairs for which both belong to the same class whereas D collects all pairs containing vectors from different classes. 11

12 A generalized Mahalanobis-distance can be defined by with M being a positive semi-definite matrix. s ij = (v i v j T M (v i v j (2.51 It is well known that for any non-singular quadratic matrix A the product AA T is positive semi-definite and, hence, can serve as M in (2.51. The task in KBDL is to find a suitable transformation matrix A of the data such that the distances for vector pairs in S is minimized whereas the distances for vector pairs in D is maximized. Let d ij refer to the distance measure (2.51 with an arbitrarily chosen positive semi-definite matrix M withm = AA T. Now the idea is to vary A in such a way that the above optimization goal is achieved. Let d ij denote the distance measure whereby AA T instead of M is used in (2.51. Then the problem can reformulated as: minimize the distances d ij for vector pairs in S whereas d ij d ij ( is imposed for vector pairs in D. The latter goal is equivalent to maximization of ς =min ij d ij d ij with ς 0. Because this condition may not be enforced perfectly, slack variables like in SVMs are necessary. This leads to the primal problem which has a similar form as ν-svms [18, 38]: min 1 2 AA T C + S N S (v i,v j S d ij 1 + CD νς + ND (v i,v j D ξ ij with respect to A, ς and ξ ij, subject to the constraints ς 0, and for each (v i, v j D d ij d ij ς ξ ij ξ ij 0 Thereby, C S, C D and ν are tunable parameters. The respective dual problem is obtained as subject to max C S N S (v i,v j D (v i,v j D α ij (v i v j T M (v i v j (v k,v l D α ijα kl ((v i v j T (v k v l 2 (v k,v α ( l S ij (v i v j T (v k 2 v l (v i,v j D 1 C D (v i,v j D α ij α ij ν [ 0, C D ND with Lagrange multipliers α ij [18]. Using the Karush-Kuhn-Tucker conditions ( α ij d ij d ij ς + ξ ij ( CD ξ ij ND α ij with ν = C 1 D ((v i,v j D α ij µ it can be shown that ] = 0 = 0 µς = 0 (2.52 quadratic programming (2.53 d ij d ij = ς, 0 <α ij < C D ND ς, α ij =0 ς, α ij = C D ND 12

13 holds, which is very similar to SVMs. Further ν is an upper bound for the fraction of errors [18]. A kernelized variant of the approach can easily be obtained [18]: Let κ ij = κ (v i, v j = φ (v i φ (v j be a kernel function with feature map φ. Then the dual problem becomes max C S N S (v i,v j D (v i,v j D (v i,v j D α ij (κ ii + κ jj 2κ ij (v k,v l D α ijα kl (κ ik κ il κ jk + κ jl 2 (2.54 (v k,v α l S ij (κ ik κ il κ jk + κ jl 2 and the modified inner product between φ (v k, φ (v l leads to the new kernel κ (v k, v l = φ (v i AA T φ (v j = (v i,v j D C S N S α ij (κ ki κ kj (κ il κ jl (v i,v j S (κ ki κ kj (κ il κ jl This adapted similarity measure can readily be included into standard metric classifiers such as nearest neighbor. The results reported below have been obtained by 1-nearest-neighbor classification. 3 Numerical Experiments In this section we demonstrate the behavior of the different algorithms on artificial data sets as introduced in [14], onto a subset of the public UCI repository of machine learning databases [39] and the prostate cancer data set as published by NCI [40]. Thereby, the artificial data set serves as a demonstration of the effect of a different problem adapted metric compared to the euclidean one for a simple test situation. The UCI data sets have been chosen for two reasons: on the one hand side, they cover different typical classification problems, on the other hand, they allow a fair comparison of the different proposals for metric adaptation. Since all methods have originally been proposed for vector data we first study vectorial representations. We add one data set containing proteomic patterns, an interesting bioinformatics application. This set goes beyond standard vectorial data in the sense that it is obtained as the collection of function values of an underlying spectrum at different wave lengths, i.e. we deal with vectorial representations of functions. Correspondingly, data are high dimensional and the dimensions show strong correlations. In addition, the data space is only sparsely covered and the number of examples is of the same order as the data dimensionality. Thus, methods which include structure in form of relevance terms and which emphasize the generalization ability e.g. in form of margin maximization such as SRNG or SVM are crucial for a successful classification. This example serves as a first step to demonstrate the applicability of the methods for nonstandard data sets. 3.1 Artificial Data For comparison, in the first tests for metric adaptation we applied the algorithms to the same artificial data sets as used before in [27],[14]: The artificial data sets 1 to 6 consist of two classes with 50 clusters with about 30 points for each cluster. The centers of the clusters are thereby located on a checkerboard structure in the two dimensional square [ 1, 1] 2. Data sets 1,3 and 13

14 Figure 1: Artificial multi-modal data set 1 (checkerboard with clear class separation Figure 2: Artificial multi-modal data set 5. The structure is a checkerboard too, but now with large overlap between the classes. 14

15 5 contain two-dimensional data points from these clusters for which the sets differ with respect to the overlap of the classes as depicted in Figs. 1 and 2. Data sets 2, 4 and 6 are achieved as copies of 1, 3 and 5, respectively, whereby the two-dimensional points are embedded into 8 dimensions as follows: a point (x1, x2 is embedded as: (x 1,x 2,x1+ν 1,x1+ν 2,x1+ν 3,ν 4,ν 5,ν 6. Thereby ν i is a uniform noise with support [ 0.05, 0.05] for ν 1, [ 0.1, 0.1] for ν 2,[ 0.2, 0.2] for ν3, [ 0.1, 0.1] for ν4, [ 0.2, 0.2] for ν 5, and[ 0.5, 0.5] for ν6. These sets are randomly divided into training and test set of the same size. We train SRNG/SNG with the scaled Euclidian metric (SEM (2.20 and the parametrized Mahalanobis metric (PMM (2.27 with 50 prototypes for each class on these sets whereby prototypes are initialized randomly with small values. The number of used prototypes result from the minimum number of prototypes to achieve a classification accuracy of 100%. Relevance factors / parameters λ i are initialized uniformly. In case of SNG, the parameters/relevances remain constant during the training process. Training is done for 3000 cycles with learning rates ɛ + =0.1 for correct prototypes, ɛ =0.05 for incorrect prototypes, and ɛ = for the relevance terms. The neighborhood range in SRNG/SNG (2.10 is initialized at γ = 100 and it is multiplied by the factor after each cycle. The results are collected in table 1. We only give the prediction (test rates, the recognition rate is omitted here. SNG and SRNG spread the prototypes faithfully among the data. SNG and SRNG usually miss at most 2 or 3 out of 100 clusters whereas GLVQ and GRLVQ miss about 50%. 2 For the well separated data set 1, SNG and SRNG achieve optimum classification accuracy in case of both applied metrices, SEM and PMM. For the second data set with SEM, SRNG is capable to classify 95%. Typical relevance profiles which are achieved are bounded as: λ =( 0.34, 0.4, 0.18, 0,..., 0 Hence the importance of dimensions 1 and 2 is clearly pointed out. Irrelevant dimensions are effectively pruned. By application of the PMM on data set 2 it is obvious that for SNG as well as for SRNG the classification is like a random guess. This may be due to the case that the Mahalanobis distance depends on the inverse of the covariance matrix of the underlying data set. The determination of the inverse is a numerically ill-conditioned problem which may lead to instable behavior [41]. The data are heavily affected by noise which, hence, result in an invalid inverse covariance strongly influencing the class separation. This behavior can also be observed for the noisy data sets 4 and 6. The classification accuracy for data set 3 and 4 is a bit worse owing to the larger overlap of the classes for SEM. Using PMM the effect of overlap could be easier handled so that the prediction values are better than for the SEM. This is especially obvious, if we consider data set 5 (see also Figure 2, where the prediction rates for PMM are nearly 20% better than for SEM and therefore are no longer a random guess. In conclusion, the results for the artificial data sets point clearly out that the usage of different metrices can be useful to improve the classification accuracy. The SEM shows much better results than the standard Euclidean metric used in SNG. The PMM can easier handle the overlapping artificial data sets but fails, if the data sets are to much affected by noise. 3.2 Real World Data In this section we compare the performances of the different classifiers for real world data sets from the UCI machine learning repository and NCI cancer research. Thereby, we use the 2 One missing cluster accounts for an error of about 1%. 15

16 SNG SEM SRNG SEM SNG PMM SRNG PMM data1 98% 99% 100% 100% data2 72% 94% 53% 53% data3 90% 91% 96% 94% data4 64% 92% 51% 53% data5 50% 54% 73% 73% data6 50% 55% 51% 52% Table 1: Prediction rates for the SNG and SRNG algorithms on the artificial data sets using SEM and PMM. data set class number of patterns for test number of patterns for training PIMA SOYA WINE WBC IONO Table 2: The used UCI data sets and there splitting into test and training data. achieved prediction accuracy as an estimate for the generalization ability of the models [42]. Further, we emphasize the feature of task (classification dependend metric adaptation of all used approaches. The UCI repository consists of a large number of different data sets with different input dimensions D V and number of classes. For comparison we used the same data as in [18]. In particular, we used the Pima Indians diabetes (PIMA, D V =8, the Soybean (SOYA, D V =35, Wisconsin breast cancer (WBC, D V =30, the Wine (WINE, D V =13 and the Ionosphere (IONO, D V = 34 data set. These data sets cover a broad spectrum of properties, which frequently occur in real world applications. According to [18], each data set was splitted into a test and training data set as shown in Table 2 In Tab. 3 the achieved prediction rates for the considered approaches with respect to the data sets are depicted. For SRNG and IT-LVQ the number of prototypes was chosen as 10% of data of the respective classes. Learning was done up to convergence. The results for KBDL and RPDML are taken from [17] and [18]. Yet, for KBDL, only for two of the above data sets results are provided. Generally, one immediately observe that SRNG with SEM and RPDML clearly achieve the best results, whereby the differences between them are small. SNG with Euclidean metric shows moderate accuracy whereas application of IT-LVQ as well as both SNG PMM and SRNG PMM leads to reduced accuracy. A general judgment of KBDL is difficult because results are available only for two data sets. Yet, for the two given results the achieved accuracy is moderate only 16

17 data set SNG SEM SRNG SEM RPDML 1 SNG PMM SRNG PMM IT-LVQ KBDL-L 2 KBDL-R 2 PIMA 88.46% 90.00% 75.34% 64.89% 65.83% 65.83% SOYA 100.0% 100.0% 100.0% 57.00% 49.00% 100.0% WINE 93.22% 95.76% 99.15% 75.42% 81.36% 61.86% 72.87% 67.65% WBC 88.27% 97.02% 95.10% 78.46% 90.41% 63.33% IONO 72.91% 88.45% 88.05% 73.71% 78.09% 56.20% 80.53% 87.00% Table 3: Prediction rates for the analysed algorithms on the UCI data sets. KBDL-L refers to KBDL with linear kernel, KBDL-R refers to KBDL with RBF-kernel. 1 results taken from [17]; 2 results taken from [18] for both variants - linear and RBF-kernel. For SNG/SRNG the relevance learning in SRNG leads to better performance than the unparametrized counterparts (SNG for both measures SEM and PMM. However, the PMM as distance measure delivers bad accuracy. This can be addressed to the numerical instabilities, when the inverse of the estimated covariance matrix is computed needed for the distance computation. It is well known that matrix inversion is an ill-conditioned problem which leads to instable behavior of the inverse matrix [41]. Hence, if the data contain large noise, this drastically influences the estimate of the inverse covariance matrix. This can be seen in the WINE data set, for example. Fig. 3 shows that the input dimensions 7 and 13 take 54.07% and 25.44% of the overall variance, respectively. Most of the other dimension contribute significant noise which leads to the decreased performance using PMM. As we have seen for the artificial data sets with noise (2, 4, 6, here we get the same reduced performance. Thus we can conclude that an accurate estimation of the covariance is essential for good performance of PMM. The IT-LVQ approach also shows significantly reduced classification accuracy. This has several reasons: Firstly, the Parzen window approximation assumes a rough Gaussian distribution of the data within the classes which is not necessarily fulfilled. Secondly, the approach is derived for a specific two-class-scenario: the priors of the classes are assumed to be equal. Further, the generalization to a more-class-scenario is only heuristic. Moreover, we observed that the performance crucially depends on initialization. All together leads to an instable behavior and unsatisfying results. The results for KBDL with linear and RBF-kernel are moderate. The performance is significant behind the top approaches. Yet, a general discussion is not possible, because we have results only for two data sets. For the given results we see that the accuracy is between SRNG SEM /RPDML and SRNG/SNG using PMM for the IONO data set whereas for the WINE data set it is behind them. RPDML shows best performance like SRNG SEM. Hence, the regression is able to adapt the weights according to the underlying metric provided by the class information of the data. However, the minimization of the cost function (2.50 is realized by the iterative majorization algorithm [17]. This algorithm iteratively computes the optimal solution. In each optimization step a matrix equation has to be solved, the dimension of which depend on the number of data [37]. Therefore, the algorithm is computational expensive, in particular for large data sets. For the SRNG and SNG algorithms we also investigated their behavior on the proteomic NCI prostate cancer set. This data set contains mass spectra of prostate cancer specimens. One class is the control class, whereas the other classes contain the data in indication on different cancer stages. Overall, the data set consist of 222 training and 96 test data with input dimension D V =130. We compare SRNG and SNG with a standard SVM-approach in proteomics (Unified Max- 17

18 Figure 3: Data from the WINE data set depicted in coordinates according dimension 7 and 13 taking 54.07% and 25.44% of the overall variance. data set SNG-SEM SRNG-SEM SNG-PMM SRNG-PMM UMSA NCI 62.5% 93.7% 27.0% 39.6% 92.7% Table 4: Predicition rates obtained on the NCI data set using SRNG and SNG imum Separability Analysis - UMSA, [43]. The results are shown in Table 4. SRNG achieves equivalent accuracy as UMSA. Yet, the advantage of SRNG are the reduced computational cost of SRNG in comparison to SVM-typed approaches [44] and, in particular in comparison to UMSA [45]. Further, the SRNG with SEM performs significantly better than the usual SNG. It becomes also obvious that by use of the PMM the results become worse, which may indicate, with respect to the results from the artificial data set, that a high amount of noise is included in this real data set. This different behavior can also be seen in the different relevance profiles, Fig. 4 and Fig. 5 4 Conclusion We compared different approaches of machine learning methods using adaptive metrics. The capabilities of the considered methods differ significantly. Best results show the regression model RPDML and the prototyped relevance learning classifier SRNG with scaled Euclidean metric. Both models demonstrate robust behavior and constant good results for all data sets which represent typical problems in real world data analysis. All other approaches show remarkable drawbacks such that an application can not be rec- 18

19 relevance relevance input dimension Figure 4: Relevance profile for the SRNG SEM input dimension Figure 5: Relevance profile for the SRNG PMM. 19

20 ommended without specific knowledge and experiences or verified assumption about data (in case of PMM. From the result for SNG and SRNG we can conclude that usage of metric adaptation generally improves the classification accuracy. This result is underlaid by the investigation in [14]. The same property has been shown for RPDML, KBDL and IT-LVQ [17, 18, 30]. References [1] M. Gori P. Frasconi and A. Sperduti, A general framework of adaptive processing of data structures, IEEE Transaction on Neural Networks, vol. 9, no. 5, pp , [2] A. Sperduti and A. Starita, Supervised neural networks for the classification of structures, IEEE Transaction on Neural Networks, vol. 8, no. 3, pp , [3] A. Passerini A. Ceroni, Frasconi P. and A. Vullo, Predicting the disulfide bonding state of cysteines with combinations of kernel machines, Journal of VLSI Signal Processing, vol. 35, no. 3, pp , [4] D. Sona A. Micheli and A. Sperduti, Contextual processing of structured data by recursive cascade correlation, IEEE Transaction on Neural Networks, vol. 15, no. 6, pp , [5] A. Micheli B. Hammer and A. Sperduti, Universal approximation capability of cascade correlation for structures, Neural Computation, p. to appear. [6] M. Gori M. Bianchini and F. Scarselli, Processing directed acyclic graphs with recursive neural networks, IEEE Transaction on Neural Networks, vol. 12, no. 6, pp , [7] B. Hammer and B.J.Jain, Neural methods for non-standard data, in European Symposium on Artificial Neural Networks 2004, M. Verleysen, Ed. 2004, pp , d-side publications. [8] N. Cristianini H. Lodhi, J. Shawe-Taylor and C. Watkins, Text classification using string kernels, Journal of Machine Learning Research, vol. 2, pp , [9] M. Diekhans T. Jaakkola and D. Haussler, A discriminative framework for detecting remote protein homologies, Journal of Computational Biology, vol. 7, no. 1-2, [10] T. Gärtner, A survey of kernels for structured data, SIGKDD explorations, [11] T. Kohonen and P. Somervuo, How to make large self-organizing maps for nonvectorial data, Neural Networks, vol. 15, no. 8-9, pp , [12] P. A. Flach T. Gärtner and S. Wrobel, On graph kernels: Hardness results and efficient alternatives, in Proc. Sixteenth Annual Conference on Computational Learning Theory and Seventh Kernel Workshop (COLT , Springer. [13] A. S. Sato and K. Yamada, Generalized learning vector quantization, in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds., vol. 7, pp MIT Press,

21 [14] B. Hammer, M. Strickert, and Th. Villmann, Supervised neural gas with general similarity measure, Neural Processing Letters, p. to appear, [15] K. Torkkola and W.M. Campbell, Mutual information in learning feature transformations, in Proc. Of International Conference on Machine Learning ICML 2000, Stanford, CA, [16] K. Torkkola, Feature extraction by non-parametric mutual information maximization, Journal of Machine Learning Research, vol. 3, pp , [17] Z. Zhang, J.T. Kwok, and D.-Y. Yeung, Parametric distance metric learning with label information, in Proc. of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 03, O. Kaynak, Ed., Acapulco, Mexico, 2003, pp [18] I.W. Tsang and J.T. Kwok, Distance metric learning with kernels, in Proc. International Conference on Artificial Neural Networks (ICANN 2003, O. Kaynak, Ed., Istanbul, 2003, pp [19] B. Hammer, M. Strickert, and Th. Villmann, Supervised neural gas with general similarity measure, Neural Processing Letters, p. to appear, [20] Teuvo Kohonen, Self-Organizing Maps, vol. 30 of Springer Series in Information Sciences, Springer, Berlin, Heidelberg, 1995, (Second Extended Edition [21] K. Crammer, R. Gilad-Bachrach, A.Navot, and A.Tishby, Margin analysis of the LVQ algorithm, in Proc. NIPS 2002, 2.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/index.html, [22] Th. Villmann and B. Hammer, Supervised neural gas for learning vector quantization, in Proc. of the 5th German Workshop on Artificial Life (GWAL-5, D. Polani, J. Kim, and T. Martinetz, Eds., pp Akademische Verlagsgesellschaft - infix - IOS Press, Berlin, [23] A. Sato and K. Yamada, A formulation of learning vector quantization using a new misclassification measure, in Proceedings. Fourteenth International Conference on Pattern Recognition, A. K. Jain, S. Venkatesh, and B. C. Lovell, Eds., vol. 1, pp IEEE Computer Society, Los Alamitos, CA, USA, [24] Atsushi Sato and Kenji Yamada, An analysis of convergence in generalized LVQ, in Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, L. Niklasson, M. Bodén, and T. Ziemke, Eds., vol. 1, pp Springer, London, [25] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten, Neural-gas network for vector quantization and its application to time-series prediction, IEEE Trans. on Neural Networks, vol. 4, no. 4, pp , [26] T. Bojer, B. Hammer, D. Schunk, and Tluk von Toschanowitz K., Relevance determination in learning vector quantization, in 9th European Symposium on Artificial Neural Networks. ESANN Proceedings. D-Facto, Evere, Belgium, 2001, pp [27] B. Hammer and Th. Villmann, Generalized relevance learning vector quantization, Neural Networks, vol. 15, no. 8-9, pp ,

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,