Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods

Size: px
Start display at page:

Download "Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods"

Transcription

1 Comparison of Relevance Learning Vector Quantization with other Metric Adaptive Classification Methods Th. Villmann 1 and F. Schleif 2 and B. Hammer 3 1 University Leipzig, Clinic for Psychotherapy 2 Bruker Daltonik GmbH Leipzig and University Leipzig, Dept. of Computer Science 3 Clausthal University of Technology, Department of Mathematics and Computer Scien February 9, 2005 Abstract The paper deals with the concept of relevance learning in learning vector quantization and classification. Recent machine learning approaches with the ability of metric adaptation but based on different concepts are considered in comparison to variants of relevance learning vector quantization. We compare these methods with respect to their theoretical motivation and we demonstrate the differences of their behavior for several real world data sets. Keywords: learning vector quantization, relevance learning, metric adaptation, classification 1 Introduction Data in interesting domains such as language processing, logic,chemistry, and bioinformatics often possess an inherent structure. Typical difficulties arise for machine learning within these domains: a very high or varying data dimensionality, correlations of the data elements, or a sparsely covered data space, to name just a few. Because of this fact, standard vector processing by means of euclidean vectors faces severe problems in these cases and several approaches to deal in a more adequate way with these data structures have been developed. A very successful and interesting possibility for recursive data structures is given by the dynamics of recurrent and recursive neural networks [1, 2]. Recently, this idea has been extended to more general graph structures in several ways [3, 4, 5, 6]; however, the data structures which can be tackled in this way are still restricted. Another very general alternative to deal with structured data is offered within similarity based machine learning approaches. Here, fairly general data structures can be dealt with as soon as a similarity measure for these structures has been defined or data are embedded into a metric space [7]. A popular application of this idea can be found in connection to support vector machines (SVM and other kernel methods, where a variety of different kernels such as string kernels, graph kernels, or kernels derived from a probabilistic model have been defined [8, 9, 10]. The use of specific kernels is not restricted to the SVM but it can readily be transferred to general metric based approaches such as the median selforganizing map or nearest neighbor classification [11]. Naturally, the similarity measure plays corresponding author, University Leipzig, Clinic for Psychotherapy, Karl-Tauchnitz-Str. 25, Leipzig, Germany, villmann@informatik.uni-leipzig.de

2 a crucial role in these approaches and an appropriate choice of the distance might face severe difficulties. The design of general similarity measures which can be used for any learning tasks, i.e. which guarantee a universal approximation ability and distinguishability of arbitrary structures, is one possible line of research; however, the resulting representation of data is often too complex for the concrete task and, moreover, a universal design might not be efficient for complex data structures due to principled problems [12]. Therefore, similarity measures which are constructed for the concrete problem based on the given data are particularly interesting since they offer an automated design of problem specific representations. A prime example of this idea is the Fisher kernel which derives a similarity measure based on a statistical model of the data [9]. Still, the resulting kernel is fairly general since it mirrors general statistical properties of the given data set. In case of supervised learning tasks such as classification, only those properties are relevant which are related to the class labels whereas statistical information which is independent of the class distribution can be abandoned. The focus of this article is a presentation and comparison of approaches which adapt a similarity measure based on given class information for supervised learning tasks. Pattern classification plays an important role in data processing. It is used for discrimination of patterns according to certain criteria, which may be of statistical type, structural differences, feature discrimination, etc. Thereby the representation of the objects significantly influences the ability for discrimination. An improper representation of the object may lead to vanishing differences whereas a suitable representation offers a clear separation of object classes. In this sense, classical statistical discriminant analysis techniques like Fisher-discriminant analysis project data onto a one-dimensional representation which should deliver the best separation of classes. Obviously, the optimal representation depends on the classification task. It is closely related to the definition of the similarity or metric between the objects used for classification. Hence, the similarity should be chosen adequately to the given classification task. An appropriate choice of the metric can substitute adaptations of the representation and vice versa. Also quite common, the standard Euclidean metric may not be the best choice. One family of intuitive metric based classification algorithms is learning vector quantization (LVQ. This family comprises prototype based algorithms which try to represent the objects by typical representatives (prototypes, weight vectors. The discriminant property is realized by labeling of the prototypes such that they adapt specifically according to the given classes. As it is described in more detail later, non-standard metrics as well as metric adaptation can easily be included in advanced modifications of LVQ. A further famous approach for classification are support vector machines (SVMs. Here, the key idea is to map the data into a possibly high-dimensional representation space, which allows a linear separation of the classes. The choice of the mapping (kernel mapping is crucial and the incorporation of metric adaptation is thus quite interesting. In the present paper we compare several machine learning approaches for classification in the light of metric adaptation and usage of non-standard metrics according to the given classification task. In particular, we compare the recent developments of relevance learning in LVQ, distance metric learning in SVM and relevance learning in information theoretic based LVQ. These approaches represent different paradigms as prototype based classification, kernel regression classification, kernel mapping based classification and mutual information maximization based classsification, respectively. We demonstrate the consequences in several real life experiments. Since the approaches have been designed for standard vectorial data, we also present results for vectorial data sets to allow a fair comparison of the methods. However, it turns out that even for this comparably simple vector representation of data a problem adapted metric which integrates some structure into the problem in form of relevance information is superior to a simple euclidean metric. To demonstrate the principled applicability of the methods to more 2

3 complex data, we also integrate an example from bioinformatics with more complex input signals, the classification of spectra. Since these data are obtained as the function values of a spectrum at different wave lengths, typical characteristica of structures can be observed: a very high dimensionality, a close correlation of subsequent entries, and an only sparsely covered data space. Here, the design of metrics which also take the generalization ability of the classifier into account and which go beyond the standard euclidean metric is of crucial importance. The paper is structured as follows: in sect. 2 the investigated method are described including the respective extensions to non-standard metrics and metric adaptation. They are compared for several classification tasks in sect. 3 followed by a summary. 2 Methods for classification We concentrate on four methods for classification in machine learning according to the above outlined four directions and consider them in the context of non-standard metrics and metric adaptation. These chosen methods are popular representants of their methodologies behind: prototype based classification: Generalized Learning Vector Quantization (GLVQ as an extension of the basic LVQ-algorithms introduced by S &Y [13] combined with relevance learning and neighborhood cooperativeness (Supervised Relevance neural Gas - SRNG [14] mutual information maximization: Information Theoretic LVQ according to information theoretic measures proposed by T [15],[16] kernel regression classification: Regression Parametric Distance Metric Learning (RPDML invented by Z. [17] kernel based classification: Kernel Based Distance Learning (KBDL by T &K [18],[17]. In the following we briefly explain the basic ideas of the algorithms. numerical considerations. It is followed by 2.1 Supervised Neural Gas for Generalized Learning Vector Quantization Supervised Neural Gas is considered as a representative for prototype based classification approaches. It can be combined with the demanded feature of relevance learning. Moreover, it is a stochastic gradient descent algorithm, which is a margin optimizer with known bounds of generalization ability [19] Basic Model Standard LVQ2.1 as proposed by K is a heuristic approach to reduce the classification error in supervised learning [20]. However, the adaptation dynamic does not minimize any continuous cost function and shows instabilities. In addition, the result of LVQ2.1 crucially depends on the initialization of the prototypes, commonly initialized with the centerpoints of the classes or random representatives from the training set. The algorithms use iterative local learning rules which can easily stuck in local optima. GLVQ avoids the numerical instabilities of LVQ2.1 due to a stochastic gradient descent on a cost function [13] which optimizes the 3

4 margin [21]. However, it crucially dependents on the initialization. To overcome this drawback a combination of GLVQ with neural gas (NG has been proposed by the authors in such a way that a cost function is minimized trough the learning rule [22]. This cost function leads to training similar to NG or simple GLVQ, respectively, depending on the choice of a parameter of the cost function. During training, this parameter is varied so that neighborhood cooperation assures a distribution of the prototypes among the data set at the beginning, and a good separation of the classes is accounted for at the end of training. We now shortly present the respective formal descriptions. Let us first clarify some notations: Let cv Lbe the label of input v, L a set of labels (classes with #L = NL. Let V RD V be a finite set of inputs v. LVQ uses a fixed number of prototypes (weight vectors, codebook vectors for each class. Let W = {w r } be the set of all codebook vectors and c r be the class label of w r. Furthermore, let W c = {w r c r = c} be the subset of prototypes assigned to class c L. The task of vector quantization is realized by the map Ψ as a winner-take-all rule, i.e. a stimulus vector v V is mapped onto that neuron s A the pointer w s of which is closest to the presented stimulus vector v, Ψ λ V A : v s (v = argmin r A dλ (v,w r (2.1 with d λ (v,w being an arbitrary differentiable similarity 1 measure which may depend on a parameter vector λ. For the moment we take λ as fixed. The neuron s (v is called winner or best matching unit. The subset of the input space Ω λ r = {v V : r =Ψ V A (v} (2.2 which is mapped to a particular neuron r according to (2.1, forms the (masked receptive field of that neuron forming a Voronoi tesselation. If the class information of the weight vector is used, the boundaries Ω λ r generate the decision boundaries for classes. A training algorithm should adapt the prototypes such that for each class c L, the corresponding codebook vectors W c represent the class as accurately as possible. This means that the set of points in any given class V c = {v V cv = c}, and the union U c = r wr W c Ω r of receptive fields of the corresponding prototypes should differ as little as possible. We now consider the Generalized Learning Vector Quantization approach - GLVQ. The main idea is to introduce a cost function such that the learning rule gives a gradient descent on it. At the same time it should assess the number of misclassifications of the prototype based classification. Let f (x = (1 + exp ( x 1 be the logistic function. GLVQ minimizes the cost function Cost GLV Q = v f(µ λ(v (2.3 µ λ(v= dλ r+ dλ r d λ r+ + dλ r (2.4 via stochastic gradient descent, where d λ r+ is the squared distance of the input vector v to the nearest codebook vector labeled with c r + = cv, sayw r+, andd λ r is the squared distance to 1 A similarity measure is a non-negative real-valued function of two variables, which, in contrast to a distance measure does not necessarily fulfill the triangle inequality and has not to be symmetric. Naturally, each distance measure is a similarity measure. 4

5 the best matching prototype but labeled with c r c sayw v, r. As it was shown in [23], the usage of the function µ λ (v yields robust behavior whereas LVQ2.1 does not. The learning rule of GLVQ is obtained taking the derivatives of the above cost function. Using µ λ w (v = ξ+ λ d r+ w and µ λ (v r+ w = ξ dλ r r w r with ξ + = one obtains for the weight updates [13]: 2 d λ r (d λ r+ + dλ r 2 and ξ = 2 d λ r+ (d λ r+ + dλ r 2 (2.5 w r+ = ɛ + f µ λ (v ξ+ d λ r+ w r+ (2.6 w r = ɛ f µ λ (v ξ λ d r (2.7 w r + ɛ,ɛ are learning rates. As shown in [24], the above learning rules are also valid in case of a continuous data distribution. The original (unsupervised Neural Gas (NG adapts unlabeled prototypes w r W according to a given data set is minimized with local costs such that the cost function e r (γ= Cost NG (γ= and neighborhood function known from NG: 1 C (γ,k r e r (γ (2.8 P (v h γ (r,v, W (v w r 2 dv (2.9 h γ (r, v,w =exp ( k r (v,w γ. (2.10 Thereby k r (v, W yields the number of prototypes w r for which the relation d λ (v,w r d λ (v,w r is valid, i.e. k r (v,w is the winner rank [25]. C (γ,k is a normalization constant depending on the neighborhood range γ and the cardinality K of W. The learning rule reads as w r = ɛ h γ (r, v, W (v w i (2.11 minimizing the cost function. The initialization of the prototypes is not longer crucial in NG because of the involved neighborhood cooperation. As mentioned above, Supervised Neural Gas (SNG constitutes a combination of GLVQ and NG. Again, let W c= {w r c r = c} be the subset of prototypes assigned to class c Land K c its cardinality. Further we assume to have m data vectors v i. As pointed out in [22], the neighborhood learning for a given input v i with label c is applied to the subset W c. The respective cost function is Cost SNG (γ= m h γ (r, v i, W ci f (µ λ (r, v i=1 r w r W ci C (γ,k ci (2.12 with f (x = (1 + exp ( x 1 and µ λ (r, v= dλ r d λ r d λ r +d λ r whereby dλ r isdefinedasinglvqabove and d λ r = d λ (v,w r. The neighborhood cooperativeness makes sure that prototypes are spread 5

6 faithfully among data of their respective classes. Note that lim γ 0 Cost SNG (γ = Cost GLV Q holds [22]. Hence, for vanishing neighborhood the SNG also becomes optimal in the sense of margin analysis, as detailed below. However, if the neighborhood range γ is large, typically at the beginning of the training, the prototypes of one class share their responsibilities for a given input. Hence, neighborhood cooperation is involved such that initialization of the prototypes is not longer crucial. Given a training example (v i, c i all prototypes w r W ci and the closest wrong prototype w r are adapted. Taking now we get the update rules ξ + r = 2 dλ r (d λ r + d λ r 2 and ξ r = w r = ɛ + ξ + r f µ λ (r,v h γ (r, v i,w ci C (γ,k ci 2 d λ r (d λ r + d λ r 2 (2.13 f µ λ (r,v h γ (r, v i,w ci d λ r w r (2.14 w r = ɛ d λ ξ r r (2.15 C (γ,k r w r W ci c w r We remark that only for correct prototypes the neighborhood cooperativeness is applied. Yet, one could also include neighborhood cooperation for wrong prototypes. However, as shown in [22] this yields instabilities of learning. Margin analysis of an algorithm is important to asses the level of confidence of a classifier with respect to its decision. For example, the sample margin is defined as the distance between the input and the decision boundary. The natural choice of this margin to be maximized in learning vector quantization causes numerical instabilities [21]. An alternative definition gives the hypothesis margin: this margin is the distance that the classifier can move without changing the way it labels any sample data. Thereby, the sample-margin majorizes the hypothesis margin. The hypothesis margin of a prototype based classifier is given by ( 1 λ d 2 ( r+ d λ 1 2 r (2.16 In fact, the GLVQ (and, hence, SNG for vanishing neighborhood maximizes a cost function closely related to this hypothesis margin and, hence, can be taken as maximum margin algorithm provided the similarity measure can be interpreted as a kernelized version of the Euclidean metric, i.e. the similarity measure is symmetric and its negative is conditionally positive definite [19]. Further, any given margin provides an upper bound for the generalization error of the classifier such that the higher the margin the lower the generalization error [21]. For a more detailed analysis we refer to [21] Non-Standard Metrics, Relevance Learning and Metric Adaptation in SNG We now consider relevance learning in the above introduced SNG, i.e. we study the influence of the parameter vector λ = (λ1,...,λ m in the distance measure d λ (v,w defined in (2.1. With other words, we are interested in adaptation of the distance measure in dependence on the λ k to minimize the cost function. We are looking for the relevance of the parameters [26]. Then, an adaptation step for the parameters λ k has to be added to the usual weight vector adaptation. Thereby, we assume that λ k 0 and λ k k = 1. Relevance learning for SNG is referred as SRNG and we get λ k = 2ɛλ r w r W c f µ λ (r,v h γ (r, v i, W ci C (γ,k ci ( ξ + r d λ r λk ξ d λ r r λk (2.17 6

7 using µ λ (v λ k = ξ + λ d r+ λ k ξ dλ r λ k. (2.18 followed by a renormalization. ɛ λ > 0 is the learning rate. For vanishing neighborhood cooperativeness γ 0 the SRNG turns to Generalized Relevance LVQ (GRLVQ as relevance learning in GLVQ [27] with λ k = ɛλ f µ λ (r,v µ λ (r,v λk (2.19 Again as above, the update rule is also valid for the continuous case [24]. The learning rate ɛ should be in both approaches at least a magnitude smaller than the learning rates for the weight vectors. Then the weight vector adaptation takes place in a quasi stationary environment with respect to the (slowly changing metric. Hence, the margin optimization takes place for each level of the parameter λk and we get an overall optimization of margin including a relevance weighting of the parameters. As pointed out before, the similarity measure d λ (v,w is only required to be differentiable with respect to λ and w. The triangle inequality has not to be fulfilled necessarily. This leads to a great freedom in the choice of suitable measures and allows the usage of non-standard metrics in a natural way. In particular, kernel based similarity measures are allowed [19]. In this way SNG/SRNG are comparable to SVMs [28]. Interestingly, the margin analysis of this LVQ-version also holds for adaptive relevance metrics and kernelizations thereof as shown in [28]. In case that D d λ V (v,w = λ i (v i w 2 i (2.20 i=1 is the squared, scaled Euclidean distance, whereby again λ i 0 and λ i i = 1, we immediately get in (2.6 and (2.7 for GLVQ update: w r+ = ɛ + 2 f µ λ (v ξ+ Λ (v w r+ (2.21 w r = ɛ 2 f µ λ (v ξ Λ (v w r, (2.22 respectively, with Λ being the diagonal matrix with entries λ 1,...,λ DV and the relevance parameter update (2.19 in GRLVQ reads as λ k = ɛ λ f µλ (v ( ξ + (v w r + 2 k ξ (v w r 2 k (2.23 with k = 1...D V. In case of SNG and SRNG we get for the updates in case of the scaled Euclidean distance w r = 2ɛ + ξ + r w r = 2ɛ f µ λ (r,v h γ (r, v i, W ci Λ (v w r (2.24 C (γ,k ci ξ r C (γ,k r w r W ci c with Λ being again the diagonal matrix with entries λ 1,...,λD V and λk= ɛλ r w r W c f f µ λ (r,v h γ (r, v i, W ci C (γ,k ci µ λ (r,v h γ (r, v i, W ci Λ (v w r, ( ( ξ + (v w r r 2 k ξ (v w r r 2 k (2.26

8 Another popular similarity measure is the (parametrized Mahalanobis distance d λ (v, w=(v w T C 1 λ (v w (2.27 with C λ being the parametrized covariance matrix defined as follows: Let C = P T DP be the diagonal representation of the usual covariance matrix with diagonal matrix D. Then the parametrized diagonal matrix D λ is obtained as D λ = Λ D with Λ is the diagonal matrix and the diagonal elements Λ ii = λ i generate the parameter vector λ. Finally we define C λ = P T D λ P (2.28 The application of this measure allows a relevance ranking of the principal components of the data distribution for the given classification task. 2.2 Learning Classification According to Information Theoretic Measures The information theoretic classification is based on maximization of the mutual information between the class information and the data to approximate the Bayes error Basic Model for Classification Learning We now describe the information theory based learning vector quantization (IT-LVQ proposed in [29],[30],[15]. The basic idea is the usage of the mutual information between the class labels of input vectors taken as a random variable and the network output. This approach is based on the following considerations: The ultimate criterion for classification tasks is the Bayes error. Yet, mutual information can be taken as a proper approximation: Generally, the mutual information I (X, Y measures the information transfer between the random variables X and Y I (Y, X = H (Y +H (X H (Y, X (2.29 = H (Y +H (Y X = H (X +H (X Y with H (Y usually being the Shannon entropy and H (Y X its conditional counterpart. As pointed out in [16], the Bayes error B (Y has an upper bound [31] given by B (Y 1 2 H (X Y = 1 (H (X I (Y, X ( and a lower bound given by Fano s inequality ( P Y Ŷ H (Y X 1 log (ℵ = H (Y I (Y, X 1 log (ℵ (2.31 with ℵ is the number off possible instances of Y and Ŷ is a certain fixed instance of Y [32]. Hence, maximizing I (Y, X is equivalent to minimizing both bounds. Therefore, the mutual information can be taken as an estimate for the Bayes error. Now we can formulate the problem of classification in the following way: Let y i = g (v i, w (2.32 8

9 be a transfer function for a given input v i V with weights w. Then the formal stochastic gradient approach according to the mutual information I can be written for the weights w as I w = ε w with = w I i I y i y i w (2.33 Hence, the task is to determine I y i. For this purpose we consider I (Y,X. We specify X = L, again as L being the set of labels (classes with #L = NL, i.e. we interested in maximizing the mutual information between the random sequence of labels cv according to the input sequence of v and the variable y by adaptation of the weights w. For high-dimensional large data sets the computational costs become intractable. Therefore, a more convenient approximation is necessary. Such an approach is provided in [16] based on the fundamental work [29]. The key idea is to replace the Shannon entropy in the definition of the mutual information for tractable computation. In fact, according to K [33] the third axiom of additivity/recursiveness of the Shannon entropy is not necessary, if one is only interested to maximize or minimize the entropy of a system [16]. Therefore, other entropy measures can be used in this case [34]. For easier numerical computation Renyi s entropy is considered [35] H Renyi (α = 1 (p l α ( α ln ( l L instead of the Shannon entropy for use in the mutual information. For α =2 it is called quadratic entropy. Then, the mutual information (2.29 can be written as: I (Y,L = l L +( l L 2 p 2 (l, y dy (2.35 p 2 l p 2 (y dy p (l p (l, y p (y dy l L with p l the a priori probabilities of class l, as shown in [16]. One can estimate the probability density using Parzen window approximation of spherical Gaussian kernels G as N p (y= 1 ( G y yi,σ 2 1 (2.36 N i=1 with the Gaussian kernel in d-dimensional space defined as G (y,σ= 1 (2π d 2 Σ 1 exp 2 Thus, the mutual information I can be estimated as I (Y,L 1 NL N l G ( y N 2 li i,j=1 l=1 + 1 N 2 ( NL p 2 l l=1 2 1 NL N 2 p l l=1 ( 1 2y T Σ 1 y y lj, 2σ 2 1 (2.37 N G( yi y j, 2σ 2 1 (2.38 N l i=1 9 N i,j=1 j=1 G ( y l i y j, 2σ 2 1 (2.39

10 using the convolution properties of Gaussian kernels. This formula can be used to carry out the derivative I y. However, the computation is still expensive. Therefore in [15] a stochastic approximation based on the above Parzen estimate is suggested: It is assumed for the moment that the database V only contains two inputs v 1 and v 2 with two class possibilities. Then two cases have to be handled separately: a both, the transformed data y 1 and y 2 come from the same class or from different classes. In the first case it follows that I (Y,L 0 holds. In the second case b one gets I (Y, L= 1 4 which leads to the derivatives ( G ( 0,2σ 2 1 G ( y 1 y 2, 2σ 2 1 (2.40 I = 1 ( y 1 8σ 2 G y1 y 2, 2σ 2 1 (y 2 y 1 = I (2.41 y 2 In this way we can rewrite the general update rule (2.33 for this case and obtain ( w 1 ( = ( ε 8σ 2 G y1 y 2, 2σ 2 y2 1 (y 2 y 1 w y 1. (2.42 w It should be mentioned here, that the same update rule for the weights can be obtained using G s information energy O (X, Y instead of the mutual information [36, 30]. information energy is defined as The O (X, Y =E [Y X] E [Y ] (2.43 whereby E is the expectation value. The information energy has the following properties O (X, Y O (Y, X O (X, Y 0, O (X, Y =0iff X and Y are statistically independent O (X, Y 1 E [X] and O (X, Y =1 E [X] iff X is completely dependent on Y. O (X, Y measures the unilateral dependence of X relative to Y Non-Standard Metrics, Relevance Learning and Metric Adaptation The definition of the transfer function (2.32 for the model requires the differentiability with respect to the weights w. The specific choice is not pre-determined and, again, allows the usage of non-standard metrics in a natural manner. Further, any transfer function can depend on an additional parameter vector λ g (v i,w=g (v i,w, λ (2.44 According to the idea of relevance learning, we suggest to extend the weight update (2.33, respectively (2.42, by a parameter adaptation, as introduced by A in case of information energy [30]. I λ = ε λ with I λ = From this we immediately derive the explicit formula ( 1 λ = ε 8σ 2G( y 1 y 2, 2σ 2 1 (y 2 y 1 i I y i y i λ ( y 2 λ y 1 λ (2.45 (2.46 in complete analogy to the formula for the weight update (2.33. We remark that we obtained the same relevance parameter update rule as known for the information energy (2.43 except a constant factor [30]. 10

11 2.3 Regression Parametric Distance Metric Learning (RPDML The Regression Parametric Distance Metric Learning (RPDML as a regression classification approach is based on implicite kernel mapping estimation and subsequent usual regression. Regression parametric distance metric learning has been developed by Z. [17]. In the RPDML the label information of the vectors is directly included into a special data dependent similarity measure: ( exp v i v j cv i = cv j s ij = 1 ( v exp i v j β β cv i cv j (2.47 where denotes the Euclidean norm and β>0 is the free width parameter. The corresponding dissimilarity measure is d ij = s ii + s jj 2s ij (2.48 It has been shown that the matrix D = [d ij ] is metric, i.e. d ij 0, d ii = 0, d ij = d ji and the triangle inequality d ik + d jk d ij hold for all i, j, k [17]. The regression model RPDML introduces a mapping f (v,w=(f 1 (v, W,...,f l (v, W from the original input space V R D V to the embedded Euclidean space Rl as f = W T φ. Each f i is a linear combination of p linear or non-linear basis functions φ j (v: f i (v,w= p j=1 w ji φ j (v (2.49 where W =[w ji ] is an adaptive weight matrix and φ = ( T φ p l 1,...,φ. p Let D (V be given for an arbitrary fixed data set. Then the RPDML minimizes the cost function e 2 (W= i j (d ij q ij (W 2 (2.50 W T (φ (v i φ (v j. In this way one obtains a classifier from the regression model f = W T φ, whereby the weight matrix W contains the information of the metric D derived from the data. with respect to W by the iterative majorization algorithm [17, 37], where q ij (W = 2.4 Kernel Based Distance Learning (KBDL In kernel based classification models the data are mapped by kernel mapping into a possible high-dimensional space with high separation abilities. The Kernel Based Distance Learning classifier is developed in the context of classification with metric adaptation. Kernel based distance learning (KBDL has been recently introduced by T &K for distance metric learning for better classification [18]. The approach is derived from classic kernel methods. In fact, the problem description becomes very similar to that for support vector machines (SVMs. However, the focus of this approach is to automatically achieve an optimum metric for a given proximity setting. It is assumed that similarity/dissimilarity information between the data vectors v V R DV is available: all pairs (v i, v j of similar vectors form the set S of size N S, all other pairs are collected in the dissimilarity set D of size N D. In sense of a classification task the set S consists of all vector pairs for which both belong to the same class whereas D collects all pairs containing vectors from different classes. 11

12 A generalized Mahalanobis-distance can be defined by with M being a positive semi-definite matrix. s ij = (v i v j T M (v i v j (2.51 It is well known that for any non-singular quadratic matrix A the product AA T is positive semi-definite and, hence, can serve as M in (2.51. The task in KBDL is to find a suitable transformation matrix A of the data such that the distances for vector pairs in S is minimized whereas the distances for vector pairs in D is maximized. Let d ij refer to the distance measure (2.51 with an arbitrarily chosen positive semi-definite matrix M withm = AA T. Now the idea is to vary A in such a way that the above optimization goal is achieved. Let d ij denote the distance measure whereby AA T instead of M is used in (2.51. Then the problem can reformulated as: minimize the distances d ij for vector pairs in S whereas d ij d ij ( is imposed for vector pairs in D. The latter goal is equivalent to maximization of ς =min ij d ij d ij with ς 0. Because this condition may not be enforced perfectly, slack variables like in SVMs are necessary. This leads to the primal problem which has a similar form as ν-svms [18, 38]: min 1 2 AA T C + S N S (v i,v j S d ij 1 + CD νς + ND (v i,v j D ξ ij with respect to A, ς and ξ ij, subject to the constraints ς 0, and for each (v i, v j D d ij d ij ς ξ ij ξ ij 0 Thereby, C S, C D and ν are tunable parameters. The respective dual problem is obtained as subject to max C S N S (v i,v j D (v i,v j D α ij (v i v j T M (v i v j (v k,v l D α ijα kl ((v i v j T (v k v l 2 (v k,v α ( l S ij (v i v j T (v k 2 v l (v i,v j D 1 C D (v i,v j D α ij α ij ν [ 0, C D ND with Lagrange multipliers α ij [18]. Using the Karush-Kuhn-Tucker conditions ( α ij d ij d ij ς + ξ ij ( CD ξ ij ND α ij with ν = C 1 D ((v i,v j D α ij µ it can be shown that ] = 0 = 0 µς = 0 (2.52 quadratic programming (2.53 d ij d ij = ς, 0 <α ij < C D ND ς, α ij =0 ς, α ij = C D ND 12

13 holds, which is very similar to SVMs. Further ν is an upper bound for the fraction of errors [18]. A kernelized variant of the approach can easily be obtained [18]: Let κ ij = κ (v i, v j = φ (v i φ (v j be a kernel function with feature map φ. Then the dual problem becomes max C S N S (v i,v j D (v i,v j D (v i,v j D α ij (κ ii + κ jj 2κ ij (v k,v l D α ijα kl (κ ik κ il κ jk + κ jl 2 (2.54 (v k,v α l S ij (κ ik κ il κ jk + κ jl 2 and the modified inner product between φ (v k, φ (v l leads to the new kernel κ (v k, v l = φ (v i AA T φ (v j = (v i,v j D C S N S α ij (κ ki κ kj (κ il κ jl (v i,v j S (κ ki κ kj (κ il κ jl This adapted similarity measure can readily be included into standard metric classifiers such as nearest neighbor. The results reported below have been obtained by 1-nearest-neighbor classification. 3 Numerical Experiments In this section we demonstrate the behavior of the different algorithms on artificial data sets as introduced in [14], onto a subset of the public UCI repository of machine learning databases [39] and the prostate cancer data set as published by NCI [40]. Thereby, the artificial data set serves as a demonstration of the effect of a different problem adapted metric compared to the euclidean one for a simple test situation. The UCI data sets have been chosen for two reasons: on the one hand side, they cover different typical classification problems, on the other hand, they allow a fair comparison of the different proposals for metric adaptation. Since all methods have originally been proposed for vector data we first study vectorial representations. We add one data set containing proteomic patterns, an interesting bioinformatics application. This set goes beyond standard vectorial data in the sense that it is obtained as the collection of function values of an underlying spectrum at different wave lengths, i.e. we deal with vectorial representations of functions. Correspondingly, data are high dimensional and the dimensions show strong correlations. In addition, the data space is only sparsely covered and the number of examples is of the same order as the data dimensionality. Thus, methods which include structure in form of relevance terms and which emphasize the generalization ability e.g. in form of margin maximization such as SRNG or SVM are crucial for a successful classification. This example serves as a first step to demonstrate the applicability of the methods for nonstandard data sets. 3.1 Artificial Data For comparison, in the first tests for metric adaptation we applied the algorithms to the same artificial data sets as used before in [27],[14]: The artificial data sets 1 to 6 consist of two classes with 50 clusters with about 30 points for each cluster. The centers of the clusters are thereby located on a checkerboard structure in the two dimensional square [ 1, 1] 2. Data sets 1,3 and 13

14 Figure 1: Artificial multi-modal data set 1 (checkerboard with clear class separation Figure 2: Artificial multi-modal data set 5. The structure is a checkerboard too, but now with large overlap between the classes. 14

15 5 contain two-dimensional data points from these clusters for which the sets differ with respect to the overlap of the classes as depicted in Figs. 1 and 2. Data sets 2, 4 and 6 are achieved as copies of 1, 3 and 5, respectively, whereby the two-dimensional points are embedded into 8 dimensions as follows: a point (x1, x2 is embedded as: (x 1,x 2,x1+ν 1,x1+ν 2,x1+ν 3,ν 4,ν 5,ν 6. Thereby ν i is a uniform noise with support [ 0.05, 0.05] for ν 1, [ 0.1, 0.1] for ν 2,[ 0.2, 0.2] for ν3, [ 0.1, 0.1] for ν4, [ 0.2, 0.2] for ν 5, and[ 0.5, 0.5] for ν6. These sets are randomly divided into training and test set of the same size. We train SRNG/SNG with the scaled Euclidian metric (SEM (2.20 and the parametrized Mahalanobis metric (PMM (2.27 with 50 prototypes for each class on these sets whereby prototypes are initialized randomly with small values. The number of used prototypes result from the minimum number of prototypes to achieve a classification accuracy of 100%. Relevance factors / parameters λ i are initialized uniformly. In case of SNG, the parameters/relevances remain constant during the training process. Training is done for 3000 cycles with learning rates ɛ + =0.1 for correct prototypes, ɛ =0.05 for incorrect prototypes, and ɛ = for the relevance terms. The neighborhood range in SRNG/SNG (2.10 is initialized at γ = 100 and it is multiplied by the factor after each cycle. The results are collected in table 1. We only give the prediction (test rates, the recognition rate is omitted here. SNG and SRNG spread the prototypes faithfully among the data. SNG and SRNG usually miss at most 2 or 3 out of 100 clusters whereas GLVQ and GRLVQ miss about 50%. 2 For the well separated data set 1, SNG and SRNG achieve optimum classification accuracy in case of both applied metrices, SEM and PMM. For the second data set with SEM, SRNG is capable to classify 95%. Typical relevance profiles which are achieved are bounded as: λ =( 0.34, 0.4, 0.18, 0,..., 0 Hence the importance of dimensions 1 and 2 is clearly pointed out. Irrelevant dimensions are effectively pruned. By application of the PMM on data set 2 it is obvious that for SNG as well as for SRNG the classification is like a random guess. This may be due to the case that the Mahalanobis distance depends on the inverse of the covariance matrix of the underlying data set. The determination of the inverse is a numerically ill-conditioned problem which may lead to instable behavior [41]. The data are heavily affected by noise which, hence, result in an invalid inverse covariance strongly influencing the class separation. This behavior can also be observed for the noisy data sets 4 and 6. The classification accuracy for data set 3 and 4 is a bit worse owing to the larger overlap of the classes for SEM. Using PMM the effect of overlap could be easier handled so that the prediction values are better than for the SEM. This is especially obvious, if we consider data set 5 (see also Figure 2, where the prediction rates for PMM are nearly 20% better than for SEM and therefore are no longer a random guess. In conclusion, the results for the artificial data sets point clearly out that the usage of different metrices can be useful to improve the classification accuracy. The SEM shows much better results than the standard Euclidean metric used in SNG. The PMM can easier handle the overlapping artificial data sets but fails, if the data sets are to much affected by noise. 3.2 Real World Data In this section we compare the performances of the different classifiers for real world data sets from the UCI machine learning repository and NCI cancer research. Thereby, we use the 2 One missing cluster accounts for an error of about 1%. 15

16 SNG SEM SRNG SEM SNG PMM SRNG PMM data1 98% 99% 100% 100% data2 72% 94% 53% 53% data3 90% 91% 96% 94% data4 64% 92% 51% 53% data5 50% 54% 73% 73% data6 50% 55% 51% 52% Table 1: Prediction rates for the SNG and SRNG algorithms on the artificial data sets using SEM and PMM. data set class number of patterns for test number of patterns for training PIMA SOYA WINE WBC IONO Table 2: The used UCI data sets and there splitting into test and training data. achieved prediction accuracy as an estimate for the generalization ability of the models [42]. Further, we emphasize the feature of task (classification dependend metric adaptation of all used approaches. The UCI repository consists of a large number of different data sets with different input dimensions D V and number of classes. For comparison we used the same data as in [18]. In particular, we used the Pima Indians diabetes (PIMA, D V =8, the Soybean (SOYA, D V =35, Wisconsin breast cancer (WBC, D V =30, the Wine (WINE, D V =13 and the Ionosphere (IONO, D V = 34 data set. These data sets cover a broad spectrum of properties, which frequently occur in real world applications. According to [18], each data set was splitted into a test and training data set as shown in Table 2 In Tab. 3 the achieved prediction rates for the considered approaches with respect to the data sets are depicted. For SRNG and IT-LVQ the number of prototypes was chosen as 10% of data of the respective classes. Learning was done up to convergence. The results for KBDL and RPDML are taken from [17] and [18]. Yet, for KBDL, only for two of the above data sets results are provided. Generally, one immediately observe that SRNG with SEM and RPDML clearly achieve the best results, whereby the differences between them are small. SNG with Euclidean metric shows moderate accuracy whereas application of IT-LVQ as well as both SNG PMM and SRNG PMM leads to reduced accuracy. A general judgment of KBDL is difficult because results are available only for two data sets. Yet, for the two given results the achieved accuracy is moderate only 16

17 data set SNG SEM SRNG SEM RPDML 1 SNG PMM SRNG PMM IT-LVQ KBDL-L 2 KBDL-R 2 PIMA 88.46% 90.00% 75.34% 64.89% 65.83% 65.83% SOYA 100.0% 100.0% 100.0% 57.00% 49.00% 100.0% WINE 93.22% 95.76% 99.15% 75.42% 81.36% 61.86% 72.87% 67.65% WBC 88.27% 97.02% 95.10% 78.46% 90.41% 63.33% IONO 72.91% 88.45% 88.05% 73.71% 78.09% 56.20% 80.53% 87.00% Table 3: Prediction rates for the analysed algorithms on the UCI data sets. KBDL-L refers to KBDL with linear kernel, KBDL-R refers to KBDL with RBF-kernel. 1 results taken from [17]; 2 results taken from [18] for both variants - linear and RBF-kernel. For SNG/SRNG the relevance learning in SRNG leads to better performance than the unparametrized counterparts (SNG for both measures SEM and PMM. However, the PMM as distance measure delivers bad accuracy. This can be addressed to the numerical instabilities, when the inverse of the estimated covariance matrix is computed needed for the distance computation. It is well known that matrix inversion is an ill-conditioned problem which leads to instable behavior of the inverse matrix [41]. Hence, if the data contain large noise, this drastically influences the estimate of the inverse covariance matrix. This can be seen in the WINE data set, for example. Fig. 3 shows that the input dimensions 7 and 13 take 54.07% and 25.44% of the overall variance, respectively. Most of the other dimension contribute significant noise which leads to the decreased performance using PMM. As we have seen for the artificial data sets with noise (2, 4, 6, here we get the same reduced performance. Thus we can conclude that an accurate estimation of the covariance is essential for good performance of PMM. The IT-LVQ approach also shows significantly reduced classification accuracy. This has several reasons: Firstly, the Parzen window approximation assumes a rough Gaussian distribution of the data within the classes which is not necessarily fulfilled. Secondly, the approach is derived for a specific two-class-scenario: the priors of the classes are assumed to be equal. Further, the generalization to a more-class-scenario is only heuristic. Moreover, we observed that the performance crucially depends on initialization. All together leads to an instable behavior and unsatisfying results. The results for KBDL with linear and RBF-kernel are moderate. The performance is significant behind the top approaches. Yet, a general discussion is not possible, because we have results only for two data sets. For the given results we see that the accuracy is between SRNG SEM /RPDML and SRNG/SNG using PMM for the IONO data set whereas for the WINE data set it is behind them. RPDML shows best performance like SRNG SEM. Hence, the regression is able to adapt the weights according to the underlying metric provided by the class information of the data. However, the minimization of the cost function (2.50 is realized by the iterative majorization algorithm [17]. This algorithm iteratively computes the optimal solution. In each optimization step a matrix equation has to be solved, the dimension of which depend on the number of data [37]. Therefore, the algorithm is computational expensive, in particular for large data sets. For the SRNG and SNG algorithms we also investigated their behavior on the proteomic NCI prostate cancer set. This data set contains mass spectra of prostate cancer specimens. One class is the control class, whereas the other classes contain the data in indication on different cancer stages. Overall, the data set consist of 222 training and 96 test data with input dimension D V =130. We compare SRNG and SNG with a standard SVM-approach in proteomics (Unified Max- 17

18 Figure 3: Data from the WINE data set depicted in coordinates according dimension 7 and 13 taking 54.07% and 25.44% of the overall variance. data set SNG-SEM SRNG-SEM SNG-PMM SRNG-PMM UMSA NCI 62.5% 93.7% 27.0% 39.6% 92.7% Table 4: Predicition rates obtained on the NCI data set using SRNG and SNG imum Separability Analysis - UMSA, [43]. The results are shown in Table 4. SRNG achieves equivalent accuracy as UMSA. Yet, the advantage of SRNG are the reduced computational cost of SRNG in comparison to SVM-typed approaches [44] and, in particular in comparison to UMSA [45]. Further, the SRNG with SEM performs significantly better than the usual SNG. It becomes also obvious that by use of the PMM the results become worse, which may indicate, with respect to the results from the artificial data set, that a high amount of noise is included in this real data set. This different behavior can also be seen in the different relevance profiles, Fig. 4 and Fig. 5 4 Conclusion We compared different approaches of machine learning methods using adaptive metrics. The capabilities of the considered methods differ significantly. Best results show the regression model RPDML and the prototyped relevance learning classifier SRNG with scaled Euclidean metric. Both models demonstrate robust behavior and constant good results for all data sets which represent typical problems in real world data analysis. All other approaches show remarkable drawbacks such that an application can not be rec- 18

19 relevance relevance input dimension Figure 4: Relevance profile for the SRNG SEM input dimension Figure 5: Relevance profile for the SRNG PMM. 19

20 ommended without specific knowledge and experiences or verified assumption about data (in case of PMM. From the result for SNG and SRNG we can conclude that usage of metric adaptation generally improves the classification accuracy. This result is underlaid by the investigation in [14]. The same property has been shown for RPDML, KBDL and IT-LVQ [17, 18, 30]. References [1] M. Gori P. Frasconi and A. Sperduti, A general framework of adaptive processing of data structures, IEEE Transaction on Neural Networks, vol. 9, no. 5, pp , [2] A. Sperduti and A. Starita, Supervised neural networks for the classification of structures, IEEE Transaction on Neural Networks, vol. 8, no. 3, pp , [3] A. Passerini A. Ceroni, Frasconi P. and A. Vullo, Predicting the disulfide bonding state of cysteines with combinations of kernel machines, Journal of VLSI Signal Processing, vol. 35, no. 3, pp , [4] D. Sona A. Micheli and A. Sperduti, Contextual processing of structured data by recursive cascade correlation, IEEE Transaction on Neural Networks, vol. 15, no. 6, pp , [5] A. Micheli B. Hammer and A. Sperduti, Universal approximation capability of cascade correlation for structures, Neural Computation, p. to appear. [6] M. Gori M. Bianchini and F. Scarselli, Processing directed acyclic graphs with recursive neural networks, IEEE Transaction on Neural Networks, vol. 12, no. 6, pp , [7] B. Hammer and B.J.Jain, Neural methods for non-standard data, in European Symposium on Artificial Neural Networks 2004, M. Verleysen, Ed. 2004, pp , d-side publications. [8] N. Cristianini H. Lodhi, J. Shawe-Taylor and C. Watkins, Text classification using string kernels, Journal of Machine Learning Research, vol. 2, pp , [9] M. Diekhans T. Jaakkola and D. Haussler, A discriminative framework for detecting remote protein homologies, Journal of Computational Biology, vol. 7, no. 1-2, [10] T. Gärtner, A survey of kernels for structured data, SIGKDD explorations, [11] T. Kohonen and P. Somervuo, How to make large self-organizing maps for nonvectorial data, Neural Networks, vol. 15, no. 8-9, pp , [12] P. A. Flach T. Gärtner and S. Wrobel, On graph kernels: Hardness results and efficient alternatives, in Proc. Sixteenth Annual Conference on Computational Learning Theory and Seventh Kernel Workshop (COLT , Springer. [13] A. S. Sato and K. Yamada, Generalized learning vector quantization, in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds., vol. 7, pp MIT Press,

21 [14] B. Hammer, M. Strickert, and Th. Villmann, Supervised neural gas with general similarity measure, Neural Processing Letters, p. to appear, [15] K. Torkkola and W.M. Campbell, Mutual information in learning feature transformations, in Proc. Of International Conference on Machine Learning ICML 2000, Stanford, CA, [16] K. Torkkola, Feature extraction by non-parametric mutual information maximization, Journal of Machine Learning Research, vol. 3, pp , [17] Z. Zhang, J.T. Kwok, and D.-Y. Yeung, Parametric distance metric learning with label information, in Proc. of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 03, O. Kaynak, Ed., Acapulco, Mexico, 2003, pp [18] I.W. Tsang and J.T. Kwok, Distance metric learning with kernels, in Proc. International Conference on Artificial Neural Networks (ICANN 2003, O. Kaynak, Ed., Istanbul, 2003, pp [19] B. Hammer, M. Strickert, and Th. Villmann, Supervised neural gas with general similarity measure, Neural Processing Letters, p. to appear, [20] Teuvo Kohonen, Self-Organizing Maps, vol. 30 of Springer Series in Information Sciences, Springer, Berlin, Heidelberg, 1995, (Second Extended Edition [21] K. Crammer, R. Gilad-Bachrach, A.Navot, and A.Tishby, Margin analysis of the LVQ algorithm, in Proc. NIPS 2002, 2.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/index.html, [22] Th. Villmann and B. Hammer, Supervised neural gas for learning vector quantization, in Proc. of the 5th German Workshop on Artificial Life (GWAL-5, D. Polani, J. Kim, and T. Martinetz, Eds., pp Akademische Verlagsgesellschaft - infix - IOS Press, Berlin, [23] A. Sato and K. Yamada, A formulation of learning vector quantization using a new misclassification measure, in Proceedings. Fourteenth International Conference on Pattern Recognition, A. K. Jain, S. Venkatesh, and B. C. Lovell, Eds., vol. 1, pp IEEE Computer Society, Los Alamitos, CA, USA, [24] Atsushi Sato and Kenji Yamada, An analysis of convergence in generalized LVQ, in Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, L. Niklasson, M. Bodén, and T. Ziemke, Eds., vol. 1, pp Springer, London, [25] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten, Neural-gas network for vector quantization and its application to time-series prediction, IEEE Trans. on Neural Networks, vol. 4, no. 4, pp , [26] T. Bojer, B. Hammer, D. Schunk, and Tluk von Toschanowitz K., Relevance determination in learning vector quantization, in 9th European Symposium on Artificial Neural Networks. ESANN Proceedings. D-Facto, Evere, Belgium, 2001, pp [27] B. Hammer and Th. Villmann, Generalized relevance learning vector quantization, Neural Networks, vol. 15, no. 8-9, pp ,

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Relevance determination in learning vector quantization Thorsten Bojer, Barbara Hammer, Daniel Schunk, and Katharina Tluk von Toschanowitz University of Osnabrück, Department of Mathematics/ Computer Science,

More information

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,

More information

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition M. Kästner 1, M. Strickert 2, and T. Villmann 1 1- University of Appl. Sciences Mittweida - Dept. of Mathematics

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Multivariate class labeling in Robust Soft LVQ

Multivariate class labeling in Robust Soft LVQ Multivariate class labeling in Robust Soft LVQ Petra Schneider, Tina Geweniger 2, Frank-Michael Schleif 3, Michael Biehl 4 and Thomas Villmann 2 - School of Clinical and Experimental Medicine - University

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Divergence based Learning Vector Quantization

Divergence based Learning Vector Quantization Divergence based Learning Vector Quantization E. Mwebaze 1,2, P. Schneider 2, F.-M. Schleif 3, S. Haase 4, T. Villmann 4, M. Biehl 2 1 Faculty of Computing & IT, Makerere Univ., P.O. Box 7062, Kampala,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Distance learning in discriminative vector quantization

Distance learning in discriminative vector quantization Distance learning in discriminative vector quantization Petra Schneider, Michael Biehl, Barbara Hammer 2 Institute for Mathematics and Computing Science, University of Groningen P.O. Box 47, 97 AK Groningen,

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines on General Confidence Functions

Support Vector Machines on General Confidence Functions Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Regularization in Matrix Relevance Learning

Regularization in Matrix Relevance Learning Regularization in Matrix Relevance Learning Petra Schneider, Kerstin Bunte, Han Stiekema, Barbara Hammer, Thomas Villmann and Michael Biehl Abstract We present a regularization technique to extend recently

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Cluster Kernels for Semi-Supervised Learning

Cluster Kernels for Semi-Supervised Learning Cluster Kernels for Semi-Supervised Learning Olivier Chapelle, Jason Weston, Bernhard Scholkopf Max Planck Institute for Biological Cybernetics, 72076 Tiibingen, Germany {first. last} @tuebingen.mpg.de

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Radial-Basis Function Networks. Radial-Basis Function Networks

Radial-Basis Function Networks. Radial-Basis Function Networks Radial-Basis Function Networks November 00 Michel Verleysen Radial-Basis Function Networks - Radial-Basis Function Networks p Origin: Cover s theorem p Interpolation problem p Regularization theory p Generalized

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Comparison of Log-Linear Models and Weighted Dissimilarity Measures Comparison of Log-Linear Models and Weighted Dissimilarity Measures Daniel Keysers 1, Roberto Paredes 2, Enrique Vidal 2, and Hermann Ney 1 1 Lehrstuhl für Informatik VI, Computer Science Department RWTH

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Small sample size generalization

Small sample size generalization 9th Scandinavian Conference on Image Analysis, June 6-9, 1995, Uppsala, Sweden, Preprint Small sample size generalization Robert P.W. Duin Pattern Recognition Group, Faculty of Applied Physics Delft University

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition A Simple Implementation of the Stochastic Discrimination for Pattern Recognition Dechang Chen 1 and Xiuzhen Cheng 2 1 University of Wisconsin Green Bay, Green Bay, WI 54311, USA chend@uwgb.edu 2 University

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Adaptive Metric Learning Vector Quantization for Ordinal Classification

Adaptive Metric Learning Vector Quantization for Ordinal Classification 1 Adaptive Metric Learning Vector Quantization for Ordinal Classification Shereen Fouad and Peter Tino 1 1 School of Computer Science, The University of Birmingham, Birmingham B15 2TT, United Kingdom,

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

EM-algorithm for Training of State-space Models with Application to Time Series Prediction EM-algorithm for Training of State-space Models with Application to Time Series Prediction Elia Liitiäinen, Nima Reyhani and Amaury Lendasse Helsinki University of Technology - Neural Networks Research

More information

3.4 Linear Least-Squares Filter

3.4 Linear Least-Squares Filter X(n) = [x(1), x(2),..., x(n)] T 1 3.4 Linear Least-Squares Filter Two characteristics of linear least-squares filter: 1. The filter is built around a single linear neuron. 2. The cost function is the sum

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Self Supervised Boosting

Self Supervised Boosting Self Supervised Boosting Max Welling, Richard S. Zemel, and Geoffrey E. Hinton Department of omputer Science University of Toronto 1 King s ollege Road Toronto, M5S 3G5 anada Abstract Boosting algorithms

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

SMO Algorithms for Support Vector Machines without Bias Term

SMO Algorithms for Support Vector Machines without Bias Term Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Statistical Learning. Dong Liu. Dept. EEIS, USTC Statistical Learning Dong Liu Dept. EEIS, USTC Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO) Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Change point method: an exact line search method for SVMs

Change point method: an exact line search method for SVMs Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.

More information

Supervised locally linear embedding

Supervised locally linear embedding Supervised locally linear embedding Dick de Ridder 1, Olga Kouropteva 2, Oleg Okun 2, Matti Pietikäinen 2 and Robert P.W. Duin 1 1 Pattern Recognition Group, Department of Imaging Science and Technology,

More information

Adaptive Metric Learning Vector Quantization for Ordinal Classification

Adaptive Metric Learning Vector Quantization for Ordinal Classification ARTICLE Communicated by Barbara Hammer Adaptive Metric Learning Vector Quantization for Ordinal Classification Shereen Fouad saf942@cs.bham.ac.uk Peter Tino P.Tino@cs.bham.ac.uk School of Computer Science,

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information