Inferring Feature Relevances from Metric Learning

Size: px
Start display at page:

Download "Inferring Feature Relevances from Metric Learning"

Transcription

1 Inferring Feature Relevances from Metric Learning Alexander Schulz Bassam Mokbel Michael Biehl Barbara Hammer CITEC centre of excellence, Bielefeld University, Germany University of Groningen, Mathematics and Computing Science, The Netherlands Preprint of the publication [], as provided by the authors. Abstract Powerful metric learning algorithms have been proposed in the last years which do not only greatly enhance the accuracy of distance-based classifiers and nearest neighbor database retrieval, but also enable the interpretability of these operations by assigning explicit relevance weights to the single data components. Starting with the work [2], it has been noticed, however, that this procedure has limited validity in the important case of high data dimensionality or high feature correlations: the resulting relevance profiles are random to a large extend, leading to invalid interpretation and fluctuations of its accuracy for novel data. While the work [2] proposes a first cure by means of L2- regularization, it only preserves strongly relevant features, leaving weakly relevant and not necessarily unique features undetected. In this contribution, we enhance the technique by an efficient linear programming scheme which enables the unique identification of a relevance interval for every observed feature, this way identifying both, strongly and weakly relevant features for a given metric. Introduction Popular machine learning and data retrieval techniques crucially rely on a distance computation for the given data, including the k- nearest neighbor classifier, prototype based classification, unsupervised self-organizing maps, k-means clustering, or neighborhoodbased retrieval. Often, the standard Euclidean metric is used as a default for these methods, and the techniques fail provided the chosen distance is not appropriate for the given task. Due to this fact, metric learning has made great strides in recent years: the aim of metric learning is to autonomously adjust the parameters of a given distance function such that it better suits the intended task. A large variety of methods covers techniques for k-nearest neighbor classifiers, met-

2 Preprint of the publication [], as provided by the authors. 2 ric learning in regression, metric learning in prototype based classification, metric learning based on side information, learning methods for unsupervised clustering, hybrid techniques, etc. see e.g. [3 8]. Some approaches can be accompanied by theoretical guarantees as concerns their generalization ability, or limit behavior for online learning [9 4]. These techniques greatly enhance the performance of the methods for applications, since they enable practitioners to tailor the metric according to the given data and task at hand. The methods have in common that the by far most popular distance measure which is used for this purpose is given by a general quadratic form, corresponding to a linear transformation of the feature space. Albeit first techniques consider more general data structures [, 5], a quadratic form is used in the majority of the techniques. Besides an improved performance, a quadratic form has the benefit that it lends itself to an improved interpretability of the result: its diagonal corresponds to the relevance of the feature dimensions for the linear transformation underlying the quadratic form, such that this relevance profile can give hints about the importance of the observed features. This fact has been heavily used in the context of biomedical data analysis, for example [6 8]. It underlines the increasing importance of the interpretability of machine learning techniques, to enable not only excellent black-box behavior, but to also allow an interaction of human experts to benefit from the findings buried in the models [9 24]. When interpreting machine learning models based on their parametrization, however, certain minimum conditions have to be fulfilled, a crucial one being the uniqueness of the observed parameters. This is not guaranteed for metric learners provided high data dimensionality or large feature correlations are present, as first observed in the contribution [2]: feature correlations cause the fact that there exist changes in the quadratic form which do not change the resulting mapping. These effects are widely induced by a random initialization of the matrix, hence, relevance profiles which display spurious relevance peaks in large parts result. The approach [2] also proposed a first cure for this observation, a L2-regularization of the quadratic form, which is standard if the linear data transformation underlying the matrix is directly learned from given data e.g. via Thikonov regularization (also known as ridge regression), but which is not yet part of most metric learning algorithms. Albeit the proposed regularization yields unique results, it diminishes the interpretability of the relevance profiles in the following sense: L2-regularization accounts for the fact that all so-called strongly relevant features (which cannot be replaced by others without a performance loss) correspond to peaks in the relevance profile. So-called weakly relevant features, which contribute to the classification but which could be substituted by alternatives, share their relevance, hence they are no longer distinguishable provided a large number of correlated features is present. Hence interpretability as concerns this potentially useful feature contributions is lost. The problem of weakly relevant features constitutes a classical hard

3 Preprint of the publication [], as provided by the authors. 3 problem of feature selection, which is particularly complex provided sets of features rather than single features carry certain information [25 27]. One key technique which has been pioneered in the frame of the popular lasso refers to L-regularization rather than L2-regularization [28]: L-regularization favors sparse signals, resulting in minimal, but not necessarily unique feature sets which also include some weakly relevant signals. Based on this insight, recently, an efficient method has been proposed which identifies relevance intervals for all given features for a linear mapping based on L-optimization [29]. In this contribution, we take the approach published in [29] as a starting point and we extend this technology towards the setting of relevance weights for metric learners. The key observation is that a quadratic form corresponds to an implicit linear transformation of the data, such that a vectorial extension of the approach [29] allows us to quantify the relevance of a feature dimension for a given distance computation. In the following, we will formally introduce this approach, and we will investigate its performance for two popular matrix learners for different data sets including artificial data with known ground truth as well as a variety of benchmarks. 2 Feature Relevance Assume a mapping f : R d Y is given such as a classifier with Y being the set of possible output classes. One very important way of interpreting any such mapping is offered by a judgment of the relevance of the given features for this mapping, i.e. a question related to feature selection. Here, a crucial distinction is offered by the notion of strongly relevant features, i.e. features which are of central relevance for the mapping f and which cannot be skipped without loss of information, and weakly relevant features, i.e. features which are of relevance for the mapping, but could be substituted by alternatives [27]. Formally, a feature X i is strongly relevant provided the output f(x) depends on the feature X i even if all other features are known. A feature is weakly relevant if there exists a subset S of the other features, such that the output f(x) depends on X i provided S is known, but the feature is not strongly relevant. In all other cases, the feature is irrelevant. While strongly relevant features can be detected e.g. based on efficient estimates of the mutual information, weakly relevant features are harder to detect since they require the investigation of all subsets of features. In the context of a linear or generalized linear mapping f, a popular technology for feature selection is offered by lasso and variants [3]. Assume f(x) = ω T x. Then lasso relies on a L-regularized optimization of the mapping parameters min 2 n (f(x i ) y i ) 2 such that i= ω j s j= () for a sparsity constant s >. The constraint can be integrated into the objective as a penalty term with a fixed weighting. Ridge regression penalizes the L2 instead of L norm, and elastic net addresses a mixture

4 Preprint of the publication [], as provided by the authors. 4 of both objectives [3], where, depending on the weighting of the penalty, different degrees of sparsity can be enforced. It is possible to infer the relevance of a given feature X i from the size of the resulting weight ω i, whereby a varying penalty also can shed some light on the question whether the feature is weakly / strongly relevant. Inspired by this observation, we will use L regularization for the valid interpretation of relevance terms of a given mapping. Thereby, we separate the question of how to train the mapping and how to interpret the feature relevance. This strategy, together with a slight reformulation of the regularization, enables us to derive intervals for the possible relevance range of a given feature. The first step is to reduce the problem of feature relevance determination for a metric learner to the equivalent question for a linear data transformation. 3 Metric learning as linear data transformation Metric learning has been introduced in distance based machine learning models as a means to autonomously adjust the underlying distance measure to the given task at hand [3, 6]. Here, we focus on two popular metric learning schemes only. Since the proposed technique for metric interpretation is separated from the metric learning step itself, the proposed regularization for feature relevance determination can be used for every metric adjustment scheme which arrives at a general quadratic form, as utilized in the following. We rely on a distance measure which is given by a general quadratic form d : R d R d R, (x i, x j ) (x i x j ) T Λ(x i x j ) (2) with the positive semi-definite matrix Λ = Ω T Ω. Large margin nearest neighbor (LMNN) [3] adjusts this matrix in such a way that the k-nn error induced by this distance is optimized. More precisely, it fixes the k nearest neighbors for every given data point, and adjusts the matrix Ω such that points with the same label in this neighborhood are close, while points with a different label are separated by a distance term with a margin at least one. Generalized matrix learning vector quantization (GMLVQ) relies on a prototype-based winner-takes-all scheme rather than lazy learning [9]. Together with the prototype locations, the matrix Ω is adjusted such that the distance of a given data point with correct labeling versus its distance to a prototype with incorrect labeling is minimized. By optimizing Ω, both methods do not only increase the classification accuracy, but they also offer an interpretation of the feature relevance in terms of its diagonal Λ ii = j Ω2 ji or related terms, since the metric (2) corresponds to the linear data transformation x Ωx. (3) Hence interpretation of the matrix relevance terms reduces to the interpretation of this linear mapping.

5 Preprint of the publication [], as provided by the authors. 5 In the following, we will employ the unique eigendecomposition of the matrix Λ as the linear data mapping Ω, i.e. the eigenvectors scaled with the square root of the eigenvalues. 4 Linear Bounds Given the parameters Ω that define a linear mapping Ωx of a general quadratic form (2), we are interested in the interpretation of the mapping parameters Ω. First, we decompose the problem into one-dimensional mappings based on the following observation: Each row ω of Ω constitutes an independent mapping of the data into a one-dimensional subspace. Hence we can interpret each of these rows independently. After having obtained relevance bounds for the individual mappings ω, we can sum the absolute values of them in order to obtain relevance bounds for the whole mapping Ω. In a particular mapping, the parameter value ω j is often directly interpreted as the relevance of feature X j, provided the input features have the same scaling. However, this can be highly problematic, as pointed out in [2], because features in high-dimensional data are often correlated, and thus the absolute value of ω j can be misleading. In [2], the authors formalize mapping invariances for the given data to underline this observation, which we will briefly recap here. For given data vectors x i in a matrix X, the central notion of invariance is defined as follows: Given a mapping f(x) = ω x, we define that the parameters ω are equivalent to ω iff ω X = (ω ) X (4) i.e. the data mapping remains unchanged under a substitution of ω by ω. Note that this formulation addresses the behavior of the mapping for given data, without referring to a predefined criterion, such as the accuracy. The conditions for which ω is equivalent to ω are exactly described in [2]: two vectors ω and ω are equivalent iff the difference vector ω ω is contained in the null space of the data covariance matrix XX. The covariance matrix has eigenvectors v i with eigenvalues λ... λ I > λ I+ =... = λ d = sorted according to their size, whereby I denotes the number of non zero eigenvalues. Therefore, before interpreting the mapping parameters, the proposal in [2] is to choose one canonic representation ω of the equivalence class induced by a given ω: the vector ω results by dividing out the null space, such that ω becomes ω = Ψω where the matrix Ψ = Id v i vi i=i+ corresponds to a projection of ω to the eigenvectors with nonzero eigenvalues only, induced by the eigenvectors v i of the covariance matrix XX. Hence, the eigenvectors with eigenvalue zero are divided out. In [2], the authors show that choosing a representative in this way corresponds to a vector in the equivalence class with the smallest L 2 norm. It is therefore no longer possible to ascribe a misleading high value ω j to an irrelevant feature, e.g. due to unfavorable effects in the data. Instead, only relevant features are

6 Preprint of the publication [], as provided by the authors. 6 expressed in the mapping parameters. Although this approach provides a unique representative of every equivalence class, it is problematic regarding direct interpretability of the values: Sets of correlated features share their total relevance, which can lead to large groups of weakly expressed relevance. This is undesirable, since a single feature may be weighted consistently low, only because it is highly correlated to a large number of others, despite the fact that the information provided by this feature (or any equivalent one) might be of high relevance for the linear mapping outcome. Hence, we propose an alternative approach to determine representatives which are equivalent to a certain parameter vector ω, while allowing for an intuitive direct interpretation of the weights as feature relevances. In essence, instead of choosing the representative with smallest L 2 norm, we will use the L norm. Unlike the former, the latter induces a set of equivalent weights which have minimal L norm. This is beneficial, since we can infer the minimum and maximum relevance of each feature by looking at the minimum and maximum weighting of the feature within this set. In the following, we will formalize this concept. 4. Formalizing the Objective Given a parameter vector ω of a mapping, we are looking for equivalent vectors of the form ω = ω + i=i+ α i v i (5) where real-valued parameters α i add the null space of the mapping to the vector ω. Similar to the approach in [2], we choose minimum vectors only, in order to avoid arbitrary scaling effects of the null space. However, we use the L norm instead of the L 2 norm: µ min α ω + i=i+ α i v i. (6) The minimum value µ is unique per definition. However, uniqueness of the corresponding vector ω + d i=i+ α iv i is not guaranteed. To illustrate this fact, we refer to a simple observation: assume identical features X i = X j and a weighting ω i and ω j. Then any weighting ω i = t ω i +( t)ω j and ω j = ( t)ω i +tω j yields an equivalent vector with the same L norm. Based on this observation, we can formalize a notion of minimum and maximum feature relevance for a given linear mapping: the minimum feature relevance of feature X j is the smallest value of a weight ω j such that ω is equivalent to ω and ω = µ. Analogously, the maximum feature relevance of feature X j is the largest value of a weight ω j such that ω is equivalent to ω and ω = µ. Thus, we arrive at the following mathematical optimization problems: ω j min α ω j + α i (v i ) j (7) i=i+ s.t. ω + α i v i = µ i=i+

7 Preprint of the publication [], as provided by the authors. 7 and ω j max α ω j + α i (v i ) j (8) i=i+ s.t. ω + α i v i = µ. i=i+ where (v i ) j refers to entry j of v i. As solutions, we obtain a pair (ω j, ω j ) for each feature X j indicating the minimum and maximum weight of this feature for all equivalent mappings that share the same L norm. In the special case of linear mappings with the objective of mapping invariance, this strongly resembles the concept of strong and weak feature relevance. However, this framework does not realize the notion of strong and weak feature relevance in a strict sense. The reason is that we aim for scaling terms as observed in the linear mapping, which are subject to L regularization. Consequently, a set of two features with the same information content as a single feature are not treated as identical by this formulation. Instead, our formulation prefers the single feature because of the smaller scaling of the respective weight. A qualitative feature selection objective would treat such variables identically. Natural relaxations of our optimization problems are possible, as follows: By incorporating also eigenvectors which correspond to small eigenvalues in Eq. (5), we can enable an approximate preservation of mapping equivalence. Further, we can approximate the equality in Eq. (6) by allowing values below ( + ɛ)µ instead of exactly µ, for some small ɛ >. Such relaxations with small thresholds ɛ are strongly recommended for practical applications, where noise in the data has to be considered. We will refer to these straightforward approximations in our experiments. 4.2 Reformalization as Linear Programming Problem To obtain a solution algorithmically, we reformulate our optimization problems as linear programming problems (LP). Problem (7) can be rephrased as the following equivalent LP, in which we introduce a new variable ω k for every k, which takes the role of ω k + d i=i+ α i(v i ) k : ω j min ω j, (9) ω,α s.t. ω i µ i= ω k ω k + ω k ( i=i+ ω k + α i (v i ) k, k i=i+ α i (v i ) k ), k, where µ is computed in (6) and the variables ω i must be non-negative due to the constraints. For the optimum solution, we can assume that equality holds for one of the two constraints for every k. Otherwise, the solution could be improved due to the weaker constraints and the minimization of the objective. To rephrase problem (8), we use the

8 Preprint of the publication [], as provided by the authors. 8 equivalent formulation max ω,α ω j + α i (v i ) j, () s.t. i=i+ ω i µ i= ω k ω k + ω k ( i=i+ ω k + α i (v i ) k, k i=i+ α i (v i ) k ), k, where new variables ω k are introduced as well. Again, these serve as the absolute value ω k + d i=i+ α i(v i ) k : any solution for which equality does not hold for one of the constraints can be improved due to the weaker constraints and maximization as the objective. Since an absolute value is optimized, this is not an LP yet. To obtain its solution, one can simply solve two LPs, where the positive and negative value of the objective is considered, respectively: ( ) ω ± j max ± ω j + α i (v i ) j, ω,α i=i+ and we add the corresponding non-negativity constraint ( ) ± ω j + α i (v i ) j i=i+ At least one of these LPs has a feasible solution, and the final upper bound can be derived thereof as the maximum value ω j = max{ω + j, ω j } Dim C C Dim error train test orig train orig test effective dimension Figure : Two relevant features of the xor data set (left). Average classification error rates of GMLVQ with regularized metrics for the xor data set (right). For each linear mapping, this formulation requires to solve 3d LP problems containing 2d constraints and d I variables. For this purpose, standard solvers can be applied. 5 Experiments In this section we apply our proposed methods to four data sets from different domains. After describing the data, we explain the experimental setup and, finally, depict the results. For the evaluation, we employ the following data sets. The xor data set is artificially generated and consists of 4 clusters belonging to 2 classes constituting the XOR problem. One dimension is present 3 times with the addition of noise (features -3) and two identical irrelevant features are included (5-6). An image with features and 4 is depicted in Fig. (left). The wine data set consists of 256 features which are near-infrared spectra measur-

9 Preprint of the publication [], as provided by the authors. 9 ing the alcohol content of 24 wine samples [32]. The set is split into 94 training and 3 test samples, where samples number 34,35 and 84 are discarded as outliers, similar to [33]. Additionally, we switch the role of training and test set to obtain a more challenging problem in terms of interpretation. Since this is originally a regression problem, we transform it into a classification problem by binning alcohol levels into 3 classes of similar size. The tecator data set [34] consists of features that represent absorbances deduced from a spectometer. The goal is to predict the fat content of 25 meat samples. The set is split into 72 samples for training and 43 samples for evaluation, where we again switch the role of training and test set. Similarly as for the wine data, we bin the target variable into three classes to obtain a classification problem. The adrenal data set [6, 35] consists of 47 patients characterized by 32 steroid markers. The goal is to predict whether a patient has a benign or a malignant adrenal tumor. For this purposes, we split the data set into a training set containing instances for training and 37 instances for evaluation. IR reflectance high alcohol medium alcohol low alcohol As a pre-processing step, we apply a z-score transformation to all our data sets, by remov channel absorbance high fat medium fat low fat channel Figure 2: Spectra of the data sets wine (left) and tecator (right). ing the mean and standard deviation of the training data from each data set. It is important that the features in each data set have the same scaling so that we can interpret the weights of linear mappings. We train the GMLVQ model always using one prototype per class, except for the xor data set, where we use two. For the LMNN model, we use the parameters suggested by a parameter search procedure provided by the original authors. A crucial parameter in our framework is the size of the assumed null space of the data. In order to obtain a sensible choice for this parameter, we first train a metric learning algorithm and then utilize the following scheme:. Create a set S of candidate values for the size of the null space. This can simply be all possible values, or a guess based on the eigenspectrum of the data. 5. Experimental setup 2. For each element in S, apply our proposed interpretation framework to the previously learned metric, resulting in 2d relevance mappings for each row of the trained metric.

10 Preprint of the publication [], as provided by the authors. 3. Compute the classification accuracy on the train and test set for each of these 2d mappings and average them. Select the size of the null space as the one with a small test error along with the largest null space. We also employ the term regularized relevance profile when we refer to the resulting relevance bounds of our approach. Since the null space is often large, we will recall in the following the size of the effective dimension which is the number of dimensions minus the size of the null space. Additionally, due to noise, we soften the minimum norm conditions in (9) and () by allowing solutions smaller then. µ, in the following. As concerns the complexity of the metric learning scheme for high-dimensional data, the computation of a full rank matrix Λ R d d can be costly. However, Λ can be forced to have a low rank [5]. This can be done by defining Λ = Ω T Ω with Ω R l d, where l d restricts the rank. 5.2 Synthetic Data In order to demonstrate the problem of directly interpreting linear weights of a trained metric as relevances, we employ the synthetic data set xor. We train a GMLVQ method that results in a zero prediction error on the training and test set. The resulting three mappings of the metric with the largest scaling are depicted in the first row of Fig. 3. Basically, only one of these mappings has a high scaling so that the classification model uses approximately a one-dimensional subspace to solve the classification task. A direct interpretation of this linear vector ω 3 would suggest that feature 4 is the most important one, features and 3 have only half the relevance and features 2,5 and 6 are not useful for the task. However, for this data set we know that features -3 have the same explanatory power and if considered alone, each of them is as important as feature 4. It follows that, for this example, a direct interpretation of the linear weights is misleading, particularly for the weakly relevant features. In order to obtain a valid interpretation for the relevance of the features, we apply our proposed framework. We estimate the classification accuracy of the regularized mappings for all possible sizes of the null space as described in subsection 5.. The resulting curves are depicted in Fig. (right). It is apparent from the Figure, that the smallest effective dimension size with a zero test error is 3, although the test error for 2 dimensions is only slightly larger. Nevertheless, we employ 3 for our proposed framework. The resulting relevance bounds are shown in the second row of Fig. 3, where black bars depict weakly and white bars strongly relevant features. The results show that the bounds for the first two one-dimensional mappings ω and ω 2 have vanished. Formally speaking, this implies that the same mappings can be obtain with an almost zero L norm, meaning that these two mappings map the training data to zero. More interestingly are the resulting bounds for ω 3 : The framework has identified feature 4 as a strongly relevant feature and has found that features -3 can be re-

11 Preprint of the publication [], as provided by the authors. ω Relevances L Relevances for ω with. µ ω 2 Relevances L Relevances for ω 2 with. µ ω 3 Relevances L Relevances for ω 3 with. µ Figure 3: Results of our proposed approach for the xor data set. The first row shows the original linear mappings, the second row depicts the resulting upper (in black) and lower bounds (in white). placed but that each of them can explain as much of the target variable as feature 4. This explanation is precisely how we generated the data. 5 and 6 have almost upper bounds, merely reflecting noise. In order to have a comparison to relevance interpretation in literature, we apply the methods Lasso, Elastic Net and Ridge Regression to our resulting mapping by defining ŷ i = ω x i. Then, we can apply the formulation in equation () for Lasso and according ones for Elastic Net and Ridge Regression to obtain an interpretation for the feature weights based on these methods. The results are depicted in Fig. 4 for the Lasso (left), Elastic Net (middle) and Ridge Regression (right). We interpret only ω 3 this way, since, as we saw previously, this mapping contains the relevant information for discriminating the classes. The progress of the coefficient weights for Table : Classification error rates ranging between and for all data sets. If not specified differently, the classification model is GM- LVQ. xor wine tecator GMLVQ on adrenal LMNN on adrenal train er test er all three frameworks implies that feature 4 is particularly relevant. However, the results differ for the three weakly relevant features -3: Elastic net and Ridge regression require all three features equally weighted and hence do not show that each of them can actually be neglected. The Lasso identifies feature as particularly important, followed by feature 3 and 2. This order seems arbitrary and is an artifact of the noise which was added to all three features. Hence, we argue that these frameworks cannot provide the same information as our formalization. 5.3 Near-Infrared spectral data Spectral data have many correlated features and hence a large null space. For such data, it can be particularly misleading to directly interpret the weights of linear mappings. Hence, our method should be well suited in this case. First of all we train a GMLVQ model for each of the two data sets wine and tecator where we restrict the rank of the matrix Ω to two since GMLVQ tends to utilize only a low rank matrix in the end, usually. This does not harm the training accuracy as can be observed in Table but tends to improve the generalization. Subsequently, we apply

12 Preprint of the publication [], as provided by the authors. 2 Coefficients lasso Coefficients elastic net Coefficients ridge L Norm L Norm L Norm Figure 4: Employing the xor data set, estimates of the coefficients for different values of the L norm (x-axis) are shown. The methods lasso (left), elastic net (middle) and ridge regression (right) are utilized. error train test orig train orig test 5 5 effective dimension error train test orig train orig test effective dimension Figure 5: Average classification error rates of GMLVQ with regularized metrics for the wine (left) and tecator (right) data set, both for set S. then the original metric: while the training error stays the same, the test error drops from.29 to.4, which is a factor 2. The resulting relevance bounds for the wine data set are depicted in Fig. 6 and for the tecator data set in Fig. 7. For the wine data, the training procedure of the classifier yielded a rank one matrix, hence we use only a one-dimensional mapping for interpretation, in this case. For the tecator data set, both mappings are utilized, and the resulting relevance bounds for both mappings are displayed in Fig. 8. our approach to compute relevance bounds for the according learned metric. As previously, we determine a suitable size of the effective dimension using the scheme described in subsection 5.. The corresponding images are shown in Fig. 5: left for the wine and right for the tecator data set. The smallest test error is obtained with an effective dimensionality of 7 for the wine data set and with an effective dimensionality of 9 for the tecator data set. It is particularly interesting, that for the wine data set the regularized metric achieves a better performance Interestingly, the relevance bounds for both data sets contain only very few irreplaceable features, while the upper bounds are peaked, which implies that a few features can already explain the mapping to a large extent. Particularly for the wine data set, much noise is removed from the original mapping, i.e. many features have a low upper bound. Fig. 9 displays the original mapping (top) and the averaged mapping over all ω, ω (bottom). Here it is apparent that, while the original mapping has many non-zero values, the averaged regularized profile is extremely

13 Preprint of the publication [], as provided by the authors ω ω ω 2 Relevances L Relevances for ω with. µ Figure 6: Results of our proposed approach for the wine data set. The first row shows the original linear mapping, while the second row depicts the resulting upper relevance bounds. The lower bounds are all zero, in this case. sparse. Furthermore, the classification error with the averaged map accounts to on the training and to.4 on the test set, which is comparable to the averaged error of the regularized mappings ω, ω. Relevances L Relevances for ω with. µ Relevances L Relevances for ω 2 with. µ Figure 7: Results of our proposed approach for the tecator data set. The first row shows the original linear mapping, the second row depicts the resulting upper and lower relevance bounds. better on the training data while GMLVQ is superior on the test data. 5.4 Biomedical data For the adrenal data set we compare two metric learning approaches: The GMLVQ and LMNN. Both use the same functional form for computing distances, hence, we can apply our approach to both trained relevance matrices. First, we train both models with the restriction to rank two relevance matrices. This restriction did not harm the classification performance in our experiments, as compared to training without this restriction. The classification errors are depicted in Table. Both approaches achieve a comparable performance, where the LMNN model is For these models, we compute the classification errors based on different sizes of the effective dimension. These results are depicted in Fig.. Good performances are achieved with an effective dimensionality of 5 for the GMLVQ model and of 2 for the LMNN algorithm. The according relevance bounds for both relevance matrices are shown in Fig.. Both trained distance metrics agree on a few features, such as 9, while they emphasize often different ones. This might explain the different error curves in Fig. and the different classification performance on the training and test set.

14 Preprint of the publication [], as provided by the authors Relevances Figure 8: Summed lower and upper bounds for the tecator data set. 6 Conclusion In this contribution, we present an approach to obtain a valid interpretation of the feature relevance from a trained metric learning model. The results show that this procedure does not only provide a meaningful interpretation of the learned data transformation, but it can even improve the classification performance, in cases of a particular large null space. In future work we will employ this proposal to obtaining insights into specific data sets, as well as to provide more information about weakly relevant features. Acknowledgment Funding from DFG under grant number HA279/7- and by the CITEC center of excellence is gratefully acknowledged. ω aver. rel. maps Figure 9: Absolute values of the original mapping (top row) together with the absolute value of the averaged regularized mappings (bottom row). References [] A. Schulz, B. Mokbel, M. Biehl, and B. Hammer, Inferring feature relevances from metric learning, in 25 IEEE Symposium Series on Computational Intelligence, Dec 25, pp [2] M. Strickert, B. Hammer, T. Villmann, and M. Biehl, Regularization and improved interpretation of linear data mappings and adaptive distance measures, in IEEE SSCI CIDM 23. IEEE Computational Intelligence Society, 23, pp. 7. [3] A. Bellet, A. Habrard, and M. Sebban, A Survey on Metric Learning for Feature Vectors and Structured Data, ArXiv e-prints, Jun. 23.

15 Preprint of the publication [], as provided by the authors. 5 error train test orig train orig test effective dimension error train test orig train orig test effective dimension Relevances Figure : Average classification error rates of GMLVQ (left) and LMNN (right) with regularized metrics for the adrenal data set. [4] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, Large scale metric learning from equivalence constraints, in 22 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 6-2, 22. IEEE Computer Society, 22, pp [5] K. Bunte, P. Schneider, B. Hammer, F.- M. Schleif, T. Villmann, and M. Biehl, Limited rank matrix learning, discriminative dimension reduction and visualization, Neural Networks, vol. 26, pp , 22. [6] B. Kulis, Metric learning: A survey, Foundations and Trends in Machine Learning, vol. 5, no. 4, pp , 23. [7] A. Takasu, D. Fukagawa, and T. Akutsu, Statistical learning algorithm for tree similarity, in IEEE Int. Conf. on Data Mining, ICDM, 27, pp Relevances Figure : Relevance bounds for a GMLVQ model (top) and a LMNN model (bottom), both trained on the adrenal data set. [8] S. S. Bucak, R. Jin, and A. K. Jain, Multiple kernel learning for visual object recognition: A review, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp , 24. [9] P. Schneider, M. Biehl, and B. Hammer, Adaptive relevance matrices in learning vector quantization, Neural Computation, vol. 2, pp , 29. [] A. Bellet and A. Habrard, Robustness and generalization for metric learning,

16 Preprint of the publication [], as provided by the authors. 6 Neurocomputing, vol. 5, pp , 25. [] R. Jin, S. Wang, and Y. Zhou, Regularized distance metric learning: Theory and algorithm, in Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 29. Proceedings of a meeting held 7- December 29, Vancouver, British Columbia, Canada., Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds. Curran Associates, Inc., 29, pp [2] Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, Eds., Advances in Neural Information Processing Systems 22. Proceedings from 7- December 29, Vancouver, British Columbia, Canada. Curran Associates, Inc., 29. [3] B. Arnonkijpanich, A. Hasenfuss, and B. Hammer, Local matrix adaptation in topographic neural maps, Neurocomputing, vol. 74, no. 4, pp , 2. [4] M. Biehl, B. Hammer, F. Schleif, P. Schneider, and T. Villmann, Stationarity of matrix relevance LVQ, in 25 International Joint Conference on Neural Networks, IJCNN 25, Killarney, Ireland, July 2-7, 25. IEEE, 25, pp. 8. [5] B. Mokbel, B. Paassen, F.-M. Schleif, and B. Hammer, Metric learning for sequences in relational {LVQ}, Neurocomputing, no. to appear, pp., 25. [6] W. Arlt, M. Biehl, A. E. Taylor, S. Hahner, R. Libe, B. A. Hughes, P. Schneider, D. J. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C. H. L. Shackleton, X. Bertagna, M. Fassnacht, and P. M. Stewart, Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors, J Clinical Endocrinology and Metabolism, vol. 96, pp , 2. [7] G. de Vries, S. C. Pauws, and M. Biehl, Insightful stress detection from physiology modalities using learning vector quantization, Neurocomputing, vol. 5, pp , 25. [8] F.-M. Schleif, B. Hammer, M. Kostrzewa, and T. Villmann, Exploration of mass-spectrometric data in clinical proteomics using learning vector quantization methods, Briefings in Bioinformatics, vol. 9, no. 2, pp , 28. [9] A. A. Freitas, Comprehensible classification models: A position paper, SIGKDD Explor. Newsl., vol. 5, no., pp., Mar. 24. [2] V. V. Belle and P. Lisboa, White box radial basis function classifiers with

17 Preprint of the publication [], as provided by the authors. 7 component selection for clinical prediction models, Artificial Intelligence in Medicine, vol. 6, no., pp , 24. [2] C. Rudin and K. L. Wagstaff, Machine learning for science and society, Machine Learning, vol. 95, no., pp. 9, 24. [22] N. R. Wray, J. Yang, B. J. Hayes, A. L. Price, M. E. Goddard, and P. M. Visscher, Pitfalls of predicting complex traits from SNPs. Nature reviews. Genetics, vol. 4, no. 7, pp , Jul. 23. [23] P. Langley, The changing science of machine learning. Machine Learning, vol. 82, no. 3, pp , 2. [24] P. J. G. Lisboa, Interpretability in machine learning - principles and practice, in WILF, ser. Lecture Notes in Computer Science, F. Masulli, G. Pasi, and R. R. Yager, Eds., vol Springer, 23, pp [25] R. Nilsson, J. M. Peña, J. Björkegren, and J. Tegnér, Consistent feature selection for pattern recognition in polynomial time, Journal of Machine Learning Research, vol. 8, pp , 27. [26] B. Frénay, G. Doquire, and M. Verleysen, Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification, Neurocomputing, vol. 2, pp , 23. [27] L. Yu and H. Liu, Efficient feature selection via analysis of relevance and redundancy, J. Mach. Learn. Res., vol. 5, pp , Dec. 24. [28] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, vol. 58, pp , 994. [29] B. Frénay, D. Hofmann, A. Schulz, M. Biehl, and B. Hammer, Valid interpretation of feature relevance for linear data mappings, in 24 IEEE SSCI, CIDM 24, Orlando, FL, USA, December 9-2, 24, 24, pp [3] J. H. Friedman, T. Hastie, and R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software, vol. 33, no., pp. 22, 2 2. [3] K. Q. Weinberger and L. K. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research, vol., pp , 29. [32] UCL, Spectral wine database, 27, provided by Prof. Marc Meurens. [33] C. Krier, D. François, F. Rossi, and M. Verleysen, Feature clustering and mutual information for the selection of variables in spectral data, in 5th ESANN 27, Bruges, Belgium, April

18 Preprint of the publication [], as provided by the authors , 27, Proceedings, 27, pp [34] Tecator meat sample dataset. [35] M. Biehl, P. Schneider, D. Smith, H. Stiekema, A. Taylor, B. Hughes, C. Shackleton, P. Stewart, and W. Arlt, Matrix relevance LVQ in steroid metabolomics based classification of adrenal tumors, in 2th European Symposium on Artificial Neural Networks, ESANN 22, Bruges, Belgium, April 25-27, 22, 22.

Regularization and improved interpretation of linear data mappings and adaptive distance measures

Regularization and improved interpretation of linear data mappings and adaptive distance measures Regularization and improved interpretation of linear data mappings and adaptive distance measures Marc Strickert Computational Intelligence, Philipps Universität Marburg, DE Barbara Hammer CITEC centre

More information

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,

More information

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data P. Schneider, M. Biehl Mathematics and Computing Science, University of Groningen, The Netherlands email: {petra,

More information

Multivariate class labeling in Robust Soft LVQ

Multivariate class labeling in Robust Soft LVQ Multivariate class labeling in Robust Soft LVQ Petra Schneider, Tina Geweniger 2, Frank-Michael Schleif 3, Michael Biehl 4 and Thomas Villmann 2 - School of Clinical and Experimental Medicine - University

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition M. Kästner 1, M. Strickert 2, and T. Villmann 1 1- University of Appl. Sciences Mittweida - Dept. of Mathematics

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Functional Radial Basis Function Networks (FRBFN)

Functional Radial Basis Function Networks (FRBFN) Bruges (Belgium), 28-3 April 24, d-side publi., ISBN 2-9337-4-8, pp. 313-318 Functional Radial Basis Function Networks (FRBFN) N. Delannay 1, F. Rossi 2, B. Conan-Guez 2, M. Verleysen 1 1 Université catholique

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Support Vector Machine For Functional Data Classification

Support Vector Machine For Functional Data Classification Support Vector Machine For Functional Data Classification Nathalie Villa 1 and Fabrice Rossi 2 1- Université Toulouse Le Mirail - Equipe GRIMM 5allées A. Machado, 31058 Toulouse cedex 1 - FRANCE 2- Projet

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Olga Kouropteva, Oleg Okun, Matti Pietikäinen Machine Vision Group, Infotech Oulu and

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Time series prediction

Time series prediction Chapter 12 Time series prediction Amaury Lendasse, Yongnan Ji, Nima Reyhani, Jin Hao, Antti Sorjamaa 183 184 Time series prediction 12.1 Introduction Amaury Lendasse What is Time series prediction? Time

More information

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1 Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data."

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Regularization in Matrix Relevance Learning

Regularization in Matrix Relevance Learning Regularization in Matrix Relevance Learning Petra Schneider, Kerstin Bunte, Han Stiekema, Barbara Hammer, Thomas Villmann and Michael Biehl Abstract We present a regularization technique to extend recently

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

An indicator for the number of clusters using a linear map to simplex structure

An indicator for the number of clusters using a linear map to simplex structure An indicator for the number of clusters using a linear map to simplex structure Marcus Weber, Wasinee Rungsarityotin, and Alexander Schliep Zuse Institute Berlin ZIB Takustraße 7, D-495 Berlin, Germany

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

Divergence based Learning Vector Quantization

Divergence based Learning Vector Quantization Divergence based Learning Vector Quantization E. Mwebaze 1,2, P. Schneider 2, F.-M. Schleif 3, S. Haase 4, T. Villmann 4, M. Biehl 2 1 Faculty of Computing & IT, Makerere Univ., P.O. Box 7062, Kampala,

More information

Functional Preprocessing for Multilayer Perceptrons

Functional Preprocessing for Multilayer Perceptrons Functional Preprocessing for Multilayer Perceptrons Fabrice Rossi and Brieuc Conan-Guez Projet AxIS, INRIA, Domaine de Voluceau, Rocquencourt, B.P. 105 78153 Le Chesnay Cedex, France CEREMADE, UMR CNRS

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Classification on Pairwise Proximity Data

Classification on Pairwise Proximity Data Classification on Pairwise Proximity Data Thore Graepel, Ralf Herbrich, Peter Bollmann-Sdorra y and Klaus Obermayer yy This paper is a submission to the Neural Information Processing Systems 1998. Technical

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Distance learning in discriminative vector quantization

Distance learning in discriminative vector quantization Distance learning in discriminative vector quantization Petra Schneider, Michael Biehl, Barbara Hammer 2 Institute for Mathematics and Computing Science, University of Groningen P.O. Box 47, 97 AK Groningen,

More information

Robust Inverse Covariance Estimation under Noisy Measurements

Robust Inverse Covariance Estimation under Noisy Measurements .. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related

More information

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely

More information

Using the Delta Test for Variable Selection

Using the Delta Test for Variable Selection Using the Delta Test for Variable Selection Emil Eirola 1, Elia Liitiäinen 1, Amaury Lendasse 1, Francesco Corona 1, and Michel Verleysen 2 1 Helsinki University of Technology Lab. of Computer and Information

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels ESANN'29 proceedings, European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning. Bruges Belgium), 22-24 April 29, d-side publi., ISBN 2-9337-9-9. Heterogeneous

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA

GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA D Pimentel-Alarcón 1, L Balzano 2, R Marcia 3, R Nowak 1, R Willett 1 1 University of Wisconsin - Madison, 2 University of Michigan - Ann Arbor, 3 University

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

L 2,1 Norm and its Applications

L 2,1 Norm and its Applications L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

A RELIEF Based Feature Extraction Algorithm

A RELIEF Based Feature Extraction Algorithm A RELIEF Based Feature Extraction Algorithm Yijun Sun Dapeng Wu Abstract RELIEF is considered one of the most successful algorithms for assessing the quality of features due to its simplicity and effectiveness.

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

Variable Selection and Weighting by Nearest Neighbor Ensembles

Variable Selection and Weighting by Nearest Neighbor Ensembles Variable Selection and Weighting by Nearest Neighbor Ensembles Jan Gertheiss (joint work with Gerhard Tutz) Department of Statistics University of Munich WNI 2008 Nearest Neighbor Methods Introduction

More information

Learning From Data: Modelling as an Optimisation Problem

Learning From Data: Modelling as an Optimisation Problem Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31 You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Linear Spectral Hashing

Linear Spectral Hashing Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Adaptive Metric Learning Vector Quantization for Ordinal Classification

Adaptive Metric Learning Vector Quantization for Ordinal Classification 1 Adaptive Metric Learning Vector Quantization for Ordinal Classification Shereen Fouad and Peter Tino 1 1 School of Computer Science, The University of Birmingham, Birmingham B15 2TT, United Kingdom,

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Rapid Object Recognition from Discriminative Regions of Interest

Rapid Object Recognition from Discriminative Regions of Interest Rapid Object Recognition from Discriminative Regions of Interest Gerald Fritz, Christin Seifert, Lucas Paletta JOANNEUM RESEARCH Institute of Digital Image Processing Wastiangasse 6, A-81 Graz, Austria

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Relevance determination in learning vector quantization Thorsten Bojer, Barbara Hammer, Daniel Schunk, and Katharina Tluk von Toschanowitz University of Osnabrück, Department of Mathematics/ Computer Science,

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Université du Littoral- Calais July 11, 16 First..... to the

More information

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Immediate Reward Reinforcement Learning for Projective Kernel Methods ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods

More information