Nonlinear Metric Learning with Kernel Density Estimation

Size: px

Start display at page:

Download "Nonlinear Metric Learning with Kernel Density Estimation"

Simon McDaniel
5 years ago
Views:

1 1 Nonlinear Metric Learning with Kernel Density Estimation Yujie He, Yi Mao, Wenlin Chen, and Yixin Chen, Senior Member, IEEE Abstract Metric learning, the task of learning a good distance metric, is a key problem in machine learning with ample applications. This paper introduces a novel framework for nonlinear metric learning, called kernel density metric learning (KDML, which is easy to use and provides nonlinear, probability-based distance measures. KDML constructs a direct nonlinear mapping from the original input space into a feature space based on kernel density estimation. The nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to accurate classification on datasets for which existing linear metric learning methods would fail. It addresses the severe challenge to distance-based classifiers when features are from heterogeneous domains and, as a result, the Euclidean or Mahalanobis distance between original feature vectors is not meaningful. We also propose two ways to determine the kernel bandwidths, including an adaptive local scaling approach and an integrated optimization algorithm that learns the Mahalanobis matrix and kernel bandwidths together. KDML is a general framework that can be combined with any existing metric learning algorithm. As concrete examples, we combine KDML with two leading metric learning algorithms, large margin nearest neighbors (LMNN and neighborhood component analysis (NCA. KDML can naturally handle not only numerical features, but also categorical ones, which is rarely found in previous metric learning algorithms. Extensive experimental results on various datasets show that KDML significantly improves existing metric learning algorithms in terms of classification accuracy. Index Terms classification; metric learning; large margin nearest neighbors; neighborhood components analysis; kernel density estimation. 1 INTRODUCTION Distance metrics are distance measurements between data points, such as Euclidean distance or Manhattan distance. Learning a distance metric is a fundamental problem in machine learning and data mining [1]. In many applications, once we have defined a good distance or similarity measure between all pairs of data points, the data mining tasks would become trivial. For example, with a perfect distance metric, the k-nearest neighbor (knn algorithm can achieve perfect classification [2] [4]. As a result, ever since metric learning is proposed by Xing et al. [5], there has been extensive research in this area [6] [10]. These new methods greatly improved the performance of many metric-based algorithms and gained lots of popularity. There are several basic desirable properties for any metric learning algorithm: 1 it must reflect the true distance or similarity between data samples; 2 it needs to be flexible to support different learning settings and data types; 3 it should be able to generalize to outof-sample data; 4 it should be easy to use and does not require extensive parameter tuning. Few existing algorithms can satisfy all these requirements. Another key challenge for knn classification and metric learning is that, in many cases, the features are from totally different domains and have vastly different scales. Y. He, W. Chen, and Y. Chen are with the Department of Computer Science and Engineering, Washington University in St. Louis, USA. Y. Mao is with Xidian Univeristy, Xi an, China. For example, one feature could be velocity in km/h while the other is the color of an object. In such cases, it does not make much sense to compute the Euclidean distance between two feature vectors, even under linear transformation of the features. The majority of existing methods are based on a linear transformation. Namely, they learn a Mahalanobis distance between two D-dimensional data points x i, x j R D in the form of d L (x i, x j = L(x i x j 2, (1 where 2 is the l 2 -norm and L R D D is a matrix. Therefore, L represents a linear transformation of the input space, which corresponds to rotating and scaling the data points. Many representative metric learning algorithms, such as distance metric learning [5], large margin nearest neighbors (LMNN [7], information theoretic metric learning (ITML [8], neighborhood components analysis (NCA [6], and SEPAPH [11], are based on such linear transformation and l 2 Euclidean distance. A main reason for the popularity of linear metric learning is its good off-the-shelf usability. However, linear metric learning has inherent limits on their mapping capability. Nonlinear metric learning is more general and offers greater separation ability in theory. For example, for the four points in two classes in Figure 1.a, no linear metric learning methods can give correct knn classification. For example, the linear transformation in Figure 1.c only rotates and scales the data, so does the LMNN mapping in Figure 1.d. We can see that knn classification on the mapped data points in Figures 1.c and 1.d still cannot separate the two classes correctly.

2 2 However, our nonlinear transformation (to be explained later can map the four points to the coordinates in Figure 1.b, which enable correct knn classification. Nonlinear metric learning methods, although more expressive, are far less popular than linear methods. Often, they are not easy to use, since they require complex computation, not only for coefficient training, but also for model selection and hyper-parameter tuning. For example, kernelization methods [10], [12] [14] are inherently limited by the size of the kernel matrices. Neural network based methods [15] are also very expensive. Furthermore, these nonlinear methods often require tuning of many hyper-parameters. Their sensitivity to the parameter tuning further hinders their off-the-shelf usability, especially for unknown domains. Recently, Kedem et al. proposed two nonlinear metric learning algorithms χ 2 -LMNN and GB-LMNN [16]. χ 2 - LMNN still uses the linear transformation in (1 but employs a non-euclidean distance. However, χ 2 -LMNN has rather limited scope: it can only be applied when the input data are sampled from a simplex S D = {x R D x 0, x T 1 = 1}. GB-LMNN learns a nonlinear mapping φ(x which is an ensemble over a number of regression trees with different heights. GB-LMNN applies gradient boosting to learn the nonlinear mapping in a function space. However, GB-LMNN is computational expensive which limits its practicability. In this paper, we propose a new metric learning framework called kernel density metric learning (KDML. It uses kernel density regression to nonlinearly map each attribute to a new feature space. The distance is then defined as the Euclidean distance in the new space. Although we focus on integrating KDML with LMNN and NCA in this paper, this framework is general and can be used to support many metric learning algorithms such as ITML and DCV [17], [18]. There are several salient advantages of the KDML approach. 1 It embodies excellent nonlinear distance measures with a sound probabilistic explanation. In fact, the Euclidean distance in the mapped feature space corresponds to established distance measures between probability density functions. As a result, such nonlinear mapping allows us to correctly classify datasets that are notoriously difficult to tackle by linear metric learning methods. 2 Compared to kernel-based nonlinear metric learning methods, KDML is easier to use and offers good off-the-shelf usability. In fact, end users do not need to tune any parameter in KDML since we can automatically learn its hyper-parameters using gradient descent. Moreover, KDML can be used as a preprocessing blackbox to map features and be integrated with any supervised metric learning algorithm. Thus, it allows us to leverage the extensive development on efficient and scalable linear metric learning methods. 3 Unlike most existing metric learning methods which require the attributes to be numerical, KDML is the first metric learning algorithm that can naturally handle both numerical, categorical, and mixed attributes in a unified (a (c (b (d Fig. 1. An toy example with four points in two classes, marked in different shapes. a shows the original data; b shows the data after our KDML mapping (the two points in each class are very close to each other; c shows a random linear transformation; d shows the data after LMNN mapping. fashion. 4 KDML can handle the cases when features are from heterogeneous domains, as it maps all features to density-based quantities that can be uniformly compared. This paper makes the following contributions. We introduce KDML, a nonlinear metric learning algorithm, by proposing a novel nonlinear mapping which provides a good similarity measure based on kernel density estimation. It can naturally handle both numerical and categorical features and offers good out-of-the-box usability. We integrate KDML with prominent existing metric learning methods, including LMNN and NCA, and develop an optimization algorithm to train the model in a holistic way. The algorithm automatically finds optimal bandwidths for a Nadaraya-Watson kernel density estimator, which is absent in previous work. We conduct extensive evaluation on a collection of real-world datasets for multiway classification, with both numerical and categorical features. We show that KDML improves the performance of state-ofthe-art metric learning methods for knn classification tasks. The rest of this paper is organized as follows. Section 2 gives some preliminaries for metric learning. Section 3 presents the proposed KDML model including its nonlinear mapping and kernel density estimation. Section 4 combines KDML with two representative metric learning algorithms, LMNN and NCA, and presents the optimization algorithm for learning the transformation matrix and kernel bandwidths. Section 5 surveys related work on metric learning. Section 6 presents experimental results of different metric learning algorithms on various

3 3 benchmark datasets. Finally, Section 7 gives conclusions. 2 PRELIMINARIES In this paper, we focus on a supervised classification setting. The main ideas can also be extended to other settings such as weakly supervised, semi-supervised, or unsupervised ones. We assume we are given a training data set T = {(x 1, y 1,, (x N, y N } D 1 D D C, (2 where there the d th feature is defined in a domain D d and the label y i s are from a set of C classes C = {1,, C}. Note that our setting is more general than typical previous settings, because the domain D i can either be a numerical set such as R or a categorical set such as a set of colors. For each new input x, the k-nearest neighbors (knn algorithm classifies x by a majority vote from the k neighbors that are closest to x under a certain distance metric. The simple knn algorithm often gives surprisingly good classification quality as compared to more complex methods such as SVM. Its decision surfaces are nonlinear and the quality of the classification improves automatically as the amount of the training set increases. However, knn classification relies heavily on the distance metric and provides a most natural paradigm for evaluating various distance metric learning algorithms [19]. Metric learning, first proposed by Xing et al. [5], aims at automatically learning a good distance metric. Most existing metric learning algorithms are linear methods which learn a Mahalanobis distance: d 2 M(x i, x j = (x i x j T M(x i x j, (3 where M = L T L is a positive semi-definite matrix. The new distance metric is d M (x i, x j. It can be seen as applying a linear transformation Lx i to each data x i. Metric learning can also be used for dimensionality reduction by applying singular vector decomposition (SVD on M. Below, we briefly review some important metric learning algorithms. In this paper, we use two representative and popular metric learning algorithms, large-margin nearest neighbors (LMNN [7] and neighborhood component analysis (NCA [20], as the basic metric learning methods to be integrated with KDML. 2.1 LMNN and ITML Large margin nearest neighbor (LMNN is a linear metric learning algorithm tailored for knn classification. LMNN learns a Mahalanobis distance as defined in (3. In LMNN, for each input (x i, y i, it specifies a number of target neighbors with the same label as y i. Normally these m target neighbors are simply the m neighbors with the same label that are closest to x i based on the Euclidean distance. We use j i to denote that x j is a target neighbor of x i, and y ij {0, 1} to denote whether the labels y j and y i match (y ij = 1 when y i = y j. The objective function of LMNN is to minimize E(M = (1 µ d 2 M(x i, x j + µ i,ji,l i,ji (1 y il [ 1 + d 2 M(x i, x j d 2 M(x i, x l ] + where [z] + = max(0, z is the standard hinge loss and µ (0, 1 is a positive constant controlling the relative weights of the two terms. The first term minimizes the distance between each input and its target neighbors, and the second term, incorporating the idea of a margin as in SVM, penalizes the distances between those mismatched points that invade the neighborhood of each input. It is shown that the optimization in (4 can be reformulated into a semidefinite programming (SDP problem [7]. Weinberger et al. have proposed a specialized subgradient descent algorithm to solve this SDP, by exploiting the sparsity of active invaders in the second term of (4. LMNN has received great attention and popularity due to its good knn classification performance, efficiency, and easiness to use. Another important work is information-theoretic metric learning (ITML [8]. ITML also uses the linear transformation in (3 but utilizes a one to one correspondence between the Mahalanobis distance parameterized by M and a multivariate Gaussian as (4 P (x; M = 1 Z exp( 1 2 d2 M(x, x 0, (5 where Z is a normalization factor and x 0 is the mean of the Gaussian. Using this correspondence, the objective of ITML is to minimize KL(p(x; M 0 p(x, M = p(x; M 0 ln p(x; M 0 p(x; M dx, where M 0 is a fixed matrix such I or the inverse covariance matrix. The intuition is to regularize M by minimizing the Kullback-Leibler (KL divergence [21] between the implied distribution P (x; M and a prior distribution. 2.2 Neighborhood component analysis (NCA NCA is another linear metric learning method. Its basic idea is to find a distance metric that maximizes the performance of knn classification, measured by leaveone-out (LOO validation. ITML also uses the linear transformation in (3 but directly learns the matrix L as in M = L T L. Note that we have d 2 M(x i, x j = (x i x j T L T L(x i x j = (Lx i Lx j T (Lx i Lx j. In NCA, each point x i selects another point x j as its neighbor with some probability p ij, and inherits its class

4 4 label from that point. By using a softmax under the Euclidean distance in the transformed space, p ij, i j, is defined as: p ij = exp( Lx i Lx j 2 k i exp( Lx i Lx k 2. (6 Under this definition, we can compute p i, the probability that point i will be correctly classified : p i = j C i p i,j, where C i = {j y i = y j }. (7 According to ( 6 and ( 7, we have: p i = j C i p ij = exp( Lx i Lx j 2 j C i k i exp( Lx i Lx k 2 j C = i exp( Lx i Lx j 2 k i exp( Lx i Lx k 2 where C i = {j y i = y j }. As the probability is a function of the transformation matrix L, we find the L that maximizes the expected number of points correctly classified: (8 f(l = p i (9 The corresponding gradient is: f L =2L (p i p ik (x i x k (x i x k T i k p ij (x i x j (x i x j T j C i (10 NCA uses a gradient descent algorithm to maximize (9. By restricting the size of L, NCA can also be used for dimensionality reduction. 3 THE KDML FRAMEWORK In this section, we propose the KDML framework for nonlinear metric learning. We assume that we are given inputs (x, y D 1 D D C, where the d th feature is defined in a domain D d and the label y is from a set of C classes C = {1,, C}. 3.1 KDML feature mapping Under the KDML framework, we propose two kinds of transformation for each input (x, y. Density features. For each dimension d = 1,, D and any x d D d, there exists a conditional probability density function P d (c, x d = P (y = c x d, c = 1,, C. (11 We use P d (x d to denote the vector [P d (1, x d,, P d (C, x d ]. In this transformation, each input is transformed into a vector φ P (x = [P 1 (x 1 ; ; P D (x D ], (12 which is a concatenation of all the D probability density vectors. An alternative is to use the square roots of the probabilities. S d (c, x d = P (y = c x d, c = 1,, C, (13 which leads to a corresponding feature vector φ S (x. Entropy features. For each dimension d = 1,, D and any x d D d, we compute the logarithm of the density E d (c, x d = ln P (y = c x d, c = 1,, C. (14 Let E d (x d denote the vector [E d (1, x d,, E d (C, x d ]. In this mapping, each input is transformed into a vector φ E (x = [E 1 (x 1 ; ; E D (x D ], (15 which is a concatenation of all the entropy vectors. In KDML, we choose a feature mapping from φ P, φ S, and φ E and name it φ. We may also include the original variables in the feature vector to make it strictly more general than the original linear mapping. It then employs an existing linear metric learning method to learn a linear transformation Lφ(x which gives rise to a Mahalanobis distance in the mapped feature space. 3.2 Implied distance measures We now discuss the distance measures implied by using the above KDML features. We can see that they all correspond to some sound distance/similarity measures between two probability density functions. As a result, in many cases, the Euclidean distance in the feature space after mapping reflects a better distance measure than the Euclidean distance in the original input space. One way to view KDML is that it first transforms the original input into a new space. In the new space, before any metric learning, the similarity between two data points are based on the Euclidean distance between their feature vectors: d 2 (x i, x j = (φ(x i φ(x j T (φ(x i φ(x j. (16 In fact, since there are D dimensions, each data point corresponds to D probability density functions (PDFs, each with C possible values. That is, for an input x i, its PDF at the d th dimension is PDF i,d = [P d (1, x i,d,, P d (C, x i,d ], (17 where we use x i,d to denote the d th attribute of x i. The distance in (16 can be viewed as the summation over the D dimensions D d 2 (x i, x j = diff(pdf i,d, PDF j,d, (18 d=1 where diff( measures the distance between two PDFs over the set C.

5 5 Distance or similarity measures between PDFs have been extensively studied. A good survey of these measures can be found in [22]. To justify the features in KDML, we examine the underlying distance measures they imply. When the density feature φ S is used, considering two points x i and x j, their Euclidean distance is d 2 (x i, x j = (φ S (x i φ S (x j T (φ S (x i φ S (x j D C = [ P d (c, x i,d P d (c, x j,d ] 2 d=1 c=1 Therefore, using φ S features, the implied distance measure between the PDFs is diff S (PDF i,d, PDF j,d = C [ P d (c, x i,d c=1 P d (c, x j,d ] 2. The above equation is exactly the well-known squaredchord PDF distance measure [23], which is also the square of the Matusita distance measure [24]. Hence, the φ S feature implies the squared-chord and Matusita PDF distance measures. When the density feature φ P is used, considering two points x i and x j, their Euclidean distance is d 2 (x i, x j = (φ P (x i φ P (x j T (φ P (x i φ P (x j D C = [P d (c, x i,d P d (c, x j,d ] 2. (19 d=1 c=1 We can see that, using φ P features, the implied distance measure between two PDFs is diff P (PDF i,d, PDF j,d = C [P d (c, x i,d P d (c, x j,d ] 2. (20 c=1 We can see that (20 is exactly the commonly used squared Euclidean distance measure between two PDFs [22]. Furthermore, the well-known Squared χ 2 PDF distance measure [25] is diff χ 2(PDF i,d, PDF j,d = C c=1 [P d (c, x i,d P d (c, x j,d ] 2 P d (c, x i,d + P d (c, x j,d (21 Comparing (20 with (21, we can see that diff χ 2 can be obtained if we apply a linear transformation to φ P (x i and φ P (x j (by dividing φ P by P d (c, x i,d + P d (c, x j,d and then use the Euclidean distance as the distance measure. In this sense, the metric learning is more general since it learns a linear transformation M, in the entire space of positive semi-definite matrices. Since M is learned under the guidance of some external objectives, such as optimizing the knn classification accuracy, we expect it to give better metric for each specific data mining task than the fixed transformation in the squared χ 2 measure. Finally, when the entropy feature φ E is used, the Euclidean distance between two points x i and x j is d 2 (x i, x j = (φ E (x i φ E (x j T (φ E (x i φ E (x j = D C [ln P d (c, x i,d ln P d (c, x j,d ] 2 = d=1 c=1 D C d=1 c=1 [ ln P ] 2 d(c, x i,d (22 P d (c, x j,d Using φ E features, the implied distance measure between the PDFs is C [ diff E (PDF i,d, PDF j,d = ln P ] 2 d(c, x i,d, (23 P d (c, x j,d c=1 which is not a known PDF distance measure to our knowledge but embodies, under a linear transformation that can be reflected in the Mahalanobis matrix M, the following squared variant of KL divergence [21] diff KL 2(PDF i,d, PDF j,d = C c=1 [ P d (c, x i,d ln P ] 2 d(c, x i,d P d (c, x j,d In summary, the proposed features correspond to some sound distance measures between two PDFs. We believe that they usually give a more reasonable distance measure than the original Euclidean distance. Performing metric learning on these transformed features may allow us to improve many learning algorithms. Moreover, the proposed transformations address the problem of heterogeneous features since they map the data into quantities based on probability densities, which can be directly compared in a uniform space. 3.3 Kernel density estimation for computing features We have proposed the feature mappings φ P, φ S, and φ E for KDML. Now we estimate the conditional probability densities in these features. From (11, (13 and (14, all of them require estimating P (y = c x d, for each x d D d and c = 1,, C. Once we have all the P (y = c x d, those features can be computed. Given training data T = {x i, y i }, i = 1,, N, We partition T into C subsets T 1,, T C, which contain data points with labels y = 1,, C, respectively. To estimate p(y = c x d, we distinguish the cases of categorical and numerical attributes. We use ˆp(y = c x d to denote the estimates. Categorical attributes. If an attribute x d takes categorical values, p(y = c x d can be estimated by the proportion of samples with y = c among all the samples whose d th attribute is x d. Thus, it can be computed using: ˆp(y = c x d = T k Txd, c = 1,, C (24 T xd

6 6 where T xd = {x i x i,d = x d, i = 1,, N} is the set of samples in T whose d th attribute is x d. Numerical attributes. If an attribute x d takes numerical values, we propose to use a Nadaraya-Watson type kernel density regression to estimate p(y = c x d. According to the Nadaraya-Watson estimator [26], [27], we have: ˆp(y = c x d = i T c K( x d x i,d h d N i=1 K( x d x i,d h d (25 where K(x is a kernel function satisfying K(x 0 and K(xdx = 1, and hd > 0 is a parameter called the bandwidth of the kernel density function. In this paper, we choose the Gaussian kernel for K(x, namely, K(x = 1 exp( x2. (26 2π 2 We can thus compute the KDML features by substituting the estimates in (24 and (25 into (11, (13, and (14. For example, the entropy features for categorical and numerical attributes are, respectively, [ ] Tc Txd E d (c, x d = ln, (27 T xd and, i T E d (c, x d = ln c exp( (x d x i,d 2 2h 2 d i T exp(. (28 (x d x i,d 2 We also comment on the difference between the assumptions of ITML and KDML. The main assumption of ITML is that all the data points are drawn from a single Gaussian distribution, centered at x 0. Such an assumption may be too restrictive in some cases. KDML, in contrast, assumes a nonlinear distribution which is a mixture of multiple Gaussians at each dimension. 4 KDML AS PREPROCESSING Now we discuss the training algorithm for KDML. In principle, KDML can be combined with any metric learning algorithm. One scheme is to first fix the kernel bandwidths h d s in the Nadaraya-Watson estimator before learning the transformation matrix L. In this case, KDML can be viewed as a preprocessing step that transforms each input x d into φ(x d. We can simply call an existing metric learning algorithm such as LMNN or NCA to learn L based on the new features φ(x d. The key in this static scheme is how to choose the static kernel bandwidth. We discuss two methods below. Rule of thumb. One way to choose h d is to use rulesof-thumb to set heuristic h d values. A popular one is the Silverman s rule of thumb [28]: 2h 2 d h d = 1.06σ d N 1/5, (29 where σ d is the standard deviation of x d. Local Scaling. When the data resides on multiple scales (e.g., one cluster is tight and the other is sparse, using a single bandwidth may fail to give accurate estimation of the data distribution [29]. To address this issue, we can use local scaling [30], a technique to assign a dynamic bandwidth h id for each data point x i. Using adaptive scaling for each point allows self-tuning of the proper bandwidth according to the local statistics of the neighborhoods surrounding points x i. In particular, for point x i at dimension d, we set its kernel bandwidth to: h i,d = x i,d x K,d (30 where x K is the K th nearest neighbor of x i on the d th dimension. The selection of K is independent of scale and is a function of the data dimension of the embedding space. As suggested by [30], we set K = 7 in our experiments. Under adaptive scaling, the kernel density estimation for numerical attributes changes accordingly. For example, the entropy feature becomes: i T c exp( (x k,d x i,d 2 E d (c, x k,d = ln 2h k,d h i,d i T exp( (x k,d x i,d 2 2h k,d h i,d. (31 Comparing (31 to (6, we can see that they both do transformation based on the kernel density. The difference is that by introducing local scaling, for data in the same column, the transformation vector is no longer the same. The affinity between each points and its surrounding neighbors is different. The affinities across clusters are significantly lower than the affinities within any single cluster. 5 INTEGRATED KDML OPTIMIZATION Instead of using KDML as a preprocessing step, we can also combine it with another metric learning algorithm and learn the kernel bandwidth h d and transformation matrix L in an integrated optimization procedure. In principle, KDML is a general framework that can be combined with any existing metric learning algorithm, which nonlinearly maps the features in the original space to a new space. As concrete examples, we combine KDML with two leading metric learning algorithms, LMNN and NCA, and show how to derive the integrated learning algorithm. 5.1 Integrated optimization algorithm First, we map each training data x into a feature φ(x (which may be φ P (x, φ S (x, or φ E (x. Then, we use the metric learning algorithm to learn a transformation Lφ(x which leads to a Mahalanobis distance d 2 M(x i, x j = (φ(x i φ(x j T M(φ(x i φ(x j, (32 where M = L T L is a positive semi-definite matrix.

7 7 Algorithm 1 Optimization for Kernel Density Metric Learning 1: Initialize h using (29 2: repeat 3: compute the feature matrix Φ h 4: call a existing metric learning algorithm to optimize M under fixed h & Φ h 5: for d = 1 to D do under fixed M 6: if the d th dimension is numerical then 7: h d h d γ gradient descent for h 8: end if 9: end for 10: until h converges 11: output h and M Although using static methods often give solid performance, we can in fact derive a novel way to automatically choose optimal h d based on the objective. Such automatic tuning is absent in previous work. The training of KDML aims at learning the optimal values of the matrix M and the bandwidths h in the Nadaraya-Watson estimators. Applying LMNN or NCA to φ(x, we solve the corresponding problems of minimizing their objective function E. We show our optimization algorithm for training combinzation of KDML and metric learning algorithm in Algorithm 1. It contains two levels of optimization: an outer loop which optimizes h using subgradient descent, and an inner loop which learns M using the original LMNN or NCA package under fixed h. In the outer level, at each iteration, the feature matrix Φ h composed of φ c,d (x k is updated based on the new h. If a variable x i is categorical, its feature is computed by (24. For a numerical variable x d, we use the kernel density estimation in (25 to compute its feature. Then, entering the inner level, we use the original LMNN or NCA training algorithm to learn the M that minimizes E(M in (33 under fixed h and Φ h. Finally, we optimize h for numerical attributes by performing descent along the subgradient direction based on the validation set. We use a line search with the Armijo rule to choose the step size γ. As Algorithm 1 learns M and h, there are no other hyper-parameters to tune. The complexity of Algorithm 1 mainly depends on the complexity of the metric learning algorithm we choose. Let n be the number of data set instances, d be the number of dimensions, and c be the number of classes. And let O(f(n, d be the time complexity of chosen metric learning algorithm. In each outer loop, computing the feature matrix (Line 3 costs O(n 2 d, optimization of E(M (Line 4 costs O(f(n, dc, and updating gradient of h (Line 7 costs O(n 2 (dc 2. In practice, the metric learning part 4 is most time-consuming while computing feature matrix and updating gradient of h are much less expensive. Thus, the metric learning part dominates the running time of Algorithm The KD-LMNN approach We use KD-LMNN to denote the LMNN model combined with the KDML feature. Applying LMNN to the KDML feature φ(x, we solve the problem of minimizing: E(M = (1 µ d 2 M(x i, x j + µ i,ji,l i,ji (1 y il [ 1 + d 2 M(x i, x j d 2 M(x i, x l ] +, (33 where d 2 M (x i, x j is defined in (32. We propose to minimizes E in (33. For this minimization, a nice fact is that we can get the closed form of the subgradient and compute it efficiently. There are two terms in (33. Let E 1 = DM(x 2 i, x j, and (34 E 2 = we have i,ji,l i,ji (1 y il [ 1 + D 2 M(x i, x j D 2 M(x i, x l ] + (35 E(M = (1 µe 1 + µe 2. (36 We compute the subgradients for these two terms separately. First, since we have 1 = M(x k x j + x k jk D 2 M (x i, x j x i = M(x i x j, (37 k j,kj M(x k x j. (38 For E 2, note that 2 x k is a subgradient since it involves a hinge loss and it is non-differentiable whenever the term inside [.] + is zero. Therefore, when 2 x k [1 + D 2 M(x i, x j D 2 M(x i, x l ] < 0, (39 is 0; otherwise, we have: 2 = x k = k k,jk (1 y kl M(x l x j l + i k,ki,l (1 y il M(x k x i (1 y ik M(x k x i. (40 i,ki Then, according to (36, for d = 1,, D, we have [ (1 µ 1 + µ 2 x k x k ] T x k (41 [ T where x k = 0,, φ c,d(x k x k,, 0] is a column vector, where φ c,d (x k is the KDML feature value for the d th

8 8 dimension and class c of x k. Using the entropy feature in (14 as an example, we know i T φ c,d (x k = ln c exp( (x k,d x i,d 2 2h 2 d i T exp(. (42 (x k,d x i,d 2 We now compute φ c,d(x k. Let r d = 1/(2h 2 d, we have: φ c,d (x k i T = c [(x k,d x i,d 2 exp(r d (x k,d x i,d 2 ] r d i T c exp(r d (x k,d x i,d 2 i T [(x k,d x i,d 2 exp(r d (x k,d x i,d 2 ] i T exp(r d(x k,d x i,d 2 (43 and φ c,d (x k = 1 h 3 d 2h 2 d φ c,d (x k r d. (44 Summarizing things together, we can get the closed form of by assembling (41, (38, (40, (43, and (44. The closed form of seems complex but in fact can be efficiently computed. has two parts, 1 and 2. 1 has a simple form and only involves pairs of neighboring points satisfying j k or k j. For 2, it is important to note that it is a subgradient. For E 2 in (35, for each point i, we only need to consider those active l that are invading the neighborhood of i so that the corresponding [.] + term is positive. There are typically few invaders. This is observed and exploited in LMNN to speed up its gradient computation [31]. In [31], it is found that the k target neighbors and the invaders do not change frequently over each iteration. The LMNN package maintains such information in a data structure for efficient updates during the optimization process. This data structure is adapted in our implementation to support efficient computation of. 5.3 The KD-NCA approach We use KD-NCA to denote the NCA model combined with the KDML feature. We still use Algorithm 1 to learn the transform matrix M and kernel bandwidths h in KD- NCA. For each input x i, according to NCA, the probability that each point will be correctly classified is: p i = j C i exp( Mφ(x i Mφ(x j 2 k i exp( Mφ(x i Mφ(x k 2. (45 We minimize the negative logarithm of the likelihood: E(M, H = N ln(p i (46 i=1 where H is the matrix of h i,d (1 i N, 1 d D. We minimize E(M, H by performing gradient descent in the joint space of M and H. Let φ(x i, x j = φ(x i φ(x j, the derivatives of E with respect to M is: M = 2M i ( p il φ(x i, x l φ(x i, x l T l j C i p ij φ(x i, x j φ(x i, x l T j C i p ij (47 In order to find H, we first calculate φ c,d (x k. Let r i,d = 1 2hi,d. We have and φ c,d (x k = ln[ Moreover, we know: i T c exp( (x k,d x i,d 2 r i,d r j,d N i=1 exp( (x k,d x i,d 2 r i,d r j,d ] (48 φ c,d (x k = φ c,d(x k,d r k,d (49 h k,d h k,d r k,d φ c,d (x k = r k,d i T c [ r id (x k,d x i,d 2 exp( (x k,d x i,d 2 r k,d r j,d ] i T c exp( (x k,d x i,d 2 r i,d r k,d i T [ r i,d(x k,d x i,d 2 exp( (x k,d x i,d 2 r i,d r k,d ] i T exp( (x k,d x i,d 2 r i,d r k,d (50 where φ c,d (x k = i ( p il Mφ(x i Mφ(x l 2 l j C i p ij Mφ(x i Mφ(x l 2. j C i p ij Summarizing (50 and (51, we can get: h k,d = 6 RELATED WORK φ c,d (x k φ c,d (x k r k,d r k,d (51 h k,d (52 A number of prior works on metric learning have focused on learning a linear transformation in the original input space [5], [6], [8], [20], [31] [35]. They achieved great success in improving the performance of learning algorithms by obtaining better Mahalanobis distance measures. The concept of distance metric learning was first proposed by Xing et al. [5]. Their objective is to learn a Mahalanobis matrix such that similar points are clustered together subject to the constraints that distances between dissimilar points are larger than a lower bound. Inspired by this general idea, many works have been developed. LMNN [31] identifies the local target neighbors in the original space for each point and learns a Mahalanobis matrix such that the non-target neighbors for each point are encouraged to be far away from all its target neighbors with a large margin.

9 9 ITML [8] assumes that there exists a bijection between the Mahalanobis distance and a single multivariate Gaussian distribution. It minimizes the KL divergence between a prior distribution and the distribution implies by the Mahalanobis distance, subject to upper bound constraints on the distance between similar points and lower bound constraints on the distance between dissimilar points. Considering both must-link and cannot-link constraints, [36] proposed a linear transformations or equivalently global Mahalanobis metrics. Laplacian Regularized Metric Learning (LRML [37] are used in learning robust distance metrics. The SEPAPH [11] approach also relies on a mapping from the Mahalanobis distance to a probability distribution but extends to semi-supervised metric learning based on regularization. Neighborhood components analysis (NCA [6] maximizes a softmax function that smooths the leaveone-out accuracy of knn classification. However, it has a nonconvex objective function and suffers from local minima. Maximally collapsing metric learning (MCML [32] constructs a convex objective based on the same softmax function to characterize the distribution. It minimizes for each point the KL divergence between a bi-level distribution and the desired distribution under the Mahalanobis distance, where the bi-level distribution is zero for similar points and non-zero for dissimilar points. All the above methods look for a linear transformation. However, the linearly transformed features fail to have satisfactory performance on many cases, such as the example in Figure 1. Another well-known example is the case where the two classes of data points are form two concentric circles [31], which we will illustrate in Section 6. All the above linear methods will fail on this example. There are also some existing work on nonlinear metric learning. One nonlinear extension is to kernelize existing methods and use the Representer s Theorem to represent the nonlinear transformation using the kernel matrix [9], [10], [12]. An nonlinear extension to NCA has also been proposed [19] which tunes a multilayer neural network to learn a nonlinear transformation. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM [38] also learns a nonlinear mapping. Like NCA, DrLIM uses the same-class neighborhood structure to drive the optimization: observations with the same class label are driven to be close-by in the feature space. However, these methods have not replicated the success and out-of-the-box usability of linear metric learning methods. In general, direct kernelization of linear metric learning methods are sensitive to hyper-parameters and their utility is limited inherently by the sizes of kernel matrices [16]. Another nonlinear approach, MM-LMNN [31], uses multiple metrics for different clusters of data to achieve global nonlinearity, where the clusters are obtained by the k-means algorithm. However, the transformation is locally linear with respect to each cluster and the crosscluster distances cannot be easily learned. Two other nonlinear methods are recently proposed in [16]. χ 2 -LMNN uses a nonlinear χ 2 distance measure. It is intended for histogram data and can only be applied when all the data lie on a simplex S D = {x R D x 0, x T 1 = 1}. KDML has much wider applicability than χ 2 -LMNN since it can process any input by transforming categorical attributes into histograms and numerical attributes into probability densities. GB-LMNN uses a set of gradient boosting regression trees with different heights and optimizes the objective function of LMNN. GB-LMNN is shown to perform better than its linear counterpart LMNN and MM-LMNN [16]. However, GB- LMNN is computational much more expensive and can handle very limited data sizes. Another nonlinear method, KD-LMNN [39], constructs a nonlinear mapping from the original input space into a feature space based on kernel density estimation. Then they integrate this mapping with LMNN and acquire better knn classification quality. 7 EXPERIMENTAL RESULTS In this section, we report experimental results to evaluate the proposed KD-LMNN and KD-NCA models. Both of them automatically tune h as shown in Algorithm 1. We evaluate the following versions of algorithms: KD-LMNN and KD-NCA with φ P, φ S, or φ E features (denoted as KD-LMNN P H, KD-NCAP H, KD-LMNNS H, KD- NCA S H, and KD-LMNNE H, KD-NCAE H, respectively, and KD-LMNN and KD-NCA using local scaling bandwidths with φ E feature (denoted as KD-LMNN E A and KD-NCA E A, respectively. For comparison, we also evaluate two leading linear metric learning algorithms including LMNN [7] and ITML [8]. We also evaluate a state-of-the-art nonlinear metric learning algorithm MM-LMNN [31], which first groups data into clusters and then uses multiple linear mappings for different clusters to achieve globally nonlinear mapping. For LMNN and MM-LMNN, we use their authors packages. 1. ITML and NCA codes are obtained from their websites 2 3. We also evaluate using the original Euclidean distance as a baseline. We implemented our algorithms inside the LMNN and NCA packages, which are implemented in the Matlab environment. All experiments are performed on a desktop computer with 2.67GHz CPU and 8G memory running Mac OS X Illustrations on toy cases For sanity check and illustration, we first test on a simple example in Figure 1a. This data cannot be correctly 1. kilian/code/lmnn/lmnn.html 2. fowlkes/software/nca/ 3. pjain/itml/

10 10 error:100% error:45% error:16% (a original data (b iteration 1 (c iteration 2 error:13% error:0% (d iteration 3 (e iteration 4 Fig. 2. A toy example with two circles in two classes, marked in different colors.kd-lmnn E H is used. a shows the original data; b - e show the data mapping and knn classification error after each outer-loop iteration of Algorithm 1 which tunes h. The classification error quickly decreases to zero as h is optimized using subgradient descent. separated by any linear metric learning algorithm. Since KDML maps the data to a higher dimensional space, to visualize the mapping in 2-D, we extract a 2-D transformation L R 2 D from M using eigendecomposition. Such dimensionality reduction is in fact another main utility of metric learning and already implemented in LMNN. Figure 1b shows the 2-D mapping result by KD-LMNN E H, which clearly separates the two classes. Figures 1c and 1d show that linear metric learning methods cannot separate the two classes. We also test another toy example shown in Figure 2a. It contains two concentric circles of data from two different classes. It is a very difficult case for metric learning and distance-based classification since the nearest neighbor of any given data point is from the other class. It is a well-known example as no linear transformation can separate these two classes [16]. Figures 2b to 2e illustrate the process of KD-LMNN E H in Algorithm 1 which automatically tunes the kernel bandwidth h. For better visualization, the results in Figures 2b to e are obtained by applying Algorithm 1 and extracting a 2-D mapping using eigendecomposition of M at each outer-loop iteration. We can see that the knn classification error quickly decreases from 45% after the initial KDML mapping to 0% in just four major iterations of optimizing h using subgradient descent. Type Dataset N C D n D c Numerical Glass Wine Breast-cancer Hepatistis Handwritten Digits Mixed Contraceptive Statlog Heart Hayes-Roth Categorical Balance Scale Car TABLE 1 The number of instances N, number of classes C, number of numerical features D n, and number of categorical features D c of the tested UCI datasets. 7.2 Comparison of KDML features In this part, we test all the algorithms on benchmark datasets from the UCI repository [40]. We choose datasets mostly with multiple ( 3 classes since knn has salient advantages over other methods such as SVM on multiway classification. For each dataset, we run a 10-fold cross validation with 90/10 splits and report the average results. We use k = 3 for knn classification on all the cases. Table 1 lists the main characteristics of the tested datasets. We can see that there are datasets with numerical, categorical, and mixed attributes. Since we have proposed three features for KDML,

11 Dataset KD-LMNN P H KD-LMNN S H KD-LMNN E H KD-NCA P H KD-NCA S H KD-NCA E H Glass 22.45± 4.93 26.16± 7.56 28.03± 6.32 23.68 ± 5.67 24.90 ± 9.93 26.74± 8.05 Wine 4.51±3.20 3.95±3.21 2.84±2.86 6.

11 11 Dataset KD-LMNN P H KD-LMNN S H KD-LMNN E H KD-NCA P H KD-NCA S H KD-NCA E H Glass 22.45± ± ± ± ± ± 8.05 Wine 4.51± ± ± ± ± ±7.31 Contraceptive 50.37± ± ± ± ± ±1.80 Statlog Heart 25.56± ± ± ± ± ±9.05 Hayes-Roth 18.75± ± ± ± ± ±4.70 Balance Scale 17.12± ± ± ± ± ±4.2 Car 3.13± ± ± ± ± ±0.42 TABLE 2 KNN classification error (in %, ± standard deviation of various KDML features on the UCI datasets, averaged over 10-fold 90/10 training-testing splits. For KD-LMNN and KD-NCA, the best features are shown in bold. We do not compare across different versions of KD-LMNN and KD-NCA since we aim at determining the best feature transformation. Dataset Euclidean LMNN NCA ITML MM-LMNN KD-LMNN E H KD-LMNNE A KD-NCAE H KD-NCAE A Glass 32.25± ± ± ± ± ± ± ± ± 6.52 Wine ± ± ± ± ± ± ± ± ± 2.93 Breast-cancer 2.78± ± ± ± ± ± ± ± ± 1.69 Hepatistis 28.33± ± ± ± ± ± ± ± ± 9.07 Handwritten Digits 0.75± ± ± ± ± ± ± ± ± 0.24 TABLE 3 KNN classification error (in %, ± standard deviation of various methods on numerical datasets, averaged over 10-fold 90/10 training-testing splits. Best results are shown in bold. including Φ P, Φ S, and Φ E, we first compare the performance of these features. Table 2 shows the results for these three different features. We can see that the entropy features Φ E consistently perform well in most cases. Therefore, we will use it in the following comparison again other metric learning methods. 7.3 Results on numerical datasets Table 3 compares knn classification errors of various algorithms on the numerical datasets. For KDML algorithms, we use the entropy features, and test KD-LMNN and KD-NCA, with automatic tuning and local scaling of the kernel bandwidth. We can make a few observations from Table 3. First, directly using the Euclidean distance gives the worst performance. It is clearly that metric learning of any kind can greatly improve the results. Second, linear metric learning algorithms, including LMNN, NCA and ITML, are in general not as good as nonlinear algorithms as they tend to give higher classification errors. Third, all KDML algorithms, combined with LMNN or NCA, consistently and significantly outperform other algorithms, including the nonlinear algorithm MM-LMNN, in all the cases. Finally, it is observed that automatic tuning of h is more suitable for KD-LMNN while local scaling is better for KD-NCA. 7.4 Results on categorical and mixed datasets Another major advantage of KDML is its ability to naturally handle categorical variables. We also evaluate our algorithms on datasets with categorical attributes and mixed data types from the UCI repository. Fig. 3. Sample images in the Yale dataset. To deal with a categorical attribute x, KDML transforms x into numerical features defined as φ P (x, φ S (x or φ E (x before. For other algorithms, we use a typical multinomial encoding to handle categorical variables. For each categorical attribute x that has m different categories, we transform it into m numbers with only one of the numbers being 1 and the others being 0. Table 4 lists the knn classification results on datasets with categorical and mixed attributes. Again, we observe that all KDML algorithms consistently perform very well across all the cases. 7.5 Results on face recognition We also carry out experiments on two well-known face recognition datasets, Yale and ORL [41]. The Yale dataset contains 165 gray-scale face images of 15 persons, each having 11 images (illustrated in Figure 3. For each person, his/her images may have different illumination and facial expression/configuration including centerlight, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and wink. The ORL face dataset contains 400 images of 40 persons, where each person has 10 images. All the images

12 12 Dataset Euclidean LMNN NCA ITML MM-LMNN KD-LMNN E H KD-LMNNE A KD-NCAE H KD-NCAE A Contraceptive 51.19± ± ± ± ± ± ± ± ±5.29 Statlog Heart 38.15± ± ± ± ± ± ± ± ±4.85 Hayes-Roth 32.50± ± ± ± ± ± ± ± ±11.14 Balance Scale 28.80± ± ± ± ± ± ± ± ±9.67 Car 15.92± ± ± ± ± ± ± ± ±1.68 TABLE 4 The testing error (in %, ± standard deviation of various methods on categorical and mixed datasets, averaged over 10-fold 90/10 training-testing splits. Best results are shown in bold. Error rate on testing set (% knn ITML LMNN NCA KD LMNN E H KD NCA E H Error rate on testing set (% knn ITML LMNN NCA KD LMNN E H KD NCA E H Training images per person (p Training images per person (p Fig. 4. Face recognition results on the Yale dataset. knn denotes the original Euclidean distance. Fig. 5. Face recognition results on the ORL dataset. knn denotes the original Euclidean distance. were taken against a dark homogeneous background with the subjects in an upright frontal position, with tolerance for some tilting and rotation. The images for each person have variations in facial expression (open/closed eyes, smiling/not smiling, and facial details (glasses/no glasses, and there is some variation in scale by up to about 10%. For both datasets, all images are cropped into pixels and the gray-scale values of all images are rescaled to [0,1]. A random subset with p (p=2, 3,..., 8 images per individual was taken with labels to form the training set, and the rest of the images was used as the testing set. For each given p, we average the results over 50 random splits and report the mean value of the results. Both the training and the testing images are mapped into a low-dimensional subspace using PCA where recognition is carried out by a knn classifier (k = 3 in all cases. For fair comparison, we use the same reduced dimensionality for all the methods (37 for Yale and 39 for ORL as suggested in [41]. Figures 4 and 5 compare the performance of various methods on the two datasets, respectively. We can see that, for both datasets, KD-LMNN and KD-NCA give the best classification results against all the other metric learning algorithms over all the different values of p. 8 CONCLUSIONS AND FUTURE WORK In this paper, we have proposed a novel kernel density metric learning (KDML framework for nonlinear metric learning. KDML is fundamentally different from the previous metric learning algorithms since it introduces a nonlinear mapping from the original input space into a probability density space, based on Nadaraya-Watson kernel density estimation. We have shown that the nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to effective classification on datasets for which linear metric learning methods would fail. KDML can be used as a preprocessing step and combined with existing metric learning algorithms. KDML addresses a key challenge for knn classification and metric learning. When the features are from heterogeneous domains and have vastly different scales, knn may give poor results since it does not make much sense to compute the Euclidean distance between two original feature vectors. KDML maps the features into density-based quantities which can be used as a uniform basis for deriving distance measures. We have integrated KDML with the LMNN and NCA. We proposed two static schemes to set the kernel bandwidth for density estimation, including a rule of thumb and a local scaling scheme that adapts to the local neighborhood structure of the given data. In addition, we have also derived the closed form of the subgradients of the objective function with respect to the kernel bandwidths. We have then derived an integrated optimization algorithm for learning the Mahalanobis matrix and kernel bandwidths. Such automatic learning of kernel bandwidths in a Nadaraya-Watson estimator is

13 13 not found in previous work. Extensive results on real-world numerical and categorical data show that, KDML gives significantly better knn classification quality than other linear and nonlinear metric learning algorithms. Unlike previous metric learning algorithms, KDML can naturally handle both numerical and categorical data. It is also easy to use and offers good off-the-shelf usability. These advantages make KDML an attractive general approach for metric learning. Our ongoing work focuses on combining the nonlinear features in KDML with more expressive parametric forms of the distance function such as that in χ 2 -LMNN and KL-divergence, instead of the simple Euclidean l 2 form. The flexibility in both feature mappings and distance functions may enable us to construct superior distance/similarity measures for a wide range of applications. ACKNOWLEDGMENT This work is partially supported by the CNS , CCF , and IIS grants from the National Science Foundation of the United States, a Microsoft Research New Faculty Fellowship, a Washington University URSA grant, and a Barnes-Jewish Hospital Foundation grant. REFERENCES [1] F. Wang and J. Sun, Survey on distance metric learning and dimensionality reduction, Data Mining and Knowledge Discovery(DMKD, [2] Y. Yang and X. Liu, A re-examination of text categorization methods, in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR 99. New York, NY, USA: ACM, 1999, pp [Online]. Available: [3] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Neighbourhood components analysis, Advances in Neural Information Processing Systems, vol. 17, pp , [4] W. Shoombuatong, P. Mekha, KitsanaWaiyamai, S. Cheevadhanarak, and J. Chaijaruwanich, Prediction of human leukocyte antigen gene using k-nearest neighbour classifier based on spectrum kernel, ScienceAsia, vol. 39, pp , [5] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell, Distance metric learning with application to clustering with side-information, in Proc. NIPS, 2002, pp [6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Neighbourhood components analysis, in Proc. NIPS, 2004, pp [7] K. Weinberger, J. Blitzer, and L. Saul, Distance metric learning for large margin nearest neighbor classification, in Proc. NIPS, [8] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, Information-theoretic metric learning, in Proceedings of the 24th international conference on Machine learning, ser. ICML 07. New York, NY, USA: ACM, 2007, pp [Online]. Available: [9] A. Globerson and S. T. Roweis, Visualizing pairwise similarity via semidefinite programming, Journal of Machine Learning Research - Proceedings Track, vol. 2, pp , [10] L. Torresani and K. chih Lee, Large margin component analysis, in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds. Cambridge, MA: MIT Press, 2007, pp [11] G. Niu, B. Dai, M. Yamada, and M. Sugiyama, Informationtheoretic semi-supervised metric learning via entropy regularization, in Proceedings of the 29th international conference on Machine learning, ser. ICML 12, [12] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul, A new kernelization framework for mahalanobis distance learning algorithms, Neurocomput., vol. 73, no , pp , Jun [13] C. Galleguillos, B. McFee, S. J. Belongie, and G. R. G. Lanckriet, Multi-class object localization by combining local contextual interactions. in CVPR. IEEE, 2010, pp [14] P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon, Metric and kernel learning using a linear transformation, Journal of Machine Learning Research, vol. 13, pp , [15] S. Chopra, R. Hadsell, and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05 - Volume 1 - Volume 01, ser. CVPR 05. Washington, DC, USA: IEEE Computer Society, 2005, pp [Online]. Available: [16] D. Kedem, S. Tyree, K. Weinberger, F. Sha, and G. Lanckriet, Nonlinear metric learning, in Proc. NIPS, [17] H. Cevikalp and M. Wilke, Face recognition by using discriminative common vectors, in In Proceedings of the 17th International Conference on Pattern Recognition, vol. 1, August 2004, pp [18] H. Cevikalp, M. Neamtu, M. Wilkes, and A. Barkana, Discriminative common vectors for face recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27(1, [19] R. Salakhutdinov and G. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, AISTATS, vol. 11, [20] A. Globerson and S. Roweis, Metric learning by collapsing classes, NIPS, [21] S. Kullback and R. Leibler, On information and sufficiency, Ann. Math. Statist., vol. 22, pp , [22] S. H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, pp , [23] D. Gavin, W. Osward, E. Wahl, and J. Williams, A statistical approach to evaluating distance metrics and analog assignments for pollen records, vol. 60, pp , [24] K. Matusita, Decision rules, based on the distance, for problems of fit, two samples, and estimation, Ann. Math. Statist., vol. 26, pp , [25] K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Phil. Mag., vol. 50, pp , [26] E. Nadaraya, On estimating regression, Theory of Probability and its Applications, vol. 9, pp , [27] C. M. Bishop, Pattern Recognition and Machine Learning. Secaucus, NJ, USA: Springer-Verlag New York, Inc., [28] B. W. Silverman and P. J. Green, Density Estimation for Statistics and Data Analysis. Chapman and Hall, [29] C. Yang, X. Zhang, and L. Jiao, Self-tuning semi-supervised spectral clustering, in In Proceedings of the 2008 International Conference on Computational Intelligence and Security, [30] L. Zelnik-manor and P. Perona, Self-tuning spectral clustering, in Advances in Neural Information Processing Systems, vol. 17. MIT Press, 2004, pp [31] K. Weinberger and L. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research, vol. 10, pp , [Online]. Available: [32] A. Globerson and S. Roweis, Metric learning by collapsing classes, in Proc. NIPS, [33] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, Adjustment learning and relevant component analysis, in Proceedings of the 7th European Conference on Computer Vision-Part IV, ser. ECCV 02. London, UK, UK: Springer-Verlag, 2002, pp [Online]. Available: [34] D. Cai, X. He, J. Han, and H.-J. Zhang, Orthogonal laplacianfaces for face recognition, IEEE Transactions on Image Processing, vol. 15, no. 11, pp , 2006.

14 [35] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 27, no. 3, pp. 328 340, 2005. [36] M. S. Baghshah and S.

Chang, Semi-supervised distance metric learning for collaborative image retrieval, IEEE Conference on Computer Vision and Pattern Recognition, vol. 1(7, pp. 23 28, 2008. [38] S. C. R. Hadsell and Y.

1735 1742. [39] Y. He, Y. Mao, W. Chen, and Y. Chen, Kernel density metric learning, in Proceedings of the International Conference on Data Mining (ICDM 13, 2013. [40] A. Frank and A.

IEEE Conf. Computer Vision and Pattern Recognition Machine Learning (CVPR 07, 2007. Yixin Chen is an Associate Professor of Computer Science at the Washington University in St. Louis.

14 14 [35] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 27, no. 3, pp , [36] M. S. Baghshah and S. B. Shouraki, Semi-supervised metric learning using pairwise constraints, in IJCAI, [37] S. Hoi, W. Liu, and S.-F. Chang, Semi-supervised distance metric learning for collaborative image retrieval, IEEE Conference on Computer Vision and Pattern Recognition, vol. 1(7, pp , [38] S. C. R. Hadsell and Y. LeCun, Dimensionality reduction by learning an invariant mapping, in In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 06, 2006, pp [39] Y. He, Y. Mao, W. Chen, and Y. Chen, Kernel density metric learning, in Proceedings of the International Conference on Data Mining (ICDM 13, [40] A. Frank and A. Asuncion, UCI machine learning repository, [41] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning a spatially smooth subspace for face recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Machine Learning (CVPR 07, Yixin Chen is an Associate Professor of Computer Science at the Washington University in St. Louis. His research interests include data mining, machine learning, artificial intelligence, optimization, and cyber-physical systems. He received a Ph.D. in Computing Science from University of Illinois at Urbana Champaign in He received the Best Paper Award at the AAAI Conference on Artificial Intelligence (2010 and International Conference on Tools for AI (2005, and best paper nomination at the ACM KDD Conference (2009. His work on planning has won First Prizes in the International Planning Competitions (2004 & He has received an Early Career Principal Investigator Award from the Department of Energy (2006 and a Microsoft Research New Faculty Fellowship (2007. He is an Associate Editor for ACM Transactions of Intelligent Systems and Technology and IEEE Transactions on Knowledge and Data Engineering, and serves on the Editorial Board of Journal of Artificial Intelligence Research. Yujie He is a PH.D student in the Department of Computer Science and Engineering at Washington University in St. Louis. His research interests include data mining and machine learning. Yujie completed his bachelor degree in both bioinformatics and computer science from Shanghai Jiao Tong University. He got nomination of best paper reward in IEEE international Conference on Data Mining Yi Mao received her B.S. degrees in control technology and instrument from Xidian University, Xian, and P.R. China in She is currently a Ph.D. student with the School of Areospace Science and Technology, Xidian University. Her research interests include data mining, pattern recognition, signal processing and machine learning. Wenlin Chen is a Ph.D student in the Department of Computer Science and Engineering at Washington University in St. Louis. His research mainly focuses on machine learning and data mining. Before joining WashU, he got his bachelor degree in computer science from University of Science and Technology of China (USTC. Wenlin received the runner up for best student paper award in ACM SIGKDD Conference 2014 and nomination for best paper award in IEEE International Conference on Data Mining 2013.

Kernel Density Metric Learning

Kernel Density Metric Learning Yujie He, Wenlin Chen, Yixin Chen Department of Computer Science and Engineering Washington University St. Louis, USA yujie.he@wustl.edu, wenlinchen@wustl.edu, chen@cse.wustl.edu