Nonlinear Metric Learning with Kernel Density Estimation

Size: px
Start display at page:

Download "Nonlinear Metric Learning with Kernel Density Estimation"

Transcription

1 1 Nonlinear Metric Learning with Kernel Density Estimation Yujie He, Yi Mao, Wenlin Chen, and Yixin Chen, Senior Member, IEEE Abstract Metric learning, the task of learning a good distance metric, is a key problem in machine learning with ample applications. This paper introduces a novel framework for nonlinear metric learning, called kernel density metric learning (KDML, which is easy to use and provides nonlinear, probability-based distance measures. KDML constructs a direct nonlinear mapping from the original input space into a feature space based on kernel density estimation. The nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to accurate classification on datasets for which existing linear metric learning methods would fail. It addresses the severe challenge to distance-based classifiers when features are from heterogeneous domains and, as a result, the Euclidean or Mahalanobis distance between original feature vectors is not meaningful. We also propose two ways to determine the kernel bandwidths, including an adaptive local scaling approach and an integrated optimization algorithm that learns the Mahalanobis matrix and kernel bandwidths together. KDML is a general framework that can be combined with any existing metric learning algorithm. As concrete examples, we combine KDML with two leading metric learning algorithms, large margin nearest neighbors (LMNN and neighborhood component analysis (NCA. KDML can naturally handle not only numerical features, but also categorical ones, which is rarely found in previous metric learning algorithms. Extensive experimental results on various datasets show that KDML significantly improves existing metric learning algorithms in terms of classification accuracy. Index Terms classification; metric learning; large margin nearest neighbors; neighborhood components analysis; kernel density estimation. 1 INTRODUCTION Distance metrics are distance measurements between data points, such as Euclidean distance or Manhattan distance. Learning a distance metric is a fundamental problem in machine learning and data mining [1]. In many applications, once we have defined a good distance or similarity measure between all pairs of data points, the data mining tasks would become trivial. For example, with a perfect distance metric, the k-nearest neighbor (knn algorithm can achieve perfect classification [2] [4]. As a result, ever since metric learning is proposed by Xing et al. [5], there has been extensive research in this area [6] [10]. These new methods greatly improved the performance of many metric-based algorithms and gained lots of popularity. There are several basic desirable properties for any metric learning algorithm: 1 it must reflect the true distance or similarity between data samples; 2 it needs to be flexible to support different learning settings and data types; 3 it should be able to generalize to outof-sample data; 4 it should be easy to use and does not require extensive parameter tuning. Few existing algorithms can satisfy all these requirements. Another key challenge for knn classification and metric learning is that, in many cases, the features are from totally different domains and have vastly different scales. Y. He, W. Chen, and Y. Chen are with the Department of Computer Science and Engineering, Washington University in St. Louis, USA. Y. Mao is with Xidian Univeristy, Xi an, China. For example, one feature could be velocity in km/h while the other is the color of an object. In such cases, it does not make much sense to compute the Euclidean distance between two feature vectors, even under linear transformation of the features. The majority of existing methods are based on a linear transformation. Namely, they learn a Mahalanobis distance between two D-dimensional data points x i, x j R D in the form of d L (x i, x j = L(x i x j 2, (1 where 2 is the l 2 -norm and L R D D is a matrix. Therefore, L represents a linear transformation of the input space, which corresponds to rotating and scaling the data points. Many representative metric learning algorithms, such as distance metric learning [5], large margin nearest neighbors (LMNN [7], information theoretic metric learning (ITML [8], neighborhood components analysis (NCA [6], and SEPAPH [11], are based on such linear transformation and l 2 Euclidean distance. A main reason for the popularity of linear metric learning is its good off-the-shelf usability. However, linear metric learning has inherent limits on their mapping capability. Nonlinear metric learning is more general and offers greater separation ability in theory. For example, for the four points in two classes in Figure 1.a, no linear metric learning methods can give correct knn classification. For example, the linear transformation in Figure 1.c only rotates and scales the data, so does the LMNN mapping in Figure 1.d. We can see that knn classification on the mapped data points in Figures 1.c and 1.d still cannot separate the two classes correctly.

2 2 However, our nonlinear transformation (to be explained later can map the four points to the coordinates in Figure 1.b, which enable correct knn classification. Nonlinear metric learning methods, although more expressive, are far less popular than linear methods. Often, they are not easy to use, since they require complex computation, not only for coefficient training, but also for model selection and hyper-parameter tuning. For example, kernelization methods [10], [12] [14] are inherently limited by the size of the kernel matrices. Neural network based methods [15] are also very expensive. Furthermore, these nonlinear methods often require tuning of many hyper-parameters. Their sensitivity to the parameter tuning further hinders their off-the-shelf usability, especially for unknown domains. Recently, Kedem et al. proposed two nonlinear metric learning algorithms χ 2 -LMNN and GB-LMNN [16]. χ 2 - LMNN still uses the linear transformation in (1 but employs a non-euclidean distance. However, χ 2 -LMNN has rather limited scope: it can only be applied when the input data are sampled from a simplex S D = {x R D x 0, x T 1 = 1}. GB-LMNN learns a nonlinear mapping φ(x which is an ensemble over a number of regression trees with different heights. GB-LMNN applies gradient boosting to learn the nonlinear mapping in a function space. However, GB-LMNN is computational expensive which limits its practicability. In this paper, we propose a new metric learning framework called kernel density metric learning (KDML. It uses kernel density regression to nonlinearly map each attribute to a new feature space. The distance is then defined as the Euclidean distance in the new space. Although we focus on integrating KDML with LMNN and NCA in this paper, this framework is general and can be used to support many metric learning algorithms such as ITML and DCV [17], [18]. There are several salient advantages of the KDML approach. 1 It embodies excellent nonlinear distance measures with a sound probabilistic explanation. In fact, the Euclidean distance in the mapped feature space corresponds to established distance measures between probability density functions. As a result, such nonlinear mapping allows us to correctly classify datasets that are notoriously difficult to tackle by linear metric learning methods. 2 Compared to kernel-based nonlinear metric learning methods, KDML is easier to use and offers good off-the-shelf usability. In fact, end users do not need to tune any parameter in KDML since we can automatically learn its hyper-parameters using gradient descent. Moreover, KDML can be used as a preprocessing blackbox to map features and be integrated with any supervised metric learning algorithm. Thus, it allows us to leverage the extensive development on efficient and scalable linear metric learning methods. 3 Unlike most existing metric learning methods which require the attributes to be numerical, KDML is the first metric learning algorithm that can naturally handle both numerical, categorical, and mixed attributes in a unified (a (c (b (d Fig. 1. An toy example with four points in two classes, marked in different shapes. a shows the original data; b shows the data after our KDML mapping (the two points in each class are very close to each other; c shows a random linear transformation; d shows the data after LMNN mapping. fashion. 4 KDML can handle the cases when features are from heterogeneous domains, as it maps all features to density-based quantities that can be uniformly compared. This paper makes the following contributions. We introduce KDML, a nonlinear metric learning algorithm, by proposing a novel nonlinear mapping which provides a good similarity measure based on kernel density estimation. It can naturally handle both numerical and categorical features and offers good out-of-the-box usability. We integrate KDML with prominent existing metric learning methods, including LMNN and NCA, and develop an optimization algorithm to train the model in a holistic way. The algorithm automatically finds optimal bandwidths for a Nadaraya-Watson kernel density estimator, which is absent in previous work. We conduct extensive evaluation on a collection of real-world datasets for multiway classification, with both numerical and categorical features. We show that KDML improves the performance of state-ofthe-art metric learning methods for knn classification tasks. The rest of this paper is organized as follows. Section 2 gives some preliminaries for metric learning. Section 3 presents the proposed KDML model including its nonlinear mapping and kernel density estimation. Section 4 combines KDML with two representative metric learning algorithms, LMNN and NCA, and presents the optimization algorithm for learning the transformation matrix and kernel bandwidths. Section 5 surveys related work on metric learning. Section 6 presents experimental results of different metric learning algorithms on various

3 3 benchmark datasets. Finally, Section 7 gives conclusions. 2 PRELIMINARIES In this paper, we focus on a supervised classification setting. The main ideas can also be extended to other settings such as weakly supervised, semi-supervised, or unsupervised ones. We assume we are given a training data set T = {(x 1, y 1,, (x N, y N } D 1 D D C, (2 where there the d th feature is defined in a domain D d and the label y i s are from a set of C classes C = {1,, C}. Note that our setting is more general than typical previous settings, because the domain D i can either be a numerical set such as R or a categorical set such as a set of colors. For each new input x, the k-nearest neighbors (knn algorithm classifies x by a majority vote from the k neighbors that are closest to x under a certain distance metric. The simple knn algorithm often gives surprisingly good classification quality as compared to more complex methods such as SVM. Its decision surfaces are nonlinear and the quality of the classification improves automatically as the amount of the training set increases. However, knn classification relies heavily on the distance metric and provides a most natural paradigm for evaluating various distance metric learning algorithms [19]. Metric learning, first proposed by Xing et al. [5], aims at automatically learning a good distance metric. Most existing metric learning algorithms are linear methods which learn a Mahalanobis distance: d 2 M(x i, x j = (x i x j T M(x i x j, (3 where M = L T L is a positive semi-definite matrix. The new distance metric is d M (x i, x j. It can be seen as applying a linear transformation Lx i to each data x i. Metric learning can also be used for dimensionality reduction by applying singular vector decomposition (SVD on M. Below, we briefly review some important metric learning algorithms. In this paper, we use two representative and popular metric learning algorithms, large-margin nearest neighbors (LMNN [7] and neighborhood component analysis (NCA [20], as the basic metric learning methods to be integrated with KDML. 2.1 LMNN and ITML Large margin nearest neighbor (LMNN is a linear metric learning algorithm tailored for knn classification. LMNN learns a Mahalanobis distance as defined in (3. In LMNN, for each input (x i, y i, it specifies a number of target neighbors with the same label as y i. Normally these m target neighbors are simply the m neighbors with the same label that are closest to x i based on the Euclidean distance. We use j i to denote that x j is a target neighbor of x i, and y ij {0, 1} to denote whether the labels y j and y i match (y ij = 1 when y i = y j. The objective function of LMNN is to minimize E(M = (1 µ d 2 M(x i, x j + µ i,ji,l i,ji (1 y il [ 1 + d 2 M(x i, x j d 2 M(x i, x l ] + where [z] + = max(0, z is the standard hinge loss and µ (0, 1 is a positive constant controlling the relative weights of the two terms. The first term minimizes the distance between each input and its target neighbors, and the second term, incorporating the idea of a margin as in SVM, penalizes the distances between those mismatched points that invade the neighborhood of each input. It is shown that the optimization in (4 can be reformulated into a semidefinite programming (SDP problem [7]. Weinberger et al. have proposed a specialized subgradient descent algorithm to solve this SDP, by exploiting the sparsity of active invaders in the second term of (4. LMNN has received great attention and popularity due to its good knn classification performance, efficiency, and easiness to use. Another important work is information-theoretic metric learning (ITML [8]. ITML also uses the linear transformation in (3 but utilizes a one to one correspondence between the Mahalanobis distance parameterized by M and a multivariate Gaussian as (4 P (x; M = 1 Z exp( 1 2 d2 M(x, x 0, (5 where Z is a normalization factor and x 0 is the mean of the Gaussian. Using this correspondence, the objective of ITML is to minimize KL(p(x; M 0 p(x, M = p(x; M 0 ln p(x; M 0 p(x; M dx, where M 0 is a fixed matrix such I or the inverse covariance matrix. The intuition is to regularize M by minimizing the Kullback-Leibler (KL divergence [21] between the implied distribution P (x; M and a prior distribution. 2.2 Neighborhood component analysis (NCA NCA is another linear metric learning method. Its basic idea is to find a distance metric that maximizes the performance of knn classification, measured by leaveone-out (LOO validation. ITML also uses the linear transformation in (3 but directly learns the matrix L as in M = L T L. Note that we have d 2 M(x i, x j = (x i x j T L T L(x i x j = (Lx i Lx j T (Lx i Lx j. In NCA, each point x i selects another point x j as its neighbor with some probability p ij, and inherits its class

4 4 label from that point. By using a softmax under the Euclidean distance in the transformed space, p ij, i j, is defined as: p ij = exp( Lx i Lx j 2 k i exp( Lx i Lx k 2. (6 Under this definition, we can compute p i, the probability that point i will be correctly classified : p i = j C i p i,j, where C i = {j y i = y j }. (7 According to ( 6 and ( 7, we have: p i = j C i p ij = exp( Lx i Lx j 2 j C i k i exp( Lx i Lx k 2 j C = i exp( Lx i Lx j 2 k i exp( Lx i Lx k 2 where C i = {j y i = y j }. As the probability is a function of the transformation matrix L, we find the L that maximizes the expected number of points correctly classified: (8 f(l = p i (9 The corresponding gradient is: f L =2L (p i p ik (x i x k (x i x k T i k p ij (x i x j (x i x j T j C i (10 NCA uses a gradient descent algorithm to maximize (9. By restricting the size of L, NCA can also be used for dimensionality reduction. 3 THE KDML FRAMEWORK In this section, we propose the KDML framework for nonlinear metric learning. We assume that we are given inputs (x, y D 1 D D C, where the d th feature is defined in a domain D d and the label y is from a set of C classes C = {1,, C}. 3.1 KDML feature mapping Under the KDML framework, we propose two kinds of transformation for each input (x, y. Density features. For each dimension d = 1,, D and any x d D d, there exists a conditional probability density function P d (c, x d = P (y = c x d, c = 1,, C. (11 We use P d (x d to denote the vector [P d (1, x d,, P d (C, x d ]. In this transformation, each input is transformed into a vector φ P (x = [P 1 (x 1 ; ; P D (x D ], (12 which is a concatenation of all the D probability density vectors. An alternative is to use the square roots of the probabilities. S d (c, x d = P (y = c x d, c = 1,, C, (13 which leads to a corresponding feature vector φ S (x. Entropy features. For each dimension d = 1,, D and any x d D d, we compute the logarithm of the density E d (c, x d = ln P (y = c x d, c = 1,, C. (14 Let E d (x d denote the vector [E d (1, x d,, E d (C, x d ]. In this mapping, each input is transformed into a vector φ E (x = [E 1 (x 1 ; ; E D (x D ], (15 which is a concatenation of all the entropy vectors. In KDML, we choose a feature mapping from φ P, φ S, and φ E and name it φ. We may also include the original variables in the feature vector to make it strictly more general than the original linear mapping. It then employs an existing linear metric learning method to learn a linear transformation Lφ(x which gives rise to a Mahalanobis distance in the mapped feature space. 3.2 Implied distance measures We now discuss the distance measures implied by using the above KDML features. We can see that they all correspond to some sound distance/similarity measures between two probability density functions. As a result, in many cases, the Euclidean distance in the feature space after mapping reflects a better distance measure than the Euclidean distance in the original input space. One way to view KDML is that it first transforms the original input into a new space. In the new space, before any metric learning, the similarity between two data points are based on the Euclidean distance between their feature vectors: d 2 (x i, x j = (φ(x i φ(x j T (φ(x i φ(x j. (16 In fact, since there are D dimensions, each data point corresponds to D probability density functions (PDFs, each with C possible values. That is, for an input x i, its PDF at the d th dimension is PDF i,d = [P d (1, x i,d,, P d (C, x i,d ], (17 where we use x i,d to denote the d th attribute of x i. The distance in (16 can be viewed as the summation over the D dimensions D d 2 (x i, x j = diff(pdf i,d, PDF j,d, (18 d=1 where diff( measures the distance between two PDFs over the set C.

5 5 Distance or similarity measures between PDFs have been extensively studied. A good survey of these measures can be found in [22]. To justify the features in KDML, we examine the underlying distance measures they imply. When the density feature φ S is used, considering two points x i and x j, their Euclidean distance is d 2 (x i, x j = (φ S (x i φ S (x j T (φ S (x i φ S (x j D C = [ P d (c, x i,d P d (c, x j,d ] 2 d=1 c=1 Therefore, using φ S features, the implied distance measure between the PDFs is diff S (PDF i,d, PDF j,d = C [ P d (c, x i,d c=1 P d (c, x j,d ] 2. The above equation is exactly the well-known squaredchord PDF distance measure [23], which is also the square of the Matusita distance measure [24]. Hence, the φ S feature implies the squared-chord and Matusita PDF distance measures. When the density feature φ P is used, considering two points x i and x j, their Euclidean distance is d 2 (x i, x j = (φ P (x i φ P (x j T (φ P (x i φ P (x j D C = [P d (c, x i,d P d (c, x j,d ] 2. (19 d=1 c=1 We can see that, using φ P features, the implied distance measure between two PDFs is diff P (PDF i,d, PDF j,d = C [P d (c, x i,d P d (c, x j,d ] 2. (20 c=1 We can see that (20 is exactly the commonly used squared Euclidean distance measure between two PDFs [22]. Furthermore, the well-known Squared χ 2 PDF distance measure [25] is diff χ 2(PDF i,d, PDF j,d = C c=1 [P d (c, x i,d P d (c, x j,d ] 2 P d (c, x i,d + P d (c, x j,d (21 Comparing (20 with (21, we can see that diff χ 2 can be obtained if we apply a linear transformation to φ P (x i and φ P (x j (by dividing φ P by P d (c, x i,d + P d (c, x j,d and then use the Euclidean distance as the distance measure. In this sense, the metric learning is more general since it learns a linear transformation M, in the entire space of positive semi-definite matrices. Since M is learned under the guidance of some external objectives, such as optimizing the knn classification accuracy, we expect it to give better metric for each specific data mining task than the fixed transformation in the squared χ 2 measure. Finally, when the entropy feature φ E is used, the Euclidean distance between two points x i and x j is d 2 (x i, x j = (φ E (x i φ E (x j T (φ E (x i φ E (x j = D C [ln P d (c, x i,d ln P d (c, x j,d ] 2 = d=1 c=1 D C d=1 c=1 [ ln P ] 2 d(c, x i,d (22 P d (c, x j,d Using φ E features, the implied distance measure between the PDFs is C [ diff E (PDF i,d, PDF j,d = ln P ] 2 d(c, x i,d, (23 P d (c, x j,d c=1 which is not a known PDF distance measure to our knowledge but embodies, under a linear transformation that can be reflected in the Mahalanobis matrix M, the following squared variant of KL divergence [21] diff KL 2(PDF i,d, PDF j,d = C c=1 [ P d (c, x i,d ln P ] 2 d(c, x i,d P d (c, x j,d In summary, the proposed features correspond to some sound distance measures between two PDFs. We believe that they usually give a more reasonable distance measure than the original Euclidean distance. Performing metric learning on these transformed features may allow us to improve many learning algorithms. Moreover, the proposed transformations address the problem of heterogeneous features since they map the data into quantities based on probability densities, which can be directly compared in a uniform space. 3.3 Kernel density estimation for computing features We have proposed the feature mappings φ P, φ S, and φ E for KDML. Now we estimate the conditional probability densities in these features. From (11, (13 and (14, all of them require estimating P (y = c x d, for each x d D d and c = 1,, C. Once we have all the P (y = c x d, those features can be computed. Given training data T = {x i, y i }, i = 1,, N, We partition T into C subsets T 1,, T C, which contain data points with labels y = 1,, C, respectively. To estimate p(y = c x d, we distinguish the cases of categorical and numerical attributes. We use ˆp(y = c x d to denote the estimates. Categorical attributes. If an attribute x d takes categorical values, p(y = c x d can be estimated by the proportion of samples with y = c among all the samples whose d th attribute is x d. Thus, it can be computed using: ˆp(y = c x d = T k Txd, c = 1,, C (24 T xd

6 6 where T xd = {x i x i,d = x d, i = 1,, N} is the set of samples in T whose d th attribute is x d. Numerical attributes. If an attribute x d takes numerical values, we propose to use a Nadaraya-Watson type kernel density regression to estimate p(y = c x d. According to the Nadaraya-Watson estimator [26], [27], we have: ˆp(y = c x d = i T c K( x d x i,d h d N i=1 K( x d x i,d h d (25 where K(x is a kernel function satisfying K(x 0 and K(xdx = 1, and hd > 0 is a parameter called the bandwidth of the kernel density function. In this paper, we choose the Gaussian kernel for K(x, namely, K(x = 1 exp( x2. (26 2π 2 We can thus compute the KDML features by substituting the estimates in (24 and (25 into (11, (13, and (14. For example, the entropy features for categorical and numerical attributes are, respectively, [ ] Tc Txd E d (c, x d = ln, (27 T xd and, i T E d (c, x d = ln c exp( (x d x i,d 2 2h 2 d i T exp(. (28 (x d x i,d 2 We also comment on the difference between the assumptions of ITML and KDML. The main assumption of ITML is that all the data points are drawn from a single Gaussian distribution, centered at x 0. Such an assumption may be too restrictive in some cases. KDML, in contrast, assumes a nonlinear distribution which is a mixture of multiple Gaussians at each dimension. 4 KDML AS PREPROCESSING Now we discuss the training algorithm for KDML. In principle, KDML can be combined with any metric learning algorithm. One scheme is to first fix the kernel bandwidths h d s in the Nadaraya-Watson estimator before learning the transformation matrix L. In this case, KDML can be viewed as a preprocessing step that transforms each input x d into φ(x d. We can simply call an existing metric learning algorithm such as LMNN or NCA to learn L based on the new features φ(x d. The key in this static scheme is how to choose the static kernel bandwidth. We discuss two methods below. Rule of thumb. One way to choose h d is to use rulesof-thumb to set heuristic h d values. A popular one is the Silverman s rule of thumb [28]: 2h 2 d h d = 1.06σ d N 1/5, (29 where σ d is the standard deviation of x d. Local Scaling. When the data resides on multiple scales (e.g., one cluster is tight and the other is sparse, using a single bandwidth may fail to give accurate estimation of the data distribution [29]. To address this issue, we can use local scaling [30], a technique to assign a dynamic bandwidth h id for each data point x i. Using adaptive scaling for each point allows self-tuning of the proper bandwidth according to the local statistics of the neighborhoods surrounding points x i. In particular, for point x i at dimension d, we set its kernel bandwidth to: h i,d = x i,d x K,d (30 where x K is the K th nearest neighbor of x i on the d th dimension. The selection of K is independent of scale and is a function of the data dimension of the embedding space. As suggested by [30], we set K = 7 in our experiments. Under adaptive scaling, the kernel density estimation for numerical attributes changes accordingly. For example, the entropy feature becomes: i T c exp( (x k,d x i,d 2 E d (c, x k,d = ln 2h k,d h i,d i T exp( (x k,d x i,d 2 2h k,d h i,d. (31 Comparing (31 to (6, we can see that they both do transformation based on the kernel density. The difference is that by introducing local scaling, for data in the same column, the transformation vector is no longer the same. The affinity between each points and its surrounding neighbors is different. The affinities across clusters are significantly lower than the affinities within any single cluster. 5 INTEGRATED KDML OPTIMIZATION Instead of using KDML as a preprocessing step, we can also combine it with another metric learning algorithm and learn the kernel bandwidth h d and transformation matrix L in an integrated optimization procedure. In principle, KDML is a general framework that can be combined with any existing metric learning algorithm, which nonlinearly maps the features in the original space to a new space. As concrete examples, we combine KDML with two leading metric learning algorithms, LMNN and NCA, and show how to derive the integrated learning algorithm. 5.1 Integrated optimization algorithm First, we map each training data x into a feature φ(x (which may be φ P (x, φ S (x, or φ E (x. Then, we use the metric learning algorithm to learn a transformation Lφ(x which leads to a Mahalanobis distance d 2 M(x i, x j = (φ(x i φ(x j T M(φ(x i φ(x j, (32 where M = L T L is a positive semi-definite matrix.

7 7 Algorithm 1 Optimization for Kernel Density Metric Learning 1: Initialize h using (29 2: repeat 3: compute the feature matrix Φ h 4: call a existing metric learning algorithm to optimize M under fixed h & Φ h 5: for d = 1 to D do under fixed M 6: if the d th dimension is numerical then 7: h d h d γ gradient descent for h 8: end if 9: end for 10: until h converges 11: output h and M Although using static methods often give solid performance, we can in fact derive a novel way to automatically choose optimal h d based on the objective. Such automatic tuning is absent in previous work. The training of KDML aims at learning the optimal values of the matrix M and the bandwidths h in the Nadaraya-Watson estimators. Applying LMNN or NCA to φ(x, we solve the corresponding problems of minimizing their objective function E. We show our optimization algorithm for training combinzation of KDML and metric learning algorithm in Algorithm 1. It contains two levels of optimization: an outer loop which optimizes h using subgradient descent, and an inner loop which learns M using the original LMNN or NCA package under fixed h. In the outer level, at each iteration, the feature matrix Φ h composed of φ c,d (x k is updated based on the new h. If a variable x i is categorical, its feature is computed by (24. For a numerical variable x d, we use the kernel density estimation in (25 to compute its feature. Then, entering the inner level, we use the original LMNN or NCA training algorithm to learn the M that minimizes E(M in (33 under fixed h and Φ h. Finally, we optimize h for numerical attributes by performing descent along the subgradient direction based on the validation set. We use a line search with the Armijo rule to choose the step size γ. As Algorithm 1 learns M and h, there are no other hyper-parameters to tune. The complexity of Algorithm 1 mainly depends on the complexity of the metric learning algorithm we choose. Let n be the number of data set instances, d be the number of dimensions, and c be the number of classes. And let O(f(n, d be the time complexity of chosen metric learning algorithm. In each outer loop, computing the feature matrix (Line 3 costs O(n 2 d, optimization of E(M (Line 4 costs O(f(n, dc, and updating gradient of h (Line 7 costs O(n 2 (dc 2. In practice, the metric learning part 4 is most time-consuming while computing feature matrix and updating gradient of h are much less expensive. Thus, the metric learning part dominates the running time of Algorithm The KD-LMNN approach We use KD-LMNN to denote the LMNN model combined with the KDML feature. Applying LMNN to the KDML feature φ(x, we solve the problem of minimizing: E(M = (1 µ d 2 M(x i, x j + µ i,ji,l i,ji (1 y il [ 1 + d 2 M(x i, x j d 2 M(x i, x l ] +, (33 where d 2 M (x i, x j is defined in (32. We propose to minimizes E in (33. For this minimization, a nice fact is that we can get the closed form of the subgradient and compute it efficiently. There are two terms in (33. Let E 1 = DM(x 2 i, x j, and (34 E 2 = we have i,ji,l i,ji (1 y il [ 1 + D 2 M(x i, x j D 2 M(x i, x l ] + (35 E(M = (1 µe 1 + µe 2. (36 We compute the subgradients for these two terms separately. First, since we have 1 = M(x k x j + x k jk D 2 M (x i, x j x i = M(x i x j, (37 k j,kj M(x k x j. (38 For E 2, note that 2 x k is a subgradient since it involves a hinge loss and it is non-differentiable whenever the term inside [.] + is zero. Therefore, when 2 x k [1 + D 2 M(x i, x j D 2 M(x i, x l ] < 0, (39 is 0; otherwise, we have: 2 = x k = k k,jk (1 y kl M(x l x j l + i k,ki,l (1 y il M(x k x i (1 y ik M(x k x i. (40 i,ki Then, according to (36, for d = 1,, D, we have [ (1 µ 1 + µ 2 x k x k ] T x k (41 [ T where x k = 0,, φ c,d(x k x k,, 0] is a column vector, where φ c,d (x k is the KDML feature value for the d th

8 8 dimension and class c of x k. Using the entropy feature in (14 as an example, we know i T φ c,d (x k = ln c exp( (x k,d x i,d 2 2h 2 d i T exp(. (42 (x k,d x i,d 2 We now compute φ c,d(x k. Let r d = 1/(2h 2 d, we have: φ c,d (x k i T = c [(x k,d x i,d 2 exp(r d (x k,d x i,d 2 ] r d i T c exp(r d (x k,d x i,d 2 i T [(x k,d x i,d 2 exp(r d (x k,d x i,d 2 ] i T exp(r d(x k,d x i,d 2 (43 and φ c,d (x k = 1 h 3 d 2h 2 d φ c,d (x k r d. (44 Summarizing things together, we can get the closed form of by assembling (41, (38, (40, (43, and (44. The closed form of seems complex but in fact can be efficiently computed. has two parts, 1 and 2. 1 has a simple form and only involves pairs of neighboring points satisfying j k or k j. For 2, it is important to note that it is a subgradient. For E 2 in (35, for each point i, we only need to consider those active l that are invading the neighborhood of i so that the corresponding [.] + term is positive. There are typically few invaders. This is observed and exploited in LMNN to speed up its gradient computation [31]. In [31], it is found that the k target neighbors and the invaders do not change frequently over each iteration. The LMNN package maintains such information in a data structure for efficient updates during the optimization process. This data structure is adapted in our implementation to support efficient computation of. 5.3 The KD-NCA approach We use KD-NCA to denote the NCA model combined with the KDML feature. We still use Algorithm 1 to learn the transform matrix M and kernel bandwidths h in KD- NCA. For each input x i, according to NCA, the probability that each point will be correctly classified is: p i = j C i exp( Mφ(x i Mφ(x j 2 k i exp( Mφ(x i Mφ(x k 2. (45 We minimize the negative logarithm of the likelihood: E(M, H = N ln(p i (46 i=1 where H is the matrix of h i,d (1 i N, 1 d D. We minimize E(M, H by performing gradient descent in the joint space of M and H. Let φ(x i, x j = φ(x i φ(x j, the derivatives of E with respect to M is: M = 2M i ( p il φ(x i, x l φ(x i, x l T l j C i p ij φ(x i, x j φ(x i, x l T j C i p ij (47 In order to find H, we first calculate φ c,d (x k. Let r i,d = 1 2hi,d. We have and φ c,d (x k = ln[ Moreover, we know: i T c exp( (x k,d x i,d 2 r i,d r j,d N i=1 exp( (x k,d x i,d 2 r i,d r j,d ] (48 φ c,d (x k = φ c,d(x k,d r k,d (49 h k,d h k,d r k,d φ c,d (x k = r k,d i T c [ r id (x k,d x i,d 2 exp( (x k,d x i,d 2 r k,d r j,d ] i T c exp( (x k,d x i,d 2 r i,d r k,d i T [ r i,d(x k,d x i,d 2 exp( (x k,d x i,d 2 r i,d r k,d ] i T exp( (x k,d x i,d 2 r i,d r k,d (50 where φ c,d (x k = i ( p il Mφ(x i Mφ(x l 2 l j C i p ij Mφ(x i Mφ(x l 2. j C i p ij Summarizing (50 and (51, we can get: h k,d = 6 RELATED WORK φ c,d (x k φ c,d (x k r k,d r k,d (51 h k,d (52 A number of prior works on metric learning have focused on learning a linear transformation in the original input space [5], [6], [8], [20], [31] [35]. They achieved great success in improving the performance of learning algorithms by obtaining better Mahalanobis distance measures. The concept of distance metric learning was first proposed by Xing et al. [5]. Their objective is to learn a Mahalanobis matrix such that similar points are clustered together subject to the constraints that distances between dissimilar points are larger than a lower bound. Inspired by this general idea, many works have been developed. LMNN [31] identifies the local target neighbors in the original space for each point and learns a Mahalanobis matrix such that the non-target neighbors for each point are encouraged to be far away from all its target neighbors with a large margin.

9 9 ITML [8] assumes that there exists a bijection between the Mahalanobis distance and a single multivariate Gaussian distribution. It minimizes the KL divergence between a prior distribution and the distribution implies by the Mahalanobis distance, subject to upper bound constraints on the distance between similar points and lower bound constraints on the distance between dissimilar points. Considering both must-link and cannot-link constraints, [36] proposed a linear transformations or equivalently global Mahalanobis metrics. Laplacian Regularized Metric Learning (LRML [37] are used in learning robust distance metrics. The SEPAPH [11] approach also relies on a mapping from the Mahalanobis distance to a probability distribution but extends to semi-supervised metric learning based on regularization. Neighborhood components analysis (NCA [6] maximizes a softmax function that smooths the leaveone-out accuracy of knn classification. However, it has a nonconvex objective function and suffers from local minima. Maximally collapsing metric learning (MCML [32] constructs a convex objective based on the same softmax function to characterize the distribution. It minimizes for each point the KL divergence between a bi-level distribution and the desired distribution under the Mahalanobis distance, where the bi-level distribution is zero for similar points and non-zero for dissimilar points. All the above methods look for a linear transformation. However, the linearly transformed features fail to have satisfactory performance on many cases, such as the example in Figure 1. Another well-known example is the case where the two classes of data points are form two concentric circles [31], which we will illustrate in Section 6. All the above linear methods will fail on this example. There are also some existing work on nonlinear metric learning. One nonlinear extension is to kernelize existing methods and use the Representer s Theorem to represent the nonlinear transformation using the kernel matrix [9], [10], [12]. An nonlinear extension to NCA has also been proposed [19] which tunes a multilayer neural network to learn a nonlinear transformation. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM [38] also learns a nonlinear mapping. Like NCA, DrLIM uses the same-class neighborhood structure to drive the optimization: observations with the same class label are driven to be close-by in the feature space. However, these methods have not replicated the success and out-of-the-box usability of linear metric learning methods. In general, direct kernelization of linear metric learning methods are sensitive to hyper-parameters and their utility is limited inherently by the sizes of kernel matrices [16]. Another nonlinear approach, MM-LMNN [31], uses multiple metrics for different clusters of data to achieve global nonlinearity, where the clusters are obtained by the k-means algorithm. However, the transformation is locally linear with respect to each cluster and the crosscluster distances cannot be easily learned. Two other nonlinear methods are recently proposed in [16]. χ 2 -LMNN uses a nonlinear χ 2 distance measure. It is intended for histogram data and can only be applied when all the data lie on a simplex S D = {x R D x 0, x T 1 = 1}. KDML has much wider applicability than χ 2 -LMNN since it can process any input by transforming categorical attributes into histograms and numerical attributes into probability densities. GB-LMNN uses a set of gradient boosting regression trees with different heights and optimizes the objective function of LMNN. GB-LMNN is shown to perform better than its linear counterpart LMNN and MM-LMNN [16]. However, GB- LMNN is computational much more expensive and can handle very limited data sizes. Another nonlinear method, KD-LMNN [39], constructs a nonlinear mapping from the original input space into a feature space based on kernel density estimation. Then they integrate this mapping with LMNN and acquire better knn classification quality. 7 EXPERIMENTAL RESULTS In this section, we report experimental results to evaluate the proposed KD-LMNN and KD-NCA models. Both of them automatically tune h as shown in Algorithm 1. We evaluate the following versions of algorithms: KD-LMNN and KD-NCA with φ P, φ S, or φ E features (denoted as KD-LMNN P H, KD-NCAP H, KD-LMNNS H, KD- NCA S H, and KD-LMNNE H, KD-NCAE H, respectively, and KD-LMNN and KD-NCA using local scaling bandwidths with φ E feature (denoted as KD-LMNN E A and KD-NCA E A, respectively. For comparison, we also evaluate two leading linear metric learning algorithms including LMNN [7] and ITML [8]. We also evaluate a state-of-the-art nonlinear metric learning algorithm MM-LMNN [31], which first groups data into clusters and then uses multiple linear mappings for different clusters to achieve globally nonlinear mapping. For LMNN and MM-LMNN, we use their authors packages. 1. ITML and NCA codes are obtained from their websites 2 3. We also evaluate using the original Euclidean distance as a baseline. We implemented our algorithms inside the LMNN and NCA packages, which are implemented in the Matlab environment. All experiments are performed on a desktop computer with 2.67GHz CPU and 8G memory running Mac OS X Illustrations on toy cases For sanity check and illustration, we first test on a simple example in Figure 1a. This data cannot be correctly 1. kilian/code/lmnn/lmnn.html 2. fowlkes/software/nca/ 3. pjain/itml/

10 10 error:100% error:45% error:16% (a original data (b iteration 1 (c iteration 2 error:13% error:0% (d iteration 3 (e iteration 4 Fig. 2. A toy example with two circles in two classes, marked in different colors.kd-lmnn E H is used. a shows the original data; b - e show the data mapping and knn classification error after each outer-loop iteration of Algorithm 1 which tunes h. The classification error quickly decreases to zero as h is optimized using subgradient descent. separated by any linear metric learning algorithm. Since KDML maps the data to a higher dimensional space, to visualize the mapping in 2-D, we extract a 2-D transformation L R 2 D from M using eigendecomposition. Such dimensionality reduction is in fact another main utility of metric learning and already implemented in LMNN. Figure 1b shows the 2-D mapping result by KD-LMNN E H, which clearly separates the two classes. Figures 1c and 1d show that linear metric learning methods cannot separate the two classes. We also test another toy example shown in Figure 2a. It contains two concentric circles of data from two different classes. It is a very difficult case for metric learning and distance-based classification since the nearest neighbor of any given data point is from the other class. It is a well-known example as no linear transformation can separate these two classes [16]. Figures 2b to 2e illustrate the process of KD-LMNN E H in Algorithm 1 which automatically tunes the kernel bandwidth h. For better visualization, the results in Figures 2b to e are obtained by applying Algorithm 1 and extracting a 2-D mapping using eigendecomposition of M at each outer-loop iteration. We can see that the knn classification error quickly decreases from 45% after the initial KDML mapping to 0% in just four major iterations of optimizing h using subgradient descent. Type Dataset N C D n D c Numerical Glass Wine Breast-cancer Hepatistis Handwritten Digits Mixed Contraceptive Statlog Heart Hayes-Roth Categorical Balance Scale Car TABLE 1 The number of instances N, number of classes C, number of numerical features D n, and number of categorical features D c of the tested UCI datasets. 7.2 Comparison of KDML features In this part, we test all the algorithms on benchmark datasets from the UCI repository [40]. We choose datasets mostly with multiple ( 3 classes since knn has salient advantages over other methods such as SVM on multiway classification. For each dataset, we run a 10-fold cross validation with 90/10 splits and report the average results. We use k = 3 for knn classification on all the cases. Table 1 lists the main characteristics of the tested datasets. We can see that there are datasets with numerical, categorical, and mixed attributes. Since we have proposed three features for KDML,

11 11 Dataset KD-LMNN P H KD-LMNN S H KD-LMNN E H KD-NCA P H KD-NCA S H KD-NCA E H Glass 22.45± ± ± ± ± ± 8.05 Wine 4.51± ± ± ± ± ±7.31 Contraceptive 50.37± ± ± ± ± ±1.80 Statlog Heart 25.56± ± ± ± ± ±9.05 Hayes-Roth 18.75± ± ± ± ± ±4.70 Balance Scale 17.12± ± ± ± ± ±4.2 Car 3.13± ± ± ± ± ±0.42 TABLE 2 KNN classification error (in %, ± standard deviation of various KDML features on the UCI datasets, averaged over 10-fold 90/10 training-testing splits. For KD-LMNN and KD-NCA, the best features are shown in bold. We do not compare across different versions of KD-LMNN and KD-NCA since we aim at determining the best feature transformation. Dataset Euclidean LMNN NCA ITML MM-LMNN KD-LMNN E H KD-LMNNE A KD-NCAE H KD-NCAE A Glass 32.25± ± ± ± ± ± ± ± ± 6.52 Wine ± ± ± ± ± ± ± ± ± 2.93 Breast-cancer 2.78± ± ± ± ± ± ± ± ± 1.69 Hepatistis 28.33± ± ± ± ± ± ± ± ± 9.07 Handwritten Digits 0.75± ± ± ± ± ± ± ± ± 0.24 TABLE 3 KNN classification error (in %, ± standard deviation of various methods on numerical datasets, averaged over 10-fold 90/10 training-testing splits. Best results are shown in bold. including Φ P, Φ S, and Φ E, we first compare the performance of these features. Table 2 shows the results for these three different features. We can see that the entropy features Φ E consistently perform well in most cases. Therefore, we will use it in the following comparison again other metric learning methods. 7.3 Results on numerical datasets Table 3 compares knn classification errors of various algorithms on the numerical datasets. For KDML algorithms, we use the entropy features, and test KD-LMNN and KD-NCA, with automatic tuning and local scaling of the kernel bandwidth. We can make a few observations from Table 3. First, directly using the Euclidean distance gives the worst performance. It is clearly that metric learning of any kind can greatly improve the results. Second, linear metric learning algorithms, including LMNN, NCA and ITML, are in general not as good as nonlinear algorithms as they tend to give higher classification errors. Third, all KDML algorithms, combined with LMNN or NCA, consistently and significantly outperform other algorithms, including the nonlinear algorithm MM-LMNN, in all the cases. Finally, it is observed that automatic tuning of h is more suitable for KD-LMNN while local scaling is better for KD-NCA. 7.4 Results on categorical and mixed datasets Another major advantage of KDML is its ability to naturally handle categorical variables. We also evaluate our algorithms on datasets with categorical attributes and mixed data types from the UCI repository. Fig. 3. Sample images in the Yale dataset. To deal with a categorical attribute x, KDML transforms x into numerical features defined as φ P (x, φ S (x or φ E (x before. For other algorithms, we use a typical multinomial encoding to handle categorical variables. For each categorical attribute x that has m different categories, we transform it into m numbers with only one of the numbers being 1 and the others being 0. Table 4 lists the knn classification results on datasets with categorical and mixed attributes. Again, we observe that all KDML algorithms consistently perform very well across all the cases. 7.5 Results on face recognition We also carry out experiments on two well-known face recognition datasets, Yale and ORL [41]. The Yale dataset contains 165 gray-scale face images of 15 persons, each having 11 images (illustrated in Figure 3. For each person, his/her images may have different illumination and facial expression/configuration including centerlight, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and wink. The ORL face dataset contains 400 images of 40 persons, where each person has 10 images. All the images

12 12 Dataset Euclidean LMNN NCA ITML MM-LMNN KD-LMNN E H KD-LMNNE A KD-NCAE H KD-NCAE A Contraceptive 51.19± ± ± ± ± ± ± ± ±5.29 Statlog Heart 38.15± ± ± ± ± ± ± ± ±4.85 Hayes-Roth 32.50± ± ± ± ± ± ± ± ±11.14 Balance Scale 28.80± ± ± ± ± ± ± ± ±9.67 Car 15.92± ± ± ± ± ± ± ± ±1.68 TABLE 4 The testing error (in %, ± standard deviation of various methods on categorical and mixed datasets, averaged over 10-fold 90/10 training-testing splits. Best results are shown in bold. Error rate on testing set (% knn ITML LMNN NCA KD LMNN E H KD NCA E H Error rate on testing set (% knn ITML LMNN NCA KD LMNN E H KD NCA E H Training images per person (p Training images per person (p Fig. 4. Face recognition results on the Yale dataset. knn denotes the original Euclidean distance. Fig. 5. Face recognition results on the ORL dataset. knn denotes the original Euclidean distance. were taken against a dark homogeneous background with the subjects in an upright frontal position, with tolerance for some tilting and rotation. The images for each person have variations in facial expression (open/closed eyes, smiling/not smiling, and facial details (glasses/no glasses, and there is some variation in scale by up to about 10%. For both datasets, all images are cropped into pixels and the gray-scale values of all images are rescaled to [0,1]. A random subset with p (p=2, 3,..., 8 images per individual was taken with labels to form the training set, and the rest of the images was used as the testing set. For each given p, we average the results over 50 random splits and report the mean value of the results. Both the training and the testing images are mapped into a low-dimensional subspace using PCA where recognition is carried out by a knn classifier (k = 3 in all cases. For fair comparison, we use the same reduced dimensionality for all the methods (37 for Yale and 39 for ORL as suggested in [41]. Figures 4 and 5 compare the performance of various methods on the two datasets, respectively. We can see that, for both datasets, KD-LMNN and KD-NCA give the best classification results against all the other metric learning algorithms over all the different values of p. 8 CONCLUSIONS AND FUTURE WORK In this paper, we have proposed a novel kernel density metric learning (KDML framework for nonlinear metric learning. KDML is fundamentally different from the previous metric learning algorithms since it introduces a nonlinear mapping from the original input space into a probability density space, based on Nadaraya-Watson kernel density estimation. We have shown that the nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to effective classification on datasets for which linear metric learning methods would fail. KDML can be used as a preprocessing step and combined with existing metric learning algorithms. KDML addresses a key challenge for knn classification and metric learning. When the features are from heterogeneous domains and have vastly different scales, knn may give poor results since it does not make much sense to compute the Euclidean distance between two original feature vectors. KDML maps the features into density-based quantities which can be used as a uniform basis for deriving distance measures. We have integrated KDML with the LMNN and NCA. We proposed two static schemes to set the kernel bandwidth for density estimation, including a rule of thumb and a local scaling scheme that adapts to the local neighborhood structure of the given data. In addition, we have also derived the closed form of the subgradients of the objective function with respect to the kernel bandwidths. We have then derived an integrated optimization algorithm for learning the Mahalanobis matrix and kernel bandwidths. Such automatic learning of kernel bandwidths in a Nadaraya-Watson estimator is

13 13 not found in previous work. Extensive results on real-world numerical and categorical data show that, KDML gives significantly better knn classification quality than other linear and nonlinear metric learning algorithms. Unlike previous metric learning algorithms, KDML can naturally handle both numerical and categorical data. It is also easy to use and offers good off-the-shelf usability. These advantages make KDML an attractive general approach for metric learning. Our ongoing work focuses on combining the nonlinear features in KDML with more expressive parametric forms of the distance function such as that in χ 2 -LMNN and KL-divergence, instead of the simple Euclidean l 2 form. The flexibility in both feature mappings and distance functions may enable us to construct superior distance/similarity measures for a wide range of applications. ACKNOWLEDGMENT This work is partially supported by the CNS , CCF , and IIS grants from the National Science Foundation of the United States, a Microsoft Research New Faculty Fellowship, a Washington University URSA grant, and a Barnes-Jewish Hospital Foundation grant. REFERENCES [1] F. Wang and J. Sun, Survey on distance metric learning and dimensionality reduction, Data Mining and Knowledge Discovery(DMKD, [2] Y. Yang and X. Liu, A re-examination of text categorization methods, in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR 99. New York, NY, USA: ACM, 1999, pp [Online]. Available: [3] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Neighbourhood components analysis, Advances in Neural Information Processing Systems, vol. 17, pp , [4] W. Shoombuatong, P. Mekha, KitsanaWaiyamai, S. Cheevadhanarak, and J. Chaijaruwanich, Prediction of human leukocyte antigen gene using k-nearest neighbour classifier based on spectrum kernel, ScienceAsia, vol. 39, pp , [5] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell, Distance metric learning with application to clustering with side-information, in Proc. NIPS, 2002, pp [6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, Neighbourhood components analysis, in Proc. NIPS, 2004, pp [7] K. Weinberger, J. Blitzer, and L. Saul, Distance metric learning for large margin nearest neighbor classification, in Proc. NIPS, [8] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, Information-theoretic metric learning, in Proceedings of the 24th international conference on Machine learning, ser. ICML 07. New York, NY, USA: ACM, 2007, pp [Online]. Available: [9] A. Globerson and S. T. Roweis, Visualizing pairwise similarity via semidefinite programming, Journal of Machine Learning Research - Proceedings Track, vol. 2, pp , [10] L. Torresani and K. chih Lee, Large margin component analysis, in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds. Cambridge, MA: MIT Press, 2007, pp [11] G. Niu, B. Dai, M. Yamada, and M. Sugiyama, Informationtheoretic semi-supervised metric learning via entropy regularization, in Proceedings of the 29th international conference on Machine learning, ser. ICML 12, [12] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul, A new kernelization framework for mahalanobis distance learning algorithms, Neurocomput., vol. 73, no , pp , Jun [13] C. Galleguillos, B. McFee, S. J. Belongie, and G. R. G. Lanckriet, Multi-class object localization by combining local contextual interactions. in CVPR. IEEE, 2010, pp [14] P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon, Metric and kernel learning using a linear transformation, Journal of Machine Learning Research, vol. 13, pp , [15] S. Chopra, R. Hadsell, and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05 - Volume 1 - Volume 01, ser. CVPR 05. Washington, DC, USA: IEEE Computer Society, 2005, pp [Online]. Available: [16] D. Kedem, S. Tyree, K. Weinberger, F. Sha, and G. Lanckriet, Nonlinear metric learning, in Proc. NIPS, [17] H. Cevikalp and M. Wilke, Face recognition by using discriminative common vectors, in In Proceedings of the 17th International Conference on Pattern Recognition, vol. 1, August 2004, pp [18] H. Cevikalp, M. Neamtu, M. Wilkes, and A. Barkana, Discriminative common vectors for face recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 27(1, [19] R. Salakhutdinov and G. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, AISTATS, vol. 11, [20] A. Globerson and S. Roweis, Metric learning by collapsing classes, NIPS, [21] S. Kullback and R. Leibler, On information and sufficiency, Ann. Math. Statist., vol. 22, pp , [22] S. H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, pp , [23] D. Gavin, W. Osward, E. Wahl, and J. Williams, A statistical approach to evaluating distance metrics and analog assignments for pollen records, vol. 60, pp , [24] K. Matusita, Decision rules, based on the distance, for problems of fit, two samples, and estimation, Ann. Math. Statist., vol. 26, pp , [25] K. Pearson, On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Phil. Mag., vol. 50, pp , [26] E. Nadaraya, On estimating regression, Theory of Probability and its Applications, vol. 9, pp , [27] C. M. Bishop, Pattern Recognition and Machine Learning. Secaucus, NJ, USA: Springer-Verlag New York, Inc., [28] B. W. Silverman and P. J. Green, Density Estimation for Statistics and Data Analysis. Chapman and Hall, [29] C. Yang, X. Zhang, and L. Jiao, Self-tuning semi-supervised spectral clustering, in In Proceedings of the 2008 International Conference on Computational Intelligence and Security, [30] L. Zelnik-manor and P. Perona, Self-tuning spectral clustering, in Advances in Neural Information Processing Systems, vol. 17. MIT Press, 2004, pp [31] K. Weinberger and L. Saul, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research, vol. 10, pp , [Online]. Available: [32] A. Globerson and S. Roweis, Metric learning by collapsing classes, in Proc. NIPS, [33] N. Shental, T. Hertz, D. Weinshall, and M. Pavel, Adjustment learning and relevant component analysis, in Proceedings of the 7th European Conference on Computer Vision-Part IV, ser. ECCV 02. London, UK, UK: Springer-Verlag, 2002, pp [Online]. Available: [34] D. Cai, X. He, J. Han, and H.-J. Zhang, Orthogonal laplacianfaces for face recognition, IEEE Transactions on Image Processing, vol. 15, no. 11, pp , 2006.

14 14 [35] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, Face recognition using laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 27, no. 3, pp , [36] M. S. Baghshah and S. B. Shouraki, Semi-supervised metric learning using pairwise constraints, in IJCAI, [37] S. Hoi, W. Liu, and S.-F. Chang, Semi-supervised distance metric learning for collaborative image retrieval, IEEE Conference on Computer Vision and Pattern Recognition, vol. 1(7, pp , [38] S. C. R. Hadsell and Y. LeCun, Dimensionality reduction by learning an invariant mapping, in In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR 06, 2006, pp [39] Y. He, Y. Mao, W. Chen, and Y. Chen, Kernel density metric learning, in Proceedings of the International Conference on Data Mining (ICDM 13, [40] A. Frank and A. Asuncion, UCI machine learning repository, [41] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning a spatially smooth subspace for face recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Machine Learning (CVPR 07, Yixin Chen is an Associate Professor of Computer Science at the Washington University in St. Louis. His research interests include data mining, machine learning, artificial intelligence, optimization, and cyber-physical systems. He received a Ph.D. in Computing Science from University of Illinois at Urbana Champaign in He received the Best Paper Award at the AAAI Conference on Artificial Intelligence (2010 and International Conference on Tools for AI (2005, and best paper nomination at the ACM KDD Conference (2009. His work on planning has won First Prizes in the International Planning Competitions (2004 & He has received an Early Career Principal Investigator Award from the Department of Energy (2006 and a Microsoft Research New Faculty Fellowship (2007. He is an Associate Editor for ACM Transactions of Intelligent Systems and Technology and IEEE Transactions on Knowledge and Data Engineering, and serves on the Editorial Board of Journal of Artificial Intelligence Research. Yujie He is a PH.D student in the Department of Computer Science and Engineering at Washington University in St. Louis. His research interests include data mining and machine learning. Yujie completed his bachelor degree in both bioinformatics and computer science from Shanghai Jiao Tong University. He got nomination of best paper reward in IEEE international Conference on Data Mining Yi Mao received her B.S. degrees in control technology and instrument from Xidian University, Xian, and P.R. China in She is currently a Ph.D. student with the School of Areospace Science and Technology, Xidian University. Her research interests include data mining, pattern recognition, signal processing and machine learning. Wenlin Chen is a Ph.D student in the Department of Computer Science and Engineering at Washington University in St. Louis. His research mainly focuses on machine learning and data mining. Before joining WashU, he got his bachelor degree in computer science from University of Science and Technology of China (USTC. Wenlin received the runner up for best student paper award in ACM SIGKDD Conference 2014 and nomination for best paper award in IEEE International Conference on Data Mining 2013.

Kernel Density Metric Learning

Kernel Density Metric Learning Kernel Density Metric Learning Yujie He, Wenlin Chen, Yixin Chen Department of Computer Science and Engineering Washington University St. Louis, USA yujie.he@wustl.edu, wenlinchen@wustl.edu, chen@cse.wustl.edu

More information

Kernel Density Metric Learning

Kernel Density Metric Learning Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2013-28 2013 Kernel Density

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1 Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data."

More information

Distance Metric Learning

Distance Metric Learning Distance Metric Learning Technical University of Munich Department of Informatics Computer Vision Group November 11, 2016 M.Sc. John Chiotellis: Distance Metric Learning 1 / 36 Outline Computer Vision

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

An Efficient Sparse Metric Learning in High-Dimensional Space via l 1 -Penalized Log-Determinant Regularization

An Efficient Sparse Metric Learning in High-Dimensional Space via l 1 -Penalized Log-Determinant Regularization via l 1 -Penalized Log-Determinant Regularization Guo-Jun Qi qi4@illinois.edu Depart. ECE, University of Illinois at Urbana-Champaign, 405 North Mathews Avenue, Urbana, IL 61801 USA Jinhui Tang, Zheng-Jun

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Cluster Kernels for Semi-Supervised Learning

Cluster Kernels for Semi-Supervised Learning Cluster Kernels for Semi-Supervised Learning Olivier Chapelle, Jason Weston, Bernhard Scholkopf Max Planck Institute for Biological Cybernetics, 72076 Tiibingen, Germany {first. last} @tuebingen.mpg.de

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

arxiv: v1 [stat.ml] 10 Dec 2015

arxiv: v1 [stat.ml] 10 Dec 2015 Boosted Sparse Non-linear Distance Metric Learning arxiv:1512.03396v1 [stat.ml] 10 Dec 2015 Yuting Ma Tian Zheng yma@stat.columbia.edu tzheng@stat.columbia.edu Department of Statistics Department of Statistics

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

arxiv: v1 [cs.lg] 9 Apr 2008

arxiv: v1 [cs.lg] 9 Apr 2008 On Kernelization of Supervised Mahalanobis Distance Learners Ratthachat Chatpatanasiri, Teesid Korsrilabutr, Pasakorn Tangchanachaianan, and Boonserm Kijsirikul arxiv:0804.1441v1 [cs.lg] 9 Apr 2008 Department

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE THE PUBLISHING HOUSE PROCEEDINGS OF THE ROMANIAN ACADEMY, Series A, OF THE ROMANIAN ACADEMY Volume 0, Number /009, pp. 000 000 NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

More information

Issues and Techniques in Pattern Classification

Issues and Techniques in Pattern Classification Issues and Techniques in Pattern Classification Carlotta Domeniconi www.ise.gmu.edu/~carlotta Machine Learning Given a collection of data, a machine learner eplains the underlying process that generated

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

One-class Label Propagation Using Local Cone Based Similarity

One-class Label Propagation Using Local Cone Based Similarity One-class Label Propagation Using Local Based Similarity Takumi Kobayashi and Nobuyuki Otsu Abstract In this paper, we propose a novel method of label propagation for one-class learning. For binary (positive/negative)

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES THEORY AND PRACTICE Bogustaw Cyganek AGH University of Science and Technology, Poland WILEY A John Wiley &. Sons, Ltd., Publication Contents Preface Acknowledgements

More information

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Gaussian Mixture Distance for Information Retrieval

Gaussian Mixture Distance for Information Retrieval Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Chapter 6 Classification and Prediction (2)

Chapter 6 Classification and Prediction (2) Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

THE presence of missing values in a dataset often makes

THE presence of missing values in a dataset often makes 1 Efficient EM Training of Gaussian Mixtures with Missing Data Olivier Delalleau, Aaron Courville, and Yoshua Bengio arxiv:1209.0521v1 [cs.lg] 4 Sep 2012 Abstract In data-mining applications, we are frequently

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Comparison of Log-Linear Models and Weighted Dissimilarity Measures Comparison of Log-Linear Models and Weighted Dissimilarity Measures Daniel Keysers 1, Roberto Paredes 2, Enrique Vidal 2, and Hermann Ney 1 1 Lehrstuhl für Informatik VI, Computer Science Department RWTH

More information

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Olga Kouropteva, Oleg Okun, Matti Pietikäinen Machine Vision Group, Infotech Oulu and

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information