Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1

Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." arxiv 2013 Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." JMLR 2009 Parameswaran, Shibin, and Kilian Q. Weinberger. "Large margin multi-task metric learning. NIPS 2010 Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." AISTATS 2007 Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "Learning good edit similarities with generalization guarantees. ECML 2011 2

Outline Introduction Supervised Mahalanobis Distance Learning Non Linear Methods Metric Learning for structured Data Conclusion Software Packages 3

Introduction The goal of metric learning is to adapt some pairwise real-valued metric function say Mahalanobis distance to a problem of interest using training data. The matrix M 0 in Mahalanobis distance, is the metric to be learnt/adapted. While following the below constraints. Must-link / cannot-link constraints (sometimes called positive / negative pairs): Relative constraints (sometimes called training triplets): 4

Introduction A metric learning algorithm basically aims at finding the parameters of the metric M such that it best agrees with constraints S, D, R This is typically formulated as an optimization problem that has the following general form: where l(m,s,d,r) is a loss function, R(M) is some regularizer on the parameters M of the learned metric and λ 0 is the regularization parameter 5

Introduction Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 6

Applications Metric learning can potentially be beneficial whenever the notion of metric between instances plays an important role. Some of the active research areas where metric learning finds its uses are Computer Vision: Image classification, Face Recognition Information Retrieval: Search Engines Bioinformatics: Comparing Sequences of DNA 7

Key Properties of a Metric Learning Algorithm Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 8

Supervised Mahalanobis Distance Learning 9

Introduction Learn Mahalanobis Distance Metric M S + d from the data Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 10

Large Margin Nearest Neighbour (LMNN) Aimed at improving knn Minimizes knn leave-one-out classification error Similarity with SVN k nearest neighbours x i, x j together within a margin Instances x l of the same class (target neighbours) to be pulled of other classes (imposters) defined by are to be pushed away from the margin 11

LMNN Loss Function μ [0,1] Figure sourced from: Weinberger K. Q., & Saul L. K. "Distance Metric Learning for Large Margin Nearest Neighbour Classification". 12

Loss Function Minimization using SDP ξ ijl 0: Large margin inequality violation Semidefinite programming objective 13

Extensions to LMNN Multi-pass LMNN Iteratively using the transformation matrix L p from the p th pass to compute new target neighbours in the p + 1 th pass. Multi-metric LMNN Learn multiple locally linear transformations instead of a single global linear transformation. Kernel Version K ij = Φ x i T Φ(x j ) Pre-processing with PCA to get better distance estimates 14

LMNN : Application Figure sourced from: Weinberger K. Q., & Saul L. K. "Distance Metric Learning for Large Margin Nearest Neighbour Classification". 15

Multi-Task Metric Learning 16

Multi-Task Learning We assume that we are given Τ different but related tasks Each input (x i, y i ) belongs to exactly one of the tasks 1,..., T Learn T classifiers {w 1,, w T }, where each classifier w t is specifically dedicated for task t. Learn a global classifier w 0 that captures the commonality among all the tasks. An example x i Τ t is classified by the rule y i = sign(x i T (w 0 + w t )) 17

Multi-Task Learning The joint optimization problem is to minimize the following cost: where a + = max(0, a). The constants γ t 0 trade-off the regularization of the various tasks If γ 0 + then w 0 = 0 and all tasks are decoupled If γ 0 is small and γ t>0 + we obtain w t>0 = 0 and all the tasks share same decision function with weights w 0 18

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) The goal is to learn a metric d t (, ) for each of the T tasks that minimizes the knn leave-one-out classification error The distance for task t is defined by where M 0 is shared metric and M 1,, M T 0 are task specific metrics. To balance out the learning between different M 0 and the individual parameters M 1,, M T, we use the regularization given below 19

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) Regularization in Multi-task Metric Learning Convex optimization problem of LMNN 20

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) Convex optimization problem of mt-lmnn 21

An application of mt-lmnn mt-lmnn can be used in text-dependent speech analysis application. The different tasks are speaker recognition, gender recognition, dialect recognition, emotion recognition The tasks are all different from each other but related in the sense that they all are reading one common sample of text. 22

Non-Linear Metric Learning 23

Non Linear Methods Most work in supervised metric learning has focused on linear metrics due convenience in deriving and optimizing convex formulations and less prone to over-fitting. But linear metrics are unable to capture nonlinear structure in the data Two possible solutions are Kernelization of Linear Methods Learning Nonlinear Forms of Metrics Both methods involve non-linear projection data into another space where the data is linearly separated and hence linear metrics work well. 24

Non-Linear Neighborhood Component Analysis Similarity between two input vectors x a, x b D[f x a W, f(x b W)] X is given by Where f x W is a function f: X Y mapping the input vectors in X into a feature space Y and parametrized by W If D is Euclidean distance and if f(x W) = Wx. The Euclidean distance in the feature space is then the Mahalanobis distance in input space: D f x a, f x b = x a x b T W T W(x a x b ) 25

Non-Linear Neighborhood Component Analysis Given a set of N labelled training cases x a, c a, a = 1,2,, N where x a εr d and c a ε 1,2,, C For a given training vector x a, probability of a selecting b as one of its neighbors is given by p ab = exp( d ab) σ z a exp( d az ) Let, d ab = f x a W f x b W 2 be the Euclidean distance metric f W is a multi-layer neural network, W is the weight vector. 26

Non-Linear Neighborhood Component Analysis Probability that point a belongs to class k depends on the relative proximity of all other data points that belong to class k p c a = k = b:c b =k p ab NCA objective is to maximize the expected number of correctly classified points on the training data: max a N a=1 log( p ab ) b:c a =c b 27

Non-Linear Neighborhood Component Analysis Figure sourced from : Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." 28

Applications of NNCA For any fixed distance metric D, any feature extraction technique could be thought of as learning a similarity metric The simple classification task of MNIST hand-written digits could be solved by NNCA. Even complicated tasks such as, face recognition, object detection etc. could be seen as potential application areas for NNCA. 29

Structured Data Metric Learning 30

Introduction Distance between structured data Strings/Graphs Edit Distance (Levenshtein) x, x Σ are two strings made of symbols from alphabet Σ Edit script - a set of insertions, deletions and substitutions to transform x to x Can be computed in O x. x time using dynamic programming Cost matrix C of size Σ + 1 ( Σ + 1) where C ij is the cost of substituting Σ i with Σ j Cost of cheapest edit script Example Levenshtein distance between kitten and sitting is 3 1. kitten sitten (substitution of s for k ) 2. sitten sittin (substitution of i for e ) 3. sittin sitting (insertion of g at the end) 31

Good Similarity Functions Balcan et al. introduced the concept of ε, γ, τ good similarity functions A similarity function K is ε, γ, τ good if an ε proportion of examples are on average 2γ more similar to reasonable examples of the same class than to reasonable examples of the opposite class, where a τ proportion of examples must be reasonable K can be used to build a linear separator in an explicit projection space that has a margin γ and error close to 1 ε Linear classifier α given by 32

Good Edit Similarity Learning (GESL) #(x, x ) is a Σ + 1 ( Σ + 1) size matrix, s.t., # i,j (x, x ) is the number of times edit operation (i, j) is used to turn x into x Edit function Similarity function 33

GESL Continued T = z i = x i, l i S L = z j = x j, l j N T i=1 N L j=1 Optimization criterion: is a set of N T training samples is a set of N L landmark examples Alternatively 34

GESL : Salient Features Can be optimized using Stochastic Gradient Descent Can be generalized to tree edit distance learning Takes advantage of both the positive samples as well as negative samples Has fast convergence and leads to more accurate and sparser classifiers Applications Natural Language Processing (spelling correction, etc.) DNA sequence matching, etc. 35

Metric Learning : Conclusions Metric learning for numerical data has reached a good level of maturity with improvements in terms of scalability, accuracy and generalization Much less work done in the field of metric learning for structured data. However, recent advances such as GESL are a step towards better theoretical understanding, scalability and flexibility. Exploring ways of modelling multimodal similarity that can tell different ways in which two instances are similar or dissimilar (similarity because of different features), degree of similarity as well as the reasons for similarity would bring the learned metrics closer to our own notions of similarity 36

References Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." arxiv 2013 Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." JMLR 2009 Parameswaran, Shibin, and Kilian Q. Weinberger. "Large margin multi-task metric learning. NIPS 2010 Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." AISTATS 2007 Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "Learning good edit similarities with generalization guarantees. ECML 2011 37

Software Links Metric Learning Toolkit https://github.com/all-umass/metric-learn http://shogun-toolbox.org/static/notebook/current/lmnn.html GESL http://mloss.org/software/view/552/ 38

THANK YOU 39