Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Similar documents
Supervised Metric Learning with Generalization Guarantees

Tutorial on Metric Learning

Distance Metric Learning

metric learning course

Semi Supervised Distance Metric Learning

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Support Vector Machine (SVM) and Kernel Methods

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Metric Embedding for Kernel Classification Rules

Fantope Regularization in Metric Learning

Support Vector Machine (SVM) and Kernel Methods

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

arxiv: v1 [cs.lg] 9 Apr 2008

Parameter Free Large Margin Nearest Neighbor for Distance Metric Learning

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Sparse Compositional Metric Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Classification and Pattern Recognition

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

The role of dimensionality reduction in classification

Bayesian Multitask Distance Metric Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

arxiv: v1 [stat.ml] 10 Dec 2015

CS798: Selected topics in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Statistical Machine Learning

Joint Semi-Supervised Similarity Learning for Linear Classification

Linear and Non-Linear Dimensionality Reduction

Support Vector Machines

Statistical Pattern Recognition

Introduction to Support Vector Machines

metric learning for large-scale data

18.9 SUPPORT VECTOR MACHINES

Linear & nonlinear classifiers

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Max Margin-Classifier

Midterm exam CS 189/289, Fall 2015

Time Series Classification

Support Vector Machines.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chemometrics: Classification of spectra

Nonlinear Metric Learning with Kernel Density Estimation

Kernel Density Metric Learning

A Posteriori Corrections to Classification Methods.

Advanced Machine Learning & Perception

A metric learning perspective of SVM: on the relation of LMNN and SVM

Support Vector Machines for Classification: A Statistical Portrait

Unsupervised Learning with Permuted Data

Linear & nonlinear classifiers

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Neural networks. Chapter 20, Section 5 1

6.036 midterm review. Wednesday, March 18, 15

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machine (continued)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Day 3: Classification, logistic regression

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

A metric learning perspective of SVM: on the relation of LMNN and SVM

Local Metric Learning on Manifolds with Application to Query based Operations

COMS 4771 Introduction to Machine Learning. Nakul Verma

Learning a kernel matrix for nonlinear dimensionality reduction

Neural networks. Chapter 20. Chapter 20 1

Statistical Machine Learning from Data

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Supervised Learning. George Konidaris

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Density Metric Learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning Practice Page 2 of 2 10/28/13

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Lecture Support Vector Machine (SVM) Classifiers

Geometric Mean Metric Learning

Temporal and Frequential Metric Learning for Time Series knn Classication

t-sne and its theoretical guarantee

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Ad Placement Strategies

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Non-linear Dimensionality Reduction

An Invariant Large Margin Nearest Neighbour Classifier

Introduction to Machine Learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machine & Its Applications

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning for Signal Processing Bayes Classification and Regression

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

Support Vector Machines

Machine Learning (CSE 446): Neural Networks

Mining Classification Knowledge

Transcription:

Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1

Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." arxiv 2013 Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." JMLR 2009 Parameswaran, Shibin, and Kilian Q. Weinberger. "Large margin multi-task metric learning. NIPS 2010 Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." AISTATS 2007 Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "Learning good edit similarities with generalization guarantees. ECML 2011 2

Outline Introduction Supervised Mahalanobis Distance Learning Non Linear Methods Metric Learning for structured Data Conclusion Software Packages 3

Introduction The goal of metric learning is to adapt some pairwise real-valued metric function say Mahalanobis distance to a problem of interest using training data. The matrix M 0 in Mahalanobis distance, is the metric to be learnt/adapted. While following the below constraints. Must-link / cannot-link constraints (sometimes called positive / negative pairs): Relative constraints (sometimes called training triplets): 4

Introduction A metric learning algorithm basically aims at finding the parameters of the metric M such that it best agrees with constraints S, D, R This is typically formulated as an optimization problem that has the following general form: where l(m,s,d,r) is a loss function, R(M) is some regularizer on the parameters M of the learned metric and λ 0 is the regularization parameter 5

Introduction Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 6

Applications Metric learning can potentially be beneficial whenever the notion of metric between instances plays an important role. Some of the active research areas where metric learning finds its uses are Computer Vision: Image classification, Face Recognition Information Retrieval: Search Engines Bioinformatics: Comparing Sequences of DNA 7

Key Properties of a Metric Learning Algorithm Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 8

Supervised Mahalanobis Distance Learning 9

Introduction Learn Mahalanobis Distance Metric M S + d from the data Figure sourced from : Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." 10

Large Margin Nearest Neighbour (LMNN) Aimed at improving knn Minimizes knn leave-one-out classification error Similarity with SVN k nearest neighbours x i, x j together within a margin Instances x l of the same class (target neighbours) to be pulled of other classes (imposters) defined by are to be pushed away from the margin 11

LMNN Loss Function μ [0,1] Figure sourced from: Weinberger K. Q., & Saul L. K. "Distance Metric Learning for Large Margin Nearest Neighbour Classification". 12

Loss Function Minimization using SDP ξ ijl 0: Large margin inequality violation Semidefinite programming objective 13

Extensions to LMNN Multi-pass LMNN Iteratively using the transformation matrix L p from the p th pass to compute new target neighbours in the p + 1 th pass. Multi-metric LMNN Learn multiple locally linear transformations instead of a single global linear transformation. Kernel Version K ij = Φ x i T Φ(x j ) Pre-processing with PCA to get better distance estimates 14

LMNN : Application Figure sourced from: Weinberger K. Q., & Saul L. K. "Distance Metric Learning for Large Margin Nearest Neighbour Classification". 15

Multi-Task Metric Learning 16

Multi-Task Learning We assume that we are given Τ different but related tasks Each input (x i, y i ) belongs to exactly one of the tasks 1,..., T Learn T classifiers {w 1,, w T }, where each classifier w t is specifically dedicated for task t. Learn a global classifier w 0 that captures the commonality among all the tasks. An example x i Τ t is classified by the rule y i = sign(x i T (w 0 + w t )) 17

Multi-Task Learning The joint optimization problem is to minimize the following cost: where a + = max(0, a). The constants γ t 0 trade-off the regularization of the various tasks If γ 0 + then w 0 = 0 and all tasks are decoupled If γ 0 is small and γ t>0 + we obtain w t>0 = 0 and all the tasks share same decision function with weights w 0 18

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) The goal is to learn a metric d t (, ) for each of the T tasks that minimizes the knn leave-one-out classification error The distance for task t is defined by where M 0 is shared metric and M 1,, M T 0 are task specific metrics. To balance out the learning between different M 0 and the individual parameters M 1,, M T, we use the regularization given below 19

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) Regularization in Multi-task Metric Learning Convex optimization problem of LMNN 20

Multi-Task Large Margin Nearest Neighbor (mt-lmnn) Convex optimization problem of mt-lmnn 21

An application of mt-lmnn mt-lmnn can be used in text-dependent speech analysis application. The different tasks are speaker recognition, gender recognition, dialect recognition, emotion recognition The tasks are all different from each other but related in the sense that they all are reading one common sample of text. 22

Non-Linear Metric Learning 23

Non Linear Methods Most work in supervised metric learning has focused on linear metrics due convenience in deriving and optimizing convex formulations and less prone to over-fitting. But linear metrics are unable to capture nonlinear structure in the data Two possible solutions are Kernelization of Linear Methods Learning Nonlinear Forms of Metrics Both methods involve non-linear projection data into another space where the data is linearly separated and hence linear metrics work well. 24

Non-Linear Neighborhood Component Analysis Similarity between two input vectors x a, x b D[f x a W, f(x b W)] X is given by Where f x W is a function f: X Y mapping the input vectors in X into a feature space Y and parametrized by W If D is Euclidean distance and if f(x W) = Wx. The Euclidean distance in the feature space is then the Mahalanobis distance in input space: D f x a, f x b = x a x b T W T W(x a x b ) 25

Non-Linear Neighborhood Component Analysis Given a set of N labelled training cases x a, c a, a = 1,2,, N where x a εr d and c a ε 1,2,, C For a given training vector x a, probability of a selecting b as one of its neighbors is given by p ab = exp( d ab) σ z a exp( d az ) Let, d ab = f x a W f x b W 2 be the Euclidean distance metric f W is a multi-layer neural network, W is the weight vector. 26

Non-Linear Neighborhood Component Analysis Probability that point a belongs to class k depends on the relative proximity of all other data points that belong to class k p c a = k = b:c b =k p ab NCA objective is to maximize the expected number of correctly classified points on the training data: max a N a=1 log( p ab ) b:c a =c b 27

Non-Linear Neighborhood Component Analysis Figure sourced from : Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." 28

Applications of NNCA For any fixed distance metric D, any feature extraction technique could be thought of as learning a similarity metric The simple classification task of MNIST hand-written digits could be solved by NNCA. Even complicated tasks such as, face recognition, object detection etc. could be seen as potential application areas for NNCA. 29

Structured Data Metric Learning 30

Introduction Distance between structured data Strings/Graphs Edit Distance (Levenshtein) x, x Σ are two strings made of symbols from alphabet Σ Edit script - a set of insertions, deletions and substitutions to transform x to x Can be computed in O x. x time using dynamic programming Cost matrix C of size Σ + 1 ( Σ + 1) where C ij is the cost of substituting Σ i with Σ j Cost of cheapest edit script Example Levenshtein distance between kitten and sitting is 3 1. kitten sitten (substitution of s for k ) 2. sitten sittin (substitution of i for e ) 3. sittin sitting (insertion of g at the end) 31

Good Similarity Functions Balcan et al. introduced the concept of ε, γ, τ good similarity functions A similarity function K is ε, γ, τ good if an ε proportion of examples are on average 2γ more similar to reasonable examples of the same class than to reasonable examples of the opposite class, where a τ proportion of examples must be reasonable K can be used to build a linear separator in an explicit projection space that has a margin γ and error close to 1 ε Linear classifier α given by 32

Good Edit Similarity Learning (GESL) #(x, x ) is a Σ + 1 ( Σ + 1) size matrix, s.t., # i,j (x, x ) is the number of times edit operation (i, j) is used to turn x into x Edit function Similarity function 33

GESL Continued T = z i = x i, l i S L = z j = x j, l j N T i=1 N L j=1 Optimization criterion: is a set of N T training samples is a set of N L landmark examples Alternatively 34

GESL : Salient Features Can be optimized using Stochastic Gradient Descent Can be generalized to tree edit distance learning Takes advantage of both the positive samples as well as negative samples Has fast convergence and leads to more accurate and sparser classifiers Applications Natural Language Processing (spelling correction, etc.) DNA sequence matching, etc. 35

Metric Learning : Conclusions Metric learning for numerical data has reached a good level of maturity with improvements in terms of scalability, accuracy and generalization Much less work done in the field of metric learning for structured data. However, recent advances such as GESL are a step towards better theoretical understanding, scalability and flexibility. Exploring ways of modelling multimodal similarity that can tell different ways in which two instances are similar or dissimilar (similarity because of different features), degree of similarity as well as the reasons for similarity would bring the learned metrics closer to our own notions of similarity 36

References Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data." arxiv 2013 Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." JMLR 2009 Parameswaran, Shibin, and Kilian Q. Weinberger. "Large margin multi-task metric learning. NIPS 2010 Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure." AISTATS 2007 Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "Learning good edit similarities with generalization guarantees. ECML 2011 37

Software Links Metric Learning Toolkit https://github.com/all-umass/metric-learn http://shogun-toolbox.org/static/notebook/current/lmnn.html GESL http://mloss.org/software/view/552/ 38

THANK YOU 39