Fantope Regularization in Metric Learning

Similar documents
Riemannian Metric Learning for Symmetric Positive Definite Matrices

Beyond Mahalanobis Metric: Cayley-Klein Metric Learning

Large Scale Similarity Learning Using Similar Pairs for Person Verification

Face detection and recognition. Detection Recognition Sally

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Semi Supervised Distance Metric Learning

Similarity Metric Learning for Face Recognition

Sparse Gaussian conditional random fields

Sparse Compositional Metric Learning

arxiv: v1 [cs.cv] 25 Dec 2012

Kronecker Decomposition for Image Classification

An Efficient Sparse Metric Learning in High-Dimensional Space via l 1 -Penalized Log-Determinant Regularization

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Metric Learning From Relative Comparisons by Minimizing Squared Residual

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Efficient Stochastic Optimization for Low-Rank Distance Metric Learning

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Joint Semi-Supervised Similarity Learning for Linear Classification


A Least Squares Formulation for Canonical Correlation Analysis

Supervised Metric Learning with Generalization Guarantees

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Lecture 5 : Projections

Learning A Mixture of Sparse Distance Metrics for Classification and Dimensionality Reduction

Introduction to Alternating Direction Method of Multipliers

Distance Metric Learning

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Knowledge Transfer with Interactive Learning of Semantic Relationships

An Invariant Large Margin Nearest Neighbour Classifier

Perceptron Revisited: Linear Separators. Support Vector Machines

Sample Complexity of Learning Mahalanobis Distance Metrics

Kernel Density Topic Models: Visual Topics Without Visual Words

Support Vector Machine (SVM) and Kernel Methods

Lecture Notes 1: Vector spaces

Global Scene Representations. Tilke Judd

Statistical Pattern Recognition

Efficient Online Relative Comparison Kernel Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Convex Optimization Algorithms for Machine Learning in 10 Slides

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

MULTIPLEKERNELLEARNING CSE902

Online Kernel PCA with Entropic Matrix Updates

Machine Learning - MT & 14. PCA and MDS

metric learning course

A RELIEF Based Feature Extraction Algorithm

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Homework 4. Convex Optimization /36-725

Statistical Machine Learning from Data

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Some tensor decomposition methods for machine learning

Support Vector Machine (SVM) and Kernel Methods

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Machine Learning Basics

Midterm exam CS 189/289, Fall 2015

Kernel Learning with Bregman Matrix Divergences

Tutorial on Metric Learning

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

arxiv: v1 [cs.lg] 9 Apr 2008

arxiv: v2 [cs.lg] 12 Jan 2015

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Approximating the Covariance Matrix with Low-rank Perturbations

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Nonlinear Metric Learning with Kernel Density Estimation

Learning to Rank and Quadratic Assignment

Lecture Note 5: Semidefinite Programming for Stability Analysis

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Linear Subspace Models

Linear & nonlinear classifiers

Kernel Density Metric Learning

Support Vector Machine (SVM) and Kernel Methods

Regularization in Neural Networks

Learning Spectral Clustering

SVMs, Duality and the Kernel Trick

CS 3710: Visual Recognition Describing Images with Features. Adriana Kovashka Department of Computer Science January 8, 2015

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Logistic Regression. COMP 527 Danushka Bollegala

Support Vector Machines and Kernel Methods

Introduction to Support Vector Machines

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Homework 5. Convex Optimization /36-725

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Statistical Machine Learning

Beyond Linear Similarity Function Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Sparse Covariance Selection using Semidefinite Programming

arxiv: v1 [stat.ml] 10 Dec 2015

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

metric learning for large-scale data

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear Dimensionality Reduction

Transcription:

Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France

Introduction Notations & Related work Metric learning Fantope regularization Metric learning optimization algorithm Experiments Conclusion

Introduction Metric learning algorithms produce a linear transformation of data which is optimized to fit semantical relationships between training samples. Different aspects of the learning procedure have recently been investigated: how the dataset is annotated and used in the learning process; design choices for the distance parameterization; extensions to large scale context, etc. Surprisingly, few attempts have been made for deriving a proper regularization scheme. Regularization in metric learning is however a critical issue, as it often limits model complexity, the number of independent parameters to learn, and thus overfitting. Models learned with regularization usually better exploit correlations between features and often have improved predictive accuracy.

Introduction In this paper, we propose a novel regularization approach for metric learning that explicitly controls the rank of the learned distance matrix. The figure below illustrates the relevance of our approach. PubFig Our method LMNN OSR Our method LMNN

Introduction Notations & Related work Metric learning Fantope regularization Metric learning optimization algorithm Experiments Conclusion

Notations S d : sets of d d real-valued symmetric matrices S + d : sets of d d real-valued symmetric positive semidefinite (PSD) matrices For matrices A S d and B S d, we denote the Frobenius inner product by A, B = tr A T B. Π S+ d A is the orthogonal projection of the matrix A S d onto the positive semidefinite cone S + d. For a given a = a 1,, a d T R d, Diag a = A S d corresponds to a square diagonal matrix such that i, A i,i = a i. λ A is the vector of eigenvalues of matrix A arranged in non-increasing order. λ A i is the i-th largest eigenvalue of A. x i R d (resp. x j R d ) is the vector representation of image p i (resp. p j ), we note x ij = x i x j. For x R d, let x + = max 0, x.

Related work We focus in this work on supervised distance metric learning methods. ----similar and dissimilar pairs of images ----triplets of images In this paper, we consider the widely used Mahalanobis distance metric D M that is parameterized by the PSD matrix M S + d such that: D M 2 p i, p j = x i x j T M xi x j = x ij T Mxij It can also be rewritten: D 2 T M p i, p j = M, x ij x ij J. V. Davis, et al. ICML, 2007 A. Mignon and F. Jurie. CVPR, 2012 E. Xing, et al. NIPS, 2002 G. Chechik, et al. JMLR, 2010 A. Frome, et al. ICCV, 2007 M. Schultz and T. Joachims. NIPS, 2003 K. Weinberger and L. Saul. JMLR, 2009

Related work Many approaches prefer working on a specific matrix decomposition: i.e. M = L T L where L R e d and d is the data dimension. resulting optimization is very fast not convex w.r.t. L local minima In addition, an explicit regularization term is rarely introduced in the learning scheme. For instance, that lack of regularization makes LMNN prone to overfitting. To limit this shortcoming, many approaches perform early stopping which stops an iterative optimization process before convergence. However, this method needs to be carefully tuned for each dataset. T. Mensink, et al. PAMI, 2013 A. Mignon and F. Jurie. CVPR, 2012 K. Weinberger and L. Saul. JMLR, 2009

Related work Schultz and Joachims use the squared Frobenius norm M F 2, following the SVM framework to learn a diagonal PSD distance matrix. The ITML method (Information-Theoretic Metric Learning) uses a LogDet regularizer that constrains the distance matrix to be strictly positive definite. Another powerful way to regularize, is to control the rank of M. Some methods add the trace tr M as a regularization term, because it is a convex surrogate for rank M. M. Schultz and T. Joachims. NIPS, 2003 J. V. Davis, et al. ICML, 2007 D. Lim, et al. ICML, 2013 B. McFee and G. ICML, 2010 C. Shen, et al. NIPS, 2009

Related work In this paper, we investigate a new optimization scheme with a regularization term that explicitly controls the rank of M. Such a scheme allows to avoid overfitting without any trick such as early stopping. The main contributions of this paper are: 1) We introduce a new regularization strategy based on the convex hull of rank-k projection matrices, called Fantope, which allows to explicitly control the rank of distance matrices. 2) We propose an efficient algorithm to solve the new optimization scheme. 3) Our framework outperforms state-of-the-art metric learning methods on synthetic and challenging real Computer Vision datasets.

Metric Learning Fantope regularization Objective function A metric learning algorithm aims at determining M such that the metric satisfies most of the constraints defined by the training information. It is generally formulated as an optimization problem of the form: min M μr M + l M, A μ 0 is the regularization parameter R M is a regularization term on the parameter M l M, A is a loss function

Metric Learning Fantope regularization Motivation for the proposed regularization Controlling the rank of the PSD distance matrix M A standard way: use the nuclear norm M as a regularization term. In the case of PSD matrices, M S + d, M = tr M. seek a rank-0 matrix (i.e. M = 0) We formulate the regularization term R M as the sum of the k smallest eigenvalues of M S d + : R M = d i=d k+1 λ M i limit overfitting exploit correlations between features

Metric Learning Fantope regularization Motivation for the proposed regularization R M = d i=d k+1 λ M i Such a minimization of R M will naturally converge to a subspace corresponding to the (d k) most significant eigenvalues. As the rank of the PSD matrix M S + d is the number of its non-zero eigenvalues and all the eigenvalues of M S + d are non-negative, the proposed regularization term R M allows an explicit control over the rank of M: R M equals 0 iff rank M d k

Metric Learning Fantope regularization Explicit rank control regularization Using Ky Fan s theorem, we can rewrite the sum of the k smallest eigenvalues of any symmetric matrix M as the trace tr WM where W is the convex hull of the set comprising outer product of orthonormal matrices (rank-k projection matrices). This convex hull is called a Fantope. Our regularization term may be expressed as: R M = tr WM = W, M where the matrix W S + d allows to project the matrix M onto the target k-dimensional subspace. K. Fan. On a theorem of weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 1949

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M non-increasing order construct w = w 1,, w T d R d such that: 0 if 1 i d k (the first d k elements) w i = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M construct w = w 1,, w T d R d such that: d 0 if 1 i d k (the first d k elements) w i = R M = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T R M = tr WM = tr V M Diag w V M T V M Diag λ M V M T = tr Diag w Diag λ M = w T λ M = d i=d k+1 λ M i non-increasing order i=d k+1 λ M i R M equals 0 iff rank M d k

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M construct w = w 1,, w T d R d such that: 0 if 1 i d k (the first d k elements) w i = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T non-increasing order Fantope regularization is a generalization of trace regularization. Indeed, for every matrix M S + d, tr M = tr I d M. Trace regularization is equivalent to a Fantope regularization where tr WM is the sum of the d smallest eigenvalues of M W = V M Diag 1 V M T = I d.

Metric Learning Optimization algorithm Optimization problem Constraints: quadruplet-wise constraints For any quadruplet of images q = p i, p j, p k, p l : q A, D M 2 p k, p l δ q + D M 2 p i, p j a safety margin The triplet-wise constraint: D 2 M p i, p k 1 + D 2 M p i, p j : q = p i, p j, p i, p k and δ q = 1 The pairwise constraint: the dissimilar pair p i, p j D D 2 M p i, p j l: q = p i, p i, p i, p j and δ q = l the similar pair p i, p j S u D 2 M p i, p j : q = p i, p j, p i, p i and δ q = u a minimum value an upper bound

Metric Learning Optimization algorithm Optimization problem Constraints: quadruplet-wise constraints For any quadruplet of images q = p i, p j, p k, p l : q A, D M 2 p k, p l δ q + D M 2 p i, p j Using D 2 T M p i, p j = M, x ij x ij the quadruplet-wise constraints using q = p i, p j, p k, p l q A, M, x kl x T kl x ij x T ij A can be rewritten: δ q

Metric Learning Optimization algorithm Optimization problem Optimization Define a global loss: l M, A = q A l M q Design the loss for a single quadruplet: l M q = max 0, δ q + M, x ij x T T ij x kl x kl By including our regularization term and l M, A, the optimization problem becomes: min f M S d W M = min μr M + l M, A + M f W M = μ W, M + q A δ q + M, x ij x T T ij x kl x kl + μ 0 is a regularization parameter W, M is the sum of the k smallest eigenvalues of M.

Metric Learning Optimization algorithm Solving the optimization problem min f M S d W M = min μr M + l M, A M + f W M = μ W, M + q A δ q + M, x ij x T T ij x kl x kl f W M is not globally convex, it is convex w.r.t. M when W is fixed. the subgradient w.r.t. M is: M = μw + q A + x ij x T T ij x kl x kl subset of constraints in A +, μ 0 W is updated by construction as explained before so that W, M is the sum of the k smallest eigenvalues of M. That process stops when the objective value stops decreasing.

Metric Learning Optimization algorithm Solving the optimization problem The global learning scheme is described in Algorithm 1. W = V M Diag w V M T M = μw + q A + x ij x T T ij x kl x kl min f M S d W M = min μr M + l M, A M +

Metric Learning Optimization algorithm Efficiency discussion An alternative method to solve the optimization problem is to switch the update between M and W after a full subgradient descent over M. computationally demanding When the input space dimension d is large, the eigendecomposition required at each iteration of the subgradient descent also becomes computationally expensive.

Metric Learning Optimization algorithm Efficiency discussion We propose an adaptation of the Alternating Direction Method of Multipliers (ADMM) [S. Boyd, et al] to learn a metric. We then adapt the optimization problem in this way: min f M S d, Z S d W M + g Z where g Z = 0 if Z S + d + if Z S + Introducing a Lagrange multiplier Λ S + d, we obtain the augmented Lagrangian: L p M, Z, Λ = f W M + g Z + Λ, M Z + ρ 2 M Z F 2 where ρ > 0 is a scaling parameter. S. Boyd, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011 d

Metric Learning Optimization algorithm Efficiency discussion Algorithm 2 finds the optimal M before updating W, as previously proposed. However, the approximation and speed up in Algorithm 2 comes from the constraint M S + d which has been replaced by the constraint M S d, whereas g Z promotes a PSD solution matrix. U = 1 ρ Λ

Experiments Face verification task Image classification with relative attributes

Experiments Face verification task In the face verification task, we are provided with pairs of face images. The goal is to learn a classifier that determines whether image pairs are similar (represent the same person) or dissimilar (represent two different persons).

Face verification: LFW Experiment setup Dataset and evaluation metric Labeled Faces in the Wild (LFW) dataset: more than 13,000 images of faces restricted paradigm which only providing with two sets of pairs of images set S set D We follow the standard evaluation protocol that uses View 2 data for training and testing and View 1 for validation.

Face verification: LFW Experiment setup Dataset and evaluation metric Labeled Faces in the Wild (LFW) dataset: more than 13,000 images of faces restricted paradigm which only providing with two sets of pairs of images To generate our constraints, set the upper bound u = 0.5, and the lower bound l = 1.5. The distance of a test pair is compared to the threshold l+u = 1 to determine 2 whether the pair is similar or dissimilar. set S set D The pairwise constraint: the dissimilar pair p i, p j D D 2 M p i, p j l: q = p i, p i, p i, p j and δ q = l the similar pair p i, p j S u D 2 M p i, p j : q = p i, p j, p i, p i and δ q = u

Face verification: LFW Experiment setup Image representation Use same input features as popular metric learning methods [ITML, LDML, PCCA]. Use the SIFT descriptors computed by [LDML] available on their website. J. V. Davis, et al. Information-theoretic metric learning. ICML, 2007 M. Guillaumin, et al. Is that you? metric learning approaches for face identification. In ICCV, 2009 A. Mignon and F. Jurie. Pcca: A new approach for distance learning from sparse pairwise constraints. CVPR, 2012 Initialization of the distance matrix M S + d First compute the matrix L R e d that is composed of the coefficients for the e most dominant principal components of the training data. Then: M = L T L

Face verification: LFW Results Impact of regularization compare the impact of Fantope regularization over trace regularization The table shows classification accuracies when solving the optimization problem with both regularization methods. mean & standard error This illustrates the importance of having an explicit control on the rank of the distance matrix.

Face verification: LFW Results State-of-the-art results compare Fantope regularization to other popular metric learning algorithms The table shows performances of ITML, LDML and PCCA. Fantope regularization outperforms ITML and LDML and is comparable to PCCA.

Face verification: LFW Results Impact of early stopping The table reports the accuracies we obtained on LFW by testing the code of PCCA provided by its authors, as a function of the number of iterations of gradient descent. Use early stopping criterion in our method: 83.5 ± 0.5% Conclusion: our regularization scheme makes our method much more robust than PCCA to early stopping.

Face verification: LFW Results Impact of the hyper-parameter μ 82.3% 81.2% with μ=0 expected rank e = 40 for high values of μ

Experiments Image classification with relative attributes In the image classification task with attributes, we are provided with images described with attributes. The goal is to assign an image to a predefined class. Particularly, we focus on the case where classes are described with attributes. Image p i : x i R d, j-th element of x i represents the score (degree) of presence of the j-th attribute in x i.

Metric learning in attribute space Experiment setup Datasets Outdoor Scene Recognition (OSR) containing 2688 images from 8 scene categories A subset of Public Figure Face (PubFig) containing 771 images from 8 face categories We use the image features made publicly available by [Parikh and Grauman. ICCV, 2011]: a 512-dimensional GIST [Oliva and Torralba. IJCV, 2001] descriptor for OSR and a concatenation of the GIST descriptor and a 45-dimensional Lab color histogram for PubFig. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.

Metric learning in attribute space Experiment setup Baselines 1) The relative attribute learning problem described in [Parikh and Grauman. ICCV, 2011] uses relative attribute annotations on classes to compute high-level representations of images x i R d, a Gaussian distribution is learned for each class. 2) The Large Margin Nearest Neighbor (LMNN) [Weinberger and Saul. JMLR, 2009] is a popular metric learning method used for image classification. High-level representations x i R d are used as input features of the LMNN classifier. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 2009

Metric learning in attribute space Experiment setup Integration of regularization We modify the code of LMNN [Weinberger and Saul. JMLR, 2009] to integrate trace and Fantope regularization, the stopping criterion is the convergence of the algorithm (i.e. the objective function stops decreasing). Learning setup we use the same experimental setup as [Parikh and Grauman. ICCV, 2011]. N = 30 training images, the rest is for testing.

Metric learning in attribute space Results The table reports accuracies of baselines and our proposed regularization method on both OSR and PubFig datasets. 2% 3% These results validate the importance of a proper regularization for predictive accuracy.

Metric learning in attribute space Results The figure illustrates on some examples how our scheme is effective to learn semantics. Our method LMNN Our method LMNN

Metric learning in attribute space Results The figure illustrates on some examples how our scheme is effective to learn semantics. Our method LMNN Our method LMNN

Conclusion We proposed a new regularization scheme for metric learning that explicitly controls the rank of the learned distance matrix. Our method generalizes the trace regularization, and can be applied to various optimization frameworks to impose a meaningful structure on the learned PSD matrix. We derived an efficient metric learning algorithm that combines the regularization term with a loss function that can incorporate constraints between pairs or triplets of images. We demonstrated that regularization greatly improves recognition on real datasets, showing the relevance of this new regularization to limit overfitting. Future work includes the learning of a better designed ADMM formulation scheme that takes into account the fact that the objective function is not convex.

Thank You!