Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France

Introduction Notations & Related work Metric learning Fantope regularization Metric learning optimization algorithm Experiments Conclusion

Introduction Metric learning algorithms produce a linear transformation of data which is optimized to fit semantical relationships between training samples. Different aspects of the learning procedure have recently been investigated: how the dataset is annotated and used in the learning process; design choices for the distance parameterization; extensions to large scale context, etc. Surprisingly, few attempts have been made for deriving a proper regularization scheme. Regularization in metric learning is however a critical issue, as it often limits model complexity, the number of independent parameters to learn, and thus overfitting. Models learned with regularization usually better exploit correlations between features and often have improved predictive accuracy.

Introduction In this paper, we propose a novel regularization approach for metric learning that explicitly controls the rank of the learned distance matrix. The figure below illustrates the relevance of our approach. PubFig Our method LMNN OSR Our method LMNN

Introduction Notations & Related work Metric learning Fantope regularization Metric learning optimization algorithm Experiments Conclusion

Notations S d : sets of d d real-valued symmetric matrices S + d : sets of d d real-valued symmetric positive semidefinite (PSD) matrices For matrices A S d and B S d, we denote the Frobenius inner product by A, B = tr A T B. Π S+ d A is the orthogonal projection of the matrix A S d onto the positive semidefinite cone S + d. For a given a = a 1,, a d T R d, Diag a = A S d corresponds to a square diagonal matrix such that i, A i,i = a i. λ A is the vector of eigenvalues of matrix A arranged in non-increasing order. λ A i is the i-th largest eigenvalue of A. x i R d (resp. x j R d ) is the vector representation of image p i (resp. p j ), we note x ij = x i x j. For x R d, let x + = max 0, x.

Related work We focus in this work on supervised distance metric learning methods. ----similar and dissimilar pairs of images ----triplets of images In this paper, we consider the widely used Mahalanobis distance metric D M that is parameterized by the PSD matrix M S + d such that: D M 2 p i, p j = x i x j T M xi x j = x ij T Mxij It can also be rewritten: D 2 T M p i, p j = M, x ij x ij J. V. Davis, et al. ICML, 2007 A. Mignon and F. Jurie. CVPR, 2012 E. Xing, et al. NIPS, 2002 G. Chechik, et al. JMLR, 2010 A. Frome, et al. ICCV, 2007 M. Schultz and T. Joachims. NIPS, 2003 K. Weinberger and L. Saul. JMLR, 2009

Related work Many approaches prefer working on a specific matrix decomposition: i.e. M = L T L where L R e d and d is the data dimension. resulting optimization is very fast not convex w.r.t. L local minima In addition, an explicit regularization term is rarely introduced in the learning scheme. For instance, that lack of regularization makes LMNN prone to overfitting. To limit this shortcoming, many approaches perform early stopping which stops an iterative optimization process before convergence. However, this method needs to be carefully tuned for each dataset. T. Mensink, et al. PAMI, 2013 A. Mignon and F. Jurie. CVPR, 2012 K. Weinberger and L. Saul. JMLR, 2009

Related work Schultz and Joachims use the squared Frobenius norm M F 2, following the SVM framework to learn a diagonal PSD distance matrix. The ITML method (Information-Theoretic Metric Learning) uses a LogDet regularizer that constrains the distance matrix to be strictly positive definite. Another powerful way to regularize, is to control the rank of M. Some methods add the trace tr M as a regularization term, because it is a convex surrogate for rank M. M. Schultz and T. Joachims. NIPS, 2003 J. V. Davis, et al. ICML, 2007 D. Lim, et al. ICML, 2013 B. McFee and G. ICML, 2010 C. Shen, et al. NIPS, 2009

Related work In this paper, we investigate a new optimization scheme with a regularization term that explicitly controls the rank of M. Such a scheme allows to avoid overfitting without any trick such as early stopping. The main contributions of this paper are: 1) We introduce a new regularization strategy based on the convex hull of rank-k projection matrices, called Fantope, which allows to explicitly control the rank of distance matrices. 2) We propose an efficient algorithm to solve the new optimization scheme. 3) Our framework outperforms state-of-the-art metric learning methods on synthetic and challenging real Computer Vision datasets.

Metric Learning Fantope regularization Objective function A metric learning algorithm aims at determining M such that the metric satisfies most of the constraints defined by the training information. It is generally formulated as an optimization problem of the form: min M μr M + l M, A μ 0 is the regularization parameter R M is a regularization term on the parameter M l M, A is a loss function

Metric Learning Fantope regularization Motivation for the proposed regularization Controlling the rank of the PSD distance matrix M A standard way: use the nuclear norm M as a regularization term. In the case of PSD matrices, M S + d, M = tr M. seek a rank-0 matrix (i.e. M = 0) We formulate the regularization term R M as the sum of the k smallest eigenvalues of M S d + : R M = d i=d k+1 λ M i limit overfitting exploit correlations between features

Metric Learning Fantope regularization Motivation for the proposed regularization R M = d i=d k+1 λ M i Such a minimization of R M will naturally converge to a subspace corresponding to the (d k) most significant eigenvalues. As the rank of the PSD matrix M S + d is the number of its non-zero eigenvalues and all the eigenvalues of M S + d are non-negative, the proposed regularization term R M allows an explicit control over the rank of M: R M equals 0 iff rank M d k

Metric Learning Fantope regularization Explicit rank control regularization Using Ky Fan s theorem, we can rewrite the sum of the k smallest eigenvalues of any symmetric matrix M as the trace tr WM where W is the convex hull of the set comprising outer product of orthonormal matrices (rank-k projection matrices). This convex hull is called a Fantope. Our regularization term may be expressed as: R M = tr WM = W, M where the matrix W S + d allows to project the matrix M onto the target k-dimensional subspace. K. Fan. On a theorem of weyl concerning eigenvalues of linear transformations. Proceedings of the National Academy of Sciences of the United States of America, 1949

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M non-increasing order construct w = w 1,, w T d R d such that: 0 if 1 i d k (the first d k elements) w i = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M construct w = w 1,, w T d R d such that: d 0 if 1 i d k (the first d k elements) w i = R M = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T R M = tr WM = tr V M Diag w V M T V M Diag λ M V M T = tr Diag w Diag λ M = w T λ M = d i=d k+1 λ M i non-increasing order i=d k+1 λ M i R M equals 0 iff rank M d k

Metric Learning Fantope regularization Explicit rank control regularization A simple way to construct such a matrix W S d + is to use the eigendecomposition of M: T M = V M Diag λ M V M construct w = w 1,, w T d R d such that: 0 if 1 i d k (the first d k elements) w i = 1 if d k + 1 i d the last k elements then express W as: W = V M Diag w V M T non-increasing order Fantope regularization is a generalization of trace regularization. Indeed, for every matrix M S + d, tr M = tr I d M. Trace regularization is equivalent to a Fantope regularization where tr WM is the sum of the d smallest eigenvalues of M W = V M Diag 1 V M T = I d.

Metric Learning Optimization algorithm Optimization problem Constraints: quadruplet-wise constraints For any quadruplet of images q = p i, p j, p k, p l : q A, D M 2 p k, p l δ q + D M 2 p i, p j a safety margin The triplet-wise constraint: D 2 M p i, p k 1 + D 2 M p i, p j : q = p i, p j, p i, p k and δ q = 1 The pairwise constraint: the dissimilar pair p i, p j D D 2 M p i, p j l: q = p i, p i, p i, p j and δ q = l the similar pair p i, p j S u D 2 M p i, p j : q = p i, p j, p i, p i and δ q = u a minimum value an upper bound

Metric Learning Optimization algorithm Optimization problem Optimization Define a global loss: l M, A = q A l M q Design the loss for a single quadruplet: l M q = max 0, δ q + M, x ij x T T ij x kl x kl By including our regularization term and l M, A, the optimization problem becomes: min f M S d W M = min μr M + l M, A + M f W M = μ W, M + q A δ q + M, x ij x T T ij x kl x kl + μ 0 is a regularization parameter W, M is the sum of the k smallest eigenvalues of M.

Metric Learning Optimization algorithm Solving the optimization problem min f M S d W M = min μr M + l M, A M + f W M = μ W, M + q A δ q + M, x ij x T T ij x kl x kl f W M is not globally convex, it is convex w.r.t. M when W is fixed. the subgradient w.r.t. M is: M = μw + q A + x ij x T T ij x kl x kl subset of constraints in A +, μ 0 W is updated by construction as explained before so that W, M is the sum of the k smallest eigenvalues of M. That process stops when the objective value stops decreasing.

Metric Learning Optimization algorithm Solving the optimization problem The global learning scheme is described in Algorithm 1. W = V M Diag w V M T M = μw + q A + x ij x T T ij x kl x kl min f M S d W M = min μr M + l M, A M +

Metric Learning Optimization algorithm Efficiency discussion An alternative method to solve the optimization problem is to switch the update between M and W after a full subgradient descent over M. computationally demanding When the input space dimension d is large, the eigendecomposition required at each iteration of the subgradient descent also becomes computationally expensive.

Metric Learning Optimization algorithm Efficiency discussion We propose an adaptation of the Alternating Direction Method of Multipliers (ADMM) [S. Boyd, et al] to learn a metric. We then adapt the optimization problem in this way: min f M S d, Z S d W M + g Z where g Z = 0 if Z S + d + if Z S + Introducing a Lagrange multiplier Λ S + d, we obtain the augmented Lagrangian: L p M, Z, Λ = f W M + g Z + Λ, M Z + ρ 2 M Z F 2 where ρ > 0 is a scaling parameter. S. Boyd, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011 d

Metric Learning Optimization algorithm Efficiency discussion Algorithm 2 finds the optimal M before updating W, as previously proposed. However, the approximation and speed up in Algorithm 2 comes from the constraint M S + d which has been replaced by the constraint M S d, whereas g Z promotes a PSD solution matrix. U = 1 ρ Λ

Experiments Face verification task Image classification with relative attributes

Experiments Face verification task In the face verification task, we are provided with pairs of face images. The goal is to learn a classifier that determines whether image pairs are similar (represent the same person) or dissimilar (represent two different persons).

Face verification: LFW Experiment setup Dataset and evaluation metric Labeled Faces in the Wild (LFW) dataset: more than 13,000 images of faces restricted paradigm which only providing with two sets of pairs of images set S set D We follow the standard evaluation protocol that uses View 2 data for training and testing and View 1 for validation.

Face verification: LFW Experiment setup Dataset and evaluation metric Labeled Faces in the Wild (LFW) dataset: more than 13,000 images of faces restricted paradigm which only providing with two sets of pairs of images To generate our constraints, set the upper bound u = 0.5, and the lower bound l = 1.5. The distance of a test pair is compared to the threshold l+u = 1 to determine 2 whether the pair is similar or dissimilar. set S set D The pairwise constraint: the dissimilar pair p i, p j D D 2 M p i, p j l: q = p i, p i, p i, p j and δ q = l the similar pair p i, p j S u D 2 M p i, p j : q = p i, p j, p i, p i and δ q = u

Face verification: LFW Experiment setup Image representation Use same input features as popular metric learning methods [ITML, LDML, PCCA]. Use the SIFT descriptors computed by [LDML] available on their website. J. V. Davis, et al. Information-theoretic metric learning. ICML, 2007 M. Guillaumin, et al. Is that you? metric learning approaches for face identification. In ICCV, 2009 A. Mignon and F. Jurie. Pcca: A new approach for distance learning from sparse pairwise constraints. CVPR, 2012 Initialization of the distance matrix M S + d First compute the matrix L R e d that is composed of the coefficients for the e most dominant principal components of the training data. Then: M = L T L

Face verification: LFW Results Impact of regularization compare the impact of Fantope regularization over trace regularization The table shows classification accuracies when solving the optimization problem with both regularization methods. mean & standard error This illustrates the importance of having an explicit control on the rank of the distance matrix.

Face verification: LFW Results State-of-the-art results compare Fantope regularization to other popular metric learning algorithms The table shows performances of ITML, LDML and PCCA. Fantope regularization outperforms ITML and LDML and is comparable to PCCA.

Face verification: LFW Results Impact of early stopping The table reports the accuracies we obtained on LFW by testing the code of PCCA provided by its authors, as a function of the number of iterations of gradient descent. Use early stopping criterion in our method: 83.5 ± 0.5% Conclusion: our regularization scheme makes our method much more robust than PCCA to early stopping.

Face verification: LFW Results Impact of the hyper-parameter μ 82.3% 81.2% with μ=0 expected rank e = 40 for high values of μ

Experiments Image classification with relative attributes In the image classification task with attributes, we are provided with images described with attributes. The goal is to assign an image to a predefined class. Particularly, we focus on the case where classes are described with attributes. Image p i : x i R d, j-th element of x i represents the score (degree) of presence of the j-th attribute in x i.

Metric learning in attribute space Experiment setup Datasets Outdoor Scene Recognition (OSR) containing 2688 images from 8 scene categories A subset of Public Figure Face (PubFig) containing 771 images from 8 face categories We use the image features made publicly available by [Parikh and Grauman. ICCV, 2011]: a 512-dimensional GIST [Oliva and Torralba. IJCV, 2001] descriptor for OSR and a concatenation of the GIST descriptor and a 45-dimensional Lab color histogram for PubFig. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.

Metric learning in attribute space Experiment setup Baselines 1) The relative attribute learning problem described in [Parikh and Grauman. ICCV, 2011] uses relative attribute annotations on classes to compute high-level representations of images x i R d, a Gaussian distribution is learned for each class. 2) The Large Margin Nearest Neighbor (LMNN) [Weinberger and Saul. JMLR, 2009] is a popular metric learning method used for image classification. High-level representations x i R d are used as input features of the LMNN classifier. D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 2009

Metric learning in attribute space Experiment setup Integration of regularization We modify the code of LMNN [Weinberger and Saul. JMLR, 2009] to integrate trace and Fantope regularization, the stopping criterion is the convergence of the algorithm (i.e. the objective function stops decreasing). Learning setup we use the same experimental setup as [Parikh and Grauman. ICCV, 2011]. N = 30 training images, the rest is for testing.

Metric learning in attribute space Results The table reports accuracies of baselines and our proposed regularization method on both OSR and PubFig datasets. 2% 3% These results validate the importance of a proper regularization for predictive accuracy.

Metric learning in attribute space Results The figure illustrates on some examples how our scheme is effective to learn semantics. Our method LMNN Our method LMNN

Conclusion We proposed a new regularization scheme for metric learning that explicitly controls the rank of the learned distance matrix. Our method generalizes the trace regularization, and can be applied to various optimization frameworks to impose a meaningful structure on the learned PSD matrix. We derived an efficient metric learning algorithm that combines the regularization term with a loss function that can incorporate constraints between pairs or triplets of images. We demonstrated that regularization greatly improves recognition on real datasets, showing the relevance of this new regularization to limit overfitting. Future work includes the learning of a better designed ADMM formulation scheme that takes into account the fact that the objective function is not convex.

Thank You!