Supplemental Material for Discrete Graph Hashing

Similar documents
Learning to Hash with Partial Tags: Exploring Correlation Between Tags and Hashing Bits for Large Scale Image Retrieval

Linear Spectral Hashing

Isotropic Hashing. Abstract

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Stochastic Generative Hashing

Optimized Product Quantization for Approximate Nearest Neighbor Search

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

The Power of Asymmetry in Binary Hashing

Supervised Quantization for Similarity Search

Spectral Hashing: Learning to Leverage 80 Million Images

Lecture Notes 1: Vector spaces

Online Learning, Mistake Bounds, Perceptron Algorithm

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

Randomized Algorithms

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

Rank Determination for Low-Rank Data Completion

Rank minimization via the γ 2 norm

Boundary Behavior of Excess Demand Functions without the Strong Monotonicity Assumption

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Worst case analysis for a general class of on-line lot-sizing heuristics

Optimization methods

Network Localization via Schatten Quasi-Norm Minimization

LEARNING IN CONCAVE GAMES

Spectral Clustering. Zitao Liu

A Distributed Newton Method for Network Utility Maximization, II: Convergence

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Similarity-Preserving Binary Signature for Linear Subspaces

Spectral Hashing. Antonio Torralba 1 1 CSAIL, MIT, 32 Vassar St., Cambridge, MA Abstract

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Supervised Hashing via Uncorrelated Component Analysis

Solving Corrupted Quadratic Equations, Provably

Minimal Loss Hashing for Compact Binary Codes

Graph based Subspace Segmentation. Canyi Lu National University of Singapore Nov. 21, 2013

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

L26: Advanced dimensionality reduction

8.1 Concentration inequality for Gaussian random matrix (cont d)

Strictly Positive Definite Functions on a Real Inner Product Space

Matrix Derivatives and Descent Optimization Methods

Chapter 8 Gradient Methods

Bias-Variance Error Bounds for Temporal Difference Updates

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

NEAREST neighbor (NN) search has been a fundamental

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

Principal Component Analysis

Margin Maximizing Loss Functions

Random Angular Projection for Fast Nearest Subspace Search

Graph Partitioning Using Random Walks

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Online Interval Coloring and Variants

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

ANALOG-to-digital (A/D) conversion consists of two

Fantope Regularization in Metric Learning

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Linear Regression and Its Applications

Unsupervised learning: beyond simple clustering and PCA

Lecture 7: Semidefinite programming

The Skorokhod reflection problem for functions with discontinuities (contractive case)

1 Matrix notation and preliminaries from spectral graph theory

Hamming Compatible Quantization for Hashing

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Kernels for Multi task Learning

L11: Pattern recognition principles

Algorithms for Nonsmooth Optimization

LECTURE OCTOBER, 2016

9 Classification. 9.1 Linear Classifiers

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Lecture 21: HSP via the Pretty Good Measurement

Homework 4. Convex Optimization /36-725

No-Regret Algorithms for Unconstrained Online Convex Optimization

Composite Quantization for Approximate Nearest Neighbor Search

A Greedy Framework for First-Order Optimization

Relevance Aggregation Projections for Image Retrieval

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Learning gradients: prescriptive models

Lecture 7 Introduction to Statistical Decision Theory

CS 6820 Fall 2014 Lectures, October 3-20, 2014

The FTRL Algorithm with Strongly Convex Regularizers

Towards stability and optimality in stochastic gradient descent

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Unconstrained optimization

RECENT years have witnessed the explosive growth of

a i,1 a i,j a i,m be vectors with positive entries. The linear programming problem associated to A, b and c is to find all vectors sa b

Chapter 0: Preliminaries

The Newton Bracketing method for the minimization of convex functions subject to affine constraints

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

1 Adjacency matrix and eigenvalues

An Optimization-based Approach to Decentralized Assignability

EE 381V: Large Scale Learning Spring Lecture 16 March 7

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Convex and Semidefinite Programming for Approximation

Lecture: Local Spectral Methods (1 of 4)

Intensity Transformations and Spatial Filtering: WHICH ONE LOOKS BETTER? Intensity Transformations and Spatial Filtering: WHICH ONE LOOKS BETTER?

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

Transcription:

Supplemental Material for Discrete Graph Hashing Wei Liu Cun Mu Sanjiv Kumar Shih-Fu Chang IM T. J. Watson Research Center Columbia University Google Research weiliu@us.ibm.com cm52@columbia.edu sfchang@ee.columbia.edu sanjivk@google.com Proofs Lemma. If { j)} is the sequence of iterates produced by Algorithm, then f j+)) f j)) holds for any integer j, and both { f j) ) } and { j)} converge. Proof. Since the -subproblem in Eq. 5) of the main paper is to maximize a continuous function over a compact i.e., closed and bounded) set, its optimal objective function value f is finite. As j) is feasible for each j, the sequence {f j) )} is bounded from above. y definition, ˆfj j) ) f j)). Since j+) arg max {±} n r ˆf j ), we have ˆf j j+) ) ˆf j j) ). Also, as A is positive semidefinite, f) is a convex function. Therefore, f) ˆf j ) at any {, } n r, which implies that f j+)) ˆf j j+) ). Putting all the above together, we have f j+)) ˆf j j+) ) ˆf j j) ) = f j)), j Z, j. Together with the fact that { f j) ) } is bounded from above, it turns out that the monotonically non-decreasing sequence { f j) ) } converges. The convergence of { j)} is established based on the following three key facts. First, if j+) = j), then due to the update in Eq. 7) of the main paper, we have j ) j) for any integer j j +. Second, if j+) j), then we can infer that the entries of the gradient f j)) are not all zeros, and that there exists at least an entry ) i n, k r) such that i,k f j)) and j+)) i,k j)) i,k. Due to the update in Eq. 7), at such an entry ), j+)) i,k = sgn i,k f j))), which implies that i,k f j)) ) j+) i,k j)) ) >. Then, we i,k can derive as follows: ˆf j j+) ) ˆf j j) ) = f j)), j+) j) = i,k f j)) j+) ) i,k j)) ) + i,k i,k f j)) = i,k f j)) j+) ) i,k = j)) i,k i,k f j)) j+) ) i,k j)) i,k ) +

i,k f ) j)) j+) i,k j)) i,k = + + >. i,k f j)) j+) ) i,k j)) i,k As such, we have that once j+) j), i,k f j)) j+) ) i,k j)) i,k i,k f j)) j+) ) i,k j)) i,k f j+)) ˆf j j+) ) > ˆf j j) ) = f j)), which hence causes a strict increase from f j)) to f j+)). Third, there are only finite f values in the cost sequence { f j) ) } because of finite feasible points in the set {, } n r. Taking together the second and third facts, we find that there exists an integer j such that j+) = j), or otherwise infinite values will appear in { f j) ) }. Incorporating the first fact, it can be easily arrived that there exists an integer j such that j ) j) for any integer j j +, which immediately indicates the convergence of { j)}. ) ) Lemma 2. Y = n[u Ū][V V] is an optimal solution to the Y-subproblem in Eq. 6). Proof. We first prove that Y is feasible to Eq. 6) in the main paper, i.e., Y Ω. Note that J = and hence J =. ecause J and U have the same column range) space, we have U =. Moreover, as we construct Ū such that Ū =, we have [U Ū] =, which implies Y =. Together with the fact that Y ) Y = n[v V][U Ū] [U Ū][V V] = ni r, the feasibility of Y to Eq. 6) does hold. Now we consider an arbitrary Y Ω. Due to Y =, we have JY = Y n Y = Y, and, Y =, JY = J, Y. Moreover, by using von Neumann s trace inequality [2] and the fact that Y Y = ni r, we have J, Y n r k= σ k. On the other hand,, Y = J, Y Therefore, we can quickly derive = [ Σ [U Ū] = n [ Σ = n σ k. for any Y Ω, which completes the proof. r k= ] [V V], Y ], I r tr Y ) =, Y = J, Y n r k= σ k =, Y = tr Y ) 2

Theorem. If { k, Y k ) } is the sequence generated by Algorithm 2, then Q k+, Y k+ ) Q k, Y k ) holds for any integer k, and { Q k, Y k ) } converges starting with any feasible initial point, Y ). Proof. Let us use 2 to denote matrix 2-norm. {±} n r and Y Ω, we can derive Q, Y) =, A + ρ, Y F A F + ρ F Y F F A 2 F + ρ F Y F = nr nr + ρ nr nr = + ρ)nr, where the second line is due to Cauchy-Schwarz inequality, the third line follows from the inequality CD F C 2 D F for any compatible matrices C and D, and the last line holds as F = Y F = nr and A 2 =. Since k, Y k ) is feasible at each k, the sequence { Q k, Y k ) } is bounded from above. Applying Lemma and Lemma 2, we quickly have Q k+, Y k+ ) Q k+, Y k ) Q k, Y k ). Together with the boundedness of { Q k, Y k ) }, the monotonically non-decreasing sequence { Qk, Y k ) } must converge. Proposition. For any orthogonal matrix R R r r and any binary matrix {±} n r, we have tr A ) r tr2 R H A ). Proof. Since A is positive semidefinite, it suffices to write A = E E with some proper E R m n m m). Moreover, because the operator norm A 2 =, we have E 2 =. y taking into account the fact H H = I r, we can derive that for any orthogonal matrix R R r r and any binary matrix {±} n r, the following inequality holds: tr R H A ) = tr R H E E ) = EHR, E EHR F E F E 2 HR F E F = tr R H HR ) E F = tri r ) E F = r E F. The above inequality gives E F tr R H A ) / r, leading to which completes the proof. tr A ) = tr E E ) = E 2 F tr R H A ) / r ) 2 = r tr2 R H A ), { It is also easy to prove the convergence of the sequence tr R j ) H A j ) } generated in optimizing problem 8) of the main paper, by following the spirit of Theorem j.

2 More Discussions 2. Orthogonal Constraint Like Spectral Hashing ) [7, 6] and Anchor Graph Hashing AGH) [4], we impose the orthogonal constraints on the target hash bits so that the redundancy among these bits can be minimized, which will lead to uncorrelated hash bits if the orthogonal constraints are strictly satisfied. However, the orthogonality of hash bits is traditionally difficult to achieve, so our proposed Discrete Graph Hashing DGH) model pursues nearly orthogonal uncorrelated) hash bits by softening the orthogonal constraints. Here we clarify that the performance drop of linear projection based hashing methods such as [][] is not due to the orthogonal constraints imposed on hash bits, but on the projection directions. When using orthogonal projections, the quality of constructed hash functions typically degrades rapidly since most variance of the training data is captured in the first few orthogonal projections. On the contrary, in this work we impose the orthogonal constraints on hash bits. Even though recall is expected to drop with long codes for all the considered hashing techniques including our proposed DGH, orthogonal/uncorrelated hash bits have been found to yield better search accuracy e.g., precision/recall) for longer code lengths since they minimize the bit redundancy. Previous hashing methods [4, 6, 7] using continuous relaxations deteriorate with longer code lengths because of large errors introduced by the discretizations of their continuously relaxed solutions. Our hashing technique both versions DGH-I and DGH-R) generates nearly uncorrelated hash bits via direct discrete optimization, so its recall decreases much slower and also keeps much higher in comparison to the other methods see Figure ). The hash lookup success rate of our DGH is kept almost % with only a tiny drop see Figure in the main paper). Overall, we have shown that the orthogonal constraints on hash bits, which are nearly satisfied by enforcing direct discrete optimization, do not hurt but improve recall/success rate of hash lookup for relatively longer code lengths. 2.2 Spectral Relaxation Spectral methods have been well studied in the literature, and the effect of spectral relaxations which were widely adopted in clustering, segmentation, and hashing problems could be bounded. However, as those existing bounds are typically concerned with the worst case, such theoretical results may be very conservative without clear practical implications. Moreover, since we used the solution, obtained via continuous relaxation + discrete rounding, of the spectral method AGH [4] as an initial point for running our discrete method DGH-I, due to the proven monotonicity Theorem ), DGH Algorithm 2 in the main paper) leads to a solution which is certainly no worse than the initial point i.e., the spectral solution) in terms of the objective function value of Eq. 4) in the main paper. Note that the deviation e.g., l distance) from the continuously relaxed solution to the discrete solution will grow as the code length increases, as disclosed by the normalized cuts work [5]. Quantitatively, we find that the hash lookup F-measure Figure 2 in the main paper) and recall Figure ) achieved by the spectral methods e.g.,, -AGH and 2-AGH) relying on continuous relaxations either drop drastically or become very poor when the code length surpasses 48. Therefore, in the main paper, we argued that continuous relaxations applied for solving hashing problems may lead to poor hash codes with longer code lengths. 2. Initialization Our proposed initialization schemes are motivated by previous work. In our first initialization scheme, Y originates from the spectral embedding of the graph Laplacian, and its binarization threshold Y at zero) has been used as the final hash codes by the spectral method AGH [4]. Our second initialization scheme modifies the first one to seek the optimally rotated spectral embedding Y, where the motivation is provided by Proposition. While the two initialization schemes are heuristic, we find that both of them consistently result in much better performance than random initialization. 2.4 Out-of-Sample Hashing We proposed a novel out-of-sample extension for hashing in Section 2 of the main paper, which is essentially discrete and completely different from the continuous out-of-sample extensions suggested by the spectral methods including [7], MD [6], -AGH and 2-AGH [4]. Our proposed 4

.5 a) Convergence curves of Q,Y) with r=24 b) Convergence curves of Q,Y) with r=48 6 c) Convergence curves of Q,Y) with r=96 Objective function values Q,Y) e6).45.4.5..25.2.5..5 2 Objective function values Q,Y) e6) 2.9 2.8 2.7 2.6 2.5 2.4 2. 2.2 2. 2 2 4 Objective function values Q,Y) e6) 5.5 5 4.5 4.5 2 4 5 Figure : Convergence curves of Q, Y) starting with two different initial points on CIFAR-. 2.8 a) Convergence curves of Q,Y) with r=24 5.5 b) Convergence curves of Q,Y) with r=48. c) Convergence curves of Q,Y) with r=96 Objective function values Q,Y) e6) 2.7 2.6 2.5 2.4 2. 2.2 2. 2 2 Objective function values Q,Y) e6) 5 4.5 4.5 2 4 Objective function values Q,Y) e7) 2 4 5 6 7 8 Figure 2: Convergence curves of Q, Y) starting with two different initial points on SUN97. extension neither makes any assumption nor involves any approximation, but directly achieves the hash code for any given query in a discrete manner. Consequently, this discrete hashing extension makes the whole hashing scheme asymmetric in the sense that hash codes are yielded differently for samples in the dataset and queries outside the dataset. In contrast, the overall hashing schemes of the above spectral methods [4, 6, 7] are all symmetric. We gave a brief discussion about the proposed asymmetric hashing mechanism in Section 4 of the main paper. More Experimental Results In Figures and 2, we plot the convergence curves of { Q k, Y k ) } starting with the suggested two initial points, Y ). In specific, DGH-I uses the first initial point, and DGH-R uses the second initial point which is obtained by learning a rotation matrix R. Note that if R is equal to the identity matrix I r, DGH-R degenerates to be DGH-I. Figures and 2 indicate that when learning 24, 48, and 96 hash bits, DGH-R consistently gives a higher objective value Q, Y ) than DGH-I at the start; DGH-R also reaches a higher objective value Q, Y ) than DGH-I at the convergence, though DGH-R tends to need more iterations for convergence. Typically, more than one iteration are needed for both DGH-I and DGH-R to converge, where the first iteration usually gives rise to the largest increase in the objective function value Q, Y). We show the recall results of hash lookup within Hamming radius 2 in Figure, which reveal that when the number of hash bits r 6, DGH-I achieves the highest recall except on YouTube Faces, where DGH-R is the highest while DGH-I is the second highest. Among those competing hashing techniques: when r 48,,, and RE suffer from poor recall <.65), and even give close to zero recall on the first three datasets CIFAR-, SUN97 and YouTube Faces; attains much lower recall than DGH-I and DGH-R, and even gives close to zero recall when r 2 on the first two datasets; L and usually give higher recall than the other methods, but are notably inferior to DGH-I when r 6; although using the same constructed anchor graph on each dataset, DGH-I and DGH-R achieve much higher recall than -AGH and 2-AGH which give close to zero recall when r 48 on CIFAR- and Tiny-M. 5

Table : Hamming ranking performance on CIFAR- and SUN97 datasets. r denotes the number of hash bits used in the hashing methods. All training and test times are in seconds. Method CIFAR- SUN97 Mean Average Precision TrainTime TestTime Mean Average Precision TrainTime TestTime r = 24 r = 48 r = 96 r = 96 r = 96 r = 24 r = 48 r = 96 r = 96 r = 96 l 2 Scan 752 55 L 22 228 256. 5.78.7.228. 5 275 6 47.8.7 5.44.64.842 5.2 4.8 5 74 78 828 4..2 5 5 276 46 8.7. 5 65 72 765.8.2 5.88.955 28 9.2. 5 7 52 296 9.2. 4.727.79.99 7. 9.8 5 MD 66 67 6. 7.2 5.82.984 4 54.8.4 4 82 878 925 5.4 2.6 5.987 49 2 5.2.5 5 -AGH 85 685 522 2.9.4 5 4 486 49 2.4 4. 5 2-AGH 82 842 79.5 5.5 5 256 47 544.5 5.2 5 RE 69 644 7 94.9 5.6 5.68.898 96 44.9 4.5 5 DGH-I 88 89 82 46.. 5 29 9 69 87.5.5 5 DGH-R 9 92 95 7.. 5 48 575 624 27.7.5 5 From the recall results, we can conclude that the discrete optimization procedure exploited by our DGH hashing technique better preserves the neighborhood structure inherent in massive data into a discrete code space, than the relaxed optimization procedures employed by, -AGH and 2-AGH. We argue that the spectral methods, -AGH and 2-AGH did not really minimize the Hamming distances between the neighbors remaining in the input space, since the discrete binary constraints which should be imposed on the hash codes were discarded. Finally, we report the Hamming ranking results on CIFAR- and SUN97 in Table, which clearly show the superiority of DGH-R over the competing hashing techniques in mean average precision MAP). Regarding the reported training time, most of the competing methods such as AGH are noniterative, while RE and our proposed DGH-I/DGH-R involve iterative optimization procedures so they are slower. Admittedly, our DGH method is slower in training than the other methods except RE, and DGH-R is slower than DGH-I due to the longer initialization for optimizing the rotation R. However, considering that training happens in the offline mode and substantial quality improvements are found in the learned hash codes, we believe that a modest sacrifice in training time is acceptable. Note that in Table of the main paper, even for one million samples of the Tiny-M dataset, the training time of DGH-I/DGH-R is less than one hour. Search/query time of all referred hashing methods includes two components: coding time and table lookup time. Compared to coding time, the latter is constant dependent on the number of hash bits r) and small enough to be ignored, since a single hash table with no reordering is used for all the methods. Hence, we only report the coding time as the test time per query for all the compared methods. In most information retrieval applications, test time is usually the main concern as search mostly happens in the online mode, and our DGH method has comparable test time with the others. To achieve the best performance for DGH-I/DGH-R, we need to tune the involved hyper)parameters via cross validation. However, in our experience they are normally easy to tune. Regarding the parameters needed by anchor graph construction, we just fix the number of anchors m as as in the AGH method [4], fix s to, and set the other parameters according to [4], so as to make a fair comparison with AGH. Regarding the budget iteration numbers required by the alternating maximization procedures of our DGH method, we simply fix them to proper constants i.e., T R =, T =, T G = 2), within which the initialization procedure for optimizing the rotation R and the DGH procedure Algorithm 2 in the main paper) empirically converge. For the penalty parameter ρ that controls the balancing and uncorrelatedness of the learned hash codes, we tune it in the range of [, 5] on each dataset. Our choices for these parameters are found to work reasonably well over all datasets. References [] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 52):296 2929, 2. [2] R. Horn and C. Johnson. Matrix analysis. Cambridge University, 22. 6

Recall within Hamming radius 2 a) Hash lookup recall @ CIFAR L AGH RE Recall within Hamming radius 2 b) Hash lookup recall @ SUN97 L AGH RE 8 26 24 2 48 64 96 8 26 24 2 48 64 96 Recall within Hamming radius 2 c) Hash lookup recall @ YouTube Faces L AGH RE Recall within Hamming radius 2 d) Hash lookup recall @ Tiny M L AGH RE 26 24 2 48 64 96 28 26 24 2 48 64 96 28 Figure : Mean recall of hash lookup within Hamming radius 2 for different hashing techniques. [] W. Kong and W.-J. Li. Isotropic hashing. In NIPS 25, 22. [4] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proc. ICML, 2. [5] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 228):888 95, 2. [6] Y. Weiss, R. Fergus, and A. Torralba. Multidimensional spectral hashing. In Proc. ECCV, 22. [7] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS 2, 28. 7