Spectral Hashing: Learning to Leverage 80 Million Images

Size: px

Start display at page:

Download "Spectral Hashing: Learning to Leverage 80 Million Images"

Evangeline Warner
6 years ago
Views:

1 Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU

2 Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral Hashing.

3 Motivation Brute Force Computer Vision using millions of labeled images (Torralba et al, Hays and Efros, Snaveley et al)

4 Tiny Images dataset Query search engines with 80K nouns in English. One thousand images each 80 Million Images

5 Twin Jet

6 Mohammed

7 Killer Whale

8 Brute Force Recognition?

9 Brute Force Recognition?

10 Why this won t work I Grandmother cell reborn. I Similarity between images. I Noisy Labels. I Efficient search.

11 Some Inspiration

12 Why this won t work I Grandmother cell reborn. I Similarity between images. I Noisy Labels. I Efficient search.

13 Semantic Hashing Address Space Query image Semantic Hash Function Images in database Seman cally similar images Query address Short (32bit) codes. Hamming distance semantic distance. (Salakhutdinov and Hinton, 2007)

14 Constructing Codes so that Hamming distance semantic distance. Deep Neural Network (Salakhutdinov and Hinton 07) Random Projections LSH (Andoni and Indyk 06) Boosting (Shakhnarovich et al. 03)

15 Deep Neural Network 2000 Top Layer Binary Codes 32 W3 500 W2 500 W W W W RBM RBM RBM Gaussian Noise W 1+ε6 500 W 2+ε5 500 W 3+ε4 32 Code Layer T W 3+ε3 500 T W 2+ε2 500 T W 1+ε The Deep Generative Model Recursive Pretraining Fine tuning Figure 2: Left panel: The deep generative model. Middle panel: Pretraining consists of learning a stack of RBM s in which the feature activations of one RBM are treated as data by the next RBM. Right panel: After pretraining, the RBM s are unrolled to create a multi-layer autoencoder that is fine-tuned by backpropagation. First, there are limitations on the types of structure that can be represented efficiently by a single layer of hidden variables. We will show that a network with multiple hidden layers and with millions of parameters can discover latent representations that work much better for information retrieval. Second, all of these text retrieval algorithms are based on computing a similarity measure between a query document and other documents in the collection. The similarity is computed either directly in the word space or in a lowdimensional latent space. If this is done naively, the retrieval time complexity of these models is O(NV ), where N is the size of the document corpus and V is the size of vocabulary or dimensionality (Salakhutdinov and Hinton 07) of hidden variables at a time [8]. After learning is complete, the mapping from a word-count vector to the states of the top-level variables is fast, requiring only a matrixmultiplication followed by a componentwise non-linearity for each hidden layer. After the greedy, layer-by-layer training, the generative model is not significantly better thana model withonly one hidden layer. To take fulladvantage ofthe multiple hiddenlayers, the layer-by-layer learning must be treated as a pretraining stage that finds a good region of the parameter space. Starting in this region, a gradient search can then fine-tune the model parameters to produce a much better model [10].

16 LSH C LSH neighbors for hamming distance < 2 Claim: If y k arerbm random(two linear hidden thresholds, layers) then Hamming distance monotonic with Euclidean distance asymptotically. (Andoni and Indyk 06)

17 Our Approach Optimization problem for best hashing code. NP Hard spectral relaxation Eigenvectors Eigenfunctions simple alg. State-of-the-art results.

18 Optimization Input: {x i } semantic feature space, W ij = exp( x i x j 2 /σ 2 ) Output: y i { 1, 1} k Good Code: (1) Small Hamming Distance between neighbors (2) Bits fire 50% and independent.

19 Graph Partitioning minimize : ij W ij y i y j 2 subject to : y i { 1, 1} k y i = 0 1 N i i y i y T i = I

20 Graph Partitioning minimize : ij W ij y i y j 2 subject to : y i { 1, 1} k y i = 0 1 N i i y i y T i Observation: NP Hard even for one bit. = I

21 Graph Partitioning minimize : ij W ij y i y j 2 subject to : y i { 1, 1} k y i = 0 1 N i i y i y T i = I Relaxation Smallest eigenvectors of graph Laplacian.

22 Out of Sample Extension Nystrom? Too expensive Calculating Nystrom as expensive as exhaustive nearest neighbor.

23 Out of Sample Extension Using Eigenfunctions Assume x IID samples from p(x). Calculate limit of eigenvectors as number of points. (Coifman et al. 05,Belkin Niyogi 07,Bengio et al. 04,Nadler et al. 08).

24 Graph Partitioning minimize : ij W ij y i y j 2 subject to : y i { 1, 1} k y i = 0 1 N i i y i y T i = I Relaxation Smallest eigenvectors of graph Laplacian.

25 Out of Sample Extensions with Eigenfunctions minimize : y(x 1 ) y(x 2 ) 2 W (x 1 x 2 ) p(x 1 )p(x 2 )dx 1 x 2 subject to : y(x) { 1, 1} k y(x)p(x)dx = 0 y(x)y(x) T p(x)dx = I Relaxation Smallest eigenfunctions of Laplace-Beltrami.

26 Analytical Eigenfunctions for ND uniform If each dimension is uniform [a i, b i ] then eigenfunctions are product of 1D sinusoids. Φ k (x) = sin( π 2 + kπ b a x) λ k = 1 e ɛ2 2 kπ b a 2

27 Pairwise Independence too weak 3 Thresholded eigenfunctions can be deterministic functions Current solution: use only single-dimension eigenfunctions.

Experiments - Synthetic Training samples stumps boosting SSC LSH RBM (two hidden layers) Proportion good

SSC RBM LSH 0 0 5 10 15 20 25 30 35 LSH 0 number of bits 0 5 10 15 20 25 30 35 a) number 2D uniform of bits

28 Experiments - Synthetic Training samples stumps boosting SSC LSH RBM (two hidden layers) Proportion good neighbors for hamming distance < Proportion good neighbors for hamming distance < RBM+ spectral hashing Spectral hashing Boosting + spectral hashing RBM stumps boosting SSC stumps boosting SSC RBM LSH LSH 0 number of bits a) number 2D uniform of bits distribution LSH Boosting SSC LSH Boosting SSC LSH Boosting SSC RBM (two hidden layers) Spectral hashing RBM (two hidden layers) Spectral hashing RBM (two hidden layers) Spectral hashing a) 3 bits b) 7 bits c) 15 bits

29 Experiments - Real Data Approximate p(x) with multidimensional rectangle. Works well despite bad assumption. Semantic Distance Euclidean Distance in GIST descriptor.

30 LabelMe dataset Proportion good neighbors for hamming distance < Spectral hashing RBM Boosting SSC LSH number of bits Input Gist neighbors Spectral hashing 10 bits Boosting 10 bits

31 80 Million Image dataset Gist neighbors Spectral hashing: 32 bits 64 bits Retrieval time: microseconds.

32 Limitations Three professors, no students. p(x) uniform assumption. Higher order dependencies between bits. Rounding problem.

33 Why this won t work I Grandmother cell reborn. I Similarity between images. I Noisy Labels. I Efficient search.

34 Conclusions Brute force computer vision using hundreds of millions of images. Hashing allows retrieval in microseconds. Spectral hashing - simple learning that outperforms the state-of-the-art. Code Available: Google spectral hashing

Spectral Hashing. Antonio Torralba 1 1 CSAIL, MIT, 32 Vassar St., Cambridge, MA Abstract

Spectral Hashing. Antonio Torralba 1 1 CSAIL, MIT, 32 Vassar St., Cambridge, MA Abstract Spectral Hashing Yair Weiss,3 3 School of Computer Science, Hebrew University, 9904, Jerusalem, Israel yweiss@cs.huji.ac.il Antonio Torralba CSAIL, MIT, 32 Vassar St., Cambridge, MA 0239 torralba@csail.mit.edu