Stochastic Generative Hashing

Size: px

Start display at page:

Download "Stochastic Generative Hashing"

James Ferguson
5 years ago
Views:

1 Stochastic Generative Hashing B. Dai 1, R. Guo 2, S. Kumar 2, N. He 3 and L. Song 1 1 Georgia Institute of Technology, 2 Google Research, NYC, 3 University of Illinois at Urbana-Champaign Discussion by Ikenna Odinaka September 8, / 14

2 Outline 1 Introduction 2 Stochastic Generative Hashing 3 Distributional SGD 4 Experiments 2 / 14

3 Binary Hashing Represent real-valued signals x R d using binary vectors (hash codes) h 0, 1 l Faster search and retrieval e.g. L 2 Nearest Neighbor Search (L2NNS) Reduces time and storage requirements Search with binary vectors based on Hamming distance can be done efficiently Data-dependent versus randomized hashing Previous approaches to data-dependent hashing have two major shortcomings: Heuristically constructed objective function Binary constraints are crudely handled by relaxation leading to subpar results Stochastic Generative Hashing tackles both problems 3 / 14

4 Generative Model p(x h) p(x h) can be chosen to based on the problem domain Simple Gaussian distribution for p(x h) achieves state-of-the-art performance Joint distribution is given by p(x, h) = p(x h)p(h), (1) where p(x h) = N (Uh, ρ 2 I), p(h) B(θ) = l i=1 θh i i (1 θ i ) 1 h i, U = {u i } l i=1, u i R d is a codebook with l codewords, θ = [θ i ] l i=1 [0, 1]l The joint distribution can be written as ( p(x, h) exp 1 ) 2ρ 2 x Uh (log θ 1 θ )T h, (2) 4 / 14

5 Encoding Model q(h x) Computing the posterior p(h x) is intractable Getting the MAP solution from p(h x) requires integer programming Approximate p(h x) using q(h x) as the encoder l q(h x) = q(h k = 1 x) h k q(h k = 0 x) 1 h k (3) k=1 Use linear parametrization h = [h k ] l k=1 B(σ(W T x)), where W = [w k ] l k=1 Rd l MAP solution of q(h x) is h(x) = argmax h q(h x) = sign(w T x) / 14

6 Training Objective Objective function is based on Minimal Description Length (MDL) principle MDL principle tries to minimize the expected amount of information to communicate x L(x) = h q(h x)(l(h) + L(x h)), where L(h) = log p(h) + log q(h x) is the description length of the hash code h, L(x h) = log p(x h) is the description length of x with h already known. For a given bit size l, MDL principle finds the best parameters to maximally compress the training data Training objective is min Θ={W,U,β,ρ} H(Θ) := x H(Θ; x) = x where β := log q(h x)(log p(x, h) log q(h x)), (4) h θ 1 θ W comes from q(h x), while U, β, ρ come from p(x, h) The objective function is also called Helmholtz (variational) free energy 6 / 14

7 Reparametrization via Stochastic Neuron Computing gradients w.r.t. W requires back-propagating through stochastic nodes of binary variable h REINFORCE can be used, but it suffers from high variance REINFORCE + variance reduction techniques suffer from either being biased or expensive to compute Reparametrize Bernoulli distribution using stochastic neuron h k (z) is reparameterized with z (0, 1), k = 1,, l The stochastic neuron is defined as h(z, ξ) := { 1 if z ξ 0 if z < ξ, (5) where ξ U(0, 1) h(z, ξ) B(z), because P( h(z, ξ) = 1) = z Replacing h k B(σ(w T k x)) with h k (σ(w T k x), ξ k) in equation (4) gives H(Θ) = x H(Θ; x) := x ] E ξ [l( h, x), (6) where l( h, x) = log p(x, h(σ(w T x), ξ)) + log q( h(σ(w T x), ξ) x) 7 / 14

8 Theory Proposition 1 (Neighborhood Preservation) If U F is bounded, then the Gaussian reconstruction error, x Uh x 2 is a surrogate for Euclidean neighborhood preservation. Definition 2 (Distributional derivative) Let Ω R d be an open set, C0 (Ω) denote the space of functions that are infinitely differentiable with compact support in Ω, and D (Ω) be the space of continuous linear functionals on C0 (Ω). Let u D (Ω), then a distribution v is called the distributional derivative of u, denoted as v = Du, if it satisfies vφdx = u φdx, φ C0 (Ω). Ω Ω Lemma 3 For a given sample x, the distributional derivative of function H(Θ; x) w.r.t. W is given by D W H(Θ; x) = Eξ [ hl( h(σ(w T x), ξ))σ(w T x) (1 σ(w T x))x T ], (7) where denotes point-wise product and hl( h) denotes the finite difference defined as [ ] hl( h) ]k = l( h k 1) l( h k 0 [ hi ), where k = h ] l if k l, otherwise [ hi l k = i, i 0, 1. l 8 / 14

9 Distributional derivative of Stochastic Neuron H(Θ; x) is not differentiable w.r.t. W because h(z; ξ) is discontinous An unbiased stochastic estimator of the gradient of H at Θ i, using sample x i, ξ i, is given as ˆ Θ H(Θi ; x i ) = [ˆDW H(Θi ; x i ), ˆ U,β,ρ H(Θi ; x i )] (8) The estimator of D W H(Θ; x) needs two forward passes of the model for each dimension of Θ. An approximate distributional derivative D W H(Θ; x) can compute each dimension with only one forward pass D W H(Θ; x) := Eξ [ hl( h(σ(w T x), ξ))σ(w T x) (1 σ(w T x))x T ] (9) The approximate stochastic estimator of the gradient of H is given as ˆ Θ H(Θi ; x i ) = [ˆ DW H(Θi ; x i ), ˆ U,β,ρ H(Θi ; x i )] (10) 9 / 14

10 Distributional SGD Algorithm 10 / 14

11 Convergence of Distributional SGD Proposition 4 The distributional derivative D W H(Θ; x) is equivalent to the traditional gradient W H(Θ; x). Theorem 5 (Convergence of Exact Distributional SGD) Under the assumption that H is L-Lipschitz smooth and the variance of the stochastic distributional gradient (8) is bounded by σ 2 in the distributional SGD, for the solution Θ R sampled from the trajectory {Θ i } t i=1 with probability P(R = i) = 2γ i Lγi 2 ti=1 where γ 2γ i Lγ i 2 i O(1/ t), we have [ E Θ H(ΘR ) 2] ( ) 1 O. t Theorem 6 (Convergence of Approximate Distributional SGD) Under the assumption that the variance of the approximate stochastic distributional gradient (10) is bounded by σ 2, for the solution Θ R sampled from the trajectory {Θ i } t i=1 with probability P(R = i) = γ i ti=1 where γ γ i O(1/ t), we have i [ ] ( ) 1 E (Θ R Θ ) T Θ H(ΘR ) O, t where Θ denotes the optimal solution. 11 / 14

12 Reconstruction Error and Training Time Figure: Comparison of Stochastic Generative Hashing (SGH) against iterative quantization (ITQ) and binary autoencoder (BA). Convergence of reconstruction error with number of SGD training samples. Training time comparison between BA and SGH on SIFT-1M dataset over different bit lengths 12 / 14

Large Scale Nearest Neighbor Retrieval Figure: Comparison of SGH, ITQ, K -means hashing (KMH), spectral hashing (SH), spherical hashing (SpH), binary autoencoder (BA), and scalable graph hashing (GH).

13 Large Scale Nearest Neighbor Retrieval Figure: Comparison of SGH, ITQ, K -means hashing (KMH), spectral hashing (SH), spherical hashing (SpH), binary autoencoder (BA), and scalable graph hashing (GH). L2NNS comparison on MNIST, SIFT-1M, GIST-1M, and SIFT-1B with varying binary code lengths. Performance is based on Recall 10@M (fraction of top 10 ground truth neighbors in retrieved M) as M increases to / 14

14 Reconstruction Visualization Figure: Original and reconstructed samples using 64 bits for SGH and ITQ, and 64 real components (64 32 bits) for PCA. Original MNIST image uses bits. Original CIFAR-10 image uses bits. 14 / 14

arxiv: v2 [cs.lg] 12 Aug 2017

arxiv: v2 [cs.lg] 12 Aug 2017 Stochastic Generative Hashing Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He 3, Le Song arxiv:7.85v cs.lg] Aug 7 Georgia Institute of Technology bodai@gatech.edu, lsong@cc.gatech.edu Google Research, NYC {guorq,