Fully Understanding the Hashing Trick

Size: px

Start display at page:

Download "Fully Understanding the Hashing Trick"

Mitchell Haynes
5 years ago
Views:

1 Fully Understanding the Hashing Trick Lior Kamma, Aarhus University Joint work with Casper Freksen and Kasper Green Larsen.

2 Recommendation and Classification PG-13 Comic Book Super Hero Sci Fi Adventure Action Violent Scary Comedy Drama Horror

3 Recommendation and Classification PG-13 Comic Book Super Hero Sci Fi Adventure Action Violent Scary Comedy Drama Horror Categorical Variables How do we decide these are close?

4 Feature Vectors Boolean vectors Denote the feature dimension by nn

5 kk-nearest Neighbours Storing a corpus of MM items requires Ω nnnn memory Corpus

6 kk-nearest Neighbours New Movie How do we find the kk closest movies?

7 Dimensionality Reduction Given εε, δδ (0,1) find Approximation Ratio Error Probability

8 Dimensionality Reduction For some small mm Given εε, δδ (0,1) find random ff: R nn R mm such that for every xx, yy R nn Think of nn as HUGE

9 Dimensionality Reduction Given εε, δδ (0,1) find random ff: R nn R mm such that for every xx, yy R nn Pr ff xx ff(yy) 2 2 (1 ± εε) xx yy δδ

10 Dimensionality Reduction Given εε, δδ (0,1) find random AA R mm nn such that for every xx, yy R nn Pr AA xx yy 2 2 (1 ± εε) xx yy δδ Why linear? Cool Math Focus on linear projections Streaming (updates). Good in practice

11 Dimensionality Reduction Given εε, δδ (0,1) find random AA R mm nn such that for every xx R nn Pr AA xx 2 2 (1 ± εε) xx δδ Why linear? Cool Math Focus on linear projections Streaming (updates). Good in practice

12 Johnson Lindenstrauss Lemma [JL 84] Given εε, δδ (0,1) there exists a random linear AA R mm nn such that for every xx Pr AA xx 2 2 (1 ± εε) xx δδ mm = OO lg 1/δδ εε 2 In most proofs matrix is as dense as possible. Embedding takes OO(mmmm) operations.

13 Johnson Lindenstrauss Lemma [JL 84] Given εε, δδ (0,1) there exists a random linear AA R mm nn such that for every xx Pr AA xx 2 2 (1 ± εε) xx δδ If AA is sparse, this can be made faster. In most proofs matrix is as dense as possible. Embedding takes OO(mmmm) operations.

14 Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx

15 Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx ff(xx) mm = 33

16 Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx ff(xx) mm = 33

17 Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx + Observation: This operation is linear Moreover, every column has exactly one non-zero entry. - ff(xx) mm = 33

18 The Hashing Trick With High Prob. Observation: If mm is large enough, and the mass of x is not concentrated in few entries, then the trick works with high probability εε = 0.1 Pr h: 1,2,,nn {1,2,,mm} h 1 = h(2) = 1 mm xx xx 2 = 1 2.

19 The Hashing Trick With High Prob. Success Observation: iff no collision If mm is occurs large enough, and the mass of x is not concentrated in few entries, then the trick works with high probability. εε = Pr h 1 = h(2) = 1 h: 1,2,,nn {1,2,,mm} 0 mm 0 xx = 1 0 xx 2 2 Ṫo succeed we need mm 1 δδ

20 Tight Bounds Formal Problem Fix mm, εε, δδ. Define νν(mm, εε, δδ) to be the maximum νν such that whenever xx νν xx 2 then feature hashing works.

21 Tight Bounds Formal Problem Fix mm, εε, δδ. Define νν(mm, εε, δδ) to be the maximum νν such that whenever xx νν xx 2 then feature hashing works. We have a fixed budget, and a fixed room for error. Evaluating νν has been an open question for almost a decade.

22 Tight Bounds Our Result Fix mm, εε, δδ. Theorem. 1. If mm < cc log1 δδ εε 2 then νν = 0. Essentially, this means our budget is too small to do anything meaningful.

23 Tight Bounds Our Result Fix mm, εε, δδ. Theorem. 1. If mm < cc log1 δδ εε 2 then νν = If mm 2 δδεε2 then νν = 1. Essentially, this means our budget is rich enough to do anything.

24 Tight Bounds Our Result Fix mm, εε, δδ. Theorem. This is tight, which means this is the right 1. If mm < cc log1 δδ then νν = 0. εε 2 2. If expression. mm 2 2 then νν = 1. δδεε 3. If CC log1 δδ εε 2 mm < 1 δδεε 2 νν = Θ εε min then log εεεε log 1 δδ log 1 δδ, log εε2 mm log 1 δδ log 1 δδ

25 Empirical Analysis Results show that the Θ-constant is close to 1. εε min νν lg εεεε lg 1/δδ lg 1/δδ, lg εε2 mm lg 1/δδ lg 1/δδ This implies that Feature Hashing s performance can be very well predicted in practice using our formula. νν = Θ εε min log εεεε log 1 δδ, log εε2 mm log 1 δδ log 1 δδ log 1 δδ 0.725

26 Questions? Come see poster Read the paper Talk offline All of the above Tight Cell-Probe Bounds for Succinct Boolean Matrix-Vector Multiplication

27 Questions? Come see poster Read the paper Talk offline All of the above Thank you Tight Cell-Probe Bounds for Succinct Boolean Matrix-Vector Multiplication

Local Decoding and Testing Polynomials over Grids

Local Decoding and Testing Polynomials over Grids Madhu Sudan Harvard University Joint work with Srikanth Srinivasan (IIT Bombay) January 11, 2018 ITCS: Polynomials over Grids 1 of 12 DeMillo-Lipton-Schwarz-Zippel