Fully Understanding the Hashing Trick

Fully Understanding the Hashing Trick Lior Kamma, Aarhus University Joint work with Casper Freksen and Kasper Green Larsen.

Recommendation and Classification PG-13 Comic Book Super Hero Sci Fi Adventure Action Violent Scary Comedy Drama Horror

Recommendation and Classification PG-13 Comic Book Super Hero Sci Fi Adventure Action Violent Scary Comedy Drama Horror Categorical Variables How do we decide these are close?

Feature Vectors 1 0 1 1 0 0 1 1 0... 1 0 1 1 0 0 1 0 1... Boolean vectors Denote the feature dimension by nn

kk-nearest Neighbours Storing a corpus of MM items requires Ω nnnn memory Corpus

kk-nearest Neighbours New Movie How do we find the kk closest movies?

Dimensionality Reduction Given εε, δδ (0,1) find Approximation Ratio Error Probability

Dimensionality Reduction For some small mm Given εε, δδ (0,1) find random ff: R nn R mm such that for every xx, yy R nn Think of nn as HUGE

Dimensionality Reduction Given εε, δδ (0,1) find random ff: R nn R mm such that for every xx, yy R nn Pr ff xx ff(yy) 2 2 (1 ± εε) xx yy 2 2 1 δδ

Dimensionality Reduction Given εε, δδ (0,1) find random AA R mm nn such that for every xx, yy R nn Pr AA xx yy 2 2 (1 ± εε) xx yy 2 2 1 δδ Why linear? Cool Math Focus on linear projections Streaming (updates). Good in practice

Dimensionality Reduction Given εε, δδ (0,1) find random AA R mm nn such that for every xx R nn Pr AA xx 2 2 (1 ± εε) xx 2 2 1 δδ Why linear? Cool Math Focus on linear projections Streaming (updates). Good in practice

Johnson Lindenstrauss Lemma [JL 84] Given εε, δδ (0,1) there exists a random linear AA R mm nn such that for every xx Pr AA xx 2 2 (1 ± εε) xx 2 2 1 δδ mm = OO lg 1/δδ εε 2 In most proofs matrix is as dense as possible. Embedding takes OO(mmmm) operations.

Johnson Lindenstrauss Lemma [JL 84] Given εε, δδ (0,1) there exists a random linear AA R mm nn such that for every xx Pr AA xx 2 2 (1 ± εε) xx 2 2 1 δδ If AA is sparse, this can be made faster. In most proofs matrix is as dense as possible. Embedding takes OO(mmmm) operations.

Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx 1 0 1 1 0 0 1 0 1 + -

Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx 1 0 1 1 0 0 1 0 1 + - ff(xx) 0 1 0 mm = 33

Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx 1 0 1 1 0 0 1 0 1 + - ff(xx) 0 1 1 mm = 33

Feature Hashing [Weinberger et al. Add random signs 2009] General Idea: Shuffle the entries of xx xx + Observation: This operation is linear. 1 0 1 1 0 0 1 0 1 Moreover, every column has exactly one non-zero entry. - ff(xx) 0 1 1 mm = 33

The Hashing Trick With High Prob. Observation: If mm is large enough, and the mass of x is not concentrated in few entries, then the trick works with high probability. 1 1 0 0 0 εε = 0.1 Pr h: 1,2,,nn {1,2,,mm} h 1 = h(2) = 1 mm xx xx 2 = 1 2.

The Hashing Trick With High Prob. Success Observation: iff no collision If mm is occurs large enough, and the mass of x is not concentrated in few entries, then the trick works with high probability. εε = 0.1 1 1 Pr h 1 = h(2) = 1 h: 1,2,,nn {1,2,,mm} 0 mm 0 xx = 1 0 xx 2 2 Ṫo succeed we need mm 1 δδ

Tight Bounds Formal Problem Fix mm, εε, δδ. Define νν(mm, εε, δδ) to be the maximum νν such that whenever xx νν xx 2 then feature hashing works.

Tight Bounds Formal Problem Fix mm, εε, δδ. Define νν(mm, εε, δδ) to be the maximum νν such that whenever xx νν xx 2 then feature hashing works. We have a fixed budget, and a fixed room for error. Evaluating νν has been an open question for almost a decade.

Tight Bounds Our Result Fix mm, εε, δδ. Theorem. 1. If mm < cc log1 δδ εε 2 then νν = 0. Essentially, this means our budget is too small to do anything meaningful.

Tight Bounds Our Result Fix mm, εε, δδ. Theorem. 1. If mm < cc log1 δδ εε 2 then νν = 0. 2. If mm 2 δδεε2 then νν = 1. Essentially, this means our budget is rich enough to do anything.

Tight Bounds Our Result Fix mm, εε, δδ. Theorem. This is tight, which means this is the right 1. If mm < cc log1 δδ then νν = 0. εε 2 2. If expression. mm 2 2 then νν = 1. δδεε 3. If CC log1 δδ εε 2 mm < 1 δδεε 2 νν = Θ εε min then log εεεε log 1 δδ log 1 δδ, log εε2 mm log 1 δδ log 1 δδ

Empirical Analysis Results show that the Θ-constant is close to 1. εε min νν lg εεεε lg 1/δδ lg 1/δδ, lg εε2 mm lg 1/δδ lg 1/δδ This implies that Feature Hashing s performance can be very well predicted in practice using our formula. νν = Θ εε min log εεεε log 1 δδ, log εε2 mm log 1 δδ log 1 δδ log 1 δδ 0.725

Questions? Come see poster Read the paper Talk offline All of the above Tight Cell-Probe Bounds for Succinct Boolean Matrix-Vector Multiplication

Questions? Come see poster Read the paper Talk offline All of the above Thank you Tight Cell-Probe Bounds for Succinct Boolean Matrix-Vector Multiplication