Sparse Johnson-Lindenstrauss Transforms

Similar documents
Sparser Johnson-Lindenstrauss Transforms

Fast Dimension Reduction

Fast Dimension Reduction

Fast Random Projections

Sketching as a Tool for Numerical Linear Algebra

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels

The Johnson-Lindenstrauss Lemma Is Optimal for Linear Dimensionality Reduction

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Lecture 6 September 13, 2016

Optimal compression of approximate Euclidean distances

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6

Accelerated Dense Random Projections

Sketching as a Tool for Numerical Linear Algebra

Lecture 3 Sept. 4, 2014

Optimal Bounds for Johnson-Lindenstrauss Transformations

Randomness Efficient Fast-Johnson-Lindenstrauss Transform with Applications in Differential Privacy

The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction

Lecture 01 August 31, 2017

Randomized Algorithms

Lecture 1: Introduction to Sublinear Algorithms

Very Sparse Random Projections

JOHNSON-LINDENSTRAUSS TRANSFORMATION AND RANDOM PROJECTION

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Randomized Algorithms

16 Embeddings of the Euclidean metric

Random Methods for Linear Algebra

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

CS on CS: Computer Science insights into Compresive Sensing (and vice versa) Piotr Indyk MIT

Lecture 18: March 15

Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Edo Liberty. Accelerated Dense Random Projections

STAT 200C: High-dimensional Statistics

Lecture 2. Frequency problems

Sparsity Lower Bounds for Dimensionality Reducing Maps

Methods for sparse analysis of high-dimensional data, II

17 Random Projections and Orthogonal Matching Pursuit

Faster Johnson-Lindenstrauss style reductions

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

to be more efficient on enormous scale, in a stream, or in distributed settings.

Dense Fast Random Projections and Lean Walsh Transforms

Lecture 2 Sept. 8, 2015

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

1 Estimating Frequency Moments in Streams

Dimensionality Reduction Notes 3

Methods for sparse analysis of high-dimensional data, II

Ad Placement Strategies

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Optimality of the Johnson-Lindenstrauss Lemma

Big Data. Big data arises in many forms: Common themes:

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

The space complexity of approximating the frequency moments

A Unifying Framework for l 0 -Sampling Algorithms

New constructions of RIP matrices with fast multiplication and fewer rows

MANY scientific computations, signal processing, data analysis and machine learning applications lead to large dimensional

Lecture 4 Thursday Sep 11, 2014

Streaming and communication complexity of Hamming distance

Yale University Department of Computer Science

Lecture 4 February 2nd, 2017

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

Fast Moment Estimation in Data Streams in Optimal Space

Graph Sparsification I : Effective Resistance Sampling

Chaining Introduction With Some Computer Science Applications

OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings

Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Nearest Neighbor Preserving Embeddings

Combining geometry and combinatorics

arxiv: v4 [cs.ds] 5 Apr 2013

Fully Understanding the Hashing Trick

Nonlinear Estimators and Tail Bounds for Dimension Reduction in l 1 Using Cauchy Random Projections

STAT 200C: High-dimensional Statistics

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

A Near-Optimal Algorithm for Computing the Entropy of a Stream

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Lecture 1 September 3, 2013

A Unified Theorem on SDP Rank Reduction. yyye

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Sparse Fourier Transform (lecture 1)

8.1 Concentration inequality for Gaussian random matrix (cont d)

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

14.1 Finding frequent elements in stream

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

The Johnson-Lindenstrauss Lemma for Linear and Integer Feasibility

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Random hyperplane tessellations and dimension reduction

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimality of the Johnson-Lindenstrauss lemma

Some notes on streaming algorithms continued

Lecture 4: Hashing and Streaming Algorithms

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

New ways of dimension reduction? Cutting data sets into small pieces

Tutorial: Sparse Recovery Using Sparse Matrices. Piotr Indyk MIT

OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

A Simple Proof of the Restricted Isometry Property for Random Matrices

Computing the Entropy of a Stream

Transcription:

Sparse Johnson-Lindenstrauss Transforms Jelani Nelson MIT May 24, 211 joint work with Daniel Kane (Harvard)

Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O(ε 2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor.

Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O(ε 2 log n)-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor. Uses: Speed up geometric algorithms by first reducing dimension of input [Indyk-Motwani, 1998], [Indyk, 21] Low-memory streaming algorithms for linear algebra problems [Sarlós, 26], [LWMRT, 27], [Clarkson-Woodruff, 29] Essentially equivalent to RIP matrices from compressive sensing [Baraniuk et al., 28], [Krahmer-Ward, 21] (used for sparse recovery of signals)

How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x S d 1, [ Pr Sx 2 2 1 > ε ] < δ. S D ε,δ

How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x S d 1, [ Pr Sx 2 2 1 > ε ] < δ. S D ε,δ Proof of MJL: Set δ = 1/n 2 in DJL and x as the difference vector of some pair of points. Union bound over the ( n 2) pairs.

How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any < ε, δ < 1/2 there exists a distribution D ε,δ on R k d for k = O(ε 2 log(1/δ)) so that for any x S d 1, [ Pr Sx 2 2 1 > ε ] < δ. S D ε,δ Proof of MJL: Set δ = 1/n 2 in DJL and x as the difference vector of some pair of points. Union bound over the ( n 2) pairs. Theorem (Alon, 23) For every n, there exists a set of n points requiring target dimension k = Ω((ε 2 / log(1/ε)) log n). Theorem (Jayram-Woodruff, 211; Kane-Meka-N., 211) For DJL, k = Θ(ε 2 log(1/δ)) is optimal.

Proving the JL lemma Older proofs [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. [Indyk-Motwani, 1998], [Dasgupta-Gupta, 23]: Random matrix with independent Gaussian entries. [Achlioptas, 21]: Independent Bernoulli entries. [Clarkson-Woodruff, 29]: O(log(1/δ))-wise independent Bernoulli entries. [Arriaga-Vempala, 1999], [Matousek, 28]: Independent entries having mean, variance 1/k, and subgaussian tails (for a Gaussian with variance 1/k).

Proving the JL lemma Older proofs [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. [Indyk-Motwani, 1998], [Dasgupta-Gupta, 23]: Random matrix with independent Gaussian entries. [Achlioptas, 21]: Independent Bernoulli entries. [Clarkson-Woodruff, 29]: O(log(1/δ))-wise independent Bernoulli entries. [Arriaga-Vempala, 1999], [Matousek, 28]: Independent entries having mean, variance 1/k, and subgaussian tails (for a Gaussian with variance 1/k). Downside: Performing embedding is dense matrix-vector multiplication, O(k x ) time

Fast JL Transforms [Ailon-Chazelle, 26]: x PHDx, O(d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal [Ailon-Liberty, 28]: O(d log k + k 2 ) time, also based on fast Hadamard transform [Ailon-Liberty, 211], [Krahmer-Ward]: O(d log d) for MJL, but with suboptimal k = O(ε 2 log n log 4 d).

Fast JL Transforms [Ailon-Chazelle, 26]: x PHDx, O(d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ±1 on diagonal [Ailon-Liberty, 28]: O(d log k + k 2 ) time, also based on fast Hadamard transform [Ailon-Liberty, 211], [Krahmer-Ward]: O(d log d) for MJL, but with suboptimal k = O(ε 2 log n log 4 d). Downside: Slow to embed sparse vectors: running time is Ω(min{k x, d}) even if x = 1

Where Do Sparse Vectors Show Up? Documents as bags of words: x i = number of occurrences of word i. Compare documents using cosine similarity. d = lexicon size; most documents aren t dictionaries Network traffic: x i,j = #bytes sent from i to j d = 2 64 (2 256 in IPv6); most servers don t talk to each other User ratings: x i is user s score for movie i on Netflix d = #movies; most people haven t watched all movies Streaming: x receives updates x x + v e i in a stream. Maintaining Sx requires calculating Se i....

Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices.

Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices. s = #non-zero entries per column (so embedding time is s x ) reference value of s type [JL84], [FM88], [IM98],... k 4ε 2 log(1/δ) dense [Achlioptas1] k/3 sparse Bernoulli [WDALS9] no proof hashing [DKS1] Õ(ε 1 log 3 (1/δ)) hashing [KN1a], [BOR1] Õ(ε 1 log 2 (1/δ)) [KN1b] O(ε 1 log(1/δ)) hashing (random codes)

Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ))

Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ)) [this work] s = Θ(ε 1 log(1/δ))

Sparse JL Constructions [DKS, 21] s = Θ(ε 1 log 2 (1/δ)) [this work] s = Θ(ε 1 log(1/δ)) [this work] k/s s = Θ(ε 1 log(1/δ))

Sparse JL Constructions (in matrix form) = k k/s = k Each black cell is ±1/ s at random

Sparse JL Constructions (nicknames) Graph construction k/s Block construction

Sparse JL intuition Let h(j, r), σ(j, r) be random hash location and random sign for rth copy of x j.

Sparse JL intuition Let h(j, r), σ(j, r) be random hash location and random sign for rth copy of x j. (Sx) i = (1/ s) h(j,r)=i x j σ(j, r)

Sparse JL intuition Let h(j, r), σ(j, r) be random hash location and random sign for rth copy of x j. (Sx) i = (1/ s) h(j,r)=i x j σ(j, r) Sx 2 2 = x 2 2 + (1/s) (j,r) (j,r ) x j x j σ(j, r)σ(j, r ) 1 h(j,r)=h(j,r )

Sparse JL intuition Let h(j, r), σ(j, r) be random hash location and random sign for rth copy of x j. (Sx) i = (1/ s) h(j,r)=i x j σ(j, r) Sx 2 2 = x 2 2 + (1/s) (j,r) (j,r ) x j x j σ(j, r)σ(j, r ) 1 h(j,r)=h(j,r ) x = (1/ 2, 1/ 2,,..., ) with t < (1/2) log(1/δ) collisions. All signs agree with probability 2 t > δ δ, giving error t/s. So, need s = Ω(t/ε). (Collisions are bad)

Sparse JL via Codes = k k/s = k Graph construction: Constant weight binary code of weight s. Block construction: Code over q-ary alphabet, q = k/s.

Sparse JL via Codes = k k/s = k Graph construction: Constant weight binary code of weight s. Block construction: Code over q-ary alphabet, q = k/s. Will show: Suffices to have minimum distance s O(s 2 /k).

Analysis (block construction) k/s = k η i,j,r indicates whether i, j collide in ith chunk. Sx 2 2 = x 2 2 + Z Z = (1/s) r Z r Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r

Analysis (block construction) k/s = k η i,j,r indicates whether i, j collide in ith chunk. Sx 2 2 = x 2 2 + Z Z = (1/s) r Z r Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Plan: Pr[ Z > ε] < ε l E[Z l ]

Analysis (block construction) k/s = k η i,j,r indicates whether i, j collide in ith chunk. Sx 2 2 = x 2 2 + Z Z = (1/s) r Z r Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Plan: Pr[ Z > ε] < ε l E[Z l ] Z is a quadratic form in σ, so apply known moment bounds for quadratic forms

Analysis k/s = k Theorem (Hanson-Wright, 1971) z 1,..., z n independent Bernoulli, B R n n symmetric. For l 2, [ z E T Bz trace(b) l] { } l < C l max l B F, l B 2 Reminder: B F = i,j B2 i,j B 2 is largest magnitude of eigenvalue of B

Analysis Z = 1 s s x i x j σ(i, r)σ(j, r)η i,j,r r=1 i j

Analysis Z = 1 s s x i x j σ(i, r)σ(j, r)η i,j,r = σ T T σ r=1 i j T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r

Analysis (cont d) T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r

Analysis (cont d) T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r T 2 F = 1 s 2 i j x 2 i x 2 j (#times i, j collide)

Analysis (cont d) T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r T 2 F = 1 s i j x 2 2 i x j 2 (#times i, j collide) < O(1/k) x 4 2 = O(1/k) (good code!)

Analysis (cont d) T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r T 2 F = 1 s 2 i j x 2 i x 2 j (#times i, j collide) < O(1/k) x 4 2 = O(1/k) (good code!) T 2 = max r T r 2, can bound by 1/s

Analysis (cont d) T = 1 s T 1... T 2......... T s (T r ) i,j = x i x j η i,j,r T 2 F = 1 s i j x 2 2 i x j 2 (#times i, j collide) < O(1/k) x 4 2 = O(1/k) (good code!) T 2 = max r T r 2, can bound by 1/s Pr[ Z > ε] < C l max { } l 1 l ε k, 1 ε l s l = log(1/δ), k = Ω(l/ε 2 ), s = Ω(l/ε), QED

Code-based Construction: Caveat Need a sufficiently good code.

Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)).

Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)). Can assume d = O(ε 2 /δ) by first embedding into this dimension with s = 1 and 4-wise independent σ, h (Analysis: Chebyshev s inequality) Can get away with s = O(ε 1 log(1/(εδ)) log(1/δ)).

Code-based Construction: Caveat Need a sufficiently good code. Each pair of codewords should agree on O(s 2 /k) coordinates. Can get this with random code by Chernoff + union bound over pairs, but then need: s 2 /k log(d/δ) s k log(d/δ) = Ω(ε 1 log(d/δ) log(1/δ)). Can assume d = O(ε 2 /δ) by first embedding into this dimension with s = 1 and 4-wise independent σ, h (Analysis: Chebyshev s inequality) Can get away with s = O(ε 1 log(1/(εδ)) log(1/δ)). Can we avoid the loss incurred by this union bound?

Improving the Construction Idea: Random hashing gives a good code, but it gives much more! (it s random).

Improving the Construction Idea: Random hashing gives a good code, but it gives much more! (it s random). Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of the error term Z, then apply Markov to Z l

Improving the Construction Idea: Random hashing gives a good code, but it gives much more! (it s random). Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of the error term Z, then apply Markov to Z l Z = (1/s) s r=1 i j x ix j σ(i, r)σ(j, r)η i,j,r

Improving the Construction Idea: Random hashing gives a good code, but it gives much more! (it s random). Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of the error term Z, then apply Markov to Z l Z = (1/s) s r=1 i j x ix j σ(i, r)σ(j, r)η i,j,r (Z = (1/s) s r=1 Z r )

Improving the Construction Idea: Random hashing gives a good code, but it gives much more! (it s random). Pick h at random Analysis: Directly bound the l = log(1/δ)th moment of the error term Z, then apply Markov to Z l Z = (1/s) s r=1 i j x ix j σ(i, r)σ(j, r)η i,j,r (Z = (1/s) s r=1 Z r ) E h,σ [Z l ] = 1 s l r 1 <...<r n t 1,...,t n>1 i t i =l ( l t 1,..., t n ) n i=1 E h,σ [Z t i r i ] Bound the tth moment of any Z r, then get the lth moment bound for Z by plugging into the above

Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r

Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Monomials appearing in expansion of Zr t are in correspondence with directed multigraphs. (x 1 x 2 ) (x 3 x 4 ) (x 3 x 8 ) (x 4 x 8 ) (x 2 x 1 ) 1 5 2 4 3

Bounding E[Z t r ] Z r = i j x ix j σ(i, r)σ(j, r)η i,j,r Monomials appearing in expansion of Zr t are in correspondence with directed multigraphs. (x 1 x 2 ) (x 3 x 4 ) (x 3 x 8 ) (x 4 x 8 ) (x 2 x 1 ) 1 5 2 4 Monomial contributes to expectation iff all degrees even Analysis: Group monomials appearing in Zr t according to isomorphism class then do some combinatorics. 3

Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G u=1 η iu,j u,r u=1

Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G u=1 ( t ) x iu x ju u=1

Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 u=1 ( t ) x iu x ju u=1

Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k 2 O(t) v,m u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 ( ) v m t t v v ( s k G u=1 ( t ) x iu x ju u=1 ) d u du u

Bounding E[Z t r ] m = #connected components, v = #vertices, d u = degree of u E h,σ [Zr t ] = [ t ] ( t ) E x iu x ju G G t i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G = ( s ) v m k G G t G G t ( s k 2 O(t) v,m u=1 η iu,j u,r i 1 j 1,...,i t j t f ((i u,j u) t u=1 )=G ) v m 1 v! ( ) t d 1 /2,...,d v /2 ( ) v m t t v v ( s k G u=1 ( t ) x iu x ju u=1 ) d u du u

Bounding E[Z t r ] Can bound the sum by forming G one edge at a time, in increasing order of label For example, if we didn t worry about connected components: S i+1 /S i u v du d v ( u du ) 2 C S 2tv

Bounding E[Z t r ] Can bound the sum by forming G one edge at a time, in increasing order of label For example, if we didn t worry about connected components: ( S i+1 /S i u v du d v u du ) 2 C S 2tv In the end, can show E[Z t r ] 2 O(t) { s/k (t/ log(k/s)) t t < log(k/s) otherwise Plug this into formula for E[Z l ], QED

Tightness of Analysis Analysis of required s is tight: s 1/(2ε): Look at a vector with t = 1/(sε) non-zero coordinates each of value 1/ t, and show probability of exactly one collision is δ, and > ε error when this happens and signs agree.

Tightness of Analysis Analysis of required s is tight: s 1/(2ε): Look at a vector with t = 1/(sε) non-zero coordinates each of value 1/ t, and show probability of exactly one collision is δ, and > ε error when this happens and signs agree. 1/(2ε) < s < cε 1 log(1/δ): Look at vector (1/ 2, 1/ 2,,..., ) and show that probability of exactly 2sε collisions is δ, all signs agree with probability δ, and > ε error when this happens.

Open Problems

Open Problems OPEN: Devise distribution which can be sampled using few random bits Current record: O(log d + log(1/ε) log(1/δ) + log(1/δ) log log(1/δ)) [Kane-Meka-N.] Existential: O(log d + log(1/δ)) OPEN: Can we embed a k-sparse vector into R k in k polylog(d) time with the optimal k? This would give a fast amortized streaming algorithm without blowing up space (batch k updates at a time, since we re already spending k space storing the embedding). Note: Embedding should be correct for any vector, but time should depend on sparsity. OPEN: Embed any vector in Õ(d) time into optimal k