Compressing Tabular Data via Pairwise Dependencies

Size: px

Start display at page:

Download "Compressing Tabular Data via Pairwise Dependencies"

Ethan Conley
5 years ago
Views:

1 Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research TCE Conference, June 22, 2017 Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)

2 Huge datasets: everywhere - Internet - Science - Media - At Yahoo: - More than 100k servers in ~36 clusters - More than 800PB of storage - Lots of data, always want to store more

3 Compressing big data: does it matter? It s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning It s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS

4 Lossless compression: dictionary methods Typical compression: gzip (DEFLATE) - Based on LZ77 + Huffman - Popular, fast - Recent variants: zstd (FB, 2015), Brotli (Google, 2015), - Good at: detecting temporal dependencies, e.g. text Main idea: find repetitions in sliding window the brown fox jumped over the brownish jumping bear the brown fox jumped over 26,9 ish 25,5 ing bear

5 Tabular data Typical dataset: A table - Each row has several fields, complex dependencies - Example: UserID Age Location Device Time DocID iphone 7 9pm ipad pro 10am Temporal dependencies? - Cross-field dependencies?

6 Entropy coding 101 Given: a stream of i.i.d. symbols of a R.V. X: Encode each symbol as a (prefix free) bit string of variable length More frequent symbols à shorter codeword Theorem: avg. code length H ( X) = p( x)log x 1 p x ( ) Huffman code: optimal. Rate Better: Arithmetic coding Approaches entropy Requires: distribution Black box p( x) H ( X) +1bit

7 Assumptions: Records in the table are i.i.d. Only dependence between fields [ X 1, X 2,..., X n ] Example: independence P X1 X 2...X n (x 1, x 2,..., x n ) = P Xi (x i ) Expected compression rate: Fine print: for each RV, need to save i H ( X i ) per record The distribution and/or codebook (Huffman / Arith. coder) A dictionary (to translate back to the original values) n i=1

Fancier models: Bayesian networks Bayes net DAG with n nodes Nodes are the RVs, Edges model (conditional) independence X 1 X 3 X 2 X 4 P(x 1, x 2, x 3, x 4 ) = P(x 1 )P(x 2 x 1 )P(x 3 x 1 )P(x 4 x 2

8 Fancier models: Bayesian networks Bayes net DAG with n nodes Nodes are the RVs, Edges model (conditional) independence X 1 X 3 X 2 X 4 P(x 1, x 2, x 3, x 4 ) = P(x 1 )P(x 2 x 1 )P(x 3 x 1 )P(x 4 x 2 x 3 ) Dense graph àmore general Compression rate: H(X 1 )+ H(X 2 X 1 )+H(X 3 X 1 )+ H(X 4 X 2 X 3 ) Usage for compression: compress according to the graph edges Metadata: larger codebooks / distributions (conditional!) Not a new idea [e.g. Davies & Moore, KDD 99]

9 How to choose a Bayes Net for compression? Another assumption: Each node can only have a single parent DAG è Tree Simpler compression Conditioned only on a single RV Compression rate: Best tree? ( ) + H(X i X j ) H X root X 1 X 3 X 2 X 4 P(x 1, x 2, x 3, x 4 ) = P(x 1 )P(x 2 x 1 )P(x 3 x 1 )P(x 4 x 2 ) = H ( X i ) I X i ; X j Edges(i, j) i=1 Edges(i, j) n ( )

10 Searching for the best tree Rate: ( ) + H(X i X j ) H X root Edges(i, j) n = H ( X i ) I X i ; X j i=1 Edges(i, j) ( ) Algorithm: Calculate Set I ( X i ; X j ), 1 i, j n ( ) w ij = I X i, X j Find minimum spanning tree! ( ) Efficient algorithms exist O n 2 [Fredman & Tarjan, 1987] Also: minimizes the KL divergence w.r.t. to the true distribution. Known as a Chow-Liu Tree [Chow & Liu, 1968] Extensions exist [e.g. Williamson, 2000]

11 Example: MST with Mutual Information Weights UserID DocID UserID Age Location Device Time DocID Location Time iphone 7 9pm ipad pro 10am Device Age

Chow Liu compression in real life X i X j Compressing given : For each possible x j, store P Xi X j x j ( ) entropy code Dataset not infinite metadata takes space!

12 Chow Liu compression in real life X i X j Compressing given : For each possible x j, store P Xi X j x j ( ) entropy code Dataset not infinite metadata takes space! Example: 1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea metadata

13 Revised Chow-Liu tree Take into account model size Actual rate: ( ) + H(X i X j ) H X root n 1 = H ( X i ) I ( X i ; X j ) + # rows Size P X i X j Edges(i, j) à Revised weights for the Chow-Liu tree: w ij = I ( X i, X j ) + i=1 1 Edges(i, j) ( ) # rows Size P X i X j ( ) Negative gain? à might opt to drop dependencies à forest

14 Example: MST with Mutual Information Weights w ij = I ( X i, X j ) w ij = I ( X i, X j ) + 1 # rows Size ( P X i X j ) UserID DocID UserID DocID Location Time Location Time Device Age Device Age entropy code entropy code metadata metadata

15 Storing the metadata How to store the distribution P(X Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!

16 Improvements: Lossy model compression Compressing X given Y : (compression is still lossless) True distribution: Lossy representation results in distorted distribution Code rate: Want Q XY Related to MDL P XY H ( X Y ) + D( P X Y Q X Y P X ) + to minimize both model storage size and divergence! Can be used to modify edge weights 1 # rows Size ( Q X Y ) Q XY

17 Proposed approach: Add a virtual variable with a small alphabet size, s.t. X Z Y P XY (x, y) z Q Y Z (y z)q X Z (x z)q Z (z) Storage size decreased from X Y to ( X + Y ) Z Z : controls tradeoff between two objectives Finding Q Y Z (y z),q X Z (x z),q Z (z) { } Iterate through the three terms, minimize KL divergence, repeat until convergence Not optimal! Optimization is hard Similar in spirit to [Lee & Seung,NIPS 2001]

18 Example: Criteo dataset A Kaggle competition for click prediction by Criteo Dataset: 45M records Mutual information: Chow Liu Tree:

19 Example: Criteo dataset Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can t store conditional distribution Results of NNMF:

20 Experiments Datasets: machine learning, US census, etc. #features: #lines: 60K 45M Current version: MST with adjusted weights Sparse encoding of metadata + lossless comp.

22 Speed vs. compression efficiency

23 Summary Dataset compression via probabilistic assumptions Bayes nets, Chow-Liu Trees Metadata encoding +weight modification Lossless compression via lossy model compression Add a new RV with a Markov restriction Balance metadata size vs. model inaccuracy Take home message: Choose right metric Revisit old ideas

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when