Random Sampling on Big Data: Techniques and Applications Ke Yi

Size: px

Start display at page:

Download "Random Sampling on Big Data: Techniques and Applications Ke Yi"

Angela Robbins
5 years ago
Views:

1 : Techniques and Applications Ke Yi Hong Kong University of Science and Technology

2 Big Data in one slide The 3 V s: Volume Velocity Variety Integers, real numbers Points in a multi-dimensional space Records in relational database Graph-structured data 2

3 Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: Distributed/parallel systems Simpler programming models MapReduce, Pregel, Dremel, Spark BSP Failure tolerance and recovery Drop certain features: ACID, CAP, nosql This talk is not about this approach! 3

4 Downsizing data A second approach to computational scalability: scale down the data! A compact representation of a large data set Too much redundancy in big data anyway What we finally want is small: human readable analysis / decisions Necessarily gives up some accuracy: approximate answers Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach Can scale out computation and scale down data at the same time Algorithms need to work under new system architectures Good old RAM model no longer applies 4

5 Outline for the talk Simple random sampling Sampling from a data stream Sampling from distributed streams Sampling for range queries Not-so-simple sampling Importance sampling: Frequency estimation on distributed data Paired sampling: Medians and quantiles Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice 5

6 Simple Random Sampling Sampling without replacement Randomly draw an element Don t put it back Repeat s times Sampling with replacement Randomly draw an element Put it back Repeat s times Trivial in the RAM model The statistical difference is very small, for n s 6

7 Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications Data stored on disk Network traffic 7

8 8

9 Reservoir Sampling Maintain a sample of size s drawn (without replacement) from all elements in the stream so far Keep the first s elements in the stream, set n s Algorithm for a new element n n + 1 With probability s/n, use it to replace an item in the current sample chosen uniformly at random With probability 1 s/n, throw it away Perhaps the first streaming algorithm [Waterman??; Knuth s book] 9

10 Correctness Proof 10 By induction on n n = s: trivially correct Assume each element so far is sampled with probability s/n Consider n + 1: The new element is sampled with probability s n+1 Any element in the current sample is sampled with probability s 1 s + s s 1 = s. Yeah! n n+1 n+1 s n+1 This is a wrong (incomplete) proof Each element being sampled with probability s is not a sufficient n condition of random sampling Counter example: Divide elements into groups of s and pick one group randomly

11 11

12 Reservoir Sampling Correctness Proof Many proofs found online are actually wrong They only show that each item is sampled with probability s/n Need to show that every subset of size s has the same probability to be the sample Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12

13 Sampling from Distributed Streams 13 One coordinator and k sites Each site can communicate with the coordinator Goal: Maintain a random sample of size s over the union of all streams with minimum communication Difficulty: Don t know n, so can t run reservoir sampling algorithm Key observation: Don t have to know n in order to sample! [Cormode, Muthukrishnan, Yi, Zhang, PODS 10, JACM 12] [Woodruff, Tirthapura, DISC 11]

14 Reduction from Coin Flip Sampling Flip a fair coin for each element until we get 1 An element is active on a level if it is 0 If a level has s active elements, we can draw a sample from those active elements Key: The coordinator does not want all the active elements, which are too many! Choose a level appropriately 14

15 The Algorithm Initialize i 0 In round i: Sites send in every item w.p. 2 i (This is a coin-flip sample with prob. 2 i ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to advance to round i i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample 15

16 Communication Cost of Algorithm Communication cost of each round: O(k + s) Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log n) In each round, need Θ(s) items being sampled to end round Each item has prob. 2 i to contribute: need Θ(2 i s) items Total communication: O( k + s log n) Can be improved to O(k log k/s n + s log n) A matching lower bound Sliding windows 16

17 Random Sampling for Range Queries 17 [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD 15 Best Demo Award]

18 Online Range Sampling Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination. Naïve solutions: Parameters: n: data size q: query size s: sample size (not known beforehand) n q s Query then sample: O f n + q Sample then query: O sn q (store data in random order) New solution: O f sn q + s f(x): # canonical nodes in tree of size x, between log x and x 18 [Wang, Christensen, Li, Yi, VLDB 16]

19 Indexing Spatial Data Numerous spatial indexing structures in the literature R-tree 19

20 RS-tree Attach a sample to node u drawn from leaves below u Total space: O(n) Construction time: O(n) 20

21 RS-tree: A 1D Example Report: 5 5 Active nodes

22 RS-tree: A 1D Example Report: 5 5 Active nodes

23 RS-tree: A 1D Example Report: Active nodes Pick 7 or 14 with equal prob

24 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 14 with prob. 1:1:

25 RS-tree: A 1D Example Report: Active nodes

26 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 12 with equal prob

27 Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

Frequency Estimation on Distributed Data Given: A multiset S of n items drawn from the universe [u] For example: IP addresses of network packets S is partitioned arbitrarily and stored on k nodes

28 Frequency Estimation on Distributed Data Given: A multiset S of n items drawn from the universe [u] For example: IP addresses of network packets S is partitioned arbitrarily and stored on k nodes Local count x ij : frequency of item i on node j Global count y i = x ij j Goal: Estimate y i with additive error εn for all i Can t hope for relative error for all y i Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM 11] 28

29 Frequency Estimation: Standard Solutions Local heavy hitters Let n j = i x ij be the data size at node j Node j sends in all items with frequency εn j Total error is at most εn j = εn j Communication cost: O(k/ε) Simple random sampling A simple random sample of size O(1/ε 2 ) can be used to estimate the frequency of any item with error εn Extra log factor for all items Algorithm Coordinator first gets n j for all j Decides how many samples to get from each j Get the samples from the nodes Communication cost: O(k + 1/ε 2 ) 29

30 Importance Sampling Horvitz Thompson estimator: X ij = x ij g x ij 0 if x ij sampled else Estimator for global count y i : Y i = X i,1 + + X i,k 30

31 Importance Sampling: What is a good g(x)? Natural choice: g 1 x = k εn x More precisely: g 1 x = max k εn x, 1 Can show: Var Y i = O εn 2 for any i Communication cost: O k/ε This is (worst-case) optimal Interesting discovery: g 2 x = g 1 x 2 Also has Var Y i = O εn 2 for any i Also has communication cost O k/ε in the worst case But can be much lower than g 1 (x) on some inputs 31

32 g 2 x is Instance-Optimal 32

33 Median and Quantiles (order statistics) Exact quantiles: F 1 ( ) for 0 < < 1, F : CDF Approximate version: tolerate answer between F 1 ( ) F 1 ( + ) 33

34 Estimating Median by Random Sampling Simple random sampling An ε-approximation needs a sample of size Θ(1/ε 2 ) Paired Sampling Divide data into chunks of size s = O(1/ε) Sort each chunk Do binary merges into one chunk Each merge takes odd-positioned or even-positioned elements with equal probability Similar ideas used in discrepancy methods This needs O(n log s) time, how is it useful?

35 Application 1: Streaming Computation Can merge chunks up as items arrive in the stream At any time, keep at most O log n chunks Space: O(1/ε log n) Can be improved to O(1/ε log(1/ε)) by combining with random sampling Can find all quantiles [Felber, Ostrovsky, 15] Reservoir sampling needs O(1/ε 2 ) space Best deterministic algorithm needs O(1/ε log n) space [Greenwald, Khana, 01] 35 [Wang, Luo, Yi, Cormode, SIGMOD 13]

36 Application 2: Distributed Data Data stored on k nodes Each node reduces its data to O 1/ε k using paired sampling, and send to coordinator The coordinator reduces all the data received to a size of O(1/ε) using paired sampling Communication cost: O( k/ε) Looks familiar? 36

37 Generalization: -approximations A sample that preserves density of point sets For any range (e.g., a circle), fraction of sample points fraction of all points ε Simple random sample needs size O(1/ε 2 ) Paired sampling yields size O(1/ε 2d/(d+1) ) 37

38 ε-approximations on distributed data 38 [Huang, Yi, FOCS 14]

$Analytical (OLAP) Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness?$

39 Database Workloads Transactional (OLTP) Deduct x dollars from account A, credit x dollars to account B Challenge: Efficiency and correctness (ACID) Analytical (OLAP) Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness? 39 Wander Join: Online Aggregation via Random Walks

40 Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 40 Wander Join: Online Aggregation via Random Walks

41 Online Aggregation [Haas, Hellerstein, Wang, SIGMOD 97] SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) Y + ε Y Y ε FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME CONFIDENCE 95 REPORTINTERVAL 1000 Confidence interval: Pr Y ε < Y < Y + ε > Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD 99] Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion

42 Ripple Join [Haas, Hellerstein, SIGMOD 99] Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion Join with sampled tuples from other tables Works well for full Cartesian product But most joins are sparse 42 Wander Join: Online Aggregation via Random Walks

A Running Example Nation CID BuyerID OrderID OrderID ItemID Price US 1 UK 4 5 4 4 306 $500 China 5 UK 8 Japan 9 UK 10 4 1 What s the total revenue of all orders US 2 3 2 China 3 from customers 1 in 3

43 A Running Example Nation CID BuyerID OrderID OrderID ItemID Price US 1 UK $500 China 5 UK 8 Japan 9 UK What s the total revenue of all orders US China 3 from customers 1 in 3 China? 5 5 N: size of each table, e.g., 10 9 US n: # tuples 6 taken from 5 each 6 table China s: # estimators, e.g., n N9 2 = s n = N 2/3 s 1/3 = $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

44 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 44 Wander Join: Online Aggregation via Random Walks

45 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 45 Wander Join: Online Aggregation via Random Walks

46 Join as a Graph Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

47 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

48 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

49 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

50 Sampling by Random Walks Nation CID BuyerID OrderID US US 2 China 3 UK 4 China 5 UK 8 Japan 9 UK OrderID ItemID Price $ $100 N: size of each table size, e.g., n: # tuples taken from each table = # random walks s: # estimators, e.g., n = s 5 = 10 3 US China Unbiased estimator: 5 8 $500 = sampling prob $ $ $ $ $ $200 $ $100 1/3 1/4 1/ $ Wander Join: Online Aggregation via Random Walks

51 Walk Plan Optimization Structure of the data graph Selection predicates Starting table: use index Table in the middle: reject random walk Data distribution Non-uniformity may not be a bad thing! R 1 R Var R 1 R 2 < Var R 2 R 1 R 1 R 2 R 3 R 1 R Var R 1 R 2 > Var R 2 R 1 51 Wander Join: Online Aggregation via Random Walks

52 Walk Plan Optimizer Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful 52 Wander Join: Online Aggregation via Random Walks

53 Convergence Comparison 53 Wander Join: Online Aggregation via Random Walks [Li, Wu, Yi, Zhao, SIGMOD 16 Best Paper Award]

54 Wander Join in PostgreSQL Logarithmic growth due to B-tree lookup to find random neighbours 54 Wander Join: Online Aggregation via Random Walks

55 Running on Insufficient Memory (4GB) Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB 09] 55 Wander Join: Online Aggregation via Random Walks

56 Accuracy Achieved in 1/10 Time of Full Join 56 Wander Join: Online Aggregation via Random Walks

57 Wander Join vs Ripple Join Sampling methodology Wander Join Independent but non-uniform Ripple Join Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Convergence time (20GB data, 3 tables) Easy, O(n) time Complicated, O(n k ) time k: # tables ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO 57 Wander Join: Online Aggregation via Random Walks

58 Online Aggregation vs Data Cube Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report 58 Wander Join: Online Aggregation via Random Walks

59 Thank you!

60 Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD 90] Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

61 Sampling from a B-tree [Olken, 93] Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions need to modify existing B-tree implementations 61 Wander Join: Online Aggregation via Random Walks

62 Rejection Sampling [Olken, 93] Imagine each node has maximum fanout Reject as soon as it walks out of bound 62 Wander Join: Online Aggregation via Random Walks

63 Non-Uniform Sampling As long as we can compute the sampling probability, wander join still works! 63 Wander Join: Online Aggregation via Random Walks

64 Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, 13] Wander Join BlinkDB Methodology Query Sampling Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced 64 Wander Join: Online Aggregation via Random Walks

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath