Acceptably inaccurate. Probabilistic data structures

Size: px

Start display at page:

Download "Acceptably inaccurate. Probabilistic data structures"

Alyson Harmon
5 years ago
Views:

1 Acceptably inaccurate Probabilistic data structures

2 Hello

3 Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog

4 Motivation

5 Tape HDD SSD Memory

6 Tape HDD SSD Memory Speed

7 Tape HDD SSD Memory Cost

8 Tape HDD SSD Memory Ease of use (as a developer)

9 Tape HDD SSD Memory Set<String>

10 HDD SSD Memory Storage per node

11 HDD SSD Memory How can we do more here?

12 Probabilistic data structures

13 I, as a developer, accept a predictable level of inaccuracy.

14 Bloom filters

15 Bloom filters are for set membership

16 Set<String> visitors = new HashSet<>(); visitors.add(" "); visitors.add(" "); visitors.add(" "); visitors.contains(" "); // true visitors.contains(" "); // false

17 Number of UUIDs JVM heap used (MB) 10 < < 2 1, , , ,000,

18 What is a Bloom filter? Space/time trade-offs in hash coding with allowable errors (Bloom, 1970)

19 A bit array of size n

20 k hash functions h1() h2()

21 Adding an element " " h1() h2()

22 Checking for the presence of an element " " h1() h2() No!

23 Checking for the presence of an element " " h1() h2() Maybe!

24 <dependency> <groupid>com.google.guava</groupid> <artifactid>guava</artifactid> <version>19.0</version> </dependency>

25 private BloomFilter<String> visitors = BloomFilter.create( Funnels.stringFunnel(Charset.forName("UTF-8")), 10000); visitors.add(" "); visitors.add(" "); visitors.add(" "); visitors.maycontain(" "); // true visitors.maycontain(" "); // false

26 Number of UUIDs Set JVM heap used (MB) Bloom filter JVM heap used (MB) 10 < < , , , ,000,

27 Suggested size of Bloom filter Bit count , , , ,000,

28 3% false positive probability by default.

29 private BloomFilter<String> visitors = BloomFilter.create( Funnels.stringFunnel(Charset.forName("UTF-8")), 10000, 0.005);

30 Suggested FPP of Bloom filter Hash functions 3% 5 1% 7 0.1% % % % 20

31 Use cases "One hit wonders" Avoiding lookups (HBase, Cassandra) Real-time matching Algorithmic nuggets in content delivery (Maggs, 2015)

32 Count-min sketch

33 Count-min sketch for count tracking

34 Multiset<String> hits = HashMultiset.create(); hits.add(" "); hits.add(" "); hits.add(" "); hits.add(" "); hits.count(" "); // 2 hits.count(" "); // 0

35 Number of UUIDs JVM heap used (MB) 10 < < 2 1, , , ,000,

36 What is a Count-Min Sketch? Approximating data with the count-min data structure (Cormode, Muthukrishnan, 2012)

37 depth (d) An array of counters width (w) h1() h2() h3() h4() h5()

38 Adding an element " " h1() h2() h3()

39 Adding the same element again " " h1() h2() h3()

40 Adding a different element " " h1() h2() h3()

41 Getting the count of the first element " " min(2, 3, 2) = h1() h2() h3()

42 Getting the count of the second element " " min(1, 3, 1) = h1() h2() h3()

43 Initialising a sketch Epsilon: accepted error added to counts with each item Delta: probability that estimate is outside accepted error this.width = (int) Math.ceil(2 / epsilon); this.depth = (int) Math.ceil(-Math.log(1 - delta) / Math.log(2))

44 <dependency> <groupid>com.clearspring.analytics</groupid> <artifactid>stream</artifactid> <version>2.9.2</version> </dependency>

45 private CountMinSketch cms = new CountMinSketch(0.001, 0.99, 1); cms.add(" ", 1); cms.add(" ", 1); cms.add(" ", 2); cms.estimatecount(" "); // 2 cms.estimatecount(" "); // 0

46 Number of UUIDs Multiset JVM heap used (MB) CMS JVM heap used (MB) 10 < 2 N/A 100 < 2 N/A 1,000 3 N/A 10,000 9 N/A 100, N/A 1,000, N/A

47 Epsi lon Delt a Widt h Dept h CMS JVM heap used (MB)

48 Use cases Any kind of frequency tracking! NLP Extension: Heavy-hitters Extension: Range-query

49 HyperLogLog

50 HyperLogLog is for cardinality

51 Set<String> visitors = new HashSet<>(); visitors.add(" "); visitors.add(" "); visitors.add(" "); visitors.add(" "); visitors.add(" "); visitors.size(); // 3

52 Number of UUIDs JVM heap used (MB) 10 < < 2 1, , , ,000,

53 A gentle walk into HyperLogLog

54 Linear counting A linear-time probabilistic counting algorithm for database applications (Whang, 1990)

55 A bit array of size m

56 Adding an element " " hash()

57 Adding another element " " hash()

58 Estimating cardinality Cardinality (estimate) = -m * ln(m-w/m) =

59 LogLog LogLog counting of large cardinalities (Durand, Flajolet, 2003)

60 Flipping coins

61 Doing it with hashing " "

62 Doing it with hashing " " " "

63 Doing it with hashing " " " " " "

64 Doing it with hashing " " " " " " ⁿ = 2¹ = 2

65 Improving it

66 Stochastic averaging 0 " " " " " " Bucket ID

67 2 (sum(max_zeros) / buckets) * buckets * ESTIMATION_FACTOR

68 Average error = 1.3/sqrt(buckets)

69 1.3/sqrt(1024) = 0.04

70 5 bits per bucket = 5.12K!

71 HyperLogLog HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm (Flajolet, Fusy, Gandouet, 2007)

72 Average error = 1.05/sqrt(buckets)

73 1.05/sqrt(1024) = 0.032

74 1.04/sqrt(1024) = 0.03

75 <dependency> <groupid>com.clearspring.analytics</groupid> <artifactid>stream</artifactid> <version>2.9.2</version> </dependency>

76 HyperLogLog visitors = new HyperLogLog(0.03); visitors.offer(" "); visitors.offer(" "); visitors.offer(" "); visitors.offer(" "); visitors.offer(" "); visitors.cardinality(); // 3

77 Number of UUIDs Set JVM heap used (MB) HyperLo glog JVM heap used (MB) 10 < < , , , ,000,

78 Use cases Anywhere you need cardinality in O(n)! Unique site visitors Estimates of massive tables Streams of data

79 That's it!

80 We learned about Bloom filters Count-Min Sketch HyperLogLog (and some of the story)

How Philippe Flipped Coins to Count Data

1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain